Table of Papers 2023

Arxiv ID	Title	Authors	Abstract	What	Why	How	Result	LF	Tags
2312.17742 Report	Learning Vision from Models Rivals Learning Vision from Data	Yonglong Tian, Lijie Fan, Kaifeng Chen, Dina Katabi, Dilip Krishnan, Phillip Isola	We introduce SynCLR, a novel approach for learning visual representations exclusively from synthetic images and synthetic captions, without any real data. We synthesize a large dataset of image captions using LLMs, then use an off-the-shelf text-to-image model to generate multiple images corresponding to each synthetic caption. We perform visual representation learning on these synthetic images via contrastive learning, treating images sharing the same caption as positive pairs. The resulting representations transfer well to many downstream tasks, competing favorably with other general-purpose visual representation learners such as CLIP and DINO v2 in image classification tasks. Furthermore, in dense prediction tasks such as semantic segmentation, SynCLR outperforms previous self-supervised methods by a significant margin, e.g., improving over MAE and iBOT by 6.2 and 4.3 mIoU on ADE20k for ViT-B/16.	This paper introduces SynCLR, a novel approach for learning visual representations exclusively from synthetic images and captions generated by LLMs and text-to-image models, without relying on real data.	SynCLR addresses the limitations of relying on large-scale real datasets for visual representation learning, which can be costly, ethically challenging, and may introduce biases.	The methodology involves synthesizing a large dataset of image captions using LLMs, generating multiple images per caption using a text-to-image model, and training a visual representation model via contrastive learning and masked image modeling on this synthetic dataset.	SynCLR achieves comparable performance to OpenAI's CLIP and DINO v2 on ImageNet linear evaluation and fine-grained classification tasks, despite relying solely on synthetic data. SynCLR demonstrates strong transferability to dense prediction tasks, outperforming MAE and iBOT on ADE20K semantic segmentation. SynCLR's performance scales with the volume of synthetic data, with larger models benefiting more from increased data scale.	While SynCLR shows promising results, it still lags behind models like DINO v2, which benefit from distillation from larger architectures and high-resolution training. The dependence on the quality of the synthetic data introduces a limitation, as improvements in generative models can directly influence SynCLR's performance.	representation learning, synthetic data, contrastive learning, masked image modeling, text-to-image generation
2312.17681 Report	FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis	Feng Liang, Bichen Wu, Jialiang Wang, Licheng Yu, Kunpeng Li, Yinan Zhao, Ishan Misra, Jia-Bin Huang, Peizhao Zhang, Peter Vajda, Diana Marculescu	Diffusion models have transformed the image-to-image (I2I) synthesis and are now permeating into videos. However, the advancement of video-to-video (V2V) synthesis has been hampered by the challenge of maintaining temporal consistency across video frames. This paper proposes a consistent V2V synthesis framework by jointly leveraging spatial conditions and temporal optical flow clues within the source video. Contrary to prior methods that strictly adhere to optical flow, our approach harnesses its benefits while handling the imperfection in flow estimation. We encode the optical flow via warping from the first frame and serve it as a supplementary reference in the diffusion model. This enables our model for video synthesis by editing the first frame with any prevalent I2I models and then propagating edits to successive frames. Our V2V model, FlowVid, demonstrates remarkable properties: (1) Flexibility: FlowVid works seamlessly with existing I2I models, facilitating various modifications, including stylization, object swaps, and local edits. (2) Efficiency: Generation of a 4-second video with 30 FPS and 512x512 resolution takes only 1.5 minutes, which is 3.1x, 7.2x, and 10.5x faster than CoDeF, Rerender, and TokenFlow, respectively. (3) High-quality: In user studies, our FlowVid is preferred 45.7% of the time, outperforming CoDeF (3.5%), Rerender (10.2%), and TokenFlow (40.4%).	FlowVid, a novel video-to-video synthesis method that leverages optical flow as a soft constraint in conjunction with spatial conditions to enhance temporal consistency in generated videos.	Maintaining temporal consistency in video-to-video synthesis is challenging. Existing methods relying solely on spatial-temporal attention or rigidly constrained optical flow often fall short. FlowVid addresses this by jointly using spatial conditions and flexibly incorporating optical flow, leading to more consistent and high-quality video synthesis.	FlowVid employs a two-stage edit-propagate approach: 1) Edit the first frame using existing image-to-image models. 2) Propagate edits to subsequent frames using a trained video diffusion model conditioned on spatial information (e.g., depth maps) and temporal information from flow-warped frames.	Outperforms state-of-the-art methods like CoDeF, Rerender, and TokenFlow in user studies, exhibiting superior prompt alignment and overall video quality. Significantly faster than competing methods, generating a 4-second, 512x512 resolution video at 30 FPS in just 1.5 minutes. Flexible enough to support various video editing tasks, including stylization, object swaps, and local edits.	Relies heavily on the structural alignment of the edited first frame with the original. May struggle with large occlusions caused by rapid camera or object motion.	video-to-video synthesis, diffusion models, optical flow, temporal consistency, spatial conditions
2312.17561 Report	Informative Rays Selection for Few-Shot Neural Radiance Fields	Marco Orsingher, Anthony Dell'Eva, Paolo Zani, Paolo Medici, Massimo Bertozzi	Neural Radiance Fields (NeRF) have recently emerged as a powerful method for image-based 3D reconstruction, but the lengthy per-scene optimization limits their practical usage, especially in resource-constrained settings. Existing approaches solve this issue by reducing the number of input views and regularizing the learned volumetric representation with either complex losses or additional inputs from other modalities. In this paper, we present KeyNeRF, a simple yet effective method for training NeRF in few-shot scenarios by focusing on key informative rays. Such rays are first selected at camera level by a view selection algorithm that promotes baseline diversity while guaranteeing scene coverage, then at pixel level by sampling from a probability distribution based on local image entropy. Our approach performs favorably against state-of-the-art methods, while requiring minimal changes to existing NeRF codebases.	This paper proposes KeyNeRF, a method to improve the efficiency of Neural Radiance Fields (NeRF) in few-shot scenarios by focusing on key informative cameras and pixels during training.	Standard NeRF training is computationally expensive, especially when the number of input views is limited. Existing few-shot NeRF methods address this issue by introducing complex losses or relying on additional input modalities, which adds complexity and might hinder practicality.	KeyNeRF employs a two-stage selection process: (1) View selection: A minimal set of cameras covering the entire scene is selected and augmented with additional views based on baseline diversity. (2) Rays sampling: For each selected camera, pixels are sampled from a probability distribution based on local image entropy to prioritize high-frequency details.	KeyNeRF outperforms state-of-the-art few-shot NeRF methods on both synthetic (Blender) and real-world (CO3D) datasets in terms of rendering quality. The view selection strategy demonstrates faster and more stable convergence, especially with very few input views. Entropy-based rays sampling leads to better rendering of fine-grained details and intricate structures compared to uniform sampling.	The current method assumes an object-centric acquisition trajectory with the object having higher entropy than the background. Future work will focus on addressing these limitations and extending the approach to other neural reconstruction methods.	neural radiance fields, novel view synthesis, few-shot learning, 3d reconstruction, view selection
2312.17505 Report	Leveraging Open-Vocabulary Diffusion to Camouflaged Instance Segmentation	Tuan-Anh Vu, Duc Thanh Nguyen, Qing Guo, Binh-Son Hua, Nhat Minh Chung, Ivor W. Tsang, Sai-Kit Yeung	Text-to-image diffusion techniques have shown exceptional capability of producing high-quality images from text descriptions. This indicates that there exists a strong correlation between the visual and textual domains. In addition, text-image discriminative models such as CLIP excel in image labelling from text prompts, thanks to the rich and diverse information available from open concepts. In this paper, we leverage these technical advances to solve a challenging problem in computer vision: camouflaged instance segmentation. Specifically, we propose a method built upon a state-of-the-art diffusion model, empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representations. Such cross-domain representations are desirable in segmenting camouflaged objects where visual cues are subtle to distinguish the objects from the background, especially in segmenting novel objects which are not seen in training. We also develop technically supportive components to effectively fuse cross-domain features and engage relevant features towards respective foreground objects. We validate our method and compare it with existing ones on several benchmark datasets of camouflaged instance segmentation and generic open-vocabulary instance segmentation. Experimental results confirm the advances of our method over existing ones. We will publish our code and pre-trained models to support future research.	This paper proposes a novel method for Camouflaged Instance Segmentation (CIS) that leverages text-to-image diffusion and text-image transfer techniques with open-vocabulary.	CIS is a challenging problem in computer vision due to the subtle visual differences between camouflaged objects and their surroundings. Existing methods often struggle with novel objects unseen during training. This method aims to overcome these challenges by incorporating rich textual information from open-vocabulary.	The method combines a pre-trained Stable Diffusion model for image feature extraction, a pre-trained CLIP model for text embedding generation, and a mask generator based on Mask2Former. It uses a multi-scale feature fusion module to integrate image features and text embeddings at different scales. A textual-visual aggregation module highlights object-relevant features, while a camouflaged instance normalisation module refines the segmentation masks.	The proposed method achieves state-of-the-art performance on benchmark camouflaged object datasets (COD10K-v3 and NC4K). It demonstrates strong generalization ability, effectively segmenting novel object categories. The method achieves comparable performance to state-of-the-art open-vocabulary instance segmentation methods on generic datasets (ADE20K and Cityscapes) while using significantly fewer parameters.	The method may face difficulty in separating touching/overlapping instances with highly similar appearances. Severe object occlusions can hinder accurate segmentation, leading to potential misclassifications.	camouflaged instance segmentation, open-vocabulary, text-to-image diffusion, text-image transfer, computer vision
2312.17448 Report	Tracking with Human-Intent Reasoning	Jiawen Zhu, Zhi-Qi Cheng, Jun-Yan He, Chenyang Li, Bin Luo, Huchuan Lu, Yifeng Geng, Xuansong Xie	Advances in perception modeling have significantly improved the performance of object tracking. However, the current methods for specifying the target object in the initial frame are either by 1) using a box or mask template, or by 2) providing an explicit language description. These manners are cumbersome and do not allow the tracker to have self-reasoning ability. Therefore, this work proposes a new tracking task -- Instruction Tracking, which involves providing implicit tracking instructions that require the trackers to perform tracking automatically in video frames. To achieve this, we investigate the integration of knowledge and reasoning capabilities from a Large Vision-Language Model (LVLM) for object tracking. Specifically, we propose a tracker called TrackGPT, which is capable of performing complex reasoning-based tracking. TrackGPT first uses LVLM to understand tracking instructions and condense the cues of what target to track into referring embeddings. The perception component then generates the tracking results based on the embeddings. To evaluate the performance of TrackGPT, we construct an instruction tracking benchmark called InsTrack, which contains over one thousand instruction-video pairs for instruction tuning and evaluation. Experiments show that TrackGPT achieves competitive performance on referring video object segmentation benchmarks, such as getting a new state-of the-art performance of 66.5 $\mathcal{J}\&\mathcal{F}$ on Refer-DAVIS. It also demonstrates a superior performance of instruction tracking under new evaluation protocols. The code and models are available at \href{https://github.com/jiawen-zhu/TrackGPT}{https://github.com/jiawen-zhu/TrackGPT}.	This paper introduces the task of "instruction tracking", a new paradigm in object tracking where implicit instructions, rather than explicit bounding boxes or language descriptions, guide the tracker to locate and follow a target object in a video.	Current tracking methods rely on cumbersome and impractical methods (bounding boxes, precise masks, detailed descriptions) to specify the object to be tracked. Instruction tracking aims to make this interaction more natural and intuitive, mimicking how humans would guide a tracker.	The authors propose TrackGPT, a novel tracker powered by a Large Vision-Language Model (LVLM). TrackGPT leverages the LVLM's reasoning ability to interpret human instructions and translate them into referring cues for object tracking. The system also features a 'rethinking mechanism' to adjust tracking based on how well the results match the instruction's intent, and a 'cross-frame referring propagation' module for temporal consistency.	TrackGPT achieves state-of-the-art performance on the newly proposed InsTrack benchmark for instruction tracking. It also demonstrates competitive performance on established referring video object segmentation benchmarks, including a new state-of-the-art result on Refer-DAVIS₁₇. Ablation studies confirm the effectiveness of the rethinking mechanism, cross-frame referring propagation, and instruction tuning strategies.	TrackGPT currently doesn't support tracking multiple objects from a single instruction. Future work could explore multi-object instruction tracking and improve efficiency for real-time applications.	instruction tracking, object tracking, large vision-language model, referring video object segmentation, reasoning
2312.17432 Report	Video Understanding with Large Language Models: A Survey	Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, Ali Vosoughi, Chao Huang, Zeliang Zhang, Feng Zheng, Jianguo Zhang, Ping Luo, Jiebo Luo, Chenliang Xu	With the burgeoning growth of online video platforms and the escalating volume of video content, the demand for proficient video understanding tools has intensified markedly. Given the remarkable capabilities of Large Language Models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of the recent advancements in video understanding harnessing the power of LLMs (Vid-LLMs). The emergent capabilities of Vid-LLMs are surprisingly advanced, particularly their ability for open-ended spatial-temporal reasoning combined with commonsense knowledge, suggesting a promising path for future video understanding. We examine the unique characteristics and capabilities of Vid-LLMs, categorizing the approaches into four main types: LLM-based Video Agents, Vid-LLMs Pretraining, Vid-LLMs Instruction Tuning, and Hybrid Methods. Furthermore, this survey presents a comprehensive study of the tasks, datasets, and evaluation methodologies for Vid-LLMs. Additionally, it explores the expansive applications of Vid-LLMs across various domains, highlighting their remarkable scalability and versatility in real-world video understanding challenges. Finally, it summarizes the limitations of existing Vid-LLMs and outlines directions for future research. For more information, readers are recommended to visit the repository at https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding.	This paper surveys recent advancements in video understanding using Large Language Models (Vid-LLMs), analyzing models, datasets, tasks, and applications.	The integration of LLMs with video understanding is crucial due to the growing volume of video content and the demand for intelligent analysis tools.	The paper categorizes Vid-LLM approaches into four types: LLM-based Video Agents, Vid-LLM Pretraining, Vid-LLM Instruction Tuning, and Hybrid Methods. It also examines common tasks, datasets, and evaluation metrics in video understanding.	Vid-LLMs demonstrate promising capabilities in various video understanding tasks, including recognition, captioning, grounding, and question answering. The use of LLMs enables more sophisticated multimodal understanding, allowing for the processing of complex interactions between visual, textual, and auditory data. Existing Vid-LLMs have limitations in fine-grained and long-term video understanding, multi-modal integration, and addressing hallucination issues.	Current Vid-LLMs face challenges in handling fine-grained and long-term video understanding, multi-modal integration, human interaction. Future research should focus on addressing hallucination issues, improving computational efficiency, and developing more robust evaluation metrics.	video understanding, large language models, multimodal learning, computer vision, artificial intelligence
2312.17250 Report	iFusion: Inverting Diffusion for Pose-Free Reconstruction from Sparse Views	Chin-Hsuan Wu, Yen-Chun Chen, Bolivar Solarte, Lu Yuan, Min Sun	We present iFusion, a novel 3D object reconstruction framework that requires only two views with unknown camera poses. While single-view reconstruction yields visually appealing results, it can deviate significantly from the actual object, especially on unseen sides. Additional views improve reconstruction fidelity but necessitate known camera poses. However, assuming the availability of pose may be unrealistic, and existing pose estimators fail in sparse view scenarios. To address this, we harness a pre-trained novel view synthesis diffusion model, which embeds implicit knowledge about the geometry and appearance of diverse objects. Our strategy unfolds in three steps: (1) We invert the diffusion model for camera pose estimation instead of synthesizing novel views. (2) The diffusion model is fine-tuned using provided views and estimated poses, turned into a novel view synthesizer tailored for the target object. (3) Leveraging registered views and the fine-tuned diffusion model, we reconstruct the 3D object. Experiments demonstrate strong performance in both pose estimation and novel view synthesis. Moreover, iFusion seamlessly integrates with various reconstruction methods and enhances them.	Presents iFusion, a novel 3D object reconstruction framework that requires only two views with unknown camera poses, leveraging a pre-trained diffusion model (Zero123) to estimate poses and enhance reconstruction fidelity.	Existing 3D reconstruction methods either rely on single-view inference leading to ambiguity or require accurate camera poses typically unavailable from sparse views.	1) Inverts Zero123 to estimate relative camera pose by minimizing differences in denoised latent visual features. 2) Fine-tunes Zero123 with estimated poses and given views for object-specific novel view synthesis. 3) Integrates estimated poses and fine-tuned diffusion model with a differentiable renderer (e.g., NeRFs, Gaussian Splatting) for 3D reconstruction.	Significantly outperforms state-of-the-art pose estimation methods with only two views. Demonstrates superior novel view synthesis quality compared to Zero123 and 3D-based methods. Consistently enhances the performance of existing single-view reconstruction methods leading to more accurate 3D models.	Pose estimation is slower than feed-forward methods due to optimization-based approach. Lacks complete 3D consistency due to limitations of Zero123's 2D-based architecture.	3d reconstruction, pose estimation, novel view synthesis, diffusion models, sparse view
2312.17243 Report	Unsupervised Universal Image Segmentation	Dantong Niu, Xudong Wang, Xinyang Han, Long Lian, Roei Herzig, Trevor Darrell	Several unsupervised image segmentation approaches have been proposed which eliminate the need for dense manually-annotated segmentation masks; current models separately handle either semantic segmentation (e.g., STEGO) or class-agnostic instance segmentation (e.g., CutLER), but not both (i.e., panoptic segmentation). We propose an Unsupervised Universal Segmentation model (U2Seg) adept at performing various image segmentation tasks -- instance, semantic and panoptic -- using a novel unified framework. U2Seg generates pseudo semantic labels for these segmentation tasks via leveraging self-supervised models followed by clustering; each cluster represents different semantic and/or instance membership of pixels. We then self-train the model on these pseudo semantic labels, yielding substantial performance gains over specialized methods tailored to each task: a +2.6 AP$^{\text{box}}$ boost vs. CutLER in unsupervised instance segmentation on COCO and a +7.0 PixelAcc increase (vs. STEGO) in unsupervised semantic segmentation on COCOStuff. Moreover, our method sets up a new baseline for unsupervised panoptic segmentation, which has not been previously explored. U2Seg is also a strong pretrained model for few-shot segmentation, surpassing CutLER by +5.0 AP$^{\text{mask}}$ when trained on a low-data regime, e.g., only 1% COCO labels. We hope our simple yet effective method can inspire more research on unsupervised universal image segmentation.	This paper introduces U2Seg, a novel unified framework for Unsupervised Universal image Segmentation capable of performing instance, semantic, and panoptic segmentation without human annotations.	Existing unsupervised image segmentation methods only address either semantic or class-agnostic instance segmentation, but not both, limiting comprehensive scene understanding.	U2Seg leverages self-supervised models and clustering to generate pseudo semantic labels for both instance and semantic segmentation. Then, it trains a universal segmentation model on these pseudo labels to perform various segmentation tasks.	U2Seg outperforms previous state-of-the-art methods specialized for individual tasks, achieving +2.6 AP^box improvement in instance segmentation on COCO and +7.0 PixelAcc increase in semantic segmentation on COCOStuff. The method establishes a new baseline for unsupervised panoptic segmentation, previously unexplored. U2Seg shows superior performance as a pretrained model for few-shot segmentation, outperforming CutLER by +5.0 AP^mask when trained on 1% COCO labels.	The universal model shows slightly lower performance compared to task-specific models. Future work focuses on improving model versatility to handle multiple tasks effectively with single training.	unsupervised learning, image segmentation, instance segmentation, semantic segmentation, panoptic segmentation
2312.17241 Report	Compact Neural Graphics Primitives with Learned Hash Probing	Towaki Takikawa, Thomas Müller, Merlin Nimier-David, Alex Evans, Sanja Fidler, Alec Jacobson, Alexander Keller	Neural graphics primitives are faster and achieve higher quality when their neural networks are augmented by spatial data structures that hold trainable features arranged in a grid. However, existing feature grids either come with a large memory footprint (dense or factorized grids, trees, and hash tables) or slow performance (index learning and vector quantization). In this paper, we show that a hash table with learned probes has neither disadvantage, resulting in a favorable combination of size and speed. Inference is faster than unprobed hash tables at equal quality while training is only 1.2-2.6x slower, significantly outperforming prior index learning approaches. We arrive at this formulation by casting all feature grids into a common framework: they each correspond to a lookup function that indexes into a table of feature vectors. In this framework, the lookup functions of existing data structures can be combined by simple arithmetic combinations of their indices, resulting in Pareto optimal compression and speed.	This paper introduces a novel compression technique for neural graphics primitives, termed "compact neural graphics primitives," which leverages learned hash probing to achieve a favorable balance between compactness and speed.	Existing spatial data structures used to enhance neural graphics primitives often compromise either memory efficiency or performance. This work aims to address this limitation by developing a technique that combines the strengths of hash tables and index learning.	The authors propose a learned hash probing scheme where a spatial hash function determines the most significant bits of an index, while the remaining bits are learned via an auxiliary index codebook. This strategy enables efficient collision resolution and feature reuse. The method is trained using a straight-through estimator to handle the non-differentiable nature of the indexing process.	Compact neural graphics primitives demonstrate faster inference than unprobed hash tables (Instant NGP) at comparable quality levels due to improved cache utilization. The technique achieves competitive compression rates compared to state-of-the-art methods, including JPEG for images and masked wavelet representations for NeRFs, while maintaining random access capability and differentiability. Experiments reveal that a small probing range is sufficient for effective compression, and the training overhead associated with learned probing is manageable (1.26x to 2.61x slower than Instant NGP).	The approach currently relies on a straight-through estimator for training, which might not be optimal. Exploring alternative techniques, such as sparse or stochastic variants, could be beneficial. While spatial hashing provides application agnosticism, utilizing data structures that better exploit spatial locality might enhance compression efficiency further.	neural graphics primitives, compression, hash tables, index learning, deep learning
2312.17240 Report	LISA++: An Improved Baseline for Reasoning Segmentation with Large Language Model	Senqiao Yang, Tianyuan Qu, Xin Lai, Zhuotao Tian, Bohao Peng, Shu Liu, Jiaya Jia	While LISA effectively bridges the gap between segmentation and large language models to enable reasoning segmentation, it poses certain limitations: unable to distinguish different instances of the target region, and constrained by the pre-defined textual response formats. In this work, we introduce LISA++, an update to the existing LISA model, focusing on improving core functionalities while keeping the base architecture intact. The main enhancements in LISA++ include: \textbf{1) Enhanced Segmentation}: The instance segmentation ability has been added, providing a more detailed scene analysis along with the existing multi-region semantic segmentation. \textbf{2) More Natural Conversation}: Improved capability for multi-turn dialogue, with the ability to incorporate segmentation results directly into text responses, i.e., Segmentation in Dialogue (SiD). These improvements are achieved by curating the existing samples of generic segmentation datasets, aimed specifically at enhancing the segmentation and conversational skills without structural change and additional data sources. Comparative analysis with the original LISA model shows significant advancements in these areas, positioning LISA++ as a notable upgrade in visual understanding and interaction. LISA++'s adaptability and improved features highlight the versatility of the mask-as-embedding paradigm proposed by LISA, and the potential as a foundational model for diverse applications.	Introduces LISA++, an enhanced version of LISA for visual understanding and interaction, focusing on improving instance segmentation and natural conversation capabilities.	Addresses limitations of existing multimodal models in providing detailed positional information and engaging in natural dialogue with segmentation results.	Leverages existing segmentation datasets to reconstruct instruction-tuning data, enabling instance segmentation and Segmentation in Dialogue (SiD) without architectural changes. Extends the ReasonSeg benchmark to evaluate instance segmentation.	LISA++ demonstrates significant improvements in instance segmentation compared to the original LISA. LISA++ maintains comparable performance to LISA in semantic segmentation, indicating the generalizability of the framework. LISA++ exhibits the ability to integrate segmentation results naturally within dialogue, enhancing its conversational capabilities.	The performance on low-resolution images and small objects requires further investigation. Future work includes extending LISA++ to more complex scenarios, such as video understanding and 3D scene analysis.	instance segmentation, visual reasoning, multimodal learning, large language models, segmentation in dialogue
2312.17232 Report	Segment3D: Learning Fine-Grained Class-Agnostic 3D Segmentation without Manual Labels	Rui Huang, Songyou Peng, Ayca Takmaz, Federico Tombari, Marc Pollefeys, Shiji Song, Gao Huang, Francis Engelmann	Current 3D scene segmentation methods are heavily dependent on manually annotated 3D training datasets. Such manual annotations are labor-intensive, and often lack fine-grained details. Importantly, models trained on this data typically struggle to recognize object classes beyond the annotated classes, i.e., they do not generalize well to unseen domains and require additional domain-specific annotations. In contrast, 2D foundation models demonstrate strong generalization and impressive zero-shot abilities, inspiring us to incorporate these characteristics from 2D models into 3D models. Therefore, we explore the use of image segmentation foundation models to automatically generate training labels for 3D segmentation. We propose Segment3D, a method for class-agnostic 3D scene segmentation that produces high-quality 3D segmentation masks. It improves over existing 3D segmentation models (especially on fine-grained masks), and enables easily adding new training data to further boost the segmentation performance -- all without the need for manual training labels.	Introduces Segment3D, a novel method for fine-grained, class-agnostic 3D point cloud segmentation that doesn't require manually annotated labels.	Existing 3D segmentation methods depend heavily on manual labels which are costly and time-consuming to acquire, and often lack fine-grained details, limiting their generalizability.	Leverages pre-trained 2D foundation models (SAM) for automatic mask generation. Employs a two-stage training approach: 1) Pre-training on partial RGB-D point clouds supervised by projected SAM masks, 2) Self-supervised fine-tuning on full 3D point clouds using high-confidence predictions from the pre-trained model.	Segment3D achieves state-of-the-art performance on ScanNet++, surpassing existing methods, including fully supervised ones. It exhibits superior performance in segmenting small objects and fine-grained details compared to methods trained on manual labels. Demonstrates strong generalization ability, effectively segmenting unseen objects in both indoor and outdoor scenarios.	Limited exploration of incorporating even larger and more diverse datasets during pre-training. Further investigation into the impact of the number of queries on the model's performance is needed.	3d point cloud segmentation, class-agnostic segmentation, foundation models, unsupervised learning, open-vocabulary scene understanding
2312.17225 Report	4DGen: Grounded 4D Content Generation with Spatial-temporal Consistency	Yuyang Yin, Dejia Xu, Zhangyang Wang, Yao Zhao, Yunchao Wei	Aided by text-to-image and text-to-video diffusion models, existing 4D content creation pipelines utilize score distillation sampling to optimize the entire dynamic 3D scene. However, as these pipelines generate 4D content from text or image inputs, they incur significant time and effort in prompt engineering through trial and error. This work introduces 4DGen, a novel, holistic framework for grounded 4D content creation that decomposes the 4D generation task into multiple stages. We identify static 3D assets and monocular video sequences as key components in constructing the 4D content. Our pipeline facilitates conditional 4D generation, enabling users to specify geometry (3D assets) and motion (monocular videos), thus offering superior control over content creation. Furthermore, we construct our 4D representation using dynamic 3D Gaussians, which permits efficient, high-resolution supervision through rendering during training, thereby facilitating high-quality 4D generation. Additionally, we employ spatial-temporal pseudo labels on anchor frames, along with seamless consistency priors implemented through 3D-aware score distillation sampling and smoothness regularizations. Compared to existing baselines, our approach yields competitive results in faithfully reconstructing input signals and realistically inferring renderings from novel viewpoints and timesteps. Most importantly, our method supports grounded generation, offering users enhanced control, a feature difficult to achieve with previous methods. Project page: https://vita-group.github.io/4DGen/	Introduces 4DGen, a novel pipeline for grounded 4D content creation allowing control over motion and appearance using monocular video as input.	Addresses limitations of previous 4D generation methods like restricted motion capabilities, reliance on unreliable prompt engineering, and low-resolution outputs.	Leverages deformable 3D Gaussians for 4D representation, employs spatial-temporal pseudo labels from a multi-view diffusion model, and ensures consistency via 3D-aware score distillation sampling and smoothness regularization.	Outperforms baselines in video-to-4D tasks, demonstrating superior spatial and temporal consistency. Enables faithful generation of input signals and plausible novel view synthesis at arbitrary timesteps. Supports image-to-4D and text-to-4D generation via integration with video diffusion models.	Limited to single object generation due to the object-centric nature of the pre-trained diffusion prior. Future work will focus on extending the framework to handle multi-object and scene-level generation.	4d content creation, grounded generation, 3d gaussian splatting, score distillation sampling, diffusion models
2312.17161 Report	Restoration by Generation with Constrained Priors	Zheng Ding, Xuaner Zhang, Zhuowen Tu, Zhihao Xia	The inherent generative power of denoising diffusion models makes them well-suited for image restoration tasks where the objective is to find the optimal high-quality image within the generative space that closely resembles the input image. We propose a method to adapt a pretrained diffusion model for image restoration by simply adding noise to the input image to be restored and then denoise. Our method is based on the observation that the space of a generative model needs to be constrained. We impose this constraint by finetuning the generative model with a set of anchor images that capture the characteristics of the input image. With the constrained space, we can then leverage the sampling strategy used for generation to do image restoration. We evaluate against previous methods and show superior performances on multiple real-world restoration datasets in preserving identity and image quality. We also demonstrate an important and practical application on personalized restoration, where we use a personal album as the anchor images to constrain the generative space. This approach allows us to produce results that accurately preserve high-frequency details, which previous works are unable to do. Project webpage: https://gen2res.github.io.	This paper proposes a novel image restoration method that leverages the generative power of pre-trained diffusion models by adding noise to a degraded input image and then denoising it using the diffusion model, with the generative space constrained by a set of anchor images.	This method addresses limitations of existing supervised restoration methods that rely on paired training data and struggle to generalize to real-world degradations.	The method constrains the diffusion model's generative space by fine-tuning it with either a 'generative album' (generated from the input image with skip guidance) for single-image restoration or a 'personal album' (provided set of clean images of the same subject) for personalized restoration.	The method achieves state-of-the-art results on standard blind face restoration benchmarks, outperforming supervised methods in FID and MUSIQ. It exhibits strong generalization to real-world degradations like motion blur, even without explicit training on such data. In personalized restoration, the method effectively leverages the personal album to preserve identity and recover high-frequency details, surpassing both single-image and exemplar-based approaches.	Single-image restoration requires per-image fine-tuning, which is computationally expensive. The method's effectiveness on general image restoration is yet to be explored, relying on the availability of high-quality pre-trained diffusion models for diverse image domains.	image restoration, diffusion models, generative models, blind image restoration, personalized image restoration
2312.17142 Report	DreamGaussian4D: Generative 4D Gaussian Splatting	Jiawei Ren, Liang Pan, Jiaxiang Tang, Chi Zhang, Ang Cao, Gang Zeng, Ziwei Liu	Remarkable progress has been made in 4D content generation recently. However, existing methods suffer from long optimization time, lack of motion controllability, and a low level of detail. In this paper, we introduce DreamGaussian4D, an efficient 4D generation framework that builds on 4D Gaussian Splatting representation. Our key insight is that the explicit modeling of spatial transformations in Gaussian Splatting makes it more suitable for the 4D generation setting compared with implicit representations. DreamGaussian4D reduces the optimization time from several hours to just a few minutes, allows flexible control of the generated 3D motion, and produces animated meshes that can be efficiently rendered in 3D engines.	DreamGaussian4D is an efficient 4D generation framework based on 4D Gaussian Splatting, which reduces optimization time from hours to minutes while enabling controllable 3D motion.	Existing 4D content generation methods suffer from long optimization times, lack of motion controllability, and low detail.	The framework leverages a static 3D Gaussian Splatting model, optimized using an enhanced DreamGaussianHD method. A deformation network learns motion from a driving video, allowing for controllable dynamics. An optional video-to-video pipeline refines textures on exported animated meshes.	Significantly faster optimization compared to existing methods (minutes instead of hours). Controllable 3D motion generation by leveraging driving videos. High-quality animated meshes with refined textures, suitable for real-world applications.	Limited diversity in the generated shapes due to single-image input. Reliance on external image-to-video models for driving video generation.	4d content generation, gaussian splatting, motion control, image-to-4d, texture refinement
2312.16886 Report	MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices	Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei, Chunhua Shen	We present MobileVLM, a competent multimodal vision language model (MMVLM) targeted to run on mobile devices. It is an amalgamation of a myriad of architectural designs and techniques that are mobile-oriented, which comprises a set of language models at the scale of 1.4B and 2.7B parameters, trained from scratch, a multimodal vision model that is pre-trained in the CLIP fashion, cross-modality interaction via an efficient projector. We evaluate MobileVLM on several typical VLM benchmarks. Our models demonstrate on par performance compared with a few much larger models. More importantly, we measure the inference speed on both a Qualcomm Snapdragon 888 CPU and an NVIDIA Jeston Orin GPU, and we obtain state-of-the-art performance of 21.5 tokens and 65.3 tokens per second, respectively. Our code will be made available at: https://github.com/Meituan-AutoML/MobileVLM.	The paper introduces MobileVLM, a multimodal vision language model designed for efficient execution on mobile and IoT devices.	Large multimodal models (LMMs) are resource-intensive and challenging to deploy on edge devices. MobileVLM addresses this by offering comparable performance to larger models while being optimized for mobile platforms.	The model consists of a pre-trained CLIP ViT-L/14 visual encoder, a lightweight downsample projector (LDP) for aligning visual and textual features, and a tailored LLM (MobileLLaMA) based on a downscaled LLaMA architecture. It is trained using a two-stage approach involving pretraining and instruction tuning.	MobileVLM achieves comparable performance to larger VLMs on benchmarks like GQA, POPE, and MMBench. The efficient projector design in MobileVLM reduces visual tokens by 75% without compromising performance. MobileVLM exhibits superior inference speed on Snapdragon 888 CPU and Jetson Orin GPU compared to similar-sized models, achieving up to 21.5 tokens/s and 65.3 tokens/s respectively.	The performance of MobileVLM on tasks like ScienceQA and MME, which require extensive training data, shows room for improvement. Future work includes exploring neural architecture search for optimizing the LLM component.	vision language model, mobile deployment, efficient projector, multimodal learning, edge ai
2312.16862 Report	TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones	Zhengqing Yuan, Zhaoxu Li, Weiran Huang, Yanfang Ye, Lichao Sun	In recent years, multimodal large language models (MLLMs) such as GPT-4V have demonstrated remarkable advancements, excelling in a variety of vision-language tasks. Despite their prowess, the closed-source nature and computational demands of such models limit their accessibility and applicability. This study introduces TinyGPT-V, a novel open-source MLLM, designed for efficient training and inference across various vision-language tasks, including image captioning (IC) and visual question answering (VQA). Leveraging a compact yet powerful architecture, TinyGPT-V integrates the Phi-2 language model with pre-trained vision encoders, utilizing a unique mapping module for visual and linguistic information fusion. With a training regimen optimized for small backbones and employing a diverse dataset amalgam, TinyGPT-V requires significantly lower computational resources 24GB for training and as little as 8GB for inference without compromising on performance. Our experiments demonstrate that TinyGPT-V, with its language model 2.8 billion parameters, achieves comparable results in VQA and image inference tasks to its larger counterparts while being uniquely suited for deployment on resource-constrained devices through innovative quantization techniques. This work not only paves the way for more accessible and efficient MLLMs but also underscores the potential of smaller, optimized models in bridging the gap between high performance and computational efficiency in real-world applications. Additionally, this paper introduces a new approach to multimodal large language models using smaller backbones. Our code and training weights are available in \url{https://github.com/DLYuanGod/TinyGPT-V}.	This paper introduces TinyGPT-V, an open-source multimodal large language model (MLLM) designed for efficient training and inference across various vision-language tasks, despite having a significantly smaller size compared to existing models.	Existing MLLMs, while powerful, often require substantial computational resources, limiting their accessibility and applicability. TinyGPT-V addresses this by achieving comparable performance with lower computational demands, making it suitable for resource-constrained devices.	TinyGPT-V integrates the Phi-2 language model with pre-trained vision encoders, utilizing a unique mapping module for visual and linguistic information fusion. It's trained using a novel methodology optimized for small backbones and a diverse dataset.	TinyGPT-V achieves competitive performance on various benchmarks like VQA and referring expression comprehension, comparable to models with much larger parameter sizes. It requires only 24GB GPU memory for training and can be deployed on devices with as little as 8GB memory. TinyGPT-V demonstrates superior efficiency in inference speed and memory occupancy compared to models like LLaVA and MiniGPT-4.	TinyGPT-V's performance on certain benchmarks, while strong, still lags behind the absolute top performers, suggesting room for improvement. The study primarily focuses on a limited set of vision-language tasks, leaving its capabilities in other areas unexplored.	multimodal large language models, vision-language tasks, efficient training, resource-constrained devices, open-source
2312.16837 Report	DiffusionGAN3D: Boosting Text-guided 3D Generation and Domain Adaptation by Combining 3D GANs and Diffusion Priors	Biwen Lei, Kai Yu, Mengyang Feng, Miaomiao Cui, Xuansong Xie	Text-guided domain adaptation and generation of 3D-aware portraits find many applications in various fields. However, due to the lack of training data and the challenges in handling the high variety of geometry and appearance, the existing methods for these tasks suffer from issues like inflexibility, instability, and low fidelity. In this paper, we propose a novel framework DiffusionGAN3D, which boosts text-guided 3D domain adaptation and generation by combining 3D GANs and diffusion priors. Specifically, we integrate the pre-trained 3D generative models (e.g., EG3D) and text-to-image diffusion models. The former provides a strong foundation for stable and high-quality avatar generation from text. And the diffusion models in turn offer powerful priors and guide the 3D generator finetuning with informative direction to achieve flexible and efficient text-guided domain adaptation. To enhance the diversity in domain adaptation and the generation capability in text-to-avatar, we introduce the relative distance loss and case-specific learnable triplane respectively. Besides, we design a progressive texture refinement module to improve the texture quality for both tasks above. Extensive experiments demonstrate that the proposed framework achieves excellent results in both domain adaptation and text-to-avatar tasks, outperforming existing methods in terms of generation quality and efficiency. The project homepage is at https://younglbw.github.io/DiffusionGAN3D-homepage/.	This paper presents DiffusionGAN3D, a novel framework that boosts the performance of text-guided 3D generation and domain adaptation by combining 3D GANs and diffusion priors.	This approach addresses the limitations of existing text-to-3D methods, which often suffer from low-quality results, instability, and poor texture details.	DiffusionGAN3D employs a Semantic Diffusion Sampling (SDS) strategy to guide the generation process of a 3D GAN, along with a progressive texture refinement mechanism to further enhance the quality of the generated 3D assets.	DiffusionGAN3D demonstrates superior performance in both domain adaptation and text-to-avatar tasks, producing high-fidelity results with fine-grained textures and diverse geometry. The framework exhibits strong generalization capabilities across various domains, including human heads, animals, and stylized avatars. DiffusionGAN3D also enables 3D-aware local editing on both synthetic and real images while preserving details and identity.	The performance of DiffusionGAN3D is reliant on the quality and capabilities of the base 3D generator used. The method currently struggles with local editing tasks that involve significant deformation.	text-to-3d, 3d generation, domain adaptation, diffusion models, generative adversarial networks (gans)
2312.16812 Report	Spacetime Gaussian Feature Splatting for Real-Time Dynamic View Synthesis	Zhan Li, Zhang Chen, Zhong Li, Yi Xu	Novel view synthesis of dynamic scenes has been an intriguing yet challenging problem. Despite recent advancements, simultaneously achieving high-resolution photorealistic results, real-time rendering, and compact storage remains a formidable task. To address these challenges, we propose Spacetime Gaussian Feature Splatting as a novel dynamic scene representation, composed of three pivotal components. First, we formulate expressive Spacetime Gaussians by enhancing 3D Gaussians with temporal opacity and parametric motion/rotation. This enables Spacetime Gaussians to capture static, dynamic, as well as transient content within a scene. Second, we introduce splatted feature rendering, which replaces spherical harmonics with neural features. These features facilitate the modeling of view- and time-dependent appearance while maintaining small size. Third, we leverage the guidance of training error and coarse depth to sample new Gaussians in areas that are challenging to converge with existing pipelines. Experiments on several established real-world datasets demonstrate that our method achieves state-of-the-art rendering quality and speed, while retaining compact storage. At 8K resolution, our lite-version model can render at 60 FPS on an Nvidia RTX 4090 GPU. Our code is available at https://github.com/oppo-us-research/SpacetimeGaussians.	This paper introduces Spacetime Gaussian Feature Splatting, a novel dynamic scene representation for real-time, high-resolution dynamic view synthesis with compact model size.	Existing methods struggle to simultaneously achieve high-resolution photorealistic results, real-time rendering, and compact storage for dynamic novel view synthesis.	The method extends 3D Gaussians to 4D spacetime using temporal opacity and polynomial motion/rotation parameters. It replaces spherical harmonics with splatted neural features and a lightweight MLP for efficient view- and time-dependent radiance encoding. Guided sampling of Gaussians based on training error and coarse depth improves rendering quality in sparsely covered areas.	Achieves state-of-the-art rendering quality and speed while maintaining compact model size on multiple datasets. Outperforms baselines on PSNR, DSSIM, and LPIPS metrics. Enables 8K resolution rendering at 60 FPS with the lite-version model on an NVIDIA RTX 4090 GPU.	The method currently lacks on-the-fly training capability. It currently focuses on multi-view video inputs and adapting to monocular settings is left for future work.	dynamic view synthesis, neural rendering, spacetime gaussian, feature splatting, guided sampling
2312.16794 Report	ZONE: Zero-Shot Instruction-Guided Local Editing	Shanglin Li, Bohan Zeng, Yutang Feng, Sicheng Gao, Xuhui Liu, Jiaming Liu, Li Lin, Xu Tang, Yao Hu, Jianzhuang Liu, Baochang Zhang	Recent advances in vision-language models like Stable Diffusion have shown remarkable power in creative image synthesis and editing.However, most existing text-to-image editing methods encounter two obstacles: First, the text prompt needs to be carefully crafted to achieve good results, which is not intuitive or user-friendly. Second, they are insensitive to local edits and can irreversibly affect non-edited regions, leaving obvious editing traces. To tackle these problems, we propose a Zero-shot instructiON-guided local image Editing approach, termed ZONE. We first convert the editing intent from the user-provided instruction (e.g., "make his tie blue") into specific image editing regions through InstructPix2Pix. We then propose a Region-IoU scheme for precise image layer extraction from an off-the-shelf segment model. We further develop an edge smoother based on FFT for seamless blending between the layer and the image.Our method allows for arbitrary manipulation of a specific region with a single instruction while preserving the rest. Extensive experiments demonstrate that our ZONE achieves remarkable local editing results and user-friendliness, outperforming state-of-the-art methods. Code is available at https://github.com/lsl001006/ZONE.	\texttt{ZONE} is a zero-shot, instruction-guided approach for local image editing that enables users to modify specific regions of real or synthetic images using simple instructions while preserving non-edited areas.	Existing text-to-image editing methods often require complex prompt engineering or struggle to confine edits locally, leading to undesired alterations in non-targeted image regions. \texttt{ZONE} addresses these limitations by enabling intuitive, localized edits with user-friendly instructions.	\texttt{ZONE} leverages a pre-trained InstructPix2Pix model to identify and edit regions based on user instructions. It introduces a Region-IoU scheme for precise mask refinement using SAM and employs an FFT-based edge smoother for seamless blending of edited layers with the original image.	\texttt{ZONE} achieves superior performance in local image editing compared to state-of-the-art methods, as demonstrated by quantitative metrics like L1, L2, LPIPS, CLIP-I, and CLIP-T. It effectively preserves non-edited regions, avoiding distortions commonly found in other instruction-guided methods. Human evaluations confirm \texttt{ZONE}'s effectiveness, with users showing a strong preference for its editing results and a higher success rate in achieving desired edits.	The editing capabilities of \texttt{ZONE} are limited by the capacity of the underlying instruction-guided diffusion models, which may not always perform optimally. Localization can be challenging in complex scenes with multiple similar objects or very small objects, requiring further research to improve.	image editing, local editing, instruction-guided editing, diffusion models, zero-shot learning
2312.16720 Report	Prompt Expansion for Adaptive Text-to-Image Generation	Siddhartha Datta, Alexander Ku, Deepak Ramachandran, Peter Anderson	Text-to-image generation models are powerful but difficult to use. Users craft specific prompts to get better images, though the images can be repetitive. This paper proposes a Prompt Expansion framework that helps users generate high-quality, diverse images with less effort. The Prompt Expansion model takes a text query as input and outputs a set of expanded text prompts that are optimized such that when passed to a text-to-image model, generates a wider variety of appealing images. We conduct a human evaluation study that shows that images generated through Prompt Expansion are more aesthetically pleasing and diverse than those generated by baseline methods. Overall, this paper presents a novel and effective approach to improving the text-to-image generation experience.	This paper introduces Prompt Expansion, a framework that enhances text-to-image generation by expanding user queries into detailed prompts, improving image quality and diversity.	Existing text-to-image models often produce repetitive outputs and necessitate elaborate prompt engineering. This framework addresses these limitations by promoting diverse and high-quality image generation with less user effort.	The authors create a Prompt Expansion dataset by inverting aesthetically pleasing images to text prompts and then mapping them to high-level user queries. They train a text-to-text model on this dataset and fine-tune it using a downstream text-to-image model.	Prompt Expansion generates more aesthetically pleasing and diverse images compared to straight-query generation, as evidenced by automatic metrics and human evaluation. The fine-tuned model excels in aesthetics, demonstrating the significance of aligning the framework with the downstream text-to-image model. Prompt Expansion maintains reasonable text-image alignment, ensuring the expanded prompts stay true to the user's original intent.	The diversity improvements, while consistent, are relatively small in magnitude. The model's performance relies heavily on the quality and diversity of the training dataset.	text-to-image generation, prompt engineering, image diversity, image aesthetics, text-image alignment
2312.16693 Report	I2V-Adapter: A General Image-to-Video Adapter for Diffusion Models	Xun Guo, Mingwu Zheng, Liang Hou, Yuan Gao, Yufan Deng, Pengfei Wan, Di Zhang, Yufan Liu, Weiming Hu, Zhengjun Zha, Haibin Huang, Chongyang Ma	Text-guided image-to-video (I2V) generation aims to generate a coherent video that preserves the identity of the input image and semantically aligns with the input prompt. Existing methods typically augment pretrained text-to-video (T2V) models by either concatenating the image with noised video frames channel-wise before being fed into the model or injecting the image embedding produced by pretrained image encoders in cross-attention modules. However, the former approach often necessitates altering the fundamental weights of pretrained T2V models, thus restricting the model's compatibility within the open-source communities and disrupting the model's prior knowledge. Meanwhile, the latter typically fails to preserve the identity of the input image. We present I2V-Adapter to overcome such limitations. I2V-Adapter adeptly propagates the unnoised input image to subsequent noised frames through a cross-frame attention mechanism, maintaining the identity of the input image without any changes to the pretrained T2V model. Notably, I2V-Adapter only introduces a few trainable parameters, significantly alleviating the training cost and also ensures compatibility with existing community-driven personalized models and control tools. Moreover, we propose a novel Frame Similarity Prior to balance the motion amplitude and the stability of generated videos through two adjustable control coefficients. Our experimental results demonstrate that I2V-Adapter is capable of producing high-quality videos. This performance, coupled with its agility and adaptability, represents a substantial advancement in the field of I2V, particularly for personalized and controllable applications.	This paper proposes I2V-Adapter, a lightweight, plug-and-play adapter for image-to-video generation that efficiently adapts pretrained text-to-video diffusion models without altering their original weights.	Existing methods for adapting text-to-video models for image-to-video tasks require significant modifications, leading to training instability and incompatibility with personalized models or control tools.	I2V-Adapter leverages pretrained model knowledge by feeding the unnoised input image and noised frames in parallel. It employs a cross-frame attention mechanism to propagate identity information, preserving the first frame's identity. Additionally, a Frame Similarity Prior balances motion and stability in generated videos.	I2V-Adapter generates high-quality videos with consistency between input images and subsequent frames while adhering to text prompts. It outperforms existing methods in quantitative metrics, demonstrating superior image consistency, motion range, and motion accuracy. The method's plug-and-play nature ensures compatibility with personalized T2I models and control tools like ControlNet.	Limited to generating 16-frame, 512x512 videos due to constraints from pretrained base models and video data. Future work aims to incorporate frame interpolation and super-resolution modules for longer, higher-resolution videos.	image-to-video generation, diffusion models, adapter, cross-frame attention, frame similarity prior
2312.16649 Report	Forgery-aware Adaptive Transformer for Generalizable Synthetic Image Detection	Huan Liu, Zichang Tan, Chuangchuang Tan, Yunchao Wei, Yao Zhao, Jingdong Wang	In this paper, we study the problem of generalizable synthetic image detection, aiming to detect forgery images from diverse generative methods, e.g., GANs and diffusion models. Cutting-edge solutions start to explore the benefits of pre-trained models, and mainly follow the fixed paradigm of solely training an attached classifier, e.g., combining frozen CLIP-ViT with a learnable linear layer in UniFD. However, our analysis shows that such a fixed paradigm is prone to yield detectors with insufficient learning regarding forgery representations. We attribute the key challenge to the lack of forgery adaptation, and present a novel forgery-aware adaptive transformer approach, namely FatFormer. Based on the pre-trained vision-language spaces of CLIP, FatFormer introduces two core designs for the adaption to build generalized forgery representations. First, motivated by the fact that both image and frequency analysis are essential for synthetic image detection, we develop a forgery-aware adapter to adapt image features to discern and integrate local forgery traces within image and frequency domains. Second, we find that considering the contrastive objectives between adapted image features and text prompt embeddings, a previously overlooked aspect, results in a nontrivial generalization improvement. Accordingly, we introduce language-guided alignment to supervise the forgery adaptation with image and text prompts in FatFormer. Experiments show that, by coupling these two designs, our approach tuned on 4-class ProGAN data attains a remarkable detection performance, achieving an average of 98% accuracy to unseen GANs, and surprisingly generalizes to unseen diffusion models with 95% accuracy.	This paper proposes FatFormer, a forgery-aware adaptive transformer, for generalizable synthetic image detection, aiming to effectively detect fake images generated by various methods like GANs and diffusion models.	Existing methods relying on fixed pre-trained models with attached classifiers show limitations in learning robust forgery representations, resulting in poor generalization ability to unseen generation methods.	FatFormer leverages a forgery-aware adapter (FAA) to extract and integrate forgery traces in both image and frequency domains. It also introduces language-guided alignment (LGA) that utilizes contrastive objectives between adapted image features and text prompts for supervising the learning of generalized forgery representations.	FatFormer consistently outperforms state-of-the-art methods on detecting fake images from various GANs, achieving 98.4% accuracy and 99.7% AP. It demonstrates remarkable generalization ability by effectively detecting images from unseen diffusion models with 95.0% accuracy and 98.8% AP. Ablation studies validate the contributions of FAA, LGA, and their components in enhancing detection performance and generalizability.	There is still room for improvement in detecting fake images generated by specific diffusion models like Guided. Exploring better pre-training tasks specifically designed for synthetic image detection could further enhance FatFormer's performance.	synthetic image detection, forgery detection, generative models, adaptive transformer, vision-language model
2312.16486 Report	PanGu-Draw: Advancing Resource-Efficient Text-to-Image Synthesis with Time-Decoupled Training and Reusable Coop-Diffusion	Guansong Lu, Yuanfan Guo, Jianhua Han, Minzhe Niu, Yihan Zeng, Songcen Xu, Zeyi Huang, Zhao Zhong, Wei Zhang, Hang Xu	Current large-scale diffusion models represent a giant leap forward in conditional image synthesis, capable of interpreting diverse cues like text, human poses, and edges. However, their reliance on substantial computational resources and extensive data collection remains a bottleneck. On the other hand, the integration of existing diffusion models, each specialized for different controls and operating in unique latent spaces, poses a challenge due to incompatible image resolutions and latent space embedding structures, hindering their joint use. Addressing these constraints, we present "PanGu-Draw", a novel latent diffusion model designed for resource-efficient text-to-image synthesis that adeptly accommodates multiple control signals. We first propose a resource-efficient Time-Decoupling Training Strategy, which splits the monolithic text-to-image model into structure and texture generators. Each generator is trained using a regimen that maximizes data utilization and computational efficiency, cutting data preparation by 48% and reducing training resources by 51%. Secondly, we introduce "Coop-Diffusion", an algorithm that enables the cooperative use of various pre-trained diffusion models with different latent spaces and predefined resolutions within a unified denoising process. This allows for multi-control image synthesis at arbitrary resolutions without the necessity for additional data or retraining. Empirical validations of Pangu-Draw show its exceptional prowess in text-to-image and multi-control image generation, suggesting a promising direction for future model training efficiencies and generation versatility. The largest 5B T2I PanGu-Draw model is released on the Ascend platform. Project page: $\href{https://pangu-draw.github.io}{this~https~URL}$	Presents "PanGu-Draw", a novel latent diffusion model for resource-efficient text-to-image synthesis that supports multiple control signals.	Addresses the high computational cost and data requirements of large-scale diffusion models, as well as the challenge of integrating existing models with different controls, resolutions, and latent spaces.	Introduces two key innovations: (1) Time-Decoupling Training Strategy: Splits the model into structure and texture generators trained separately for efficiency. (2) Coop-Diffusion Algorithm: Enables cooperative use of pre-trained diffusion models with different latent spaces and resolutions.	PanGu-Draw achieves state-of-the-art text-to-image generation quality, surpassing models like DALL-E 2 and SDXL on English benchmarks. It excels in Chinese text-to-image generation, achieving superior scores across FID, IS, and CN-CLIP-score metrics. Coop-Diffusion enables multi-control and multi-resolution image generation by effectively fusing different diffusion models without retraining.	The paper primarily focuses on efficiency and quality, with limited exploration of novel control mechanisms. Future work could investigate the generalization of Coop-Diffusion to an even wider range of pre-trained models.	text-to-image synthesis, diffusion models, multi-control generation, multi-resolution synthesis, efficient training
2312.16414 Report	Bellman Optimal Stepsize Straightening of Flow-Matching Models	Bao Nguyen, Binh Nguyen, Viet Anh Nguyen	Flow matching is a powerful framework for generating high-quality samples in various applications, especially image synthesis. However, the intensive computational demands of these models, especially during the finetuning process and sampling processes, pose significant challenges for low-resource scenarios. This paper introduces Bellman Optimal Stepsize Straightening (BOSS) technique for distilling flow-matching generative models: it aims specifically for a few-step efficient image sampling while adhering to a computational budget constraint. First, this technique involves a dynamic programming algorithm that optimizes the stepsizes of the pretrained network. Then, it refines the velocity network to match the optimal step sizes, aiming to straighten the generation paths. Extensive experimental evaluations across image generation tasks demonstrate the efficacy of BOSS in terms of both resource utilization and image quality. Our results reveal that BOSS achieves substantial gains in efficiency while maintaining competitive sample quality, effectively bridging the gap between low-resource constraints and the demanding requirements of flow-matching generative models. Our paper also fortifies the responsible development of artificial intelligence, offering a more sustainable generative model that reduces computational costs and environmental footprints. Our code can be found at https://github.com/nguyenngocbaocmt02/BOSS.	This paper introduces Bellman Optimal Stepsize Straightening (BOSS), a technique to distill flow-matching generative models for efficient image sampling under low-resource constraints.	Flow matching models, while powerful, demand significant computational resources for finetuning and sampling, making them challenging to use in low-resource settings. BOSS aims to bridge this gap.	BOSS uses a two-phase approach: 1) a dynamic programming algorithm finds optimal stepsizes for the pretrained model, and 2) the velocity network is retrained to match these optimal stepsizes, straightening the generation paths.	Bellman optimal stepsizes substantially improve image quality (lower FID) compared to uniform stepsizes, especially for high-resolution datasets. BOSS achieves comparable or better image quality than standard reflow techniques with significantly fewer retraining iterations (reduced computational cost). Low-Rank Adaptation (LoRA) can be effectively used during the straightening process, achieving competitive results while finetuning only a small fraction of model parameters.	The method requires additional training to determine optimal stepsizes. Future work could explore extending BOSS to guided velocity networks and developing computationally cheaper algorithms for stepsize calculation.	generative models, flow matching, image generation, efficient sampling, low-resource settings
2312.16274 Report	Towards Flexible, Scalable, and Adaptive Multi-Modal Conditioned Face Synthesis	Jingjing Ren, Cheng Xu, Haoyu Chen, Xinran Qin, Lei Zhu	Recent progress in multi-modal conditioned face synthesis has enabled the creation of visually striking and accurately aligned facial images. Yet, current methods still face issues with scalability, limited flexibility, and a one-size-fits-all approach to control strength, not accounting for the differing levels of conditional entropy, a measure of unpredictability in data given some condition, across modalities. To address these challenges, we introduce a novel uni-modal training approach with modal surrogates, coupled with an entropy-aware modal-adaptive modulation, to support flexible, scalable, and scalable multi-modal conditioned face synthesis network. Our uni-modal training with modal surrogate that only leverage uni-modal data, use modal surrogate to decorate condition with modal-specific characteristic and serve as linker for inter-modal collaboration , fully learns each modality control in face synthesis process as well as inter-modal collaboration. The entropy-aware modal-adaptive modulation finely adjust diffusion noise according to modal-specific characteristics and given conditions, enabling well-informed step along denoising trajectory and ultimately leading to synthesis results of high fidelity and quality. Our framework improves multi-modal face synthesis under various conditions, surpassing current methods in image quality and fidelity, as demonstrated by our thorough experimental results.	This paper proposes a novel uni-modal training approach with modal surrogates and an entropy-aware modal-adaptive modulation mechanism for scalable, flexible, and adaptive multi-modal conditioned face synthesis.	Existing methods suffer from poor scalability, limited flexibility in handling modal combinations, and a lack of adaptivity to the varying control strength required for different modalities.	The method uses modal surrogates to decorate conditions with modal-specific characteristics and facilitate inter-modal collaboration. It also dynamically adjusts noise levels based on conditional entropy for each modality, ensuring effective utilization of information from all modalities.	The proposed framework demonstrates superior performance in multi-modal face synthesis, outperforming existing methods in terms of image quality and condition alignment. It supports a wide range of face synthesis applications, including diverse uni-modal synthesis and flexible combinations of multi-modal conditions. The method achieves high flexibility and scalability by enabling the synthesis of facial images under various modal combinations within a single sampling process of a unified diffusion model.	The method currently relies on pre-trained encoders for certain modalities like text and low-resolution images. Further exploration is needed to extend the approach to even more modalities and higher-resolution image synthesis.	multi-modal face synthesis, diffusion models, uni-modal training, modal surrogates, entropy-aware modulation
2312.16272 Report	SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation	Yuxuan Zhang, Yiren Song, Jiaming Liu, Rui Wang, Jinpeng Yu, Hao Tang, Huaxia Li, Xu Tang, Yao Hu, Han Pan, Zhongliang Jing	Recent advancements in subject-driven image generation have led to zero-shot generation, yet precise selection and focus on crucial subject representations remain challenging. Addressing this, we introduce the SSR-Encoder, a novel architecture designed for selectively capturing any subject from single or multiple reference images. It responds to various query modalities including text and masks, without necessitating test-time fine-tuning. The SSR-Encoder combines a Token-to-Patch Aligner that aligns query inputs with image patches and a Detail-Preserving Subject Encoder for extracting and preserving fine features of the subjects, thereby generating subject embeddings. These embeddings, used in conjunction with original text embeddings, condition the generation process. Characterized by its model generalizability and efficiency, the SSR-Encoder adapts to a range of custom models and control modules. Enhanced by the Embedding Consistency Regularization Loss for improved training, our extensive experiments demonstrate its effectiveness in versatile and high-quality image generation, indicating its broad applicability. Project page: https://ssr-encoder.github.io	This paper introduces SSR-Encoder, a novel finetuning-free method for selective subject-driven image generation using text or mask queries.	Existing methods either lack the flexibility to selectively generate subjects from single or multiple images without test-time fine-tuning or fail to fully capture and leverage the detailed representation of subjects.	The SSR-Encoder consists of a Token-to-Patch Aligner for precise query-subject alignment and a Detail-Preserving Subject Encoder for extracting multi-scale subject embeddings. An Embedding Consistency Regularization Loss enhances token-to-patch alignment during training.	SSR-Encoder outperforms state-of-the-art finetuning-free methods in subject and image-text alignment, subject exclusivity, and image quality. It demonstrates competitive performance even compared to finetuning-based methods. Ablation studies validate the contribution of each component, showcasing improved expressiveness and precision in subject-driven generation.	The fidelity of generated images can be affected by the uneven distribution of training data. Future work includes addressing data distribution limitations and extending the approach to 3D generation.	image generation, subject-driven generation, text-to-image, diffusion models, selective representation
2312.16256 Report	DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision	Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, Xuanmao Li, Xingpeng Sun, Rohan Ashok, Aniruddha Mukherjee, Hao Kang, Xiangrui Kong, Gang Hua, Tianyi Zhang, Bedrich Benes, Aniket Bera	We have witnessed significant progress in deep learning-based 3D vision, ranging from neural radiance field (NeRF) based 3D representation learning to applications in novel view synthesis (NVS). However, existing scene-level datasets for deep learning-based 3D vision, limited to either synthetic environments or a narrow selection of real-world scenes, are quite insufficient. This insufficiency not only hinders a comprehensive benchmark of existing methods but also caps what could be explored in deep learning-based 3D analysis. To address this critical gap, we present DL3DV-10K, a large-scale scene dataset, featuring 51.2 million frames from 10,510 videos captured from 65 types of point-of-interest (POI) locations, covering both bounded and unbounded scenes, with different levels of reflection, transparency, and lighting. We conducted a comprehensive benchmark of recent NVS methods on DL3DV-10K, which revealed valuable insights for future research in NVS. In addition, we have obtained encouraging results in a pilot study to learn generalizable NeRF from DL3DV-10K, which manifests the necessity of a large-scale scene-level dataset to forge a path toward a foundation model for learning 3D representation. Our DL3DV-10K dataset, benchmark results, and models will be publicly accessible at https://dl3dv-10k.github.io/DL3DV-10K/.	This paper introduces DL3DV-10K, a large-scale, real-world multi-view scene dataset for novel view synthesis and 3D representation learning, containing 51.3 million 4K resolution frames across 65 point-of-interest categories with fine-grained annotations for scene complexity.	Existing scene-level datasets are limited in scale and diversity, hindering comprehensive benchmarking of NVS methods and the development of generalizable 3D representation learning models.	The dataset was created by capturing high-resolution videos of diverse real-world scenes using consumer mobile devices and drones, followed by a detailed annotation process for scene complexity.	Zip-NeRF and 3DGS demonstrated the best overall performance on the benchmark, with Zip-NeRF excelling in most scenarios but consuming more memory. Outdoor (unbounded) scenes and scenes with high-frequency details posed significant challenges for all evaluated NVS methods. Pretraining a generalizable NeRF model on DL3DV-10K significantly improved performance compared to training from scratch or using a smaller dataset, highlighting the dataset's potential for learning universal scene priors.	The presence of moving objects in some scenes, inherent to mobile phone video capture, presents challenges for static view synthesis. Future work includes expanding the dataset with dynamic scenes and exploring the development of robust learning-based 3D models for dynamic NVS.	novel view synthesis, 3d representation learning, dataset, benchmark, neural radiance fields
2312.16218 Report	Hyper-VolTran: Fast and Generalizable One-Shot Image to 3D Object Structure via HyperNetworks	Christian Simon, Sen He, Juan-Manuel Perez-Rua, Mengmeng Xu, Amine Benhalloum, Tao Xiang	Solving image-to-3D from a single view is an ill-posed problem, and current neural reconstruction methods addressing it through diffusion models still rely on scene-specific optimization, constraining their generalization capability. To overcome the limitations of existing approaches regarding generalization and consistency, we introduce a novel neural rendering technique. Our approach employs the signed distance function as the surface representation and incorporates generalizable priors through geometry-encoding volumes and HyperNetworks. Specifically, our method builds neural encoding volumes from generated multi-view inputs. We adjust the weights of the SDF network conditioned on an input image at test-time to allow model adaptation to novel scenes in a feed-forward manner via HyperNetworks. To mitigate artifacts derived from the synthesized views, we propose the use of a volume transformer module to improve the aggregation of image features instead of processing each viewpoint separately. Through our proposed method, dubbed as Hyper-VolTran, we avoid the bottleneck of scene-specific optimization and maintain consistency across the images generated from multiple viewpoints. Our experiments show the advantages of our proposed approach with consistent results and rapid generation.	This paper introduces Hyper-VolTran, a novel neural rendering technique for fast and generalizable single-view 3D reconstruction, employing HyperNetworks and a Volume Transformer.	Single-view 3D reconstruction is ill-posed, and existing diffusion-based methods lack generalization due to scene-specific optimization.	The method uses a diffusion model to synthesize multi-view images, a HyperNetwork to generate SDF network weights from input image embeddings, and a Volume Transformer (VolTran) for consistent feature aggregation across inconsistent views.	Hyper-VolTran demonstrates superior generalization ability compared to existing methods in single image-to-3D reconstruction. The method achieves fast 3D mesh generation in 45 seconds without per-scene optimization. Ablation studies confirm the contribution of both the HyperNetwork and VolTran modules to the overall performance.	The performance of Hyper-VolTran relies on the quality and consistency of the multi-view images generated by the diffusion model. Future work could explore incorporating semantic information or alternative 3D representations for enhanced reconstruction accuracy.	3d reconstruction, single-view reconstruction, neural rendering, hypernetworks, diffusion models
2312.16204 Report	Iterative Prompt Relabeling for diffusion model with RLDF	Jiaxin Ge, Xinyan Chen, Tianjun Zhang, Shanghang Zhang	Diffusion models have shown impressive performance in many domains, including image generation, time series prediction, and reinforcement learning. The algorithm demonstrates superior performance over the traditional GAN and transformer based methods. However, the model's capability to follow natural language instructions (e.g., spatial relationships between objects, generating complex scenes) is still unsatisfactory. This has been an important research area to enhance such capability. Prior works adopt reinforcement learning to adjust the behavior of the diffusion models. However, RL methods not only require careful reward design and complex hyperparameter tuning, but also fails to incorporate rich natural language feedback. In this work, we propose iterative prompt relabeling (IP-RLDF), a novel algorithm that aligns images to text through iterative image sampling and prompt relabeling. IP-RLDF first samples a batch of images conditioned on the text, then relabels the text prompts of unmatched text-image pairs with classifier feedback. We conduct thorough experiments on three different models, including SDv2, GLIGEN, and SDXL, testing their capability to generate images following instructions. With IP-RLDF, we improved up to 15.22% (absolute improvement) on the challenging spatial relation VISOR benchmark, demonstrating superior performance compared to previous RL methods.	This paper proposes IP-RLDF, a novel algorithm to improve the spatial understanding and rendering capabilities of text-to-image diffusion models.	Current diffusion models struggle to accurately interpret and execute complex instructions involving spatial relationships between objects.	IP-RLDF uses an iterative process of: 1) Sampling images from a diffusion model. 2) Using an object detection model to analyze spatial relationships and relabel inaccurate text prompts. 3) Retraining the model with the augmented dataset, iteratively refining its spatial understanding.	IP-RLDF achieves up to 15.22% absolute improvement in spatial accuracy on the VISOR benchmark. The algorithm shows consistent improvement across different diffusion models (SDv2, GLIGEN, SDXL) and fine-tuning techniques. Ablation studies confirm that each component (prompt relabeling, iterative training, detection-based reward) contributes to performance gains.	The current implementation focuses on object count and spatial layouts; expanding to more complex language feedback is a potential area. Balancing the trade-off between spatial accuracy and maintaining overall image fidelity (CLIP score) requires further investigation.	diffusion models, text-to-image generation, spatial understanding, prompt relabeling, reinforcement learning
2312.16197 Report	INFAMOUS-NeRF: ImproviNg FAce MOdeling Using Semantically-Aligned Hypernetworks with Neural Radiance Fields	Andrew Hou, Feng Liu, Zhiyuan Ren, Michel Sarkis, Ning Bi, Yiying Tong, Xiaoming Liu	We propose INFAMOUS-NeRF, an implicit morphable face model that introduces hypernetworks to NeRF to improve the representation power in the presence of many training subjects. At the same time, INFAMOUS-NeRF resolves the classic hypernetwork tradeoff of representation power and editability by learning semantically-aligned latent spaces despite the subject-specific models, all without requiring a large pretrained model. INFAMOUS-NeRF further introduces a novel constraint to improve NeRF rendering along the face boundary. Our constraint can leverage photometric surface rendering and multi-view supervision to guide surface color prediction and improve rendering near the surface. Finally, we introduce a novel, loss-guided adaptive sampling method for more effective NeRF training by reducing the sampling redundancy. We show quantitatively and qualitatively that our method achieves higher representation power than prior face modeling methods in both controlled and in-the-wild settings. Code and models will be released upon publication.	INFAMOUS-NeRF, an implicit morphable face model, leverages hypernetworks to learn subject-specific NeRF MLP weights, enhancing representation power while preserving editability through semantically-aligned latent spaces.	Existing face models struggle to balance high-fidelity rendering with the capacity to represent diverse subjects and enable editing. INFAMOUS-NeRF addresses this by improving representation power and maintaining editability.	The method employs a two-stage approach: 1) a NeRF model with hypernetworks to learn subject-specific MLPs and semantically aligned latent codes for shared attributes like expressions, and 2) a conditional DDPM for novel view refinement. It also introduces a photometric surface constraint for rendering accuracy at face boundaries and an adaptive sampling technique for efficient NeRF training.	Achieves state-of-the-art novel view synthesis and 3DMM fitting on FaceScape, FFHQ, and CelebAHQ datasets, demonstrating superior representation power. Successfully transfers expressions between subjects, proving semantic alignment of latent spaces despite using hypernetworks. Demonstrates improved rendering quality, especially at face boundaries, thanks to the novel photometric surface constraint and adaptive sampling.	Handling out-of-distribution expressions remains challenging, potentially requiring training data from more diverse datasets. Initial latent code optimization for a new image is computationally expensive, necessitating exploration of faster mapping techniques.	face modeling, neural radiance fields, hypernetworks, 3d morphable models, adaptive sampling
2312.16171 Report	Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/4	Sondos Mahmoud Bsharat, Aidar Myrzakhan, Zhiqiang Shen	This paper introduces 26 guiding principles designed to streamline the process of querying and prompting large language models. Our goal is to simplify the underlying concepts of formulating questions for various scales of large language models, examining their abilities, and enhancing user comprehension on the behaviors of different scales of large language models when feeding into different prompts. Extensive experiments are conducted on LLaMA-1/2 (7B, 13B and 70B), GPT-3.5/4 to verify the effectiveness of the proposed principles on instructions and prompts design. We hope that this work can provide a better guide for researchers working on the prompting of large language models. Project page is available at https://github.com/VILA-Lab/ATLAS.	This paper introduces 26 guiding principles for crafting effective prompts for large language models (LLMs).	The goal is to simplify the process of prompting LLMs, enhance users' understanding of their behavior, and ultimately improve the quality of LLM responses.	The authors conducted extensive experiments on various LLM scales (LLaMA-1/2, GPT-3.5/4) using the manually designed ATLAS benchmark to evaluate the effectiveness of the proposed principles.	The principled prompts led to an average improvement of 57.7% in LLM response quality and 36.4% in accuracy on GPT-4. Larger models exhibited greater performance gains from the principled prompts, exceeding 20% improvement when moving from LLaMA-2-7B to GPT-4. The principles consistently improved the conciseness, factuality, and clarity of LLM responses across different scales.	The effectiveness of the principles may be limited when applied to highly complex or specialized questions. The evaluation was conducted on a limited set of questions and LLM architectures, potentially affecting the generalizability of the findings.	large language models, prompt engineering, prompting principles, llm evaluation, atlas benchmark
2312.16145 Report	One-Dimensional Adapter to Rule Them All: Concepts, Diffusion Models and Erasing Applications	Mengyao Lyu, Yuhong Yang, Haiwen Hong, Hui Chen, Xuan Jin, Yuan He, Hui Xue, Jungong Han, Guiguang Ding	The prevalent use of commercial and open-source diffusion models (DMs) for text-to-image generation prompts risk mitigation to prevent undesired behaviors. Existing concept erasing methods in academia are all based on full parameter or specification-based fine-tuning, from which we observe the following issues: 1) Generation alternation towards erosion: Parameter drift during target elimination causes alternations and potential deformations across all generations, even eroding other concepts at varying degrees, which is more evident with multi-concept erased; 2) Transfer inability & deployment inefficiency: Previous model-specific erasure impedes the flexible combination of concepts and the training-free transfer towards other models, resulting in linear cost growth as the deployment scenarios increase. To achieve non-invasive, precise, customizable, and transferable elimination, we ground our erasing framework on one-dimensional adapters to erase multiple concepts from most DMs at once across versatile erasing applications. The concept-SemiPermeable structure is injected as a Membrane (SPM) into any DM to learn targeted erasing, and meantime the alteration and erosion phenomenon is effectively mitigated via a novel Latent Anchoring fine-tuning strategy. Once obtained, SPMs can be flexibly combined and plug-and-play for other DMs without specific re-tuning, enabling timely and efficient adaptation to diverse scenarios. During generation, our Facilitated Transport mechanism dynamically regulates the permeability of each SPM to respond to different input prompts, further minimizing the impact on other concepts. Quantitative and qualitative results across ~40 concepts, 7 DMs and 4 erasing applications have demonstrated the superior erasing of SPM. Our code and pre-tuned SPMs are available on the project page https://lyumengyao.github.io/projects/spm.	This paper introduces SPM, a one-dimensional adapter framework for erasing concepts from pre-trained diffusion models (DMs) in a precise, customizable, and transferable manner.	Existing concept erasing methods often lead to undesirable generation alterations and concept erosion, especially when multiple concepts are erased. They also lack transferability across different DM architectures.	The framework utilizes 1-dim SPMs injected into DMs to learn concept-specific semi-permeability. It employs Latent Anchoring during training to preserve non-target concepts and Facilitated Transport during inference to dynamically regulate SPM activation based on input prompts.	SPMs successfully erase concrete objects, abstract styles, sexual content, and memorized images while minimizing impact on non-target generations. The method effectively mitigates generation alterations and alleviates concept erosion, even with multiple concepts erased. SPMs exhibit training-free transferability to other DMs, enabling efficient adaptation to diverse models and regulatory requirements.	Challenges remain in precisely defining and erasing interconnected concepts with nuanced attributes. Further research is needed to enhance the robustness of nudity removal, especially when transferred to community-trained DMs.	concept erasing, diffusion models, generative safety, parameter-efficient fine-tuning, transfer learning
2312.16109 Report	fMPI: Fast Novel View Synthesis in the Wild with Layered Scene Representations	Jonas Kohler, Nicolas Griffiths Sanchez, Luca Cavalli, Catherine Herold, Albert Pumarola, Alberto Garcia Garcia, Ali Thabet	In this study, we propose two novel input processing paradigms for novel view synthesis (NVS) methods based on layered scene representations that significantly improve their runtime without compromising quality. Our approach identifies and mitigates the two most time-consuming aspects of traditional pipelines: building and processing the so-called plane sweep volume (PSV), which is a high-dimensional tensor of planar re-projections of the input camera views. In particular, we propose processing this tensor in parallel groups for improved compute efficiency as well as super-sampling adjacent input planes to generate denser, and hence more accurate scene representation. The proposed enhancements offer significant flexibility, allowing for a balance between performance and speed, thus making substantial steps toward real-time applications. Furthermore, they are very general in the sense that any PSV-based method can make use of them, including methods that employ multiplane images, multisphere images, and layered depth images. In a comprehensive set of experiments, we demonstrate that our proposed paradigms enable the design of an NVS method that achieves state-of-the-art on public benchmarks while being up to $50x$ faster than existing state-of-the-art methods. It also beats the current forerunner in terms of speed by over $3x$, while achieving significantly better rendering quality.	This paper introduces two novel input processing methods for layered scene representation-based novel view synthesis (NVS) to significantly improve runtime performance without sacrificing quality.	Existing NVS methods, while effective, suffer from high computational complexity, hindering their application in real-time scenarios like VR and immersive telepresence.	The authors propose (1) Plane Grouping: splitting the computationally expensive plane sweep volume (PSV) into groups for parallel processing and (2) Plane Super-Sampling: enabling the network to leverage PSV redundancies and predict denser MPIs from a sparser input, reducing computation.	The proposed 'fast MPI' method achieves state-of-the-art quality on public NVS benchmarks while being up to 50x faster than existing methods. Plane Grouping shows superior performance compared to processing planes independently or jointly, enabling an optimal speed-performance trade-off. Super-Sampling significantly reduces runtime by predicting denser MPIs from sparser PSVs without compromising quality.	The method lacks temporal consistency, potentially leading to inconsistencies in video view synthesis. Memory requirements for layered representations remain high, posing challenges for resource-constrained environments.	novel view synthesis, multiplane images, layered scene representation, real-time rendering, computer vision
2312.16084 Report	LangSplat: 3D Language Gaussian Splatting	Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, Hanspeter Pfister	Humans live in a 3D world and commonly use natural language to interact with a 3D scene. Modeling a 3D language field to support open-ended language queries in 3D has gained increasing attention recently. This paper introduces LangSplat, which constructs a 3D language field that enables precise and efficient open-vocabulary querying within 3D spaces. Unlike existing methods that ground CLIP language embeddings in a NeRF model, LangSplat advances the field by utilizing a collection of 3D Gaussians, each encoding language features distilled from CLIP, to represent the language field. By employing a tile-based splatting technique for rendering language features, we circumvent the costly rendering process inherent in NeRF. Instead of directly learning CLIP embeddings, LangSplat first trains a scene-wise language autoencoder and then learns language features on the scene-specific latent space, thereby alleviating substantial memory demands imposed by explicit modeling. Existing methods struggle with imprecise and vague 3D language fields, which fail to discern clear boundaries between objects. We delve into this issue and propose to learn hierarchical semantics using SAM, thereby eliminating the need for extensively querying the language field across various scales and the regularization of DINO features. Extensive experimental results show that LangSplat significantly outperforms the previous state-of-the-art method LERF by a large margin. Notably, LangSplat is extremely efficient, achieving a 199 $\times$ speedup compared to LERF at the resolution of 1440 $\times$ 1080. We strongly recommend readers to check out our video results at https://langsplat.github.io/	This paper introduces LangSplat, a novel method for building 3D language fields using 3D Gaussian Splatting, enabling fast and accurate open-vocabulary querying in 3D scenes.	Modeling 3D language fields allows for versatile interaction with 3D scenes using natural language, benefiting applications like robotics, autonomous driving, and AR/VR.	LangSplat leverages 3D Gaussian Splatting for efficient rendering, incorporates a scene-specific language autoencoder for memory efficiency, and employs SAM for accurate and hierarchical semantic learning.	LangSplat significantly outperforms previous state-of-the-art methods like LERF in open-vocabulary 3D object localization and semantic segmentation tasks. The method exhibits remarkable speed improvements, achieving up to 199x faster query times compared to LERF. LangSplat effectively learns a precise 3D language field, as demonstrated by its ability to accurately capture object boundaries and reduce noise in segmentation results.	The current implementation relies on a pre-trained SAM model, which may limit its generalizability to unseen object categories. Future work could explore incorporating temporal information for dynamic scene understanding and interaction.	3d language field, open-vocabulary querying, 3d gaussian splatting, segment anything model (sam), scene understanding
2312.16047 Report	2D-Guided 3D Gaussian Segmentation	Kun Lan, Haoran Li, Haolin Shi, Wenjun Wu, Yong Liao, Lin Wang, Pengyuan Zhou	Recently, 3D Gaussian, as an explicit 3D representation method, has demonstrated strong competitiveness over NeRF (Neural Radiance Fields) in terms of expressing complex scenes and training duration. These advantages signal a wide range of applications for 3D Gaussians in 3D understanding and editing. Meanwhile, the segmentation of 3D Gaussians is still in its infancy. The existing segmentation methods are not only cumbersome but also incapable of segmenting multiple objects simultaneously in a short amount of time. In response, this paper introduces a 3D Gaussian segmentation method implemented with 2D segmentation as supervision. This approach uses input 2D segmentation maps to guide the learning of the added 3D Gaussian semantic information, while nearest neighbor clustering and statistical filtering refine the segmentation results. Experiments show that our concise method can achieve comparable performances on mIOU and mAcc for multi-object segmentation as previous single-object segmentation methods.	This paper introduces a novel 3D Gaussian segmentation method guided by 2D segmentation, enhancing efficiency and accuracy in multi-object segmentation within 3D scenes.	Existing 3D Gaussian segmentation methods are either computationally intensive or incapable of segmenting multiple objects efficiently. This work addresses these limitations, aiming for a fast and accurate multi-object segmentation approach.	The method leverages pre-trained 2D segmentation models to guide the learning of semantic information (object code) attached to 3D Gaussians. It then employs KNN clustering to refine semantic information and optionally uses statistical filtering to remove erroneously segmented Gaussians.	The method achieves comparable mean Intersection over Union (mIOU) and mean Accuracy (mAcc) to previous single-object segmentation techniques while enabling multi-object segmentation. It demonstrates superior detail preservation compared to NeRF-based segmentation methods due to the explicit representation of 3D Gaussians. The approach is efficient, requiring less than two minutes for semantic information learning and 1-2 seconds for multi-object segmentation from a given viewpoint.	The method's reliance on 2D segmentation maps might limit its performance in scenarios where 2D segmentation is challenging. Future work can explore incorporating depth information to further enhance segmentation accuracy in complex scenes.	3d gaussian, 3d segmentation, 2d segmentation guidance, knn clustering, statistical filtering
2312.15980 Report	HarmonyView: Harmonizing Consistency and Diversity in One-Image-to-3D	Sangmin Woo, Byeongjun Park, Hyojun Go, Jin-Young Kim, Changick Kim	Recent progress in single-image 3D generation highlights the importance of multi-view coherency, leveraging 3D priors from large-scale diffusion models pretrained on Internet-scale images. However, the aspect of novel-view diversity remains underexplored within the research landscape due to the ambiguity in converting a 2D image into 3D content, where numerous potential shapes can emerge. Here, we aim to address this research gap by simultaneously addressing both consistency and diversity. Yet, striking a balance between these two aspects poses a considerable challenge due to their inherent trade-offs. This work introduces HarmonyView, a simple yet effective diffusion sampling technique adept at decomposing two intricate aspects in single-image 3D generation: consistency and diversity. This approach paves the way for a more nuanced exploration of the two critical dimensions within the sampling process. Moreover, we propose a new evaluation metric based on CLIP image and text encoders to comprehensively assess the diversity of the generated views, which closely aligns with human evaluators' judgments. In experiments, HarmonyView achieves a harmonious balance, demonstrating a win-win scenario in both consistency and diversity.	HarmonyView, a novel diffusion sampling technique for single-image 3D generation, balances multi-view consistency and novel-view diversity.	Balancing consistency and diversity is crucial for high-quality 3D generation from single images, but existing methods struggle to optimize both aspects effectively.	HarmonyView decomposes the diffusion sampling process using two implicit classifiers to guide visual consistency with the input view and diversity in novel views, achieving a harmonious balance.	HarmonyView outperforms state-of-the-art methods in novel-view synthesis and 3D reconstruction tasks across quantitative metrics. HarmonyView generates high-quality, coherent 3D meshes even for complex objects and scenes. A newly proposed metric, CD score, effectively quantifies novel-view diversity and aligns well with human evaluator judgments.	Completely eliminating the trade-off between consistency and diversity remains a challenge. Expanding HarmonyView to handle multi-object scenes with complex backgrounds needs further research.	3d generation, diffusion models, multi-view consistency, novel-view diversity, single-image 3d reconstruction
2312.15905 Report	Cross Initialization for Personalized Text-to-Image Generation	Lianyu Pang, Jian Yin, Haoran Xie, Qiping Wang, Qing Li, Xudong Mao	Recently, there has been a surge in face personalization techniques, benefiting from the advanced capabilities of pretrained text-to-image diffusion models. Among these, a notable method is Textual Inversion, which generates personalized images by inverting given images into textual embeddings. However, methods based on Textual Inversion still struggle with balancing the trade-off between reconstruction quality and editability. In this study, we examine this issue through the lens of initialization. Upon closely examining traditional initialization methods, we identified a significant disparity between the initial and learned embeddings in terms of both scale and orientation. The scale of the learned embedding can be up to 100 times greater than that of the initial embedding. Such a significant change in the embedding could increase the risk of overfitting, thereby compromising the editability. Driven by this observation, we introduce a novel initialization method, termed Cross Initialization, that significantly narrows the gap between the initial and learned embeddings. This method not only improves both reconstruction and editability but also reduces the optimization steps from 5000 to 320. Furthermore, we apply a regularization term to keep the learned embedding close to the initial embedding. We show that when combined with Cross Initialization, this regularization term can effectively improve editability. We provide comprehensive empirical evidence to demonstrate the superior performance of our method compared to the baseline methods. Notably, in our experiments, Cross Initialization is the only method that successfully edits an individual's facial expression. Additionally, a fast version of our method allows for capturing an input image in roughly 26 seconds, while surpassing the baseline methods in terms of both reconstruction and editability. Code will be made publicly available.	The paper proposes a new initialization method named "Cross Initialization" for personalized text-to-image generation using diffusion models, specifically addressing the overfitting issue observed in Textual Inversion.	Textual Inversion, a popular method for personalizing text-to-image generation, often suffers from overfitting, limiting its ability to generate images that accurately reflect both the input concept and user prompt. This paper seeks to solve this issue by improving the initialization of the process.	The method leverages the observation that learned textual embeddings tend to align with the output of the CLIP text encoder. Thus, it initializes the textual embedding with the output of the text encoder, fed with a mean embedding derived from a set of well-known names. Additionally, a regularization term is used to keep the learned embedding close to the initial embedding during optimization.	Cross Initialization significantly reduces the optimization time compared to Textual Inversion (from 106 minutes to 6 minutes). It demonstrates superior performance in both identity preservation and prompt similarity compared to baseline methods like DreamBooth, NeTI, and Celeb Basis. A fast version of the method allows for learning a new concept in only 26 seconds while surpassing baselines in reconstruction and editability.	The effectiveness of Cross Initialization for general concepts beyond human faces needs further investigation. Future work will focus on exploring the applicability of the method to a broader range of concepts.	text-to-image generation, diffusion models, textual inversion, personalization, cross initialization
2312.15895 Report	Semantic-aware SAM for Point-Prompted Instance Segmentation	Zhaoyang Wei, Pengfei Chen, Xuehui Yu, Guorong Li, Jianbin Jiao, Zhenjun Han	Single-point annotation in visual tasks, with the goal of minimizing labelling costs, is becoming increasingly prominent in research. Recently, visual foundation models, such as Segment Anything (SAM), have gained widespread usage due to their robust zero-shot capabilities and exceptional annotation performance. However, SAM's class-agnostic output and high confidence in local segmentation introduce 'semantic ambiguity', posing a challenge for precise category-specific segmentation. In this paper, we introduce a cost-effective category-specific segmenter using SAM. To tackle this challenge, we have devised a Semantic-Aware Instance Segmentation Network (SAPNet) that integrates Multiple Instance Learning (MIL) with matching capability and SAM with point prompts. SAPNet strategically selects the most representative mask proposals generated by SAM to supervise segmentation, with a specific focus on object category information. Moreover, we introduce the Point Distance Guidance and Box Mining Strategy to mitigate inherent challenges: 'group' and 'local' issues in weakly supervised segmentation. These strategies serve to further enhance the overall segmentation performance. The experimental results on Pascal VOC and COCO demonstrate the promising performance of our proposed SAPNet, emphasizing its semantic matching capabilities and its potential to advance point-prompted instance segmentation. The code will be made publicly available.	This paper introduces SAPNet, a novel end-to-end point-prompted instance segmentation framework that leverages the power of visual foundation models like Segment Anything (SAM) while overcoming their limitations in semantic understanding for precise category-specific segmentation.	Instance segmentation often requires costly pixel-level annotations. This paper addresses this challenge by proposing a cost-effective category-specific segmenter that utilizes point annotations, significantly reducing annotation costs while maintaining competitive performance compared to fully-supervised methods.	SAPNet integrates SAM with point prompts and a dual-branch selection mechanism to choose the most semantically representative mask proposals. It introduces Point Distance Guidance (PDG) and a Positive-Negative Proposals Generator (PNPG) to tackle semantic ambiguity and localization errors, further refined by a Box Mining Strategy (BMS).	SAPNet achieves state-of-the-art performance in Point-Prompted Instance Segmentation (PPIS), significantly outperforming previous methods on COCO and VOC2012 benchmarks. The proposed method effectively addresses the semantic ambiguity of SAM and the localization challenges in MIL-based selection, leading to high-quality segmentation results. SAPNet exhibits strong performance even with limited annotation, bridging the gap between point-prompted and fully-supervised instance segmentation techniques.	The performance of SAPNet might be further improved by exploring different visual backbones or integrating more advanced prompting techniques. Future work could investigate the generalization ability of SAPNet on other complex datasets and real-world applications with more challenging scenarios.	instance segmentation, point supervision, weakly supervised learning, visual foundation models, segment anything (sam)
2312.15770 Report	A Recipe for Scaling up Text-to-Video Generation with Text-free Videos	Xiang Wang, Shiwei Zhang, Hangjie Yuan, Zhiwu Qing, Biao Gong, Yingya Zhang, Yujun Shen, Changxin Gao, Nong Sang	Diffusion-based text-to-video generation has witnessed impressive progress in the past year yet still falls behind text-to-image generation. One of the key reasons is the limited scale of publicly available data (e.g., 10M video-text pairs in WebVid10M vs. 5B image-text pairs in LAION), considering the high cost of video captioning. Instead, it could be far easier to collect unlabeled clips from video platforms like YouTube. Motivated by this, we come up with a novel text-to-video generation framework, termed TF-T2V, which can directly learn with text-free videos. The rationale behind is to separate the process of text decoding from that of temporal modeling. To this end, we employ a content branch and a motion branch, which are jointly optimized with weights shared. Following such a pipeline, we study the effect of doubling the scale of training set (i.e., video-only WebVid10M) with some randomly collected text-free videos and are encouraged to observe the performance improvement (FID from 9.67 to 8.19 and FVD from 484 to 441), demonstrating the scalability of our approach. We also find that our model could enjoy sustainable performance gain (FID from 8.19 to 7.64 and FVD from 441 to 366) after reintroducing some text labels for training. Finally, we validate the effectiveness and generalizability of our ideology on both native text-to-video generation and compositional video synthesis paradigms. Code and models will be publicly available at https://tf-t2v.github.io/.	This paper introduces TF-T2V, a novel text-to-video generation framework that can be trained directly on text-free videos by separating temporal and spatial modeling in video diffusion models.	Current text-to-video generation methods are limited by the scarcity of large-scale video-text datasets. TF-T2V addresses this by leveraging readily available text-free videos, opening possibilities for improved scalability and applicability.	TF-T2V utilizes a two-branch architecture: a content branch trained on image-text data for spatial appearance and a motion branch trained on text-free videos for temporal dynamics. The model jointly optimizes both branches and incorporates a temporal coherence loss to ensure smooth transitions between frames.	TF-T2V achieves state-of-the-art performance on text-to-video generation benchmarks, outperforming methods trained on labeled video-text datasets. Scaling the training set with additional text-free videos leads to consistent performance improvement, demonstrating the method's scalability. TF-T2V effectively incorporates into compositional video synthesis frameworks, enabling control over video generation using depth, sketch, and motion vectors.	The scaling experiments are limited to doubling the dataset size due to computational constraints, leaving larger-scale scalability unexplored. The paper primarily focuses on short video generation, with future work aimed at extending the method to long video sequences.	text-to-video generation, video diffusion models, text-free video learning, compositional video synthesis, temporal coherence
2312.15736 Report	Towards Real-World Blind Face Restoration with Generative Diffusion Prior	Xiaoxu Chen, Jingfan Tan, Tao Wang, Kaihao Zhang, Wenhan Luo, Xiaochun Cao	Blind face restoration is an important task in computer vision and has gained significant attention due to its wide-range applications. Previous works mainly exploit facial priors to restore face images and have demonstrated high-quality results. However, generating faithful facial details remains a challenging problem due to the limited prior knowledge obtained from finite data. In this work, we delve into the potential of leveraging the pretrained Stable Diffusion for blind face restoration. We propose BFRffusion which is thoughtfully designed to effectively extract features from low-quality face images and could restore realistic and faithful facial details with the generative prior of the pretrained Stable Diffusion. In addition, we build a privacy-preserving face dataset called PFHQ with balanced attributes like race, gender, and age. This dataset can serve as a viable alternative for training blind face restoration networks, effectively addressing privacy and bias concerns usually associated with the real face datasets. Through an extensive series of experiments, we demonstrate that our BFRffusion achieves state-of-the-art performance on both synthetic and real-world public testing datasets for blind face restoration and our PFHQ dataset is an available resource for training blind face restoration networks. The codes, pretrained models, and dataset are released at https://github.com/chenxx89/BFRffusion.	This paper proposes BFRffusion, a blind face restoration method leveraging the generative prior of pretrained Stable Diffusion, and introduces PFHQ, a privacy-preserving face dataset with balanced attributes.	Blind face restoration is essential for various applications but faces challenges in generating faithful details and ethical concerns with real face datasets.	BFRffusion utilizes a four-module architecture (SDRM, MFEM, TTPM, PDUM) to extract features from low-quality images and guide the restoration process with Stable Diffusion priors. PFHQ is constructed using ControlNet with face parsing maps for image generation and carefully selected for balanced attributes.	BFRffusion achieves state-of-the-art performance on synthetic and real-world datasets for blind face restoration. The proposed multi-scale feature extraction module and trainable time-aware prompt module effectively improve restoration quality and efficiency. PFHQ dataset demonstrates comparable performance to real face datasets while addressing privacy and bias concerns.	BFRffusion faces challenges in restoring severely degraded images and handling watermarks. Future work includes developing low-cost training strategies and exploring more practical synthetic data methods.	blind face restoration, diffusion models, generative prior, face dataset, privacy-preserving
2312.15715 Report	UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces	Jiannan Wu, Yi Jiang, Bin Yan, Huchuan Lu, Zehuan Yuan, Ping Luo	The reference-based object segmentation tasks, namely referring image segmentation (RIS), few-shot image segmentation (FSS), referring video object segmentation (RVOS), and video object segmentation (VOS), aim to segment a specific object by utilizing either language or annotated masks as references. Despite significant progress in each respective field, current methods are task-specifically designed and developed in different directions, which hinders the activation of multi-task capabilities for these tasks. In this work, we end the current fragmented situation and propose UniRef++ to unify the four reference-based object segmentation tasks with a single architecture. At the heart of our approach is the proposed UniFusion module which performs multiway-fusion for handling different tasks with respect to their specified references. And a unified Transformer architecture is then adopted for achieving instance-level segmentation. With the unified designs, UniRef++ can be jointly trained on a broad range of benchmarks and can flexibly complete multiple tasks at run-time by specifying the corresponding references. We evaluate our unified models on various benchmarks. Extensive experimental results indicate that our proposed UniRef++ achieves state-of-the-art performance on RIS and RVOS, and performs competitively on FSS and VOS with a parameter-shared network. Moreover, we showcase that the proposed UniFusion module could be easily incorporated into the current advanced foundation model SAM and obtain satisfactory results with parameter-efficient finetuning. Codes and models are available at \url{https://github.com/FoundationVision/UniRef}.	UniRef++ is a unified model capable of performing four reference-based object segmentation tasks (RIS, FSS, RVOS, and VOS) with the same model weights.	Current methods for these tasks are task-specific, requiring separate training and leading to redundant parameters. A unified model promotes synergy between tasks, reduces computational costs, and allows for flexible multi-task execution.	UniRef++ leverages a UniFusion module to inject reference information (language or mask) into visual features. A unified Transformer architecture then performs instance-level segmentation. The model is trained jointly on datasets across all four tasks.	Achieves state-of-the-art performance on RIS and RVOS. Performs competitively on FSS and VOS with a single parameter-shared network. Demonstrates efficiency for long-term video segmentation.	Performance on FSS slightly lower than specialized models due to data scale. Future work includes exploring the combination of UniFusion with other foundation models.	unified model, reference-based segmentation, referring image segmentation, few-shot segmentation, video object segmentation
2312.15707 Report	High-Fidelity Diffusion-based Image Editing	Chen Hou, Guoqiang Wei, Zhibo Chen	Diffusion models have attained remarkable success in the domains of image generation and editing. It is widely recognized that employing larger inversion and denoising steps in diffusion model leads to improved image reconstruction quality. However, the editing performance of diffusion models tends to be no more satisfactory even with increasing denoising steps. The deficiency in editing could be attributed to the conditional Markovian property of the editing process, where errors accumulate throughout denoising steps. To tackle this challenge, we first propose an innovative framework where a rectifier module is incorporated to modulate diffusion model weights with residual features, thereby providing compensatory information to bridge the fidelity gap. Furthermore, we introduce a novel learning paradigm aimed at minimizing error propagation during the editing process, which trains the editing procedure in a manner similar to denoising score-matching. Extensive experiments demonstrate that our proposed framework and training strategy achieve high-fidelity reconstruction and editing results across various levels of denoising steps, meanwhile exhibits exceptional performance in terms of both quantitative metric and qualitative assessments. Moreover, we explore our model's generalization through several applications like image-to-image translation and out-of-domain image editing.	This paper proposes a novel method to enhance the fidelity of image reconstruction and editing in diffusion models by introducing a rectifier module and a new editing training paradigm.	Existing diffusion-based editing methods suffer from distortion and low fidelity, particularly with increasing denoising steps, due to error accumulation.	The method utilizes a hypernetwork-based rectifier to modulate diffusion model weights with residual features, bridging the fidelity gap. It also trains the editing process like denoising score matching, minimizing error propagation during editing.	The proposed method achieves high-fidelity reconstruction and editing results across various levels of denoising steps. The rectifier module proves beneficial for other diffusion-based tasks like image-to-image translation. The method generalizes well to out-of-domain images without requiring fine-tuning.	The paper mainly focuses on semantic editing, leaving exploration of other editing types for future work. Some attributes remain challenging to edit due to their low frequency in training data.	diffusion models, image editing, image reconstruction, fidelity enhancement, score matching
2312.15681 Report	Partial Fine-Tuning: A Successor to Full Fine-Tuning for Vision Transformers	Peng Ye, Yongqi Huang, Chongjun Tu, Minglei Li, Tao Chen, Tong He, Wanli Ouyang	Fine-tuning pre-trained foundation models has gained significant popularity in various research fields. Existing methods for fine-tuning can be roughly divided into two categories, namely Parameter-Efficient Fine-Tuning and High-Performance Fine-Tuning. The former aims at improving efficiency, while the latter focuses on enhancing performance. Beyond these methods, we demonstrate that Partial Fine-Tuning can be an innovative and promising direction capable of concurrently enhancing both efficiency and accuracy. We first validate eight manually-defined partial fine-tuning strategies across kinds of datasets and vision transformer architectures, and find that some partial fine-tuning strategies (e.g., ffn only or attention only) can achieve better performance with fewer tuned parameters than full fine-tuning, and selecting appropriate layers is critical to partial fine-tuning. Thus, we propose a novel fine-tuned angle metric to guide the selection of appropriate layers for partial fine-tuning, making it flexible to be adapted to various scenarios for more practicable partial fine-tuning. Additionally, we show that partial fine-tuning can serve as a new dimension for Model Soups, improving both the model performance and generalization with fewer tuned parameters. Comprehensive experiments on a wide range of datasets and models validate the great potential of partial fine-tuning.	This paper explores the potential of partial fine-tuning for improving both the performance and parameter efficiency of pre-trained models, introducing a novel approach called Fine-tuned Angle guided Partial Fine-Tuning (FAPFT).	Fine-tuning large pre-trained models is computationally expensive. This paper explores how to improve efficiency and achieve better performance than full fine-tuning by selectively fine-tuning parts of the models.	The paper explores the effectiveness of manually defined partial fine-tuning strategies and then proposes FAPFT, which uses a fine-tuned angle metric to quantify the impact of training on different model layers. FAPFT selects layers with large (challenging datasets) or small (easy datasets) fine-tuned angles for fine-tuning.	Partial fine-tuning, particularly of specific functional layers (e.g., attention or FFN), can achieve comparable or even better performance than full fine-tuning with fewer parameters. The position of the fine-tuned layers significantly impacts performance. FAPFT, guided by the fine-tuned angle metric, outperforms other methods on various datasets (CIFAR-100, ImageNet-1K, FGVC) and architectures (ViT, Swin, ConvNeXt, AS-MLP), demonstrating both high accuracy and parameter efficiency.	The current FAPFT requires fully fine-tuning the model for several epochs to compute the fine-tuned angle, incurring additional computational costs. The paper mainly focuses on image classification tasks.	partial fine-tuning, fine-tuned angle metric, parameter efficiency, model soups, vision transformers
2312.15430 Report	Make-A-Character: High Quality Text-to-3D Character Generation within Minutes	Jianqiang Ren, Chao He, Lin Liu, Jiahao Chen, Yutong Wang, Yafei Song, Jianfang Li, Tangli Xue, Siqi Hu, Tao Chen, Kunkun Zheng, Jianjing Xiang, Liefeng Bo	There is a growing demand for customized and expressive 3D characters with the emergence of AI agents and Metaverse, but creating 3D characters using traditional computer graphics tools is a complex and time-consuming task. To address these challenges, we propose a user-friendly framework named Make-A-Character (Mach) to create lifelike 3D avatars from text descriptions. The framework leverages the power of large language and vision models for textual intention understanding and intermediate image generation, followed by a series of human-oriented visual perception and 3D generation modules. Our system offers an intuitive approach for users to craft controllable, realistic, fully-realized 3D characters that meet their expectations within 2 minutes, while also enabling easy integration with existing CG pipeline for dynamic expressiveness. For more information, please visit the project page at https://human3daigc.github.io/MACH/.	The paper introduces Mach, a novel text-to-3D character generation framework that leverages LLMs and diffusion models to create realistic, controllable, and animatable 3D avatars from text descriptions.	The demand for personalized 3D characters is increasing with the rise of the Metaverse and AI agents. However, traditional 3D creation tools are complex and time-consuming. Mach aims to democratize 3D character creation by enabling users to easily generate high-quality avatars using simple text prompts.	Mach utilizes an LLM (Qwen-14B) to extract facial attributes from the text prompt and generate visual clues. These clues guide Stable Diffusion with ControlNet to create a reference portrait image. Dense landmark detection, triplane-based geometry generation, differentiable rendering, and neural delighting techniques are used to create the final 3D avatar.	Mach generates high-quality 3D avatars from text descriptions within 2 minutes. The generated avatars are fully rigged and animatable, supporting various facial expressions. The framework utilizes an explicit 3D representation, ensuring compatibility with existing CG pipelines.	The current version primarily focuses on Asian ethnicities due to the training data of the SD model. The generation of clothes, expressions, and motion from text prompts is still under development.	text-to-3d, 3d avatar generation, large language models, diffusion models, character animation
2312.15289 Report	Wavelet Packet Power Spectrum Kullback-Leibler Divergence: A New Metric for Image Synthesis	Lokesh Veeramacheneni, Moritz Wolter, Juergen Gall	Current metrics for generative neural networks are biased towards low frequencies, specific generators, objects from the ImageNet dataset, and value texture more than shape. Many current quality metrics do not measure frequency information directly. In response, we propose a new frequency band-based quality metric, which opens a door into the frequency domain yet, at the same time, preserves spatial aspects of the data. Our metric works well even if the distributions we compare are far from ImageNet or have been produced by differing generator architectures. We verify the quality of our metric by sampling a broad selection of generative networks on a wide variety of data sets. A user study ensures our metric aligns with human perception. Furthermore, we show that frequency band guidance can improve the frequency domain fidelity of a current generative network.	This paper introduces Wavelet Packet Power Spectrum Kullback-Leibler Divergence (WPSKL), a new metric for assessing the quality of image synthesis in generative models.	Existing metrics like FID and SSIM are biased towards specific datasets, sensitive to irrelevant details, and don't reliably reflect human perception, particularly in the frequency domain.	The metric leverages the Wavelet Packet Transform (WPT) to capture spatial and frequency information. It computes the KL divergence between normalized wavelet power spectra of real and generated images.	WPSKL shows better alignment with human perception compared to FID and SSIM in a user study. Analysis reveals that generative models often struggle to accurately capture high-frequency details, particularly in image backgrounds. Introducing a wavelet-based loss function during training can improve a model's fidelity in representing frequency information.	The choice of wavelet function and decomposition level for WPT can impact the metric's results. Further research is needed to explore WPSKL's applicability to other image generation tasks beyond unconditional synthesis.	generative models, image synthesis, quality metrics, wavelet packet transform, frequency bias
2312.15238 Report	NoPose-NeuS: Jointly Optimizing Camera Poses with Neural Implicit Surfaces for Multi-view Reconstruction	Mohamed Shawky Sabae, Hoda Anis Baraka, Mayada Mansour Hadhoud	Learning neural implicit surfaces from volume rendering has become popular for multi-view reconstruction. Neural surface reconstruction approaches can recover complex 3D geometry that are difficult for classical Multi-view Stereo (MVS) approaches, such as non-Lambertian surfaces and thin structures. However, one key assumption for these methods is knowing accurate camera parameters for the input multi-view images, which are not always available. In this paper, we present NoPose-NeuS, a neural implicit surface reconstruction method that extends NeuS to jointly optimize camera poses with the geometry and color networks. We encode the camera poses as a multi-layer perceptron (MLP) and introduce two additional losses, which are multi-view feature consistency and rendered depth losses, to constrain the learned geometry for better estimated camera poses and scene surfaces. Extensive experiments on the DTU dataset show that the proposed method can estimate relatively accurate camera poses, while maintaining a high surface reconstruction quality with 0.89 mean Chamfer distance.	NoPose-NeuS, a novel neural implicit surface reconstruction method extending NeuS to jointly optimize camera poses with geometry and color networks, enhancing 3D reconstruction from multi-view images without assuming accurate camera parameters.	Estimating accurate camera parameters is crucial but challenging for neural implicit surface reconstruction. Existing methods often assume known camera parameters, limiting their practicality. This work addresses this by enabling camera pose optimization directly within the reconstruction pipeline.	The method utilizes an MLP for camera pose prediction from camera indices. It introduces multi-view feature consistency and rendered depth losses to refine pose estimation and improve surface reconstruction quality.	Achieves high surface reconstruction quality comparable to state-of-the-art methods relying on known camera parameters. Estimates camera poses with high relative accuracy, comparable to classical MVS pipelines. Demonstrates robustness in handling complex geometries and achieves superior reconstruction quality compared to classical MVS methods.	The method's performance is sensitive to camera pose initialization. It assumes a bounded scene, limiting its applicability to unbounded scenarios. Future work could explore relaxing this assumption.	neural implicit surface reconstruction, camera pose estimation, multi-view stereo, volume rendering, deep learning
2312.15162 Report	Cycle-Consistency Learning for Captioning and Grounding	Ning Wang, Jiajun Deng, Mingbo Jia	We present that visual grounding and image captioning, which perform as two mutually inverse processes, can be bridged together for collaborative training by careful designs. By consolidating this idea, we introduce CyCo, a cyclic-consistent learning framework to ameliorate the independent training pipelines of visual grounding and image captioning. The proposed framework (1) allows the semi-weakly supervised training of visual grounding; (2) improves the performance of fully supervised visual grounding; (3) yields a general captioning model that can describe arbitrary image regions. Extensive experiments show that our fully supervised grounding model achieves state-of-the-art performance, and the semi-weakly supervised one also exhibits competitive performance compared to the fully supervised counterparts. Our image captioning model has the capability to freely describe image regions and meanwhile shows impressive performance on prevalent captioning benchmarks.	This paper presents CyCo, a novel cyclic-consistent learning framework that bridges visual grounding and image captioning for collaborative training.	This framework addresses limitations of current visual grounding and image captioning techniques by enabling semi-weakly supervised training, improving fully supervised performance, and allowing for region-specific image descriptions.	CyCo utilizes a shared visual encoder and distinct Transformer blocks for each task. It employs two cyclic learning processes: grounding-to-captioning (box consistency) and captioning-to-grounding (caption consistency) to enforce mutual supervision.	CyCo achieves state-of-the-art performance for fully supervised visual grounding. The semi-weakly supervised CyCo shows competitive results compared to fully supervised counterparts. CyCo enables region-specific image descriptions, surpassing traditional global captioning models.	The work primarily uses ViT-B, exploring stronger backbones is left for future work. Incorporating larger-scale, weakly-labeled datasets can further enhance performance.	visual grounding, image captioning, cycle-consistency learning, vision-language pre-training, semi-weakly supervised learning
2312.15043 Report	GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection	Haozhan Shen, Tiancheng Zhao, Mingwei Zhu, Jianwei Yin	Visual grounding, a crucial vision-language task involving the understanding of the visual context based on the query expression, necessitates the model to capture the interactions between objects, as well as various spatial and attribute information. However, the annotation data of visual grounding task is limited due to its time-consuming and labor-intensive annotation process, resulting in the trained models being constrained from generalizing its capability to a broader domain. To address this challenge, we propose GroundVLP, a simple yet effective zero-shot method that harnesses visual grounding ability from the existing models trained from image-text pairs and pure object detection data, both of which are more conveniently obtainable and offer a broader domain compared to visual grounding annotation data. GroundVLP proposes a fusion mechanism that combines the heatmap from GradCAM and the object proposals of open-vocabulary detectors. We demonstrate that the proposed method significantly outperforms other zero-shot methods on RefCOCO/+/g datasets, surpassing prior zero-shot state-of-the-art by approximately 28\% on the test split of RefCOCO and RefCOCO+. Furthermore, GroundVLP performs comparably to or even better than some non-VLP-based supervised models on the Flickr30k entities dataset. Our code is available at https://github.com/om-ai-lab/GroundVLP.	This paper proposes GroundVLP, a novel zero-shot method for visual grounding tasks (both Referring Expression Comprehension (REC) and phrase grounding) by leveraging the semantic understanding of Vision-Language Pre-training (VLP) models and the object detection capabilities of Open-Vocabulary object Detectors (OVD).	Visual grounding datasets are limited due to their complex annotation process. GroundVLP addresses this challenge by leveraging readily available image-text pair and object detection data, eliminating the need for task-specific visual grounding annotations.	GroundVLP utilizes GradCAM on a VLP model to generate a heatmap highlighting image regions relevant to the given expression. It then employs an OVD to detect candidate objects belonging to a predetermined category (either ground-truth or predicted). Finally, a weighted grade fusion mechanism combines the heatmap and object proposals to pinpoint the target object.	GroundVLP significantly outperforms existing zero-shot methods on RefCOCO/+/g datasets for REC, surpassing previous state-of-the-art by approximately 28%. It achieves comparable or even better performance than some non-VLP-based supervised models on the Flickr30k entities dataset for phrase grounding. Ablation studies validate the effectiveness of each component in GroundVLP, highlighting the importance of the weighted grade fusion and visual word attention aggregation.	The performance of GroundVLP can be affected by the inherent biases and noise present in the datasets used, especially when relying on predicted object categories. GroundVLP may inherit potential biases from the foundational VLP and OVD models.	visual grounding, zero-shot learning, vision-language pre-training, open-vocabulary object detection, gradcam
2312.14988 Report	Emage: Non-Autoregressive Text-to-Image Generation	Zhangyin Feng, Runyi Hu, Liangxin Liu, Fan Zhang, Duyu Tang, Yong Dai, Xiaocheng Feng, Jiwei Li, Bing Qin, Shuming Shi	Autoregressive and diffusion models drive the recent breakthroughs on text-to-image generation. Despite their huge success of generating high-realistic images, a common shortcoming of these models is their high inference latency - autoregressive models run more than a thousand times successively to produce image tokens and diffusion models convert Gaussian noise into images with many hundreds of denoising steps. In this work, we explore non-autoregressive text-to-image models that efficiently generate hundreds of image tokens in parallel. We develop many model variations with different learning and inference strategies, initialized text encoders, etc. Compared with autoregressive baselines that needs to run one thousand times, our model only runs 16 times to generate images of competitive quality with an order of magnitude lower inference latency. Our non-autoregressive model with 346M parameters generates an image of 256$\times$256 with about one second on one V100 GPU.	This paper presents Emage, a non-autoregressive model for text-to-image generation that significantly reduces inference latency compared to autoregressive and diffusion models.	Existing text-to-image generation models, while producing high-quality images, suffer from high inference latency due to their autoregressive or iterative nature. Emage addresses this by generating image tokens in parallel.	The authors explore several non-autoregressive model variations, including fully parallel and iterative approaches. They utilize techniques like mask prediction, iterative refinement, and a CLIP-initialized text encoder to generate image tokens efficiently.	Fully non-autoregressive models struggle to converge during training due to the long sequence length of image tokens. Iterative non-autoregressive models, particularly one that revises previous predictions and predicts new tokens simultaneously, achieve competitive image quality with significantly lower latency. Emage (346M parameters) generates images in about one second on a V100 GPU, achieving an order of magnitude speedup compared to autoregressive baselines.	The performance gap between CLIP and larger text encoders needs further investigation. Generating high-quality human faces remains challenging and requires further model scaling and data improvements.	text-to-image generation, non-autoregressive models, image generation, clip, vqgan
2312.14985 Report	UniHuman: A Unified Model for Editing Human Images in the Wild	Nannan Li, Qing Liu, Krishna Kumar Singh, Yilin Wang, Jianming Zhang, Bryan A. Plummer, Zhe Lin	Human image editing includes tasks like changing a person's pose, their clothing, or editing the image according to a text prompt. However, prior work often tackles these tasks separately, overlooking the benefit of mutual reinforcement from learning them jointly. In this paper, we propose UniHuman, a unified model that addresses multiple facets of human image editing in real-world settings. To enhance the model's generation quality and generalization capacity, we leverage guidance from human visual encoders and introduce a lightweight pose-warping module that can exploit different pose representations, accommodating unseen textures and patterns. Furthermore, to bridge the disparity between existing human editing benchmarks with real-world data, we curated 400K high-quality human image-text pairs for training and collected 2K human images for out-of-domain testing, both encompassing diverse clothing styles, backgrounds, and age groups. Experiments on both in-domain and out-of-domain test sets demonstrate that UniHuman outperforms task-specific models by a significant margin. In user studies, UniHuman is preferred by the users in an average of 77% of cases. Our project is available at https://github.com/NannanLi999/UniHuman.	This paper proposes UniHuman, a unified model addressing multiple human image editing tasks in real-world settings, such as reposing, virtual try-on, and text-guided manipulation, by leveraging synergies between these tasks.	Existing methods often tackle these tasks in isolation, overlooking the benefits of learning them jointly and neglecting the adaptability to unseen human-in-the-wild cases.	The UniHuman model employs human visual encoders for texture and style guidance and introduces a novel pose-warping module to ensure texture consistency across different tasks. It leverages both dense and sparse pose representations, making it robust to unseen textures. The authors also curated a large-scale dataset (LH-400K) with diverse human images to improve generalization.	UniHuman significantly outperforms task-specific models on both in-domain and out-of-domain datasets, demonstrating its strong generalization capability. The model effectively transfers textures and preserves clothing identities, even for complex patterns and challenging poses. User studies confirm UniHuman's superiority, with users preferring its results in an average of 77% of cases.	The performance depends on the accuracy of pose detectors and parsing models, which can be challenging for complex poses. Future work will explore incorporating 3D information, such as depth and surface normal, to enhance accuracy and address limitations of existing methods.	human image editing, virtual try-on, reposing, text-guided manipulation, pose warping
2312.14923 Report	Fast-NTK: Parameter-Efficient Unlearning for Large-Scale Models	Guihong Li, Hsiang Hsu, Chun-Fu Chen, Radu Marculescu	The rapid growth of machine learning has spurred legislative initiatives such as ``the Right to be Forgotten,'' allowing users to request data removal. In response, ``machine unlearning'' proposes the selective removal of unwanted data without the need for retraining from scratch. While the Neural-Tangent-Kernel-based (NTK-based) unlearning method excels in performance, it suffers from significant computational complexity, especially for large-scale models and datasets. Our work introduces ``Fast-NTK,'' a novel NTK-based unlearning algorithm that significantly reduces the computational complexity by incorporating parameter-efficient fine-tuning methods, such as fine-tuning batch normalization layers in a CNN or visual prompts in a vision transformer. Our experimental results demonstrate scalability to much larger neural networks and datasets (e.g., 88M parameters; 5k images), surpassing the limitations of previous full-model NTK-based approaches designed for smaller cases (e.g., 8M parameters; 500 images). Notably, our approach maintains a performance comparable to the traditional method of retraining on the retain set alone. Fast-NTK can thus enable for practical and scalable NTK-based unlearning in deep neural networks.	This paper introduces "Fast-NTK," a novel algorithm for machine unlearning in large-scale models that combines parameter-efficient fine-tuning methods with Neural-Tangent-Kernel-based unlearning.	Existing NTK-based unlearning methods, while effective, struggle with high computational complexity, limiting their application to small-scale models and datasets. Fast-NTK addresses this limitation by significantly reducing the number of parameters involved in the unlearning process.	Fast-NTK selectively fine-tunes and applies NTK-based unlearning to only a subset of crucial model parameters. For CNNs, it focuses on batch normalization layers, while for ViTs, it utilizes prompts.	Fast-NTK exhibits performance comparable to retraining from scratch on the retain set, effectively removing the influence of forget samples. The method significantly reduces the number of parameters involved in fine-tuning and unlearning, enabling its application to larger models and datasets. Fast-NTK scales to vision transformers and larger datasets, unlike previous NTK-based approaches that were limited to smaller networks and datasets.	The current implementation relies on exact NTK matrix computations, limiting its efficiency. Exploring approximate computation methods could further improve scalability. The reliance on pre-trained models introduces risks, as these models may possess prior knowledge of classes to be unlearned, necessitating further investigation into the relationship between pre-training and unlearning.	machine unlearning, neural tangent kernel, parameter-efficient fine-tuning, deep neural networks, privacy
2312.14871 Report	BrainVis: Exploring the Bridge between Brain and Visual Signals via Image Reconstruction	Honghao Fu, Zhiqi Shen, Jing Jih Chin, Hao Wang	Analyzing and reconstructing visual stimuli from brain signals effectively advances understanding of the human visual system. However, the EEG signals are complex and contain a amount of noise. This leads to substantial limitations in existing works of visual stimuli reconstruction from EEG, such as difficulties in aligning EEG embeddings with the fine-grained semantic information and a heavy reliance on additional large self-collected dataset for training. To address these challenges, we propose a novel approach called BrainVis. Firstly, we divide the EEG signals into various units and apply a self-supervised approach on them to obtain EEG time-domain features, in an attempt to ease the training difficulty. Additionally, we also propose to utilize the frequency-domain features to enhance the EEG representations. Then, we simultaneously align EEG time-frequency embeddings with the interpolation of the coarse and fine-grained semantics in the CLIP space, to highlight the primary visual components and reduce the cross-modal alignment difficulty. Finally, we adopt the cascaded diffusion models to reconstruct images. Our proposed BrainVis outperforms state of the arts in both semantic fidelity reconstruction and generation quality. Notably, we reduce the training data scale to 10% of the previous work.	BrainVis, a novel pipeline for reconstructing images from EEG signals, utilizing self-supervised learning for time-domain features, LSTM for frequency-domain features, and cascaded diffusion models for image generation.	Analyzing visual stimuli reconstruction from brain signals advances the understanding of the human visual system, but existing EEG-based methods have limitations in aligning EEG embeddings with semantic information and rely on large datasets.	EEG signals are divided into units for self-supervised time-domain feature extraction, LSTM is used for frequency-domain feature extraction, and a cross-modal alignment network aligns EEG features with interpolated CLIP embeddings for image reconstruction using cascaded diffusion models.	BrainVis outperforms state-of-the-art methods in semantic reconstruction and generation quality. The method eliminates reliance on additional large-scale datasets. Analysis suggests that visual information in EEG might prioritize fundamental object properties over specific categories.	The study primarily focuses on semantic level reconstruction and not pixel-level accuracy. Further research can explore decoding individual visual properties like color and shape from EEG.	eeg, image reconstruction, brain-computer interface, deep learning, clip
2312.14867 Report	VIEScore: Towards Explainable Metrics for Conditional Image Synthesis Evaluation	Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, Wenhu Chen	In the rapidly advancing field of conditional image generation research, challenges such as limited explainability lie in effectively evaluating the performance and capabilities of various models. This paper introduces VIESCORE, a Visual Instruction-guided Explainable metric for evaluating any conditional image generation tasks. VIESCORE leverages general knowledge from Multimodal Large Language Models (MLLMs) as the backbone and does not require training or fine-tuning. We evaluate VIESCORE on seven prominent tasks in conditional image tasks and found: (1) VIESCORE (GPT4-v) achieves a high Spearman correlation of 0.3 with human evaluations, while the human-to-human correlation is 0.45. (2) VIESCORE (with open-source MLLM) is significantly weaker than GPT-4v in evaluating synthetic images. (3) VIESCORE achieves a correlation on par with human ratings in the generation tasks but struggles in editing tasks. With these results, we believe VIESCORE shows its great potential to replace human judges in evaluating image synthesis tasks.	This paper introduces VIEScore, a Visual Instruction-guided Explainable metric for evaluating conditional image generation tasks using Multimodal Large Language Models (MLLMs) without training.	Evaluating AI-synthesized images is challenging due to limitations of existing metrics and the subjectivity and scalability issues of human evaluation. VIEScore aims to address these gaps.	VIEScore leverages MLLMs to evaluate images based on instructions and provide rationale for their scores. The authors tested VIEScore across seven image synthesis tasks using ImagenHub benchmark and compared its performance with human evaluations and existing automatic metrics.	VIEScore (GPT-4v) achieves high correlation with human evaluations, outperforming other MLLMs and automatic metrics in most tasks. Open-source MLLMs perform significantly weaker than GPT-4v in evaluating synthetic images. MLLMs struggle to capture nuances in edited images, highlighting a challenge in evaluating image editing tasks.	OpenAI's security and privacy policy limits evaluation of images resembling real persons. Future work focuses on investigating distillation models to replicate human-like evaluation performance.	image generation, image evaluation, multimodal large language models, explainable ai, viescore
2312.14828 Report	Plan, Posture and Go: Towards Open-World Text-to-Motion Generation	Jinpeng Liu, Wenxun Dai, Chunyu Wang, Yiji Cheng, Yansong Tang, Xin Tong	Conventional text-to-motion generation methods are usually trained on limited text-motion pairs, making them hard to generalize to open-world scenarios. Some works use the CLIP model to align the motion space and the text space, aiming to enable motion generation from natural language motion descriptions. However, they are still constrained to generate limited and unrealistic in-place motions. To address these issues, we present a divide-and-conquer framework named PRO-Motion, which consists of three modules as motion planner, posture-diffuser and go-diffuser. The motion planner instructs Large Language Models (LLMs) to generate a sequence of scripts describing the key postures in the target motion. Differing from natural languages, the scripts can describe all possible postures following very simple text templates. This significantly reduces the complexity of posture-diffuser, which transforms a script to a posture, paving the way for open-world generation. Finally, go-diffuser, implemented as another diffusion model, estimates whole-body translations and rotations for all postures, resulting in realistic motions. Experimental results have shown the superiority of our method with other counterparts, and demonstrated its capability of generating diverse and realistic motions from complex open-world prompts such as "Experiencing a profound sense of joy". The project page is available at https://moonsliu.github.io/Pro-Motion.	This paper presents PRO-Motion, a novel framework for open-world text-to-motion generation, addressing the limitations of conventional methods that struggle to generalize beyond limited text-motion paired datasets.	This work is important because it allows for the generation of diverse and realistic motions from open-world text prompts, a task previously challenging due to the limitations of existing datasets and models.	The PRO-Motion framework utilizes a divide-and-conquer approach, consisting of three modules: 1) a motion planner that leverages Large Language Models (LLMs) to translate complex text descriptions into a sequence of posture scripts; 2) a posture-diffuser, a diffusion-based model, that generates key poses aligning with the scripts; and 3) a go-diffuser, another diffusion model, that predicts whole-body translations and rotations for smooth and realistic motion generation.	PRO-Motion demonstrates superior performance compared to state-of-the-art methods in open-world text-to-motion generation, as evidenced by quantitative metrics such as R-precision, FID, and Multimodal Distance. The posture-diffuser module effectively generates precise poses from localized body part descriptions, surpassing existing methods in preserving textual information and handling diverse motion descriptions. The go-diffuser module successfully predicts spatial information (translation and rotation) from local pose sequences, outperforming baseline methods and achieving state-of-the-art results in Average Positional Error and Average Variance Error.	The reliance on LLMs for motion planning introduces a dependency on their capabilities and potential biases, which might require further investigation. The current implementation primarily focuses on generating motions with a fixed number of frames. Future work could explore extending it to handle variable-length motion sequences.	text-to-motion, open-world generation, diffusion models, large language models, pose generation
2312.14733 Report	Harnessing Diffusion Models for Visual Perception with Meta Prompts	Qiang Wan, Zilong Huang, Bingyi Kang, Jiashi Feng, Li Zhang	The issue of generative pretraining for vision models has persisted as a long-standing conundrum. At present, the text-to-image (T2I) diffusion model demonstrates remarkable proficiency in generating high-definition images matching textual inputs, a feat made possible through its pre-training on large-scale image-text pairs. This leads to a natural inquiry: can diffusion models be utilized to tackle visual perception tasks? In this paper, we propose a simple yet effective scheme to harness a diffusion model for visual perception tasks. Our key insight is to introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception. The effect of meta prompts are two-fold. First, as a direct replacement of the text embeddings in the T2I models, it can activate task-relevant features during feature extraction. Second, it will be used to re-arrange the extracted features to ensures that the model focuses on the most pertinent features for the task on hand. Additionally, we design a recurrent refinement training strategy that fully leverages the property of diffusion models, thereby yielding stronger visual features. Extensive experiments across various benchmarks validate the effectiveness of our approach. Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes. Concurrently, the proposed method attains results comparable to the current state-of-the-art in semantic segmentation on ADE20K and pose estimation on COCO datasets, further exemplifying its robustness and versatility.	This paper presents a novel adaptation method for employing text-to-image diffusion models in visual perception tasks by introducing learnable embeddings, termed 'meta prompts,' for enhanced feature extraction and recurrent refinement training.	Adapting powerful generative diffusion models for perception tasks is a promising direction but existing methods struggle with complex prompt interfaces. This paper introduces a streamlined approach using meta prompts, eliminating the need for external text inputs or pre-trained text encoders.	The method uses a pre-trained text-to-image diffusion model. An input image is encoded into latent space and fed to the model. Learnable meta prompts, instead of text embeddings, are used to activate task-relevant features through cross-attention. These prompts further rearrange multi-scale features. A recurrent refinement strategy with modulated timestep embeddings allows for iterative feature enhancement. Finally, a task-specific decoder generates the prediction.	Achieves state-of-the-art depth estimation performance on NYU Depth V2 and KITTI datasets. Sets a new benchmark for semantic segmentation on the CityScapes dataset. Achieves competitive results in semantic segmentation on ADE20K and pose estimation on COCO, showing robustness and versatility.	The number of meta prompts needs to be optimized for each specific task. Further research on extending the method to other visual perception tasks beyond those tested is warranted.	diffusion models, visual perception, meta prompts, recurrent refinement, feature extraction
2312.14611 Report	Tuning-Free Inversion-Enhanced Control for Consistent Image Editing	Xiaoyue Duan, Shuhao Cui, Guoliang Kang, Baochang Zhang, Zhengcong Fei, Mingyuan Fan, Junshi Huang	Consistent editing of real images is a challenging task, as it requires performing non-rigid edits (e.g., changing postures) to the main objects in the input image without changing their identity or attributes. To guarantee consistent attributes, some existing methods fine-tune the entire model or the textual embedding for structural consistency, but they are time-consuming and fail to perform non-rigid edits. Other works are tuning-free, but their performances are weakened by the quality of Denoising Diffusion Implicit Model (DDIM) reconstruction, which often fails in real-world scenarios. In this paper, we present a novel approach called Tuning-free Inversion-enhanced Control (TIC), which directly correlates features from the inversion process with those from the sampling process to mitigate the inconsistency in DDIM reconstruction. Specifically, our method effectively obtains inversion features from the key and value features in the self-attention layers, and enhances the sampling process by these inversion features, thus achieving accurate reconstruction and content-consistent editing. To extend the applicability of our method to general editing scenarios, we also propose a mask-guided attention concatenation strategy that combines contents from both the inversion and the naive DDIM editing processes. Experiments show that the proposed method outperforms previous works in reconstruction and consistent editing, and produces impressive results in various settings.	Presents Tuning-free Inversion-enhanced Control (TIC) for consistent editing of real images by mitigating inconsistencies in DDIM reconstruction.	Consistent editing in real images while changing non-rigid attributes is challenging due to limitations in existing methods like DDIM reconstruction quality.	Analyzes reconstruction error in DDIM, introduces TIC which correlates features from inversion and sampling processes. Employs mask-guided attention concatenation to balance fidelity and editability, and integrates with controllable diffusion models.	TIC achieves superior reconstruction quality compared to baselines, approaching VAE's upper bound. Performs non-rigid edits (e.g., posture, expressions) while preserving content consistency in complex scenarios with multiple objects. Integration with controllable diffusion models and mask-guided attention concatenation extends TIC to general editing, balancing fidelity and new content generation.	TIC's enhancement strategy is applied from a specific timestep and layer, which might not be optimal for all cases. Exploration of its application in image and video generation.	consistent image editing, ddim reconstruction, text-guided image editing, diffusion models, controllable image synthesis
2312.14579 Report	Environment-Specific People	Mirela Ostrek, Soubhik Sanyal, Carol O'Sullivan, Michael J. Black, Justus Thies	Despite significant progress in generative image synthesis and full-body generation in particular, state-of-the-art methods are either context-independent, overly reliant to text prompts, or bound to the curated training datasets, such as fashion images with monotonous backgrounds. Here, our goal is to generate people in clothing that is semantically appropriate for a given scene. To this end, we present ESP, a novel method for context-aware full-body generation, that enables photo-realistic inpainting of people into existing "in-the-wild" photographs. ESP is conditioned on a 2D pose and contextual cues that are extracted from the environment photograph and integrated into the generation process. Our models are trained on a dataset containing a set of in-the-wild photographs of people covering a wide range of different environments. The method is analyzed quantitatively and qualitatively, and we show that ESP outperforms state-of-the-art on the task of contextual full-body generation.	This paper presents ESP, a context-aware full-body generation method that inpaints people wearing semantically appropriate clothing into existing photographs.	Existing methods are context-independent, rely heavily on text prompts, or are limited by curated datasets lacking realistic environment-clothing correlations. ESP addresses these limitations by enabling photorealistic inpainting of people whose attire matches the scene.	ESP leverages a VAE to extract contextual cues from the environment, feeds these into a StyleGAN-based HPM generator to predict clothing semantics, and uses a HPM translation module to guide a pre-trained Stable Diffusion model for seamless inpainting.	ESP successfully generates environment-specific people whose clothing aligns with the input photograph's context. Quantitative analysis shows that ESP outperforms state-of-the-art methods in terms of contextual appropriateness. The HPM translation module effectively bridges the semantic gap between binary masks and complex human bodies, enabling high-quality inpainting.	The current training dataset exhibits biases that need to be addressed through diversification. Further research can explore finer-grained control over the generated clothing style, potentially incorporating textual prompts.	image generation, inpainting, context-aware, full-body generation, human parsing maps
2312.14494 Report	Revisiting Few-Shot Object Detection with Vision-Language Models	Anish Madan, Neehar Peri, Shu Kong, Deva Ramanan	Few-shot object detection (FSOD) benchmarks have advanced techniques for detecting new categories with limited annotations. Existing benchmarks repurpose well-established datasets like COCO by partitioning categories into base and novel classes for pre-training and fine-tuning respectively. However, these benchmarks do not reflect how FSOD is deployed in practice. Rather than only pre-training on a small number of base categories, we argue that it is more practical to fine-tune a foundation model (e.g., a vision-language model (VLM) pre-trained on web-scale data) for a target domain. Surprisingly, we find that zero-shot inference from VLMs like GroundingDINO significantly outperforms the state-of-the-art (48.3 vs. 33.1 AP) on COCO. However, such zero-shot models can still be misaligned to target concepts of interest. For example, trailers on the web may be different from trailers in the context of autonomous vehicles. In this work, we propose Foundational FSOD, a new benchmark protocol that evaluates detectors pre-trained on any external datasets and fine-tuned on K-shots per target class. Further, we note that current FSOD benchmarks are actually federated datasets containing exhaustive annotations for each category on a subset of the data. We leverage this insight to propose simple strategies for fine-tuning VLMs with federated losses. We demonstrate the effectiveness of our approach on LVIS and nuImages, improving over prior work by 5.9 AP. Our code is available at https://github.com/anishmadan23/foundational_fsod	This paper proposes Foundational FSOD, a new benchmark for few-shot object detection using vision-language foundation models pre-trained on large-scale datasets.	Existing FSOD benchmarks are unrealistic because they partition datasets into base and novel classes and do not reflect the use of foundation models in practice.	The authors leverage the observation that FSOD benchmarks are actually federated datasets and propose simple fine-tuning strategies for VLMs using federated losses and pseudo-negative labels.	Zero-shot inference with VLMs outperforms state-of-the-art FSOD methods on COCO. Fine-tuning VLMs with federated losses and pseudo-negatives further improves performance on LVIS and nuImages. Fine-tuning with pseudo-negatives approaches the oracle performance of using ground-truth negatives.	Performance on rare categories is significantly lower than common categories, suggesting VLMs are pre-trained on imbalanced data. The approach only uses class names as text features, and future work could explore richer textual descriptions for multi-modal alignment.	few-shot object detection, vision-language models, federated datasets, concept alignment, pseudo-negative labels
2312.14385 Report	Generative AI Beyond LLMs: System Implications of Multi-Modal Generation	Alicia Golden, Samuel Hsia, Fei Sun, Bilge Acun, Basil Hosmer, Yejin Lee, Zachary DeVito, Jeff Johnson, Gu-Yeon Wei, David Brooks, Carole-Jean Wu	As the development of large-scale Generative AI models evolve beyond text (1D) generation to include image (2D) and video (3D) generation, processing spatial and temporal information presents unique challenges to quality, performance, and efficiency. We present the first work towards understanding this new system design space for multi-modal text-to-image (TTI) and text-to-video (TTV) generation models. Current model architecture designs are bifurcated into 2 categories: Diffusion- and Transformer-based models. Our systematic performance characterization on a suite of eight representative TTI/TTV models shows that after state-of-the-art optimization techniques such as Flash Attention are applied, Convolution accounts for up to 44% of execution time for Diffusion-based TTI models, while Linear layers consume up to 49% of execution time for Transformer-based models. We additionally observe that Diffusion-based TTI models resemble the Prefill stage of LLM inference, and benefit from 1.1-2.5x greater speedup from Flash Attention than Transformer-based TTI models that resemble the Decode phase. Since optimizations designed for LLMs do not map directly onto TTI/TTV models, we must conduct a thorough characterization of these workloads to gain insights for new optimization opportunities. In doing so, we define sequence length in the context of TTI/TTV models and observe sequence length can vary up to 4x in Diffusion model inference. We additionally observe temporal aspects of TTV workloads pose unique system bottlenecks, with Temporal Attention accounting for over 60% of total Attention time. Overall, our in-depth system performance characterization is a critical first step towards designing efficient and deployable systems for emerging TTI/TTV workloads.	This paper provides the first in-depth system characterization of multi-modal text-to-image (TTI) and text-to-video (TTV) generation models, highlighting their unique system properties and performance bottlenecks compared to traditional LLMs.	As Generative AI evolves beyond text generation towards higher-dimensional data like images and videos, understanding the system implications of TTI/TTV models is crucial for designing efficient and deployable systems for these emerging workloads. This is especially important given their growing usage in industry-scale datacenters.	The authors systematically characterized the performance of eight representative TTI/TTV models, including diffusion and transformer-based architectures, on NVIDIA A100 GPUs, using tools like PyTorch Profiler and NVIDIA Nsight Compute. They analyzed operator breakdowns, sequence length variations, and the impact of scaling image size and temporal dimensions on system performance.	After applying Flash Attention, Convolution emerges as the main bottleneck for Diffusion-based TTI models, consuming up to 44% of execution time. Sequence length in Diffusion models varies significantly during inference, unlike LLMs, and scales quadratically with image size, impacting memory requirements (O(L^4)). Temporal Attention in TTV models poses a unique bottleneck, consuming 2x the execution time of Spatial Attention despite requiring 9x fewer FLOPs, suggesting optimization opportunities.	The analysis focuses on a limited set of open-source TTI/TTV models. Future work can explore optimization strategies tailored to the identified bottlenecks, such as efficient Convolution algorithms for Diffusion models and memory-efficient Temporal Attention mechanisms for TTV models.	generative ai, multi-modal, diffusion model, transformer, sequence length, attention
2312.14239 Report	PlatoNeRF: 3D Reconstruction in Plato's Cave via Single-View Two-Bounce Lidar	Tzofi Klinghoffer, Xiaoyu Xiang, Siddharth Somasundaram, Yuchen Fan, Christian Richardt, Ramesh Raskar, Rakesh Ranjan	3D reconstruction from a single-view is challenging because of the ambiguity from monocular cues and lack of information about occluded regions. Neural radiance fields (NeRF), while popular for view synthesis and 3D reconstruction, are typically reliant on multi-view images. Existing methods for single-view 3D reconstruction with NeRF rely on either data priors to hallucinate views of occluded regions, which may not be physically accurate, or shadows observed by RGB cameras, which are difficult to detect in ambient light and low albedo backgrounds. We propose using time-of-flight data captured by a single-photon avalanche diode to overcome these limitations. Our method models two-bounce optical paths with NeRF, using lidar transient data for supervision. By leveraging the advantages of both NeRF and two-bounce light measured by lidar, we demonstrate that we can reconstruct visible and occluded geometry without data priors or reliance on controlled ambient lighting or scene albedo. In addition, we demonstrate improved generalization under practical constraints on sensor spatial- and temporal-resolution. We believe our method is a promising direction as single-photon lidars become ubiquitous on consumer devices, such as phones, tablets, and headsets.	PlatoNeRF reconstructs 3D scenes from a single viewpoint using time-of-flight data from a single-photon lidar, exploiting two-bounce light to infer geometry of both visible and occluded regions.	Single-view 3D reconstruction with NeRF typically relies on data priors for hallucination or shadows from RGB images, both limited in accuracy. PlatoNeRF leverages physically-accurate lidar measurements for enhanced reconstruction.	The method models two-bounce optical paths with NeRF, supervised by lidar transients. Primary rays determine depth and secondary rays determine shadowing. A combined distance and shadow loss function optimizes the NeRF model.	PlatoNeRF outperforms baseline lidar and RGB-based methods in depth reconstruction accuracy on simulated scenes. It demonstrates robustness to low spatial- and temporal-resolution, ambient light, and low albedo backgrounds, advantageous for real-world applications. The method generalizes well to real-world lidar data, achieving competitive results with fewer artifacts than non-learning based methods.	Current implementation only considers Lambertian reflectance, limiting its applicability to certain materials. Reliance on vanilla NeRF architecture can lead to occasional floaters in reconstructed geometry.	single-view 3d reconstruction, neural radiance fields (nerf), single-photon lidar, time-of-flight imaging, two-bounce light
2312.14238 Report	InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks	Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, Jifeng Dai	The exponential growth of large language models (LLMs) has opened up numerous possibilities for multimodal AGI systems. However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs. In this work, we design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters and progressively aligns it with the LLM, using web-scale image-text data from various sources. This model can be broadly applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks including visual perception tasks such as image-level or pixel-level recognition, vision-language tasks such as zero-shot image/video classification, zero-shot image/video-text retrieval, and link with LLMs to create multi-modal dialogue systems. It has powerful visual capabilities and can be a good alternative to the ViT-22B. We hope that our research could contribute to the development of multi-modal large models. Code and models are available at https://github.com/OpenGVLab/InternVL.	InternVL, a large-scale vision-language foundation model, aligns a scaled-up vision encoder (6 billion parameters) with a large language model (LLM), achieving state-of-the-art results on 32 visual and visual-linguistic tasks.	Existing vision-language models suffer from disparity in parameter scales between vision and language components, inconsistent representations, and inefficient connection methods, hindering their effectiveness in tasks requiring both vision and language understanding.	InternVL utilizes a three-stage progressive image-text alignment strategy: 1) contrastive learning on large-scale noisy image-text data, 2) generative learning on fine-grained data, and 3) supervised fine-tuning on instruction data for multi-modal dialogue.	InternVL achieves state-of-the-art performance in zero-shot image classification across various ImageNet variants and ObjectNet, demonstrating robust generalization across different domains. It exhibits strong multilingual capabilities, outperforming previous methods on multilingual ImageNet-1K and image-text retrieval tasks. InternVL seamlessly integrates with existing LLMs, enabling effective multi-modal dialogue capabilities with superior performance on benchmarks like MME and POPE.	The study primarily focuses on public data sources, future work could explore the impact of incorporating private datasets. While InternVL excels in many tasks, there's room for further investigation into more specialized visual-linguistic tasks requiring fine-grained understanding.	vision-language foundation model, multi-modal dialogue, progressive alignment, zero-shot learning, large language models
2312.14233 Report	VCoder: Versatile Vision Encoders for Multimodal Large Language Models	Jitesh Jain, Jianwei Yang, Humphrey Shi	Humans possess the remarkable skill of Visual Perception, the ability to see and understand the seen, helping them make sense of the visual world and, in turn, reason. Multimodal Large Language Models (MLLM) have recently achieved impressive performance on vision-language tasks ranging from visual question-answering and image captioning to visual reasoning and image generation. However, when prompted to identify or count (perceive) the entities in a given image, existing MLLM systems fail. Working towards developing an accurate MLLM system for perception and reasoning, we propose using Versatile vision enCoders (VCoder) as perception eyes for Multimodal LLMs. We feed the VCoder with perception modalities such as segmentation or depth maps, improving the MLLM's perception abilities. Secondly, we leverage the images from COCO and outputs from off-the-shelf vision perception models to create our COCO Segmentation Text (COST) dataset for training and evaluating MLLMs on the object perception task. Thirdly, we introduce metrics to assess the object perception abilities in MLLMs on our COST dataset. Lastly, we provide extensive experimental evidence proving the VCoder's improved object-level perception skills over existing Multimodal LLMs, including GPT-4V. We open-source our dataset, code, and models to promote research. We open-source our code at https://github.com/SHI-Labs/VCoder	This paper introduces Versatile vision enCoders (VCoder) that enhance object perception abilities in Multimodal Large Language Models (MLLMs) by incorporating control inputs like segmentation and depth maps.	Existing MLLMs, while proficient in complex visual reasoning, often struggle with basic object perception tasks such as accurate object identification and counting. This work aims to bridge this gap by improving the foundational object-level perception skills of MLLMs.	The authors propose (1) a new dataset named COCO Segmentation Text (COST) designed to train and evaluate MLLMs on object identification and counting tasks; (2) VCoder, an adapter module that processes additional perception modalities as control inputs and integrates them into the MLLM framework; (3) novel metrics - Count Score (CS), Hallucination Score (HS), and Depth Score (DS) - to quantitatively assess the object perception performance of MLLMs.	VCoder-adapted LLaVA-1.5 outperforms existing open-source MLLMs and GPT-4V on the COST dataset, demonstrating significant improvement in object identification and counting. Incorporating segmentation maps as control inputs considerably improves the MLLM's ability to perceive both salient and background objects. VCoder's ability to leverage depth maps as control input leads to substantial enhancement in predicting the order of objects in an image.	The COST dataset, while a significant contribution, is limited by the object categories present in the COCO dataset and would benefit from expansion to include a wider variety of objects with varying granularity. The current evaluation metrics rely on one-to-one word matching for calculating scores, requiring manual mapping of synonyms. Exploring methods to overcome this limitation would be beneficial.	multimodal large language models, object perception, vision-language models, object counting, hallucination
2312.14232 Report	Parrot Captions Teach CLIP to Spot Text	Yiqi Lin, Conghui He, Alex Jinpeng Wang, Bin Wang, Weijia Li, Mike Zheng Shou	Despite CLIP being the foundation model in numerous vision-language applications, the CLIP suffers from a severe text spotting bias. Such bias causes CLIP models to `Parrot' the visual text embedded within images while disregarding the authentic visual semantics. We uncover that in the most popular image-text dataset LAION-2B, the captions also densely parrot (spell) the text embedded in images. Our analysis shows that around 50% of images are embedded with visual text content, and around 30% of captions words are in these embedded visual content. Based on such observation, we thoroughly inspect the different released versions of CLIP models and verify that the visual text is the dominant factor in measuring the LAION-style image-text similarity for these models. To examine whether these parrot captions shape the text spotting bias, we train a series of CLIP models with LAION subsets curated by different parrot-caption-oriented criteria. We show that training with parrot captions easily shapes such bias but harms the expected visual-language representation learning in CLIP models. This suggests that it is urgent to revisit either the design of CLIP-like models or the existing image-text dataset curation pipeline built on CLIP score filtering.	This paper reveals a significant text-spotting bias in CLIP models, where they heavily rely on recognizing text within images instead of understanding true visual semantics. This bias is linked to the prevalence of "parrot captions" in datasets like LAION-2B, which simply describe the text present in the images.	This finding is crucial as CLIP, being a foundational model in many vision-language applications, might be exhibiting skewed behaviors. This bias can lead to inaccurate interpretations of visual content and hinder the development of robust and fair vision-language models.	The authors analyze LAION-2B, finding a high correlation between image text and captions. They then conduct experiments by training CLIP models on different subsets of LAION-2B, curated based on the presence and extent of "parrot captions". Additionally, they assess the impact of removing text from images on CLIP's performance.	Over 50% of images in LAION-2B contain embedded text, with around 30% of caption words directly parroting this text. Released CLIP models show a strong preference for image-text pairs containing parrot captions, achieving higher similarity scores for such pairs. Training CLIP models on datasets with a high proportion of parrot captions results in a strong text-spotting bias, negatively impacting their performance on downstream tasks.	The text spotting model used for analysis might not be perfect, potentially leading to inaccurate estimations of text presence and correlation. Further investigation is needed to develop more robust and scalable data curation pipelines to mitigate the impact of parrot captions.	clip, text spotting bias, parrot captions, vision-language models, dataset bias
2312.14216 Report	DreamDistribution: Prompt Distribution Learning for Text-to-Image Diffusion Models	Brian Nlong Zhao, Yuhang Xiao, Jiashu Xu, Xinyang Jiang, Yifan Yang, Dongsheng Li, Laurent Itti, Vibhav Vineet, Yunhao Ge	The popularization of Text-to-Image (T2I) diffusion models enables the generation of high-quality images from text descriptions. However, generating diverse customized images with reference visual attributes remains challenging. This work focuses on personalizing T2I diffusion models at a more abstract concept or category level, adapting commonalities from a set of reference images while creating new instances with sufficient variations. We introduce a solution that allows a pretrained T2I diffusion model to learn a set of soft prompts, enabling the generation of novel images by sampling prompts from the learned distribution. These prompts offer text-guided editing capabilities and additional flexibility in controlling variation and mixing between multiple distributions. We also show the adaptability of the learned prompt distribution to other tasks, such as text-to-3D. Finally we demonstrate effectiveness of our approach through quantitative analysis including automatic evaluation and human assessment. Project website: https://briannlongzhao.github.io/DreamDistribution	This paper proposes a method to personalize text-to-image generation using a set of user-provided reference images by learning a distribution of prompts.	The proposed method enables diverse and personalized image generation while maintaining the text editability of pre-trained text-to-image diffusion models.	The method learns a distribution of text embedding vectors from multiple learnable text prompts. These prompts are optimized to reconstruct user-provided reference images using a pre-trained diffusion model.	The learned prompt distribution enables diverse image generation by sampling different prompts from it. The method allows users to control the generation diversity by scaling the standard deviation of the learned prompt distribution. Generated images using this method achieve high classification accuracy when used as synthetic training data, outperforming images generated from class names only.	The number of learnable prompts is a hyperparameter that needs to be manually tuned. Training the prompt distribution requires a significant amount of time and resources.	text-to-image generation, personalization, diffusion models, prompt learning, synthetic data
2312.14198 Report	ZeroShape: Regression-based Zero-shot Shape Reconstruction	Zixuan Huang, Stefan Stojanov, Anh Thai, Varun Jampani, James M. Rehg	We study the problem of single-image zero-shot 3D shape reconstruction. Recent works learn zero-shot shape reconstruction through generative modeling of 3D assets, but these models are computationally expensive at train and inference time. In contrast, the traditional approach to this problem is regression-based, where deterministic models are trained to directly regress the object shape. Such regression methods possess much higher computational efficiency than generative methods. This raises a natural question: is generative modeling necessary for high performance, or conversely, are regression-based approaches still competitive? To answer this, we design a strong regression-based model, called ZeroShape, based on the converging findings in this field and a novel insight. We also curate a large real-world evaluation benchmark, with objects from three different real-world 3D datasets. This evaluation benchmark is more diverse and an order of magnitude larger than what prior works use to quantitatively evaluate their models, aiming at reducing the evaluation variance in our field. We show that ZeroShape not only achieves superior performance over state-of-the-art methods, but also demonstrates significantly higher computational and data efficiency.	This paper presents ZeroShape, a regression-based method for zero-shot 3D shape reconstruction from single images, achieving state-of-the-art performance while being computationally efficient.	Zero-shot 3D shape reconstruction is crucial for various applications like AR and robotics, and current generative approaches, while impressive, suffer from high computational costs. This work explores the effectiveness of a more efficient regression-based approach.	ZeroShape leverages a novel architecture with three modules: a depth and camera estimator, a geometric unprojection unit, and a projection-guided shape reconstructor. It is trained on a large synthetic dataset with diverse camera poses and lighting conditions. Additionally, a large-scale real-world evaluation benchmark is created to rigorously assess the model's performance.	ZeroShape achieves state-of-the-art zero-shot performance on the proposed benchmark, outperforming existing methods including generative approaches. The model demonstrates significantly higher computational efficiency compared to generative counterparts, making it more suitable for real-world applications. Jointly learning depth and camera intrinsics for 3D visible surface estimation is crucial for achieving high accuracy.	The model's performance with the full Objaverse dataset is yet to be explored due to computational constraints. The current work does not consider object texture modeling, which could be a promising future direction.	zero-shot learning, 3d shape reconstruction, single image reconstruction, generative models, computer vision
2312.14140 Report	HeadCraft: Modeling High-Detail Shape Variations for Animated 3DMMs	Artem Sevastopolsky, Philip-William Grassal, Simon Giebenhain, ShahRukh Athar, Luisa Verdoliva, Matthias Niessner	Current advances in human head modeling allow to generate plausible-looking 3D head models via neural representations. Nevertheless, constructing complete high-fidelity head models with explicitly controlled animation remains an issue. Furthermore, completing the head geometry based on a partial observation, e.g. coming from a depth sensor, while preserving details is often problematic for the existing methods. We introduce a generative model for detailed 3D head meshes on top of an articulated 3DMM which allows explicit animation and high-detail preservation at the same time. Our method is trained in two stages. First, we register a parametric head model with vertex displacements to each mesh of the recently introduced NPHM dataset of accurate 3D head scans. The estimated displacements are baked into a hand-crafted UV layout. Second, we train a StyleGAN model in order to generalize over the UV maps of displacements. The decomposition of the parametric model and high-quality vertex displacements allows us to animate the model and modify it semantically. We demonstrate the results of unconditional generation and fitting to the full or partial observation. The project page is available at https://seva100.github.io/headcraft.	A generative model for highly-detailed 3D human head meshes, built on top of an articulated 3D Morphable Model (3DMM) for explicit animation and detail preservation.	Existing methods struggle to create high-fidelity, animatable head models, especially when reconstructing from partial observations like depth maps. This work combines the strengths of explicit parametric models and neural representations for detail.	Two-stage registration of a subdivided FLAME template with vertex displacements to a dataset of 3D head scans. Displacements are baked into UV maps and used to train a StyleGAN2 model for generalization.	The model generates diverse and highly-detailed heads, surpassing baselines in visual fidelity and quantitative metrics like FID, KID, IS, MMD, JSD, and COV. The model can be effectively fitted to full or partial point clouds, enabling reconstruction from depth maps. The use of an underlying 3DMM allows for realistic animation and semantic editing, such as hair transfer.	The current model lacks an appearance component for color and relighting. Future work could incorporate a physics-based hair movement model for more realistic animation.	3d head modeling, generative models, 3d morphable models, stylegan, animation
2312.14132 Report	DUSt3R: Geometric 3D Vision Made Easy	Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, Jerome Revaud	Multi-view stereo reconstruction (MVS) in the wild requires to first estimate the camera parameters e.g. intrinsic and extrinsic parameters. These are usually tedious and cumbersome to obtain, yet they are mandatory to triangulate corresponding pixels in 3D space, which is the core of all best performing MVS algorithms. In this work, we take an opposite stance and introduce DUSt3R, a radically novel paradigm for Dense and Unconstrained Stereo 3D Reconstruction of arbitrary image collections, i.e. operating without prior information about camera calibration nor viewpoint poses. We cast the pairwise reconstruction problem as a regression of pointmaps, relaxing the hard constraints of usual projective camera models. We show that this formulation smoothly unifies the monocular and binocular reconstruction cases. In the case where more than two images are provided, we further propose a simple yet effective global alignment strategy that expresses all pairwise pointmaps in a common reference frame. We base our network architecture on standard Transformer encoders and decoders, allowing us to leverage powerful pretrained models. Our formulation directly provides a 3D model of the scene as well as depth information, but interestingly, we can seamlessly recover from it, pixel matches, relative and absolute camera. Exhaustive experiments on all these tasks showcase that the proposed DUSt3R can unify various 3D vision tasks and set new SoTAs on monocular/multi-view depth estimation as well as relative pose estimation. In summary, DUSt3R makes many geometric 3D vision tasks easy.	\duster{} is a novel end-to-end deep learning approach for dense and unconstrained 3D reconstruction from uncalibrated and unposed image collections. It unifies monocular and multi-view stereo by regressing 3D pointmaps, simplifying the traditional reconstruction pipeline.	Existing 3D reconstruction methods rely on complex pipelines with multiple independent steps, leading to error accumulation. They also struggle with uncalibrated images and often fail in challenging conditions like low scene views or non-Lambertian surfaces. This work aims to simplify the process and improve robustness.	The core of \duster{} is a transformer-based network trained to regress dense pointmaps from image pairs. These pointmaps encode scene geometry, pixel-to-point mapping, and viewpoint relations. A global alignment strategy extends the method to multiple views, aligning pairwise predictions in a common reference frame.	\duster{} achieves state-of-the-art results on monocular and multi-view depth estimation benchmarks without requiring ground-truth camera parameters. It demonstrates superior performance on multi-view camera pose estimation compared to existing learning-based and structure-based methods. The method produces accurate and consistent dense 3D reconstructions, even in challenging scenarios with uncalibrated images.	While \duster{} shows promising results for visual localization with known intrinsics, it faces limitations when intrinsics are unknown, particularly in outdoor scenes with sparse ground-truth. The regression-based nature of \duster{} might limit its accuracy in 3D reconstruction compared to methods leveraging explicit camera parameters and sub-pixel triangulation.	3d reconstruction, uncalibrated images, pointmap regression, transformer networks, multi-view stereo
2312.14091 Report	HD-Painter: High-Resolution and Prompt-Faithful Text-Guided Image Inpainting with Diffusion Models	Hayk Manukyan, Andranik Sargsyan, Barsegh Atanyan, Zhangyang Wang, Shant Navasardyan, Humphrey Shi	Recent progress in text-guided image inpainting, based on the unprecedented success of text-to-image diffusion models, has led to exceptionally realistic and visually plausible results. However, there is still significant potential for improvement in current text-to-image inpainting models, particularly in better aligning the inpainted area with user prompts and performing high-resolution inpainting. Therefore, we introduce HD-Painter, a training free approach that accurately follows prompts and coherently scales to high resolution image inpainting. To this end, we design the Prompt-Aware Introverted Attention (PAIntA) layer enhancing self-attention scores by prompt information resulting in better text aligned generations. To further improve the prompt coherence we introduce the Reweighting Attention Score Guidance (RASG) mechanism seamlessly integrating a post-hoc sampling strategy into the general form of DDIM to prevent out-of-distribution latent shifts. Moreover, HD-Painter allows extension to larger scales by introducing a specialized super-resolution technique customized for inpainting, enabling the completion of missing regions in images of up to 2K resolution. Our experiments demonstrate that HD-Painter surpasses existing state-of-the-art approaches quantitatively and qualitatively across multiple metrics and a user study. Code is publicly available at: https://github.com/Picsart-AI-Research/HD-Painter	Introduces HD-Painter, a training-free approach for text-guided image inpainting that excels in prompt alignment and high-resolution generation.	Addresses the limitations of existing methods in aligning inpainted content with user prompts, particularly at high resolutions.	Combines two novel components: Prompt-Aware Introverted Attention (PAIntA) to enhance prompt relevance in self-attention and Reweighting Attention Score Guidance (RASG) for domain-preserving post-hoc guidance. Employs a specialized super-resolution technique for upscaling.	Outperforms state-of-the-art methods quantitatively across CLIP score, accuracy, aesthetic score, and PickScore. Demonstrates superior qualitative results, effectively addressing background and nearby object dominance issues. Enables high-resolution (up to 2048x2048) inpainting with seamless integration of known region details.	Inherits some quality limitations from the backbone inpainting model, occasionally leading to illogical appearances. Future work can explore alternative upscaling techniques for further quality improvements in high-resolution inpainting.	image inpainting, text-guided synthesis, diffusion models, prompt alignment, high-resolution
2312.13980 Report	Carve3D: Improving Multi-view Reconstruction Consistency for Diffusion Models with RL Finetuning	Desai Xie, Jiahao Li, Hao Tan, Xin Sun, Zhixin Shu, Yi Zhou, Sai Bi, Sören Pirk, Arie E. Kaufman	Multi-view diffusion models, obtained by applying Supervised Finetuning (SFT) to text-to-image diffusion models, have driven recent breakthroughs in text-to-3D research. However, due to the limited size and quality of existing 3D datasets, they still suffer from multi-view inconsistencies and Neural Radiance Field (NeRF) reconstruction artifacts. We argue that multi-view diffusion models can benefit from further Reinforcement Learning Finetuning (RLFT), which allows models to learn from the data generated by themselves and improve beyond their dataset limitations during SFT. To this end, we introduce Carve3D, an improved RLFT algorithm coupled with a novel Multi-view Reconstruction Consistency (MRC) metric, to enhance the consistency of multi-view diffusion models. To measure the MRC metric on a set of multi-view images, we compare them with their corresponding NeRF renderings at the same camera viewpoints. The resulting model, which we denote as Carve3DM, demonstrates superior multi-view consistency and NeRF reconstruction quality than existing models. Our results suggest that pairing SFT with Carve3D's RLFT is essential for developing multi-view-consistent diffusion models, mirroring the standard Large Language Model (LLM) alignment pipeline. Our code, training and testing data, and video results are available at: https://desaixie.github.io/carve-3d.	Introduces Carve3D, an improved RLFT algorithm paired with a novel Multi-view Reconstruction Consistency (MRC) metric to enhance the consistency of multi-view diffusion models for text-to-3D generation.	Existing multi-view diffusion models, primarily trained with SFT, suffer from inconsistencies across generated views, leading to artifacts in 3D reconstructions. RLFT offers a way to improve consistency without being limited by the size and quality of existing 3D datasets.	Develops MRC metric that compares generated multi-view images with renderings from a NeRF reconstructed from those images, using LPIPS for image similarity and bounding box normalization. Employs an improved on-policy DDPO algorithm for RLFT with KL divergence regularization to maintain proximity to the base model.	Carve3DM, trained with Carve3D, achieves superior multi-view consistency and NeRF reconstruction quality compared to baselines like Instant3D and MVDream. Carve3DM preserves prompt alignment, diversity, and realistic details of the base model, avoiding the degradation seen in models with prolonged SFT. User study confirms that Carve3DM generates significantly more 3D-consistent results while maintaining comparable prompt alignment.	Reconstruction quality is limited by the accuracy of the Sparse View LRM used. High computational cost of SDXL and DDPO limits further scaling up of data and batch size.	text-to-3d, multi-view consistency, diffusion models, reinforcement learning finetuning, neural radiance fields
2312.13964 Report	PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models	Yiming Zhang, Zhening Xing, Yanhong Zeng, Youqing Fang, Kai Chen	Recent advancements in personalized text-to-image (T2I) models have revolutionized content creation, empowering non-experts to generate stunning images with unique styles. While promising, adding realistic motions into these personalized images by text poses significant challenges in preserving distinct styles, high-fidelity details, and achieving motion controllability by text. In this paper, we present PIA, a Personalized Image Animator that excels in aligning with condition images, achieving motion controllability by text, and the compatibility with various personalized T2I models without specific tuning. To achieve these goals, PIA builds upon a base T2I model with well-trained temporal alignment layers, allowing for the seamless transformation of any personalized T2I model into an image animation model. A key component of PIA is the introduction of the condition module, which utilizes the condition frame and inter-frame affinity as input to transfer appearance information guided by the affinity hint for individual frame synthesis in the latent space. This design mitigates the challenges of appearance-related image alignment within and allows for a stronger focus on aligning with motion-related guidance.	PIA, a personalized image animator that turns any personalized text-to-image model into an image animation model, allowing animation of stylized images while preserving their unique features.	Existing methods struggle to animate personalized images while preserving their distinct styles, high-fidelity details, and achieving motion controllability via text prompts.	PIA leverages a base T2I model, temporal alignment layers, and a novel condition module. The condition module takes the condition frame and inter-frame affinity as inputs, transferring appearance information to individual frames, thus improving alignment and allowing for better motion control.	PIA demonstrates superior image alignment and motion controllability compared to state-of-the-art methods on the introduced AnimateBench benchmark. PIA allows users to control motion magnitude by adjusting the inter-frame affinity. PIA can even achieve style transfer effects when using models and input images from different domains.	PIA may exhibit color discrepancies when applied to images with significantly different styles from the training data. Color inconsistencies can occur if trigger words used in personalized image generation are absent from animation prompts.	image animation, personalized text-to-image, text-to-video synthesis, motion controllability, style transfer
2312.13913 Report	Paint3D: Paint Anything 3D with Lighting-Less Texture Diffusion Models	Xianfang Zeng, Xin Chen, Zhongqi Qi, Wen Liu, Zibo Zhao, Zhibin Wang, Bin Fu, Yong Liu, Gang Yu	This paper presents Paint3D, a novel coarse-to-fine generative framework that is capable of producing high-resolution, lighting-less, and diverse 2K UV texture maps for untextured 3D meshes conditioned on text or image inputs. The key challenge addressed is generating high-quality textures without embedded illumination information, which allows the textures to be re-lighted or re-edited within modern graphics pipelines. To achieve this, our method first leverages a pre-trained depth-aware 2D diffusion model to generate view-conditional images and perform multi-view texture fusion, producing an initial coarse texture map. However, as 2D models cannot fully represent 3D shapes and disable lighting effects, the coarse texture map exhibits incomplete areas and illumination artifacts. To resolve this, we train separate UV Inpainting and UVHD diffusion models specialized for the shape-aware refinement of incomplete areas and the removal of illumination artifacts. Through this coarse-to-fine process, Paint3D can produce high-quality 2K UV textures that maintain semantic consistency while being lighting-less, significantly advancing the state-of-the-art in texturing 3D objects.	The paper introduces Paint3D, a coarse-to-fine framework for generating high-quality, lighting-less 2K UV texture maps for 3D meshes using text or image prompts.	Existing methods struggle to generate textures that are both high-quality and free from pre-illumination artifacts, limiting their compatibility with traditional rendering pipelines.	The method utilizes a pre-trained 2D image diffusion model for initial multi-view texture generation and then refines the texture in UV space with specialized diffusion models for inpainting and high-definition enhancement.	Paint3D outperforms state-of-the-art methods in text-to-texture and image-to-texture generation tasks on both qualitative and quantitative metrics. The coarse-to-fine strategy effectively combines the strengths of large-scale image generation priors and lighting-less texture refinement. The use of position maps in UV space guides the diffusion process to produce semantically consistent and visually appealing textures.	The method can suffer from multi-face issues due to inconsistencies in multi-view images from the pre-trained 2D diffusion model. Paint3D currently does not generate material maps and cannot manipulate 3D geometry.	texture synthesis, 3d model, diffusion models, coarse-to-fine, uv mapping
2312.13834 Report	Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis	Bichen Wu, Ching-Yao Chuang, Xiaoyan Wang, Yichen Jia, Kapil Krishnakumar, Tong Xiao, Feng Liang, Licheng Yu, Peter Vajda	In this paper, we introduce Fairy, a minimalist yet robust adaptation of image-editing diffusion models, enhancing them for video editing applications. Our approach centers on the concept of anchor-based cross-frame attention, a mechanism that implicitly propagates diffusion features across frames, ensuring superior temporal coherence and high-fidelity synthesis. Fairy not only addresses limitations of previous models, including memory and processing speed. It also improves temporal consistency through a unique data augmentation strategy. This strategy renders the model equivariant to affine transformations in both source and target images. Remarkably efficient, Fairy generates 120-frame 512x384 videos (4-second duration at 30 FPS) in just 14 seconds, outpacing prior works by at least 44x. A comprehensive user study, involving 1000 generated samples, confirms that our approach delivers superior quality, decisively outperforming established methods.	\ourmodel is a fast and robust video editing framework adapted from image diffusion models. It leverages anchor-based cross-frame attention for feature propagation, ensuring temporal consistency.	Existing video editing methods struggle with temporal consistency, especially for complex videos with large motions. This work addresses these limitations, enabling high-quality, efficient video editing.	The method uses a set of anchor frames to extract diffusion features. Cross-frame attention with these anchor features is applied to subsequent frames, ensuring consistency. The model is fine-tuned using an equivariant strategy with affine transformations for further consistency enhancement.	Human evaluation of 1000 generated videos confirms superior quality over existing methods like Rerender, Tokenflow, and Gen-1. Quantitative metrics demonstrate improved temporal consistency and frame-wise editing accuracy compared to baselines. \ourmodel achieves significant speedup, being 53x faster than TokenFlow and 44x faster than Rerender when utilizing 8 GPUs.	The model inherits limitations from the underlying image-editing model, such as difficulties with dynamic visual effects like lightning or flames. Instructions involving camera motion, like zooming in or out, are not handled effectively.	video editing, diffusion models, temporal consistency, cross-frame attention, equivariant finetuning
2312.13789 Report	TinySAM: Pushing the Envelope for Efficient Segment Anything Model	Han Shu, Wenshuo Li, Yehui Tang, Yiman Zhang, Yihao Chen, Houqiang Li, Yunhe Wang, Xinghao Chen	Recently segment anything model (SAM) has shown powerful segmentation capability and has drawn great attention in computer vision fields. Massive following works have developed various applications based on the pretrained SAM and achieved impressive performance on downstream vision tasks. However, SAM consists of heavy architectures and requires massive computational capacity, which hinders the further application of SAM on computation constrained edge devices. To this end, in this paper we propose a framework to obtain a tiny segment anything model (TinySAM) while maintaining the strong zero-shot performance. We first propose a full-stage knowledge distillation method with hard prompt sampling and hard mask weighting strategy to distill a lightweight student model. We also adapt the post-training quantization to the promptable segmentation task and further reduce the computational cost. Moreover, a hierarchical segmenting everything strategy is proposed to accelerate the everything inference by $2\times$ with almost no performance degradation. With all these proposed methods, our TinySAM leads to orders of magnitude computational reduction and pushes the envelope for efficient segment anything task. Extensive experiments on various zero-shot transfer tasks demonstrate the significantly advantageous performance of our TinySAM against counterpart methods. Pre-trained models and codes are available at https://github.com/xinghaochen/TinySAM and https://gitee.com/mindspore/models/tree/master/research/cv/TinySAM.	This paper presents TinySAM, a highly efficient framework for segmenting anything, which significantly reduces computational cost while maintaining strong zero-shot segmentation capabilities.	The existing Segment Anything Model (SAM), though powerful, has a heavy architecture and high computational demands, hindering its deployment on resource-constrained devices.	The TinySAM framework employs three key techniques: 1) Hard Mining Full-Stage Knowledge Distillation to train a lightweight image encoder with guidance from the original SAM. 2) Post-Training Quantization adapted for promptable segmentation to further reduce computational complexity. 3) Hierarchical Segmenting Everything strategy to accelerate inference by reducing redundant computations.	TinySAM achieves superior performance compared to other efficient SAM variants, exhibiting a 4% AP improvement over FastSAM with only 12.2% FLOPs and 25% latency. The proposed model outperforms MobileSAM in zero-shot instance segmentation tasks on COCO and LVIS datasets, demonstrating higher accuracy with the same computational cost. The hierarchical everything inference strategy reduces inference time by approximately 50% while maintaining comparable results to the original points grid strategy.	The performance of TinySAM with quantization, while significantly more efficient, still lags behind the full-precision model. The hierarchical everything inference strategy relies on pre-defined thresholds and may require adjustments for different datasets or applications.	segment anything model, knowledge distillation, model quantization, efficient inference, zero-shot segmentation
2312.13770 Report	3D Points Splatting for Real-Time Dynamic Hand Reconstruction	Zheheng Jiang, Hossein Rahmani, Sue Black, Bryan M. Williams	We present 3D Points Splatting Hand Reconstruction (3D-PSHR), a real-time and photo-realistic hand reconstruction approach. We propose a self-adaptive canonical points upsampling strategy to achieve high-resolution hand geometry representation. This is followed by a self-adaptive deformation that deforms the hand from the canonical space to the target pose, adapting to the dynamic changing of canonical points which, in contrast to the common practice of subdividing the MANO model, offers greater flexibility and results in improved geometry fitting. To model texture, we disentangle the appearance color into the intrinsic albedo and pose-aware shading, which are learned through a Context-Attention module. Moreover, our approach allows the geometric and the appearance models to be trained simultaneously in an end-to-end manner. We demonstrate that our method is capable of producing animatable, photorealistic and relightable hand reconstructions using multiple datasets, including monocular videos captured with handheld smartphones and large-scale multi-view videos featuring various hand poses. We also demonstrate that our approach achieves real-time rendering speeds while simultaneously maintaining superior performance compared to existing state-of-the-art methods.	This supplementary material provides further details on the 3D points splatting method for real-time dynamic hand reconstruction presented in the main paper, including context-attention modules, ablation studies, training algorithm, and additional comparisons with state-of-the-art methods.	This approach addresses limitations in existing hand reconstruction methods by introducing a novel 3D point splatting technique that enables efficient and accurate reconstruction of dynamic hand poses from monocular images.	The method utilizes a differentiable renderer and a learned canonical representation of the hand. It optimizes a set of canonical points, their corresponding colors, and shading parameters to reconstruct the hand's 3D shape and appearance.	The proposed method outperforms state-of-the-art methods in terms of both geometry and appearance reconstruction quality on the Hand Appearance Dataset. Ablation studies demonstrate the effectiveness of different components of the proposed method, such as the context-attention modules, loss functions, and training algorithm. The method achieves real-time performance while maintaining high reconstruction accuracy.	The method relies on a pre-defined hand template (MANO model), which may limit its ability to generalize to hands with significant shape variations. Future work could explore incorporating hand shape parameters into the learning process to improve generalization.	hand reconstruction, 3d point splatting, differentiable rendering, canonical representation, real-time
2312.13735 Report	DECO: Query-Based End-to-End Object Detection with ConvNets	Xinghao Chen, Siwei Li, Yijing Yang, Yunhe Wang	Detection Transformer (DETR) and its variants have shown great potential for accurate object detection in recent years. The mechanism of object query enables DETR family to directly obtain a fixed number of object predictions and streamlines the detection pipeline. Meanwhile, recent studies also reveal that with proper architecture design, convolution networks (ConvNets) also achieve competitive performance with transformers, \eg, ConvNeXt. To this end, in this paper we explore whether we could build a query-based end-to-end object detection framework with ConvNets instead of sophisticated transformer architecture. The proposed framework, \ie, Detection ConvNet (DECO), is composed of a backbone and convolutional encoder-decoder architecture. We carefully design the DECO encoder and propose a novel mechanism for our DECO decoder to perform interaction between object queries and image features via convolutional layers. We compare the proposed DECO against prior detectors on the challenging COCO benchmark. Despite its simplicity, our DECO achieves competitive performance in terms of detection accuracy and running speed. Specifically, with the ResNet-50 and ConvNeXt-Tiny backbone, DECO obtains $38.6\%$ and $40.8\%$ AP on COCO \textit{val} set with $35$ and $28$ FPS respectively and outperforms the DETR model. Incorporated with advanced multi-scale feature module, our DECO+ achieves $47.8\%$ AP with $34$ FPS. We hope the proposed DECO brings another perspective for designing object detection framework.	This paper introduces DECO, a novel end-to-end object detection framework built solely on convolutional neural networks (CNNs) while adopting the query-based prediction mechanism of DETR (Detection Transformer).	The motivation stems from recent studies demonstrating the competitiveness of well-designed ConvNets against transformers in various vision tasks. This, coupled with the potential benefits of query-based detection, like eliminating the need for Non-Maximum Suppression (NMS), led to the exploration of a CNN-based alternative to DETR.	DECO comprises a CNN backbone, an encoder built upon ConvNeXt blocks, and a novel decoder designed for object query and image feature interaction. The decoder leverages depthwise and 1x1 convolutions, along with upsampling and pooling operations, to facilitate this interaction, deviating from the attention-based mechanism employed in DETR.	DECO achieves competitive performance on the COCO benchmark in terms of accuracy and speed, surpassing DETR in both aspects. With ResNet-50 and ConvNeXt-Tiny backbones, DECO attains 38.6% and 40.8% AP on COCO validation set at 35 and 28 FPS, respectively. The enhanced version, DECO+, incorporating multi-scale features, further boosts the performance to 47.8% AP with 34 FPS.	One limitation is the absence of specialized techniques like deformable attention or denoising training tailored for CNN-based architectures. Future work could focus on incorporating these strategies into DECO to potentially further enhance its performance.	object detection, convolutional neural networks, end-to-end detection, query-based detection, detr
2312.13729 Report	Gaussian Splatting with NeRF-based Color and Opacity	Dawid Malarz, Weronika Smolak, Jacek Tabor, Sławomir Tadeja, Przemysław Spurek	Neural Radiance Fields (NeRFs) have demonstrated the remarkable potential of neural networks to capture the intricacies of 3D objects. By encoding the shape and color information within neural network weights, NeRFs excel at producing strikingly sharp novel views of 3D objects. Recently, numerous generalizations of NeRFs utilizing generative models have emerged, expanding its versatility. In contrast, Gaussian Splatting (GS) offers a similar render quality with faster training and inference as it does not need neural networks to work. We encode information about the 3D objects in the set of Gaussian distributions that can be rendered in 3D similarly to classical meshes. Unfortunately, GS are difficult to condition since they usually require circa hundred thousand Gaussian components. To mitigate the caveats of both models, we propose a hybrid model Viewing Direction Gaussian Splatting (VDGS) that uses GS representation of the 3D object's shape and NeRF-based encoding of color and opacity. Our model uses Gaussian distributions with trainable positions (i.e. means of Gaussian), shape (i.e. covariance of Gaussian), color and opacity, and neural network, which takes parameters of Gaussian and viewing direction to produce changes in color and opacity. Consequently, our model better describes shadows, light reflections, and transparency of 3D objects.	This paper proposes Viewing Direction Gaussian Splatting (VDGS), a hybrid neural rendering method that combines Gaussian Splatting (GS) with NeRF-based color and opacity encoding.	VDGS aims to combine the speed of GS with the view-dependent effects of NeRF, leading to faster training and inference while better modeling shadows, reflections, and transparency in 3D objects.	VDGS utilizes GS to represent the 3D object's shape and a NeRF-based neural network to predict view-dependent changes in the color and opacity of the Gaussian components.	VDGS achieves better quantitative results than both GS and neural rendering methods on the NeRF Synthetic dataset. VDGS effectively models light reflections and shadows on the Tanks and Temples dataset, outperforming both GS and NeRF in most cases. VDGS demonstrates comparable performance to NeRF and GS on the Shiny Blender dataset, showcasing its ability to handle reflective surfaces.	The paper acknowledges a slightly longer training and inference time compared to pure GS due to the added neural network. Future work could explore alternative ways to combine color and opacity updates or investigate the impact of pre-training the GS component.	neural rendering, gaussian splatting, nerf, 3d object representation, view synthesis
2312.13691 Report	DreamTuner: Single Image is Enough for Subject-Driven Generation	Miao Hua, Jiawei Liu, Fei Ding, Wei Liu, Jie Wu, Qian He	Diffusion-based models have demonstrated impressive capabilities for text-to-image generation and are expected for personalized applications of subject-driven generation, which require the generation of customized concepts with one or a few reference images. However, existing methods based on fine-tuning fail to balance the trade-off between subject learning and the maintenance of the generation capabilities of pretrained models. Moreover, other methods that utilize additional image encoders tend to lose important details of the subject due to encoding compression. To address these challenges, we propose DreamTurner, a novel method that injects reference information from coarse to fine to achieve subject-driven image generation more effectively. DreamTurner introduces a subject-encoder for coarse subject identity preservation, where the compressed general subject features are introduced through an attention layer before visual-text cross-attention. We then modify the self-attention layers within pretrained text-to-image models to self-subject-attention layers to refine the details of the target subject. The generated image queries detailed features from both the reference image and itself in self-subject-attention. It is worth emphasizing that self-subject-attention is an effective, elegant, and training-free method for maintaining the detailed features of customized subjects and can serve as a plug-and-play solution during inference. Finally, with additional subject-driven fine-tuning, DreamTurner achieves remarkable performance in subject-driven image generation, which can be controlled by a text or other conditions such as pose. For further details, please visit the project page at https://dreamtuner-diffusion.github.io/.	Proposes DreamTuner, a subject-driven image generation method that uses a single reference image to generate new images of a specific subject in different scenes guided by text or pose.	Personalized image generation with customized subjects is in high demand for various applications, but existing methods struggle to balance subject identity preservation and model controllability.	Combines a subject-encoder for coarse identity preservation, self-subject-attention for fine identity details, and a subject-driven fine-tuning stage to optimize the model for a specific subject.	Achieves high-fidelity subject-driven image generation with a single reference image. Outperforms existing methods in terms of subject fidelity and prompt consistency. Demonstrates strong capability in generating images with detailed subject features while adapting to different text prompts and poses.	Training the subject-encoder and fine-tuning the model require additional computational resources. Further exploration on extending the method to handle multiple subjects in a single image.	image generation, diffusion models, subject-driven generation, self-attention, text-to-image
2312.13663 Report	Free-Editor: Zero-shot Text-driven 3D Scene Editing	Nazmul Karim, Umar Khalid, Hasan Iqbal, Jing Hua, Chen Chen	Text-to-Image (T2I) diffusion models have gained popularity recently due to their multipurpose and easy-to-use nature, e.g. image and video generation as well as editing. However, training a diffusion model specifically for 3D scene editing is not straightforward due to the lack of large-scale datasets. To date, editing 3D scenes requires either re-training the model to adapt to various 3D edited scenes or design-specific methods for each special editing type. Furthermore, state-of-the-art (SOTA) methods require multiple synchronized edited images from the same scene to facilitate the scene editing. Due to the current limitations of T2I models, it is very challenging to apply consistent editing effects to multiple images, i.e. multi-view inconsistency in editing. This in turn compromises the desired 3D scene editing performance if these images are used. In our work, we propose a novel training-free 3D scene editing technique, Free-Editor, which allows users to edit 3D scenes without further re-training the model during test time. Our proposed method successfully avoids the multi-view style inconsistency issue in SOTA methods with the help of a "single-view editing" scheme. Specifically, we show that editing a particular 3D scene can be performed by only modifying a single view. To this end, we introduce an Edit Transformer that enforces intra-view consistency and inter-view style transfer by utilizing self- and cross-attention, respectively. Since it is no longer required to re-train the model and edit every view in a scene, the editing time, as well as memory resources, are reduced significantly, e.g., the runtime being $\sim \textbf{20} \times$ faster than SOTA. We have conducted extensive experiments on a wide range of benchmark datasets and achieve diverse editing capabilities with our proposed technique.	Proposes Free-Editor, a zero-shot text-guided 3D scene editing technique that synthesizes novel views based on a text description while maintaining 3D consistency without retraining.	Existing 3D scene editing methods using T2I diffusion models require retraining for each scene or editing type, leading to computational overhead and limitations in practical applications.	Leverages a generalized NeRF model and introduces an Edit Transformer to transfer style information from a single edited starting view to target views using cross-attention. Employs multi-view consistency and self-view robust losses for spatial smoothness and color consistency.	Achieves state-of-the-art performance in text-driven 3D scene editing, accurately reflecting text descriptions in novel views. Demonstrates superior efficiency compared to previous methods, achieving significantly faster editing time and constant space complexity due to the zero-shot nature. Maintains 3D consistency and preserves details of the original scene while effectively implementing edits.	Relies on successful 2D pre-editing of the starting view, where failures can adversely affect 3D editing outcomes. Addressing multi-view inconsistency in edited images requires careful consideration, potentially through trial and error or a view-filtering system.	3d scene editing, text-guided image synthesis, neural radiance fields (nerf), diffusion models, zero-shot learning
2312.13578 Report	DREAM-Talk: Diffusion-based Realistic Emotional Audio-driven Method for Single Image Talking Face Generation	Chenxu Zhang, Chao Wang, Jianfeng Zhang, Hongyi Xu, Guoxian Song, You Xie, Linjie Luo, Yapeng Tian, Xiaohu Guo, Jiashi Feng	The generation of emotional talking faces from a single portrait image remains a significant challenge. The simultaneous achievement of expressive emotional talking and accurate lip-sync is particularly difficult, as expressiveness is often compromised for the accuracy of lip-sync. As widely adopted by many prior works, the LSTM network often fails to capture the subtleties and variations of emotional expressions. To address these challenges, we introduce DREAM-Talk, a two-stage diffusion-based audio-driven framework, tailored for generating diverse expressions and accurate lip-sync concurrently. In the first stage, we propose EmoDiff, a novel diffusion module that generates diverse highly dynamic emotional expressions and head poses in accordance with the audio and the referenced emotion style. Given the strong correlation between lip motion and audio, we then refine the dynamics with enhanced lip-sync accuracy using audio features and emotion style. To this end, we deploy a video-to-video rendering module to transfer the expressions and lip motions from our proxy 3D avatar to an arbitrary portrait. Both quantitatively and qualitatively, DREAM-Talk outperforms state-of-the-art methods in terms of expressiveness, lip-sync accuracy and perceptual quality.	DREAM-Talk, a novel two-stage diffusion-based framework, generates photorealistic, lip-synchronized talking face videos with high-quality emotional expressions from a single portrait image, audio, and emotion style example.	Existing methods struggle to simultaneously achieve expressive emotional talking and accurate lip-sync, often compromising expressiveness for lip-sync accuracy.	DREAM-Talk uses a two-stage pipeline: 1) EmoDiff Module: An emotion-conditioned diffusion model generates dynamic emotional expressions and head poses from audio and emotion style. 2) Lip Refinement: A lip-sync refinement network enhances lip-sync accuracy using audio and emotion style while preserving emotional expressiveness. Finally, a video-to-video rendering module transfers expressions and lip motions to an arbitrary portrait.	Outperforms state-of-the-art methods in expressiveness, lip-sync accuracy, and perceptual quality. Effectively captures high-frequency facial details and subtle variations in emotional expressions. Demonstrates superior performance in both quantitative metrics and subjective user studies.	Relies on the accuracy of pre-trained emotion recognition models. Limited generalization ability to unseen emotional expressions or speaking styles.	talking face generation, emotional expression, lip sync, diffusion models, deep learning
2312.13528 Report	DyBluRF: Dynamic Deblurring Neural Radiance Fields for Blurry Monocular Video	Minh-Quan Viet Bui, Jongmin Park, Jihyong Oh, Munchurl Kim	Neural Radiance Fields (NeRF), initially developed for static scenes, have inspired many video novel view synthesis techniques. However, the challenge for video view synthesis arises from motion blur, a consequence of object or camera movement during exposure, which hinders the precise synthesis of sharp spatio-temporal views. In response, we propose a novel dynamic deblurring NeRF framework for blurry monocular video, called DyBluRF, consisting of a Base Ray Initialization (BRI) stage and a Motion Decomposition-based Deblurring (MDD) stage. Our DyBluRF is the first that handles the novel view synthesis for blurry monocular video with a novel two-stage framework. In the BRI stage, we coarsely reconstruct dynamic 3D scenes and jointly initialize the base ray, which is further used to predict latent sharp rays, using the inaccurate camera pose information from the given blurry frames. In the MDD stage, we introduce a novel Incremental Latent Sharp-rays Prediction (ILSP) approach for the blurry monocular video frames by decomposing the latent sharp rays into global camera motion and local object motion components. We further propose two loss functions for effective geometry regularization and decomposition of static and dynamic scene components without any mask supervision. Experiments show that DyBluRF outperforms qualitatively and quantitatively the SOTA methods.	DyBluRF, a novel dynamic deblurring Neural Radiance Field (NeRF) framework, synthesizes sharp, novel spatio-temporal views from blurry monocular videos.	Existing video view synthesis methods struggle with motion blur in casually captured monocular videos, hindering the generation of sharp, temporally consistent novel views.	DyBluRF employs a two-stage approach: (1) Base Ray Initialization (BRI) coarsely reconstructs the 3D scene and initializes base rays from imprecise camera poses; (2) Motion Decomposition-based Deblurring (MDD) refines the rays by considering global camera and local object motion, simulating the blur process during training.	DyBluRF significantly outperforms state-of-the-art methods in novel view synthesis from blurry monocular videos, both qualitatively and quantitatively. The proposed Unsupervised Staticness Maximization and Local Geometry Variance Distillation losses enable robust decomposition of static and dynamic scene components and accurate geometry reconstruction, respectively. DyBluRF demonstrates robustness against varying degrees of blurriness, maintaining consistent performance across different dataset capture qualities.	DyBluRF's performance is limited by the diversity of training and validation views in the dataset, leading to overfitting in scenarios with significantly different lighting conditions. Future work includes integrating Gaussian Splatting-based shading networks for improved training and rendering efficiency.	deblurring nerf, dynamic nerf, video view synthesis, motion blur, monocular video
2312.13324 Report	ShowRoom3D: Text to High-Quality 3D Room Generation Using 3D Priors	Weijia Mao, Yan-Pei Cao, Jia-Wei Liu, Zhongcong Xu, Mike Zheng Shou	We introduce ShowRoom3D, a three-stage approach for generating high-quality 3D room-scale scenes from texts. Previous methods using 2D diffusion priors to optimize neural radiance fields for generating room-scale scenes have shown unsatisfactory quality. This is primarily attributed to the limitations of 2D priors lacking 3D awareness and constraints in the training methodology. In this paper, we utilize a 3D diffusion prior, MVDiffusion, to optimize the 3D room-scale scene. Our contributions are in two aspects. Firstly, we propose a progressive view selection process to optimize NeRF. This involves dividing the training process into three stages, gradually expanding the camera sampling scope. Secondly, we propose the pose transformation method in the second stage. It will ensure MVDiffusion provide the accurate view guidance. As a result, ShowRoom3D enables the generation of rooms with improved structural integrity, enhanced clarity from any view, reduced content repetition, and higher consistency across different perspectives. Extensive experiments demonstrate that our method, significantly outperforms state-of-the-art approaches by a large margin in terms of user study.	Introduces ShowRoom3D, a novel three-stage pipeline utilizing a 3D diffusion prior (MVDiffusion) to optimize NeRF for generating high-quality 3D room-scale scenes from text prompts.	Generating high-quality 3D room-scale scenes is crucial for various industries, including VR/AR and the Metaverse. Existing methods face challenges such as the Janus problem, unreasonable room structures, and style inconsistencies.	Combines MVDiffusion and NeRF using a progressive view selection approach: (1) Generates a panoramic view to determine room structure and geometry. (2) Improves geometry and layout by adding training views from various positions facing outward. Introduces pose transformation for accurate view guidance. (3) Freely positions the camera and applies rotations to fine-tune the NeRF model for rendering from any position and rotation.	Generates rooms with improved structural integrity and clarity from any view. Reduces content repetition and enhances consistency across different perspectives. Significantly outperforms state-of-the-art approaches in user studies, demonstrating superior overall quality, text alignment, and consistency.	Generated results exhibit oversaturation despite employing techniques to mitigate the issue. The three-stage training process is time-consuming.	3d scene generation, text-to-3d, diffusion models, neural radiance fields (nerf), score distillation sampling (sds)
2312.13314 Report	Unlocking Pre-trained Image Backbones for Semantic Image Synthesis	Tariq Berrada, Jakob Verbeek, Camille Couprie, Karteek Alahari	Semantic image synthesis, i.e., generating images from user-provided semantic label maps, is an important conditional image generation task as it allows to control both the content as well as the spatial layout of generated images. Although diffusion models have pushed the state of the art in generative image modeling, the iterative nature of their inference process makes them computationally demanding. Other approaches such as GANs are more efficient as they only need a single feed-forward pass for generation, but the image quality tends to suffer on large and diverse datasets. In this work, we propose a new class of GAN discriminators for semantic image synthesis that generates highly realistic images by exploiting feature backbone networks pre-trained for tasks such as image classification. We also introduce a new generator architecture with better context modeling and using cross-attention to inject noise into latent variables, leading to more diverse generated images. Our model, which we dub DP-SIMS, achieves state-of-the-art results in terms of image quality and consistency with the input label maps on ADE-20K, COCO-Stuff, and Cityscapes, surpassing recent diffusion models while requiring two orders of magnitude less compute for inference.	This paper introduces a novel GAN-based semantic image synthesis model that leverages pre-trained image backbones as encoders in the discriminator, enhancing image quality and consistency with input segmentation masks.	Semantic image synthesis is crucial for tasks demanding precise control over object location and boundaries, such as photo editing and data augmentation, where existing methods struggle with quality, consistency, or speed.	The proposed method uses a UNet-like discriminator with a pre-trained backbone encoder and a trainable decoder. It introduces a novel generator architecture with cross-attention for noise injection and employs a contrastive loss and diversity constraint during training.	Achieves state-of-the-art performance in FID and mIoU across COCO, ADE20k, and Cityscapes datasets. Outperforms recent diffusion models in quality and consistency while being significantly faster in inference (two orders of magnitude). Demonstrates the effectiveness of pre-trained backbones, feature conditioning, and novel architectural modifications through extensive ablations.	Transformer-based backbones, like Swin, exhibit instability during training and require further investigation. Exploration of larger encoder architectures for handling more complex datasets could be beneficial.	semantic image synthesis, generative adversarial networks (gans), pre-trained backbones, contrastive learning, image generation
2312.13308 Report	SWAGS: Sampling Windows Adaptively for Dynamic 3D Gaussian Splatting	Richard Shaw, Jifei Song, Arthur Moreau, Michal Nazarczuk, Sibi Catley-Chandar, Helisa Dhamo, Eduardo Perez-Pellitero	Novel view synthesis has shown rapid progress recently, with methods capable of producing evermore photo-realistic results. 3D Gaussian Splatting has emerged as a particularly promising method, producing high-quality renderings of static scenes and enabling interactive viewing at real-time frame rates. However, it is currently limited to static scenes only. In this work, we extend 3D Gaussian Splatting to reconstruct dynamic scenes. We model the dynamics of a scene using a tunable MLP, which learns the deformation field from a canonical space to a set of 3D Gaussians per frame. To disentangle the static and dynamic parts of the scene, we learn a tuneable parameter for each Gaussian, which weighs the respective MLP parameters to focus attention on the dynamic parts. This improves the model's ability to capture dynamics in scenes with an imbalance of static to dynamic regions. To handle scenes of arbitrary length whilst maintaining high rendering quality, we introduce an adaptive window sampling strategy to partition the sequence into windows based on the amount of movement in the sequence. We train a separate dynamic Gaussian Splatting model for each window, allowing the canonical representation to change, thus enabling the reconstruction of scenes with significant geometric or topological changes. Temporal consistency is enforced using a fine-tuning step with self-supervising consistency loss on randomly sampled novel views. As a result, our method produces high-quality renderings of general dynamic scenes with competitive quantitative performance, which can be viewed in real-time with our dynamic interactive viewer.	This paper introduces a novel method for high-quality, real-time novel view synthesis of dynamic scenes using an extension of 3D Gaussian Splatting.	Existing methods struggle with long sequences, complex motions, and often lack temporal consistency. This work addresses these issues to achieve realistic and efficient dynamic scene rendering.	The method employs adaptive window sampling based on motion, learns per-window canonical representations and deformation fields using tuneable MLPs, and ensures temporal consistency through a fine-tuning step with a self-supervised consistency loss.	The method achieves state-of-the-art PSNR and SSIM performance on the Neural 3D Video dataset. It enables real-time interactive viewing of dynamic scenes. The adaptive window sampling and temporal consistency fine-tuning effectively handle complex motions and long sequences without distracting flickering.	The method relies on pre-computed camera parameters and may be sensitive to their accuracy. Future work could explore joint optimization of camera poses and scene representation.	novel view synthesis, dynamic scene reconstruction, 3d gaussian splatting, temporal consistency, neural rendering
2312.13307 Report	Not All Steps are Equal: Efficient Generation with Progressive Diffusion Models	Wenhao Li, Xiu Su, Shan You, Tao Huang, Fei Wang, Chen Qian, Chang Xu	Diffusion models have demonstrated remarkable efficacy in various generative tasks with the predictive prowess of denoising model. Currently, these models employ a uniform denoising approach across all timesteps. However, the inherent variations in noisy latents at each timestep lead to conflicts during training, constraining the potential of diffusion models. To address this challenge, we propose a novel two-stage training strategy termed Step-Adaptive Training. In the initial stage, a base denoising model is trained to encompass all timesteps. Subsequently, we partition the timesteps into distinct groups, fine-tuning the model within each group to achieve specialized denoising capabilities. Recognizing that the difficulties of predicting noise at different timesteps vary, we introduce a diverse model size requirement. We dynamically adjust the model size for each timestep by estimating task difficulty based on its signal-to-noise ratio before fine-tuning. This adjustment is facilitated by a proxy-based structural importance assessment mechanism, enabling precise and efficient pruning of the base denoising model. Our experiments validate the effectiveness of the proposed training strategy, demonstrating an improvement in the FID score on CIFAR10 by over 0.3 while utilizing only 80\% of the computational resources. This innovative approach not only enhances model performance but also significantly reduces computational costs, opening new avenues for the development and application of diffusion models.	This paper introduces a novel two-stage training strategy called Step-Adaptive Training for diffusion models, aiming to address the limitations of uniform denoising across timesteps.	The conventional uniform denoising approach in diffusion models leads to training conflicts and inefficient resource allocation due to varying noise levels across timesteps, limiting their efficiency and performance.	The approach involves initially training a base denoising model across all timesteps. Subsequently, timesteps are partitioned into groups, and the model is fine-tuned within each group with a specific FLOPs budget determined by the signal-to-noise ratio. This process is facilitated by a GPT-4 proxy-based pruning method for efficient model size adjustment.	The Step-Adaptive Training Strategy improves the FID score on CIFAR10 by over 0.3 while using only 80% of the computational resources. The two-stage training approach, in comparison to single-stage training, demonstrates significant improvement in convergence speed and performance. The proposed GPT-4 proxy-based pruning method outperforms other pruning algorithms for diffusion models, leading to smaller models with competitive performance.	The paper mainly focuses on image generation tasks, and further exploration is needed for other applications of diffusion models. While the GPT-4 proxy shows promising results, investigating alternative pruning methods specifically tailored for diffusion models could be beneficial. Future work includes exploring the application of Step-Adaptive Training to diverse diffusion model architectures and datasets.	diffusion models, denoising, model pruning, step-adaptive training, gpt-4 proxy
2312.13299 Report	Compact 3D Scene Representation via Self-Organizing Gaussian Grids	Wieland Morgenstern, Florian Barthel, Anna Hilsmann, Peter Eisert	3D Gaussian Splatting has recently emerged as a highly promising technique for modeling of static 3D scenes. In contrast to Neural Radiance Fields, it utilizes efficient rasterization allowing for very fast rendering at high-quality. However, the storage size is significantly higher, which hinders practical deployment, e.g. on resource constrained devices. In this paper, we introduce a compact scene representation organizing the parameters of 3D Gaussian Splatting (3DGS) into a 2D grid with local homogeneity, ensuring a drastic reduction in storage requirements without compromising visual quality during rendering. Central to our idea is the explicit exploitation of perceptual redundancies present in natural scenes. In essence, the inherent nature of a scene allows for numerous permutations of Gaussian parameters to equivalently represent it. To this end, we propose a novel highly parallel algorithm that regularly arranges the high-dimensional Gaussian parameters into a 2D grid while preserving their neighborhood structure. During training, we further enforce local smoothness between the sorted parameters in the grid. The uncompressed Gaussians use the same structure as 3DGS, ensuring a seamless integration with established renderers. Our method achieves a reduction factor of 17x to 42x in size for complex scenes with no increase in training time, marking a substantial leap forward in the domain of 3D scene distribution and consumption. Additional information can be found on our project page: https://fraunhoferhhi.github.io/Self-Organizing-Gaussians/	This paper introduces a novel method for compact 3D scene representation using self-organizing Gaussian grids, significantly reducing the storage requirements of 3D Gaussian Splatting (3DGS) without compromising rendering quality.	3DGS offers high-quality rendering at fast speeds, but its large storage size hinders practical deployment on devices with limited resources. This work addresses this limitation, making 3DGS more practical for various applications.	The method employs a novel parallel sorting algorithm (PLAS) to arrange 3DGS parameters into a 2D grid, ensuring local homogeneity. A smoothness loss is integrated into the training process to encourage compressible parameter arrangements, and off-the-shelf compression methods are used for storage.	Achieves a 17x to 42x reduction in storage size compared to vanilla 3DGS. Maintains high visual quality comparable to original 3DGS rendering. Sorting and compression do not increase training time compared to 3DGS.	Current implementation relies on high-dimensional spherical harmonics, which could be improved for further compression. Future work includes extending the method to 4D scenes with temporal dependencies.	3d scene representation, 3d gaussian splatting, compression, self-organizing maps, neural rendering
2312.13286 Report	Generative Multimodal Models are In-Context Learners	Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, Xinlong Wang	The human ability to easily solve multimodal tasks in context (i.e., with only a few demonstrations or simple instructions), is what current multimodal systems have largely struggled to imitate. In this work, we demonstrate that the task-agnostic in-context learning capabilities of large multimodal models can be significantly enhanced by effective scaling-up. We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences with a unified autoregressive objective. Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning, such as visual prompting and object-grounded generation. The model sets a new record on multiple multimodal understanding tasks in few-shot settings. When instruction-tuned to follow specific instructions, Emu2 further achieves new state-of-the-art on challenging tasks such as question answering benchmarks for large multimodal models and open-ended subject-driven generation. These achievements demonstrate that Emu2 can serve as a base model and general-purpose interface for a wide range of multimodal tasks. Code and models are publicly available to facilitate future research.	This paper introduces Emu++, a 37B parameter generative multimodal model demonstrating significant advancements in in-context learning for multimodal tasks.	The work is crucial as it addresses the limitations of current multimodal systems in replicating human-like in-context learning abilities for diverse and complex tasks.	Emu++ is trained on a massive dataset of multimodal sequences (text, image-text, video-text) with a unified autoregressive objective to predict the next multimodal element. The model architecture includes a visual encoder, multimodal modeling, and visual decoder for image and video generation.	Emu++ achieves state-of-the-art performance on various few-shot multimodal understanding tasks, including visual question answering. It demonstrates strong in-context learning capabilities, excelling in tasks like visual prompting and object-grounded generation. With instruction tuning, Emu++ excels in controllable visual generation, accepting text, location, and image inputs for context-aware image synthesis.	The in-context learning capability of Emu++ can be limited in complex situations, such as counting objects in crowded scenes. There's a performance gap between Emu++ and specialized multimodal systems, particularly in question-answering tasks.	multimodal learning, in-context learning, generative models, large language models, visual generation
2312.13285 Report	UniSDF: Unifying Neural Representations for High-Fidelity 3D Reconstruction of Complex Scenes with Reflections	Fangjinhua Wang, Marie-Julie Rakotosaona, Michael Niemeyer, Richard Szeliski, Marc Pollefeys, Federico Tombari	Neural 3D scene representations have shown great potential for 3D reconstruction from 2D images. However, reconstructing real-world captures of complex scenes still remains a challenge. Existing generic 3D reconstruction methods often struggle to represent fine geometric details and do not adequately model reflective surfaces of large-scale scenes. Techniques that explicitly focus on reflective surfaces can model complex and detailed reflections by exploiting better reflection parameterizations. However, we observe that these methods are often not robust in real unbounded scenarios where non-reflective as well as reflective components are present. In this work, we propose UniSDF, a general purpose 3D reconstruction method that can reconstruct large complex scenes with reflections. We investigate both view-based as well as reflection-based color prediction parameterization techniques and find that explicitly blending these representations in 3D space enables reconstruction of surfaces that are more geometrically accurate, especially for reflective surfaces. We further combine this representation with a multi-resolution grid backbone that is trained in a coarse-to-fine manner, enabling faster reconstructions than prior methods. Extensive experiments on object-level datasets DTU, Shiny Blender as well as unbounded datasets Mip-NeRF 360 and Ref-NeRF real demonstrate that our method is able to robustly reconstruct complex large-scale scenes with fine details and reflective surfaces. Please see our project page at https://fangjinhuawang.github.io/UniSDF.	UniSDF, a novel algorithm that combines camera view and reflected view radiance fields, enabling robust and accurate 3D reconstruction of complex scenes with reflections.	Existing methods struggle to balance accurate geometry and reflections, especially in complex real-world scenes. UniSDF addresses this by leveraging the strengths of different radiance field parameterizations.	The method uses a hash grid backbone for fast training and combines two radiance fields: one parameterized by camera view direction and the other by reflected view direction. A learned weight field blends these in 3D space. A coarse-to-fine training strategy is employed to enhance reconstruction quality.	Achieves state-of-the-art reconstruction quality on DTU, outperforming Neuralangelo and PermutoSDF. Demonstrates high-fidelity reconstruction on Shiny Blender dataset, surpassing BakedSDF in capturing reflective surfaces accurately. Reconstructs complex unbounded scenes with fine details and reflections, outperforming baselines like BakedSDF on Mip-NeRF 360 and Ref-NeRF real datasets.	The method's performance on highly specular and less specular reflections is not explicitly analyzed. Further exploration of the optimization challenges related to the diffuse component in the reflected view radiance field, particularly with high-frequency iNGP representations, is warranted.	3d reconstruction, neural radiance fields, reflections, hash grids, signed distance functions
2312.13271 Report	Repaint123: Fast and High-quality One Image to 3D Generation with Progressive Controllable 2D Repainting	Junwu Zhang, Zhenyu Tang, Yatian Pang, Xinhua Cheng, Peng Jin, Yida Wei, Munan Ning, Li Yuan	Recent one image to 3D generation methods commonly adopt Score Distillation Sampling (SDS). Despite the impressive results, there are multiple deficiencies including multi-view inconsistency, over-saturated and over-smoothed textures, as well as the slow generation speed. To address these deficiencies, we present Repaint123 to alleviate multi-view bias as well as texture degradation and speed up the generation process. The core idea is to combine the powerful image generation capability of the 2D diffusion model and the texture alignment ability of the repainting strategy for generating high-quality multi-view images with consistency. We further propose visibility-aware adaptive repainting strength for overlap regions to enhance the generated image quality in the repainting process. The generated high-quality and multi-view consistent images enable the use of simple Mean Square Error (MSE) loss for fast 3D content generation. We conduct extensive experiments and show that our method has a superior ability to generate high-quality 3D content with multi-view consistency and fine textures in 2 minutes from scratch. Our project page is available at https://pku-yuangroup.github.io/repaint123/.	Presents Repaint123, a novel method that generates high-quality, multi-view consistent 3D content from a single image in approximately 2 minutes.	Addresses limitations of existing single image to 3D generation methods, which suffer from multi-view inconsistency, over-saturated and smoothed textures, and slow generation speed.	Employs a two-stage optimization strategy: a coarse stage using Gaussian Splatting and a refining stage that leverages a 2D controllable diffusion model with a progressive, controllable repainting scheme for texture refinement.	Achieves superior multi-view consistency compared to existing methods, as evidenced by CLIP-similarity and contextual distance metrics. Generates high-quality textures, addressing the over-smoothing issue common in other methods. Significantly faster (around 2 minutes) than NeRF-based methods while maintaining high quality.	Reliance on Gaussian Splatting, which is still under development and may exhibit geometry artifacts. Potential for further improvement in reference-view reconstruction quality compared to some NeRF-based approaches.	3d generation, image-to-3d, gaussian splatting, diffusion models, controllable image synthesis
2312.13253 Report	Conditional Image Generation with Pretrained Generative Model	Rajesh Shrestha, Bowen Xie	In recent years, diffusion models have gained popularity for their ability to generate higher-quality images in comparison to GAN models. However, like any other large generative models, these models require a huge amount of data, computational resources, and meticulous tuning for successful training. This poses a significant challenge, rendering it infeasible for most individuals. As a result, the research community has devised methods to leverage pre-trained unconditional diffusion models with additional guidance for the purpose of conditional image generative. These methods enable conditional image generations on diverse inputs and, most importantly, circumvent the need for training the diffusion model. In this paper, our objective is to reduce the time-required and computational overhead introduced by the addition of guidance in diffusion models -- while maintaining comparable image quality. We propose a set of methods based on our empirical analysis, demonstrating a reduction in computation time by approximately threefold.	This paper proposes methods to reduce the computational overhead of guided image generation using pretrained diffusion models, specifically focusing on the Universal Guidance method.	Guided diffusion models allow for controlled image generation but introduce significant computational overhead, making them time-consuming. Reducing this overhead is crucial for broader application.	The paper analyzes the impact of hyperparameters (self-recurrence steps and backward guidance steps) and the necessity of guidance at different diffusion steps. It also explores a model-based approach to approximate the guidance process.	Reducing self-recurrence steps to 5 and backward guidance steps to 10 significantly reduces computation time without significant quality loss. Guidance is more critical in the initial stages of the diffusion process, allowing for its deactivation in later stages without substantial impact. A model-based approach to replace the iterative guidance process shows potential but requires further refinement.	The experiments were conducted on a limited dataset, requiring further validation on a larger scale. The model-based approximation needs further development and a larger, more diverse dataset for training.	diffusion models, image generation, guided diffusion, computational efficiency, clip
2312.13150 Report	Splatter Image: Ultra-Fast Single-View 3D Reconstruction	Stanislaw Szymanowicz, Christian Rupprecht, Andrea Vedaldi	We introduce the \method, an ultra-efficient approach for monocular 3D object reconstruction. Splatter Image is based on Gaussian Splatting, which allows fast and high-quality reconstruction of 3D scenes from multiple images. We apply Gaussian Splatting to monocular reconstruction by learning a neural network that, at test time, performs reconstruction in a feed-forward manner, at 38 FPS. Our main innovation is the surprisingly straightforward design of this network, which, using 2D operators, maps the input image to one 3D Gaussian per pixel. The resulting set of Gaussians thus has the form an image, the Splatter Image. We further extend the method take several images as input via cross-view attention. Owning to the speed of the renderer (588 FPS), we use a single GPU for training while generating entire images at each iteration to optimize perceptual metrics like LPIPS. On several synthetic, real, multi-category and large-scale benchmark datasets, we achieve better results in terms of PSNR, LPIPS, and other metrics while training and evaluating much faster than prior works. Code, models, demo and more results are available at https://szymanowiczs.github.io/splatter-image.	Introduces Splatter Image, an ultra-efficient method for single- and few-view 3D object reconstruction using Gaussian Splatting.	Addresses limitations of existing methods in terms of speed, efficiency, and reconstruction quality, particularly for single-view reconstruction.	Predicts a 'Splatter Image' where each pixel represents parameters of a 3D Gaussian, using a U-Net architecture. Employs Gaussian Splatting for fast and high-quality rendering. Extends to multi-view by registering and fusing Gaussian mixtures from different views.	Achieves state-of-the-art reconstruction quality on ShapeNet, CO3D, and Google Scanned Objects datasets, outperforming or being comparable to much slower methods. Significantly faster than previous methods in both training and inference, enabling single-GPU training for standard benchmarks and competing with methods 50x more expensive in training. Demonstrates the ability to reconstruct full 360° 3D objects from single views.	Assumes a fixed number of views during training, potentially limiting generalisation to arbitrary viewpoints. Relies on relative camera poses, which may be challenging to obtain accurately in some real-world scenarios.	3d reconstruction, gaussian splatting, single-view reconstruction, few-shot learning, computer vision
2312.12661 Report	Misalign, Contrast then Distill: Rethinking Misalignments in Language-Image Pretraining	Bumsoo Kim, Yeonsik Jo, Jinhyung Kim, Seung Hwan Kim	Contrastive Language-Image Pretraining has emerged as a prominent approach for training vision and text encoders with uncurated image-text pairs from the web. To enhance data-efficiency, recent efforts have introduced additional supervision terms that involve random-augmented views of the image. However, since the image augmentation process is unaware of its text counterpart, this procedure could cause various degrees of image-text misalignments during training. Prior methods either disregarded this discrepancy or introduced external models to mitigate the impact of misalignments during training. In contrast, we propose a novel metric learning approach that capitalizes on these misalignments as an additional training source, which we term "Misalign, Contrast then Distill (MCD)". Unlike previous methods that treat augmented images and their text counterparts as simple positive pairs, MCD predicts the continuous scales of misalignment caused by the augmentation. Our extensive experimental results show that our proposed MCD achieves state-of-the-art transferability in multiple classification and retrieval downstream datasets.	This paper introduces MCD (Misalign, Contrast then Distill), a novel training framework that leverages the various levels of misalignments between random augmented images and its text description during training for Contrastive Language-Image Pretraining.	Random image augmentations in Contrastive Language-Image Pretraining can cause misalignments between the image and its corresponding text description, leading to performance degradation if not addressed properly.	MCD utilizes a teacher-student network where the student learns from the continuous distance between the image--text and augmented image--text of the teacher model with a log-ratio loss.	MCD achieves state-of-the-art transferability in multiple classification and retrieval downstream datasets. MCD outperforms previous methods without relying on external models or additional parameters for inference. The proposed distillation strategies, addressing misalignments in positive pairs, negative pairs, and noisy pairs, all contribute positively to the final performance.	The paper mainly focuses on image augmentations as the source of misalignments and could further explore other sources. Future work could extend MCD frameworks to other modalities beyond vision and language.	contrastive learning, language-image pretraining, multi-modal learning, knowledge distillation, misalignment
2312.12540 Report	Fixed-point Inversion for Text-to-image diffusion models	Barak Meiri, Dvir Samuel, Nir Darshan, Gal Chechik, Shai Avidan, Rami Ben-Ari	Text-guided diffusion models offer powerful new ways to generate and manipulate images. Several applications of these models, including image editing interpolation, and semantic augmentation, require diffusion inversion. This is the process of finding a noise seed that can be used to generate a given image. Current techniques for inverting a given image can be slow or inaccurate. The technical challenge for inverting the diffusion process arises from an implicit equation over the latent that cannot be solved in closed form. Previous approaches proposed to solve this issue by approximation or various learning schemes. Here, we formulate the problem as a fixed-point equation problem and solve it using fixed-point iterations, a well-studied approach in numerical analysis. We further identify a source of inconsistency that significantly hurts the inversion of real images encoded to the latent space. We show how to correct it by applying a prompt-aware adjustment of the encoding. Our solution, Fixed-point inversion, is much faster than previous techniques like EDICT and Null-text, with similar inversion quality. It can be combined with any pretrained diffusion model and requires no model training, prompt tuning, or additional parameters. In a series of experiments, we find that Fixed-point inversion shows improved results in several downstream tasks: image editing, image interpolation, and generation of rare objects.	This paper introduces Fixed-Point Inversion (FPI), a novel, fast, and accurate method for inverting real images in text-guided diffusion models.	Diffusion inversion is crucial for many applications, including image editing and rare concept generation. Existing methods are either slow or inaccurate.	FPI leverages fixed-point iterations to efficiently solve the implicit equation governing the diffusion process. It also introduces a prompt-aware adjustment to improve consistency.	FPI achieves comparable or better reconstruction quality than state-of-the-art methods while being significantly faster. FPI demonstrates superior performance in image editing, preserving image structure and adhering to target prompts. FPI improves the generation of rare concepts by providing more accurate and efficient seed initialization for methods like SeedSelect.	Theoretical convergence guarantees for FPI are not fully established. Exploring more sophisticated implicit function solvers may further enhance FPI's performance.	diffusion models, image inversion, image editing, rare concept generation, fixed-point iteration
2312.12491 Report	StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation	Akio Kodaira, Chenfeng Xu, Toshiki Hazama, Takanori Yoshimoto, Kohei Ohno, Shogo Mitsuhori, Soichi Sugano, Hanying Cho, Zhijian Liu, Kurt Keutzer	We introduce StreamDiffusion, a real-time diffusion pipeline designed for interactive image generation. Existing diffusion models are adept at creating images from text or image prompts, yet they often fall short in real-time interaction. This limitation becomes particularly evident in scenarios involving continuous input, such as Metaverse, live video streaming, and broadcasting, where high throughput is imperative. To address this, we present a novel approach that transforms the original sequential denoising into the batching denoising process. Stream Batch eliminates the conventional wait-and-interact approach and enables fluid and high throughput streams. To handle the frequency disparity between data input and model throughput, we design a novel input-output queue for parallelizing the streaming process. Moreover, the existing diffusion pipeline uses classifier-free guidance(CFG), which requires additional U-Net computation. To mitigate the redundant computations, we propose a novel residual classifier-free guidance (RCFG) algorithm that reduces the number of negative conditional denoising steps to only one or even zero. Besides, we introduce a stochastic similarity filter(SSF) to optimize power consumption. Our Stream Batch achieves around 1.5x speedup compared to the sequential denoising method at different denoising levels. The proposed RCFG leads to speeds up to 2.05x higher than the conventional CFG. Combining the proposed strategies and existing mature acceleration tools makes the image-to-image generation achieve up-to 91.07fps on one RTX4090, improving the throughputs of AutoPipline developed by Diffusers over 59.56x. Furthermore, our proposed StreamDiffusion also significantly reduces the energy consumption by 2.39x on one RTX3060 and 1.99x on one RTX4090, respectively.	Introduces StreamDiffusion, a real-time diffusion pipeline for interactive image generation, prioritizing high throughput and energy efficiency.	Existing diffusion models lack real-time interactivity needed for applications like Metaverse, video games, and live streaming.	Employs stream batch denoising, residual classifier-free guidance (RCFG), an input-output queue, stochastic similarity filtering, pre-computation, and a tiny autoencoder.	Achieves up to 91.07fps on an RTX 4090 GPU, outperforming Diffusers Autopipeline by up to 59.6x. RCFG achieves up to 2.05x speedup compared to conventional classifier-free guidance. Stochastic similarity filtering significantly reduces GPU power usage (up to 2.39x on an RTX 3060 GPU).	Fixed input dimensions and batch sizes limit flexibility, requiring new engine builds for different configurations. Further exploration of more sophisticated similarity metrics for the stochastic similarity filter.	diffusion models, real-time image generation, interactive ai, high throughput, energy efficiency
2312.12490 Report	InstructVideo: Instructing Video Diffusion Models with Human Feedback	Hangjie Yuan, Shiwei Zhang, Xiang Wang, Yujie Wei, Tao Feng, Yining Pan, Yingya Zhang, Ziwei Liu, Samuel Albanie, Dong Ni	Diffusion models have emerged as the de facto paradigm for video generation. However, their reliance on web-scale data of varied quality often yields results that are visually unappealing and misaligned with the textual prompts. To tackle this problem, we propose InstructVideo to instruct text-to-video diffusion models with human feedback by reward fine-tuning. InstructVideo has two key ingredients: 1) To ameliorate the cost of reward fine-tuning induced by generating through the full DDIM sampling chain, we recast reward fine-tuning as editing. By leveraging the diffusion process to corrupt a sampled video, InstructVideo requires only partial inference of the DDIM sampling chain, reducing fine-tuning cost while improving fine-tuning efficiency. 2) To mitigate the absence of a dedicated video reward model for human preferences, we repurpose established image reward models, e.g., HPSv2. To this end, we propose Segmental Video Reward, a mechanism to provide reward signals based on segmental sparse sampling, and Temporally Attenuated Reward, a method that mitigates temporal modeling degradation during fine-tuning. Extensive experiments, both qualitative and quantitative, validate the practicality and efficacy of using image reward models in InstructVideo, significantly enhancing the visual quality of generated videos without compromising generalization capabilities. Code and models will be made publicly available.	This paper introduces \method, a novel approach to enhance text-to-video diffusion models by leveraging human feedback through reward fine-tuning.	Existing text-to-video diffusion models often generate videos with subpar visual quality and misalignment with the textual prompts, primarily due to reliance on large-scale web data of inconsistent quality. Aligning model outputs with human preferences is crucial for improving video quality and prompt adherence.	The proposed \method recasts the reward fine-tuning process as an editing task, reducing computational burden by utilizing partial inference of the DDIM sampling chain. It further introduces Segmental Video Reward (SegVR) and Temporally Attenuated Reward (TAR) to effectively utilize image reward models for evaluating video quality.	\method significantly improves the visual quality of generated videos compared to the base model, demonstrating clearer structures, more appealing colors, finer details, and improved text-to-video alignment. \method outperforms other reward fine-tuning methods in terms of both efficiency and effectiveness while exhibiting strong generalization ability to unseen text prompts. The study validates that the quality of fine-tuning data does not limit the potential quality of the fine-tuned results, suggesting \method can generate videos exceeding the quality of the data it was trained on.	While the use of image reward models proves effective, developing specialized video reward models could further enhance performance by capturing human preferences in a more holistic manner. Future work could explore strategies to mitigate the risk of over-optimization, a common challenge in reward fine-tuning.	text-to-video generation, diffusion models, reward fine-tuning, human preferences, video quality
2312.12487 Report	Adaptive Guidance: Training-free Acceleration of Conditional Diffusion Models	Angela Castillo, Jonas Kohler, Juan C. Pérez, Juan Pablo Pérez, Albert Pumarola, Bernard Ghanem, Pablo Arbeláez, Ali Thabet	This paper presents a comprehensive study on the role of Classifier-Free Guidance (CFG) in text-conditioned diffusion models from the perspective of inference efficiency. In particular, we relax the default choice of applying CFG in all diffusion steps and instead search for efficient guidance policies. We formulate the discovery of such policies in the differentiable Neural Architecture Search framework. Our findings suggest that the denoising steps proposed by CFG become increasingly aligned with simple conditional steps, which renders the extra neural network evaluation of CFG redundant, especially in the second half of the denoising process. Building upon this insight, we propose "Adaptive Guidance" (AG), an efficient variant of CFG, that adaptively omits network evaluations when the denoising process displays convergence. Our experiments demonstrate that AG preserves CFG's image quality while reducing computation by 25%. Thus, AG constitutes a plug-and-play alternative to Guidance Distillation, achieving 50% of the speed-ups of the latter while being training-free and retaining the capacity to handle negative prompts. Finally, we uncover further redundancies of CFG in the first half of the diffusion process, showing that entire neural function evaluations can be replaced by simple affine transformations of past score estimates. This method, termed LinearAG, offers even cheaper inference at the cost of deviating from the baseline model. Our findings provide insights into the efficiency of the conditional denoising process that contribute to more practical and swift deployment of text-conditioned diffusion models.	This paper introduces "Adaptive Guidance" (AG), an efficient variant of Classifier-Free Guidance (CFG) for text-conditioned diffusion models that reduces computational cost without sacrificing image quality.	CFG, while effective for enhancing sample quality in text-to-image generation, doubles the number of function evaluations (NFEs), making it computationally expensive.	The authors use Neural Architecture Search (NAS) to discover efficient guidance policies for diffusion models. They identify that CFG's denoising steps become redundant in the later stages as conditional and unconditional steps converge. AG leverages this finding by adaptively switching from CFG to cheaper conditional updates when the similarity between these steps is high.	AG preserves CFG's image quality while reducing computation by 25%. AG is a plug-and-play, training-free alternative to Guidance Distillation, achieving 50% of its speed-ups while handling negative prompts. The study uncovers further CFG redundancies, showing potential for replacing NFEs with affine transformations of past score estimates.	The linear approximation method (LR-based AG), while promising for further runtime reduction, requires extensive evaluation as it deviates from replicating the baseline model. Future work could explore extending the approach to multimodal conditioning beyond text and image.	diffusion models, text-to-image generation, classifier-free guidance, neural architecture search, efficient inference
2312.12483 Report	SCoTTi: Save Computation at Training Time with an adaptive framework	Ziyu Lin, Enzo Tartaglione, Van-Tam Nguyen	On-device training is an emerging approach in machine learning where models are trained on edge devices, aiming to enhance privacy protection and real-time performance. However, edge devices typically possess restricted computational power and resources, making it challenging to perform computationally intensive model training tasks. Consequently, reducing resource consumption during training has become a pressing concern in this field. To this end, we propose SCoTTi (Save Computation at Training Time), an adaptive framework that addresses the aforementioned challenge. It leverages an optimizable threshold parameter to effectively reduce the number of neuron updates during training which corresponds to a decrease in memory and computation footprint. Our proposed approach demonstrates superior performance compared to the state-of-the-art methods regarding computational resource savings on various commonly employed benchmarks and popular architectures, including ResNets, MobileNet, and Swin-T.	SCoTTi, an adaptive framework that reduces the computational cost of on-device training by selectively updating neurons based on their learning progress.	On-device training is challenging due to limited computational resources of edge devices, making resource efficiency crucial.	SCoTTi combines an ultimate optimizer for dynamic learning rate adjustment and the concept of neuron velocity to identify neurons requiring updates. It introduces a learnable threshold to determine neuron equilibrium and dynamically adjusts it during training.	SCoTTi achieves significant FLOPs reduction across various datasets and architectures. The method maintains or even slightly improves accuracy compared to traditional training methods. SCoTTi effectively mitigates overfitting by gradually increasing the neuron update threshold during training.	The performance of SCoTTi can degrade if the average FLOPs value falls below a certain threshold. Further research is needed to explore the adaptability of SCoTTi to other domains and hardware platforms.	on-device training, resource efficiency, adaptive training, neuron velocity, flops reduction
2312.12468 Report	MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers	Haoyu Ma, Shahin Mahdizadehaghdam, Bichen Wu, Zhipeng Fan, Yuchao Gu, Wenliang Zhao, Lior Shapira, Xiaohui Xie	Recent advances in generative AI have significantly enhanced image and video editing, particularly in the context of text prompt control. State-of-the-art approaches predominantly rely on diffusion models to accomplish these tasks. However, the computational demands of diffusion-based methods are substantial, often necessitating large-scale paired datasets for training, and therefore challenging the deployment in real applications. To address these issues, this paper breaks down the text-based video editing task into two stages. First, we leverage an pre-trained text-to-image diffusion model to simultaneously edit few keyframes in an zero-shot way. Second, we introduce an efficient model called MaskINT, which is built on non-autoregressive masked generative transformers and specializes in frame interpolation between the edited keyframes, using the structural guidance from intermediate frames. Experimental results suggest that our MaskINT achieves comparable performance with diffusion-based methodologies, while significantly improve the inference time. This research offers a practical solution for text-based video editing and showcases the potential of non-autoregressive masked generative transformers in this domain.	Introduces MaskINT, a two-stage text-based video editing framework that combines keyframe editing with structure-aware frame interpolation using non-autoregressive masked generative transformers.	Addresses limitations of diffusion-based methods for video editing, such as high computational cost and the need for large paired text-video datasets.	Uses a pre-trained text-to-image model to edit keyframes and a novel structure-aware frame interpolation module based on non-autoregressive transformers to generate intermediate frames.	Achieves comparable quality to diffusion-based methods in terms of temporal consistency and adherence to text prompts. Significantly faster than diffusion-based methods, with a 5-7 times improvement in inference time. Demonstrates the potential of non-autoregressive masked generative transformers for efficient video editing.	Limited to structure-preserving edits and struggles with new objects appearing in intermediate frames. Performance relies heavily on the accuracy of the keyframe editing model and structure detector.	video editing, text-to-video, generative transformers, frame interpolation, non-autoregressive generation
2312.12433 Report	TAO-Amodal: A Benchmark for Tracking Any Object Amodally	Cheng-Yen Hsieh, Kaihua Chen, Achal Dave, Tarasha Khurana, Deva Ramanan	Amodal perception, the ability to comprehend complete object structures from partial visibility, is a fundamental skill, even for infants. Its significance extends to applications like autonomous driving, where a clear understanding of heavily occluded objects is essential. However, modern detection and tracking algorithms often overlook this critical capability, perhaps due to the prevalence of \textit{modal} annotations in most benchmarks. To address the scarcity of amodal benchmarks, we introduce TAO-Amodal, featuring 833 diverse categories in thousands of video sequences. Our dataset includes \textit{amodal} and modal bounding boxes for visible and partially or fully occluded objects, including those that are partially out of the camera frame. We investigate the current lay of the land in both amodal tracking and detection by benchmarking state-of-the-art modal trackers and amodal segmentation methods. We find that existing methods, even when adapted for amodal tracking, struggle to detect and track objects under heavy occlusion. To mitigate this, we explore simple finetuning schemes that can increase the amodal tracking and detection metrics of occluded objects by 2.1\% and 3.3\%.	Introduces TAO-Amodal, a large-scale benchmark for amodal tracking of diverse objects in videos, featuring 833 categories and amodal bounding box annotations for occluded objects.	Amodal perception, the ability to perceive the full extent of objects even when occluded, is crucial for applications like autonomous driving but overlooked in current benchmarks that focus on modal perception.	Annotates 17,000 object tracks with amodal bounding boxes, leveraging the existing TAO dataset for modal annotations, and proposes evaluation metrics for amodal tracking.	Existing trackers and amodal segmentation methods struggle with heavy occlusion and out-of-frame scenarios. Fine-tuning modal trackers on TAO-Amodal improves amodal tracking performance. Amodal expander, a lightweight module for predicting amodal boxes, shows promising results, especially when combined with data augmentation.	Limited size of the amodal training set. Exploiting temporal information for amodal tracking requires further exploration.	amodal perception, object tracking, benchmarking, occlusion reasoning, computer vision
2312.12423 Report	Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model	Shraman Pramanick, Guangxing Han, Rui Hou, Sayan Nag, Ser-Nam Lim, Nicolas Ballas, Qifan Wang, Rama Chellappa, Amjad Almahairi	The ability of large language models (LLMs) to process visual inputs has given rise to general-purpose vision systems, unifying various vision-language (VL) tasks by instruction tuning. However, due to the enormous diversity in input-output formats in the vision domain, existing general-purpose models fail to successfully integrate segmentation and multi-image inputs with coarse-level tasks into a single framework. In this work, we introduce VistaLLM, a powerful visual system that addresses coarse- and fine-grained VL tasks over single and multiple input images using a unified framework. VistaLLM utilizes an instruction-guided image tokenizer that filters global embeddings using task descriptions to extract compressed and refined features from numerous images. Moreover, VistaLLM employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences, significantly improving over previously used uniform sampling. To bolster the desired capability of VistaLLM, we curate CoinIt, a comprehensive coarse-to-fine instruction tuning dataset with 6.8M samples. We also address the lack of multi-image grounding datasets by introducing a novel task, AttCoSeg (Attribute-level Co-Segmentation), which boosts the model's reasoning and grounding capability over multiple input images. Extensive experiments on a wide range of V- and VL tasks demonstrate the effectiveness of VistaLLM by achieving consistent state-of-the-art performance over strong baselines across all downstream tasks. Our project page can be found at https://shramanpramanick.github.io/VistaLLM/.	VistaLLM, a general-purpose vision model, seamlessly integrates coarse- and fine-grained vision-language reasoning and grounding tasks over single and multiple input images, including segmentation tasks which previous general-purpose models could not handle.	Unifying diverse vision-language tasks into a single framework reduces computational overhead associated with task-specific fine-tuning and improves performance by sharing feature representations.	VistaLLM uses an instruction-guided image tokenizer to refine and compress global image embeddings, a gradient-aware adaptive sampling technique to represent segmentation masks as sequences, and a Vicuna LLM decoder to process image and language features and generate outputs. The model is trained on a large-scale coarse-to-fine instruction-tuning dataset (CoFiT) containing 6.8M samples and a new multi-image grounding dataset (AttCoSeg).	VistaLLM achieves state-of-the-art performance across 15 vision-language benchmarks, surpassing specialist systems in many tasks. The proposed adaptive sampling technique for segmentation masks improves mIoU scores by 3-4 points compared to uniform sampling. The instruction-guided image tokenizer significantly enhances performance in tasks involving multiple images, such as NLVR and CoSeg.	VistaLLM struggles to accurately ground tiny or obscured objects in cluttered environments, requiring further improvement in image feature robustness. VistaLLM may generate harmful or unsafe outputs similar to other LLMs, necessitating active research in mitigating such risks.	vision-language, general-purpose vision model, instruction tuning, segmentation, multi-image reasoning
2312.12419 Report	Scene-Conditional 3D Object Stylization and Composition	Jinghao Zhou, Tomas Jakab, Philip Torr, Christian Rupprecht	Recently, 3D generative models have made impressive progress, enabling the generation of almost arbitrary 3D assets from text or image inputs. However, these approaches generate objects in isolation without any consideration for the scene where they will eventually be placed. In this paper, we propose a framework that allows for the stylization of an existing 3D asset to fit into a given 2D scene, and additionally produce a photorealistic composition as if the asset was placed within the environment. This not only opens up a new level of control for object stylization, for example, the same assets can be stylized to reflect changes in the environment, such as summer to winter or fantasy versus futuristic settings-but also makes the object-scene composition more controllable. We achieve this by combining modeling and optimizing the object's texture and environmental lighting through differentiable ray tracing with image priors from pre-trained text-to-image diffusion models. We demonstrate that our method is applicable to a wide variety of indoor and outdoor scenes and arbitrary objects.	This paper introduces a novel framework that adapts a 3D object's appearance to a given 2D scene, enabling photorealistic composition.	This addresses the challenge of seamlessly integrating 3D objects into existing scenes, a task crucial for applications like video games and media production.	The framework leverages differentiable ray tracing and image priors from pre-trained text-to-image diffusion models to optimize object texture and environmental lighting.	The method achieves realistic adaptation of object appearance to diverse environments, including lighting and shadow effects. The framework effectively preserves the object's original identity and structural details during the adaptation process. The proposed light capturing apparatus, inspired by real-world techniques, enables accurate estimation of scene lighting from a single image.	The reliance on differentiable rendering can be computationally demanding. The current method assumes a single dominant light source for outdoor scenes, which might not always hold.	3d object stylization, scene composition, diffusion models, differentiable rendering, light estimation
2312.12416 Report	Prompting Hard or Hardly Prompting: Prompt Inversion for Text-to-Image Diffusion Models	Shweta Mahajan, Tanzila Rahman, Kwang Moo Yi, Leonid Sigal	The quality of the prompts provided to text-to-image diffusion models determines how faithful the generated content is to the user's intent, often requiring `prompt engineering'. To harness visual concepts from target images without prompt engineering, current approaches largely rely on embedding inversion by optimizing and then mapping them to pseudo-tokens. However, working with such high-dimensional vector representations is challenging because they lack semantics and interpretability, and only allow simple vector operations when using them. Instead, this work focuses on inverting the diffusion model to obtain interpretable language prompts directly. The challenge of doing this lies in the fact that the resulting optimization problem is fundamentally discrete and the space of prompts is exponentially large; this makes using standard optimization techniques, such as stochastic gradient descent, difficult. To this end, we utilize a delayed projection scheme to optimize for prompts representative of the vocabulary space in the model. Further, we leverage the findings that different timesteps of the diffusion process cater to different levels of detail in an image. The later, noisy, timesteps of the forward diffusion process correspond to the semantic information, and therefore, prompt inversion in this range provides tokens representative of the image semantics. We show that our approach can identify semantically interpretable and meaningful prompts for a target image which can be used to synthesize diverse images with similar content. We further illustrate the application of the optimized prompts in evolutionary image generation and concept removal.	This paper presents PH2P, a novel method for inverting text-to-image diffusion models to generate interpretable language prompts directly from images, surpassing the limitations of embedding inversion techniques.	Existing methods for generating image prompts rely on prompt engineering or embedding inversion, which lack interpretability and limit prompt manipulation. Direct prompt inversion enables semantic understanding and flexible image editing.	PH2P utilizes a delayed projection scheme and leverages the sensitivity of later diffusion timesteps to semantic information. This enables optimization for discrete text tokens within the model's vocabulary using the L-BFGS algorithm.	PH2P generates semantically meaningful prompts for accurate and diverse image synthesis, outperforming baselines in CLIP similarity and LPIPS metrics. The generated prompts exhibit high contextual similarity to ground-truth captions, as measured by BertScore, indicating human-interpretable prompt generation. The method enables applications like evolutionary multi-concept image synthesis and concept removal through negative image prompting.	The current study primarily focuses on single-image prompt inversion. Future work will investigate efficient strategies for prompt optimization in multi-image settings.	diffusion models, prompt engineering, image generation, text-to-image synthesis, prompt inversion
2312.12359 Report	CLIP-DINOiser: Teaching CLIP a few DINO tricks for open-vocabulary semantic segmentation	Monika Wysoczańska, Oriane Siméoni, Michaël Ramamonjisoa, Andrei Bursuc, Tomasz Trzciński, Patrick Pérez	The popular CLIP model displays impressive zero-shot capabilities thanks to its seamless interaction with arbitrary text prompts. However, its lack of spatial awareness makes it unsuitable for dense computer vision tasks, e.g., semantic segmentation, without an additional fine-tuning step that often uses annotations and can potentially suppress its original open-vocabulary properties. Meanwhile, self-supervised representation methods have demonstrated good localization properties without human-made annotations nor explicit supervision. In this work, we take the best of both worlds and propose an open-vocabulary semantic segmentation method, which does not require any annotations. We propose to locally improve dense MaskCLIP features, which are computed with a simple modification of CLIP's last pooling layer, by integrating localization priors extracted from self-supervised features. By doing so, we greatly improve the performance of MaskCLIP and produce smooth outputs. Moreover, we show that the used self-supervised feature properties can directly be learnt from CLIP features. Our method CLIP-DINOiser needs only a single forward pass of CLIP and two light convolutional layers at inference, no extra supervision nor extra memory and reaches state-of-the-art results on challenging and fine-grained benchmarks such as COCO, Pascal Context, Cityscapes and ADE20k. The code to reproduce our results is available at https://github.com/wysoczanska/clip_dinoiser.	This paper introduces a novel open-vocabulary semantic segmentation method that enhances the dense features of MaskCLIP using localization cues derived from self-supervised learning (SSL) models, achieving state-of-the-art performance without requiring annotations or retraining CLIP.	CLIP, despite its zero-shot capabilities, lacks spatial awareness for dense tasks like segmentation. Existing solutions often compromise CLIP's open-vocabulary nature. This work addresses this gap by integrating the localization strengths of SSL with the open-vocabulary nature of CLIP.	The method refines MaskCLIP features by leveraging patch correlations from self-supervised DINO features. This is achieved via a lightweight convolutional layer trained to predict DINO-like correlations directly from CLIP features. It further incorporates a background filtering mechanism by learning objectness information, also inspired by DINO, to refine background predictions.	The method surpasses previous state-of-the-art techniques on benchmarks like COCO, Pascal Context, Cityscapes, and ADE20k. It demonstrates CLIP's inherent capacity for localization, effectively learned using simple convolutional layers. The approach operates efficiently with a single forward pass of CLIP and minimal additional computation.	The method's performance is inherently limited by CLIP's ability to differentiate between classes. Future work may explore adaptive feature correlation granularity and address ambiguities in textual queries for further improvement.	open-vocabulary semantic segmentation, clip, self-supervised learning, dino, localization
2312.12337 Report	pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction	David Charatan, Sizhe Li, Andrea Tagliasacchi, Vincent Sitzmann	We introduce pixelSplat, a feed-forward model that learns to reconstruct 3D radiance fields parameterized by 3D Gaussian primitives from pairs of images. Our model features real-time and memory-efficient rendering for scalable training as well as fast 3D reconstruction at inference time. To overcome local minima inherent to sparse and locally supported representations, we predict a dense probability distribution over 3D and sample Gaussian means from that probability distribution. We make this sampling operation differentiable via a reparameterization trick, allowing us to back-propagate gradients through the Gaussian splatting representation. We benchmark our method on wide-baseline novel view synthesis on the real-world RealEstate10k and ACID datasets, where we outperform state-of-the-art light field transformers and accelerate rendering by 2.5 orders of magnitude while reconstructing an interpretable and editable 3D radiance field.	pixelSplat is a novel view synthesis model that reconstructs a 3D radiance field represented by 3D Gaussian primitives from image pairs, enabling real-time rendering and scalable training.	Existing differentiable rendering methods are computationally expensive, while light field transformers lack interpretable 3D structure. pixelSplat addresses these limitations by combining the efficiency of Gaussian splatting with generalizable view synthesis.	pixelSplat utilizes a two-view image encoder with epipolar attention to resolve scale ambiguity in real-world datasets. It predicts a dense probability distribution over 3D space and samples Gaussian means from it, using a reparameterization trick to maintain differentiability.	Outperforms state-of-the-art light field transformers on RealEstate10k and ACID datasets. Achieves 2.5 orders of magnitude faster rendering compared to baselines. Produces an interpretable and editable 3D radiance field.	Gaussian fusion and de-duplication from different views is not addressed. Limited to in-distribution view synthesis and doesn't model unseen regions.	novel view synthesis, 3d gaussian splatting, epipolar transformer, differentiable rendering, scale ambiguity
2312.12198 Report	Mask Grounding for Referring Image Segmentation	Yong Xien Chng, Henry Zheng, Yizeng Han, Xuchong Qiu, Gao Huang	Referring Image Segmentation (RIS) is a challenging task that requires an algorithm to segment objects referred by free-form language expressions. Despite significant progress in recent years, most state-of-the-art (SOTA) methods still suffer from considerable language-image modality gap at the pixel and word level. These methods generally 1) rely on sentence-level language features for language-image alignment and 2) lack explicit training supervision for fine-grained visual grounding. Consequently, they exhibit weak object-level correspondence between visual and language features. Without well-grounded features, prior methods struggle to understand complex expressions that require strong reasoning over relationships among multiple objects, especially when dealing with rarely used or ambiguous clauses. To tackle this challenge, we introduce a novel Mask Grounding auxiliary task that significantly improves visual grounding within language features, by explicitly teaching the model to learn fine-grained correspondence between masked textual tokens and their matching visual objects. Mask Grounding can be directly used on prior RIS methods and consistently bring improvements. Furthermore, to holistically address the modality gap, we also design a cross-modal alignment loss and an accompanying alignment module. These additions work synergistically with Mask Grounding. With all these techniques, our comprehensive approach culminates in MagNet (Mask-grounded Network), an architecture that significantly outperforms prior arts on three key benchmarks (RefCOCO, RefCOCO+ and G-Ref), demonstrating our method's effectiveness in addressing current limitations of RIS algorithms. Our code and pre-trained weights will be released.	This paper introduces Mask Grounding, a novel auxiliary task that enhances Referring Image Segmentation (RIS) by improving fine-grained visual grounding in language features.	Current RIS methods often struggle with complex referring expressions requiring detailed visual grounding due to the language-image modality gap.	Mask Grounding trains the model to predict masked textual tokens using visual, linguistic, and segmentation information, encouraging fine-grained visual-textual correspondence. Additionally, a cross-modal alignment module and loss are introduced to further bridge the modality gap.	MagNet, incorporating Mask Grounding, achieves state-of-the-art performance on RefCOCO, RefCOCO+, and G-Ref benchmarks. Mask Grounding significantly improves language-image alignment compared to baseline methods. The universality of Mask Grounding is demonstrated by its successful integration into other RIS methods like LAVT, ReLA, and CRIS, consistently boosting their performance.	The impact of different masking strategies on the effectiveness of Mask Grounding warrants further investigation. Future work will explore the application of Mask Grounding to other multimodal dense prediction tasks beyond RIS.	referring image segmentation, visual grounding, masked language modeling, multimodal learning, computer vision
2312.12030 Report	Towards Accurate Guided Diffusion Sampling through Symplectic Adjoint Method	Jiachun Pan, Hanshu Yan, Jun Hao Liew, Jiashi Feng, Vincent Y. F. Tan	Training-free guided sampling in diffusion models leverages off-the-shelf pre-trained networks, such as an aesthetic evaluation model, to guide the generation process. Current training-free guided sampling algorithms obtain the guidance energy function based on a one-step estimate of the clean image. However, since the off-the-shelf pre-trained networks are trained on clean images, the one-step estimation procedure of the clean image may be inaccurate, especially in the early stages of the generation process in diffusion models. This causes the guidance in the early time steps to be inaccurate. To overcome this problem, we propose Symplectic Adjoint Guidance (SAG), which calculates the gradient guidance in two inner stages. Firstly, SAG estimates the clean image via $n$ function calls, where $n$ serves as a flexible hyperparameter that can be tailored to meet specific image quality requirements. Secondly, SAG uses the symplectic adjoint method to obtain the gradients accurately and efficiently in terms of the memory requirements. Extensive experiments demonstrate that SAG generates images with higher qualities compared to the baselines in both guided image and video generation tasks.	This paper proposes Symplectic Adjoint Guidance (SAG), a training-free method for guided diffusion models that improves generation quality by using a multiple-step estimate of the clean image and a memory-efficient symplectic adjoint method for gradient backpropagation.	Existing training-free guided sampling methods for diffusion models often produce inaccurate guidance due to the misalignment between the final generated image and its one-step denoised approximation, especially in early sampling stages. This leads to lower quality in generated images.	SAG calculates gradient guidance in two stages: 1) it estimates the clean image using n denoising steps for higher accuracy, and 2) it employs the symplectic adjoint method to accurately and efficiently backpropagate gradients through these n steps.	SAG generates images with higher quality compared to baseline methods like FreeDOM and Universal Guidance in style-guided image generation, as measured by style loss and CLIP score. SAG achieves superior aesthetic improvement compared to Stable Diffusion, FreeDOM, and DOODL, as evidenced by higher aesthetic scores from LAION, PickScore, and HPSv2. In personalized image generation, SAG outperforms DreamBooth, FreeDOM, and DOODL in object guidance, achieving higher CLIP image similarity, and demonstrates better face-ID matching and lower FID scores compared to FreeDOM.	There is a trade-off between computation cost and generation quality depending on the number of estimation steps n. The paper primarily focuses on image generation and video stylization, leaving exploration of other guidance tasks for future work.	guided diffusion models, training-free guidance, symplectic adjoint method, image generation, video stylization
2312.11894 Report	3D-LFM: Lifting Foundation Model	Mosam Dabhi, Laszlo A. Jeni, Simon Lucey	The lifting of 3D structure and camera from 2D landmarks is at the cornerstone of the entire discipline of computer vision. Traditional methods have been confined to specific rigid objects, such as those in Perspective-n-Point (PnP) problems, but deep learning has expanded our capability to reconstruct a wide range of object classes (e.g. C3DPO and PAUL) with resilience to noise, occlusions, and perspective distortions. All these techniques, however, have been limited by the fundamental need to establish correspondences across the 3D training data -- significantly limiting their utility to applications where one has an abundance of "in-correspondence" 3D data. Our approach harnesses the inherent permutation equivariance of transformers to manage varying number of points per 3D data instance, withstands occlusions, and generalizes to unseen categories. We demonstrate state of the art performance across 2D-3D lifting task benchmarks. Since our approach can be trained across such a broad class of structures we refer to it simply as a 3D Lifting Foundation Model (3D-LFM) -- the first of its kind.	This paper presents a novel 3D Lifting Foundation Model (3D-LFM) capable of lifting 2D landmarks to 3D structures across 30+ categories using a single unified model, trained without object-specific information.	Existing 2D-3D lifting methods are limited by the need for correspondences across 3D training data and object-specific knowledge, hindering their generalizability. 3D-LFM addresses these limitations by leveraging permutation equivariance in transformers and Procrustean alignment, enabling unified learning across diverse object categories.	3D-LFM utilizes a graph-based transformer architecture with tokenized positional encoding (TPE) to handle varying numbers of landmarks. It employs Procrustean alignment to focus on deformable aspects within a canonical frame and a hybrid attention mechanism for efficient feature aggregation.	3D-LFM achieves state-of-the-art performance on benchmarks like H3WB, outperforming specialized methods for human body, face, and hand categories. It exhibits strong generalization by handling unseen object categories and rig configurations, evidenced by successful reconstructions on Acinoset, PASCAL3D+, and Panoptic Studio datasets. Ablation studies confirm the efficacy of TPE in handling data imbalance and rig transfer, while Procrustean alignment and hybrid attention enhance performance and convergence speed.	The model can face challenges when extreme perspective distortions cause misinterpretations of 2D keypoint configurations. Future work involves incorporating visual features and temporal dynamics to enhance depth perception and object category differentiation under challenging real-world scenarios. Exploring cross-category knowledge transfer	3d lifting, foundation models, transformers, permutation equivariance, procrustean alignment
2312.11841 Report	MixRT: Mixed Neural Representations For Real-Time NeRF Rendering	Chaojian Li, Bichen Wu, Peter Vajda, Yingyan, Lin	Neural Radiance Field (NeRF) has emerged as a leading technique for novel view synthesis, owing to its impressive photorealistic reconstruction and rendering capability. Nevertheless, achieving real-time NeRF rendering in large-scale scenes has presented challenges, often leading to the adoption of either intricate baked mesh representations with a substantial number of triangles or resource-intensive ray marching in baked representations. We challenge these conventions, observing that high-quality geometry, represented by meshes with substantial triangles, is not necessary for achieving photorealistic rendering quality. Consequently, we propose MixRT, a novel NeRF representation that includes a low-quality mesh, a view-dependent displacement map, and a compressed NeRF model. This design effectively harnesses the capabilities of existing graphics hardware, thus enabling real-time NeRF rendering on edge devices. Leveraging a highly-optimized WebGL-based rendering framework, our proposed MixRT attains real-time rendering speeds on edge devices (over 30 FPS at a resolution of 1280 x 720 on a MacBook M1 Pro laptop), better rendering quality (0.2 PSNR higher in indoor scenes of the Unbounded-360 datasets), and a smaller storage size (less than 80% compared to state-of-the-art methods).	MixRT, a novel NeRF representation for real-time rendering on edge devices, combining a low-quality mesh, a view-dependent displacement map, and a compressed NeRF model (Instant-NGP).	Real-time NeRF rendering in large-scale scenes is challenging, with existing methods relying on intricate baked meshes or resource-intensive ray marching.	Leverages a low-quality mesh for coarse geometry, a view-dependent displacement map for refined intersection points, and a compressed Instant-NGP model for color. Employs a highly-optimized WebGL-based rendering framework.	Achieves real-time rendering speeds on edge devices (over 30 FPS at 1280x720 on MacBook M1 Pro). Delivers high rendering quality, outperforming state-of-the-art methods (0.2 PSNR higher in Unbounded-360 indoor scenes). Reduces storage size compared to state-of-the-art methods (less than 80% of existing methods).	Rendering quality in complex outdoor scenes can be further improved. Limited to rasterization-based rendering methods, potentially impacting quality in specific scenarios.	neural radiance field, nerf, real-time rendering, webgl, edge devices
2312.11774 Report	Text-Image Conditioned Diffusion for Consistent Text-to-3D Generation	Yuze He, Yushi Bai, Matthieu Lin, Jenny Sheng, Yubin Hu, Qi Wang, Yu-Hui Wen, Yong-Jin Liu	By lifting the pre-trained 2D diffusion models into Neural Radiance Fields (NeRFs), text-to-3D generation methods have made great progress. Many state-of-the-art approaches usually apply score distillation sampling (SDS) to optimize the NeRF representations, which supervises the NeRF optimization with pre-trained text-conditioned 2D diffusion models such as Imagen. However, the supervision signal provided by such pre-trained diffusion models only depends on text prompts and does not constrain the multi-view consistency. To inject the cross-view consistency into diffusion priors, some recent works finetune the 2D diffusion model with multi-view data, but still lack fine-grained view coherence. To tackle this challenge, we incorporate multi-view image conditions into the supervision signal of NeRF optimization, which explicitly enforces fine-grained view consistency. With such stronger supervision, our proposed text-to-3D method effectively mitigates the generation of floaters (due to excessive densities) and completely empty spaces (due to insufficient densities). Our quantitative evaluations on the T$^3$Bench dataset demonstrate that our method achieves state-of-the-art performance over existing text-to-3D methods. We will make the code publicly available.	This paper proposes Text-Image Conditioned Diffusion (TICD), a novel text-to-3D generation method that leverages both text-conditioned and image-conditioned diffusion models to improve view consistency and geometric fidelity in generated 3D models.	Existing text-to-3D methods often produce inconsistent multi-view images and struggle with generating accurate object densities, leading to artifacts like floaters or empty spaces. This method aims to address these limitations by incorporating fine-grained view consistency during the generation process.	TICD uses two diffusion models during NeRF optimization: a text-conditioned multi-view model for coarse consistency and an image-conditioned novel view model for fine-grained view consistency. The method first renders reference views from sampled camera poses and uses them as conditions for the image-guided diffusion model. Both models contribute to score distillation, guiding the NeRF to generate consistent and accurate 3D models.	TICD achieves state-of-the-art performance on the T^3Bench dataset, outperforming existing text-to-3D methods in terms of quality and text alignment. The inclusion of the image-conditioned diffusion module significantly improves the generation quality and reduces artifacts like density collapse and color inconsistency. Quantitative and qualitative results demonstrate that TICD generates 3D content with higher fidelity, clearer geometry, and improved consistency compared to previous approaches.	The method relies on two separate diffusion models, which increases the number of parameters and computational cost. Future work could explore designing a single diffusion model capable of handling both text-conditioned multi-view and image-conditioned novel view generation.	text-to-3d generation, aigc, diffusion models, neural radiance fields (nerfs), multi-view consistency
2312.11595 Report	TIP: Text-Driven Image Processing with Semantic and Restoration Instructions	Chenyang Qi, Zhengzhong Tu, Keren Ye, Mauricio Delbracio, Peyman Milanfar, Qifeng Chen, Hossein Talebi	Text-driven diffusion models have become increasingly popular for various image editing tasks, including inpainting, stylization, and object replacement. However, it still remains an open research problem to adopt this language-vision paradigm for more fine-level image processing tasks, such as denoising, super-resolution, deblurring, and compression artifact removal. In this paper, we develop TIP, a Text-driven Image Processing framework that leverages natural language as a user-friendly interface to control the image restoration process. We consider the capacity of text information in two dimensions. First, we use content-related prompts to enhance the semantic alignment, effectively alleviating identity ambiguity in the restoration outcomes. Second, our approach is the first framework that supports fine-level instruction through language-based quantitative specification of the restoration strength, without the need for explicit task-specific design. In addition, we introduce a novel fusion mechanism that augments the existing ControlNet architecture by learning to rescale the generative prior, thereby achieving better restoration fidelity. Our extensive experiments demonstrate the superior restoration performance of TIP compared to the state of the arts, alongside offering the flexibility of text-based control over the restoration effects.	TIP, a text-driven image processing framework, uses natural language instructions for semantic and quantitative control over image restoration.	Existing restoration methods struggle with semantic ambiguities in degraded images and lack flexible control over restoration strength.	TIP decouples semantic and restoration prompts, leveraging a ControlNet adaptor trained on a synthetic dataset with paired text instructions. It introduces a modulation fusion layer for adaptive feature alignment.	TIP outperforms existing image restoration methods both quantitatively and qualitatively. Semantic prompts in TIP allow controlling the identity of objects in restored images. Restoration prompts enable users to adjust the type and strength of restoration effects using natural language.	The current implementation primarily focuses on four common degradation types. Future work includes exploring more complex compositions of degradations.	image restoration, text-guided image editing, diffusion models, controlnet, semantic image processing
2312.11535 Report	Customize-It-3D: High-Quality 3D Creation from A Single Image Using Subject-Specific Knowledge Prior	Nan Huang, Ting Zhang, Yuhui Yuan, Dong Chen, Shanghang Zhang	In this paper, we present a novel two-stage approach that fully utilizes the information provided by the reference image to establish a customized knowledge prior for image-to-3D generation. While previous approaches primarily rely on a general diffusion prior, which struggles to yield consistent results with the reference image, we propose a subject-specific and multi-modal diffusion model. This model not only aids NeRF optimization by considering the shading mode for improved geometry but also enhances texture from the coarse results to achieve superior refinement. Both aspects contribute to faithfully aligning the 3D content with the subject. Extensive experiments showcase the superiority of our method, Customize-It-3D, outperforming previous works by a substantial margin. It produces faithful 360-degree reconstructions with impressive visual quality, making it well-suited for various applications, including text-to-3D creation.	Presents Customize-It-3D, a novel two-stage approach for image-to-3D generation that utilizes a subject-specific and multi-modal diffusion model to enhance the personalization of 3D content creation.	Existing image-to-3D generation methods often produce inconsistent results with the reference image, lacking fidelity and consistency in reconstructing high-fidelity 3D objects.	The method uses a two-stage coarse-to-fine framework. The coarse stage optimizes a NeRF using a subject-specific diffusion model for novel view synthesis and shading-aware guidance. The refine stage transforms the NeRF into a point cloud, enhancing texture realism through a subject-specific T2I model and a deferred rendering scheme.	Significantly outperforms previous state-of-the-art methods in image-to-3D generation. Produces faithful 360-degree reconstructions with impressive visual quality and 3D consistency. Demonstrates versatility in handling general objects and enables applications like text-to-3D creation.	Reliance on pretrained models for depth and normal estimation can impact overall generation quality. Inherent geometry ambiguity from using generative priors can lead to issues like the Janus problem or over-flat geometry.	image-to-3d generation, neural radiance fields (nerf), diffusion models, subject-specific knowledge prior, multi-modal learning
2312.11473 Report	Synthetic Shifts to Initial Seed Vector Exposes the Brittle Nature of Latent-Based Diffusion Models	Mao Po-Yuan, Shashank Kotyan, Tham Yik Foong, Danilo Vasconcellos Vargas	Recent advances in Conditional Diffusion Models have led to substantial capabilities in various domains. However, understanding the impact of variations in the initial seed vector remains an underexplored area of concern. Particularly, latent-based diffusion models display inconsistencies in image generation under standard conditions when initialized with suboptimal initial seed vectors. To understand the impact of the initial seed vector on generated samples, we propose a reliability evaluation framework that evaluates the generated samples of a diffusion model when the initial seed vector is subjected to various synthetic shifts. Our results indicate that slight manipulations to the initial seed vector of the state-of-the-art Stable Diffusion (Rombach et al., 2022) can lead to significant disturbances in the generated samples, consequently creating images without the effect of conditioning variables. In contrast, GLIDE (Nichol et al., 2022) stands out in generating reliable samples even when the initial seed vector is transformed. Thus, our study sheds light on the importance of the selection and the impact of the initial seed vector in the latent-based diffusion model.	This paper introduces a framework for systematically evaluating the robustness of diffusion models, focusing on their ability to handle variations in initial noise.	Understanding the robustness of diffusion models is crucial as they are increasingly used for content creation and synthetic data generation. This study investigates why models like Stable Diffusion might exhibit inconsistent performance under slight variations in initial input noise, impacting their reliability.	The authors apply five different noise perturbation techniques (uniform mean shift, random mean shift, standard deviation shift, mixed shift, and pixel arrangement shift) to the initial noise vector of different diffusion models (Stable Diffusion versions, Glide). They then evaluate the effect of these perturbations on image generation quality using metrics like top-1 and top-5 accuracy on ImageNet-100, as well as CLIP score.	Stable Diffusion models show significant performance degradation with increasing noise perturbation, highlighting their sensitivity to the initial noise vector. Glide demonstrates significantly higher robustness to noise perturbations compared to Stable Diffusion models, maintaining consistent performance across different noise levels. The paper identifies the fixed variance in Stable Diffusion's denoising process and its tendency to amplify prediction errors as potential reasons for its reduced robustness compared to Glide.	The study primarily focuses on image generation and might not directly generalize to other diffusion model applications. Future work could explore the development of more robust diffusion models by incorporating the findings of this study, such as investigating alternative denoising processes and training strategies.	diffusion models, robustness, stable diffusion, glide, image generation
2312.11461 Report	GAvatar: Animatable 3D Gaussian Avatars with Implicit Mesh Learning	Ye Yuan, Xueting Li, Yangyi Huang, Shalini De Mello, Koki Nagano, Jan Kautz, Umar Iqbal	Gaussian splatting has emerged as a powerful 3D representation that harnesses the advantages of both explicit (mesh) and implicit (NeRF) 3D representations. In this paper, we seek to leverage Gaussian splatting to generate realistic animatable avatars from textual descriptions, addressing the limitations (e.g., flexibility and efficiency) imposed by mesh or NeRF-based representations. However, a naive application of Gaussian splatting cannot generate high-quality animatable avatars and suffers from learning instability; it also cannot capture fine avatar geometries and often leads to degenerate body parts. To tackle these problems, we first propose a primitive-based 3D Gaussian representation where Gaussians are defined inside pose-driven primitives to facilitate animation. Second, to stabilize and amortize the learning of millions of Gaussians, we propose to use neural implicit fields to predict the Gaussian attributes (e.g., colors). Finally, to capture fine avatar geometries and extract detailed meshes, we propose a novel SDF-based implicit mesh learning approach for 3D Gaussians that regularizes the underlying geometries and extracts highly detailed textured meshes. Our proposed method, GAvatar, enables the large-scale generation of diverse animatable avatars using only text prompts. GAvatar significantly surpasses existing methods in terms of both appearance and geometry quality, and achieves extremely fast rendering (100 fps) at 1K resolution.	GAvatar: a novel approach for generating animatable avatars from text using a novel primitive-based implicit Gaussian representation and a new SDF-based implicit mesh learning approach for 3D Gaussians.	Existing methods for text-to-3D avatar generation struggle to balance fine-grained geometry, efficient rendering, and animation capabilities. GAvatar addresses these limitations.	GAvatar represents avatars with pose-driven primitives, each containing 3D Gaussians. Neural implicit fields predict Gaussian attributes (color, opacity, etc.) for stable training with SDS loss. An SDF-based approach regularizes geometry and enables mesh extraction.	Generates high-quality, animatable avatars with fine geometry details, surpassing existing methods. Achieves fast rendering speed (100 fps at 1K resolution) due to the use of 3D Gaussians. Enables high-quality textured mesh extraction from the learned 3D Gaussian avatar.	Occasional color oversaturation in generated avatars, similar to other SDS-based methods. Potential misalignment between geometry and appearance, requiring further exploration of consistent supervision techniques.	text-to-3d, avatar generation, gaussian splatting, implicit mesh learning, animatable avatars
2312.11459 Report	VolumeDiffusion: Flexible Text-to-3D Generation with Efficient Volumetric Encoder	Zhicong Tang, Shuyang Gu, Chunyu Wang, Ting Zhang, Jianmin Bao, Dong Chen, Baining Guo	This paper introduces a pioneering 3D volumetric encoder designed for text-to-3D generation. To scale up the training data for the diffusion model, a lightweight network is developed to efficiently acquire feature volumes from multi-view images. The 3D volumes are then trained on a diffusion model for text-to-3D generation using a 3D U-Net. This research further addresses the challenges of inaccurate object captions and high-dimensional feature volumes. The proposed model, trained on the public Objaverse dataset, demonstrates promising outcomes in producing diverse and recognizable samples from text prompts. Notably, it empowers finer control over object part characteristics through textual cues, fostering model creativity by seamlessly combining multiple concepts within a single object. This research significantly contributes to the progress of 3D generation by introducing an efficient, flexible, and scalable representation methodology. Code is available at https://github.com/checkcrab/VolumeDiffusion.	This paper introduces VolumeDiffusion, a novel text-to-3D generation method using a novel 3D volumetric representation and a lightweight encoder for efficient feature volume acquisition from multi-view images.	Scaling up training data for text-to-3D generation is crucial, and this method addresses limitations of previous representations by being efficient, flexible, and enabling fine-grained text control.	The method uses a two-stage approach: 1) a lightweight encoder converts multi-view images to feature volumes, and 2) a 3D U-Net diffusion model learns the distribution of these volumes conditioned on text prompts.	The lightweight encoder efficiently generates high-quality 3D volumes, processing 30 objects per second on a single GPU. The diffusion model, trained on a subset of the Objaverse dataset, generates diverse and recognizable 3D objects from text prompts. Compared to methods like Shap·E, VolumeDiffusion exhibits superior control over object part characteristics through textual cues.	The model exhibits a bias towards generating white objects due to the prevalence of texture-less objects in the training dataset. Generated objects often have over-smooth surfaces, potentially limited by the spatial resolution of feature volumes.	text-to-3d generation, 3d volumetric representation, diffusion model, multi-view images, feature volume
2312.11458 Report	GauFRe: Gaussian Deformation Fields for Real-time Dynamic Novel View Synthesis	Yiqing Liang, Numair Khan, Zhengqin Li, Thu Nguyen-Phuoc, Douglas Lanman, James Tompkin, Lei Xiao	We propose a method for dynamic scene reconstruction using deformable 3D Gaussians that is tailored for monocular video. Building upon the efficiency of Gaussian splatting, our approach extends the representation to accommodate dynamic elements via a deformable set of Gaussians residing in a canonical space, and a time-dependent deformation field defined by a multi-layer perceptron (MLP). Moreover, under the assumption that most natural scenes have large regions that remain static, we allow the MLP to focus its representational power by additionally including a static Gaussian point cloud. The concatenated dynamic and static point clouds form the input for the Gaussian Splatting rasterizer, enabling real-time rendering. The differentiable pipeline is optimized end-to-end with a self-supervised rendering loss. Our method achieves results that are comparable to state-of-the-art dynamic neural radiance field methods while allowing much faster optimization and rendering. Project website: https://lynl7130.github.io/gaufre/index.html	This paper introduces GauFRe, a novel method for dynamic scene reconstruction from monocular videos using deformable 3D Gaussians and Gaussian splatting.	Existing methods struggle to balance high-quality reconstruction with fast optimization and rendering, especially for dynamic scenes in monocular videos.	GauFRe uses a deformation field parameterized by an MLP to deform canonical Gaussians, representing dynamic scene parts. A separate set of static Gaussians captures quasi-static regions. The model is optimized end-to-end with a self-supervised rendering loss.	GauFRe achieves comparable or superior reconstruction quality to state-of-the-art dynamic neural radiance field methods. The method allows for much faster optimization (around 20 minutes) compared to hours for some methods. GauFRe enables real-time rendering for novel view synthesis.	Modeling scenes with large or irregular motions is challenging due to the single MLP used for the entire deformation field. The dynamic/static separation is sensitive to the quality of the initial structure-from-motion point cloud for real-world scenes.	dynamic scene reconstruction, monocular video, deformable gaussians, gaussian splatting, novel view synthesis
2312.11417 Report	PolyDiff: Generating 3D Polygonal Meshes with Diffusion Models	Antonio Alliegro, Yawar Siddiqui, Tatiana Tommasi, Matthias Nießner	We introduce PolyDiff, the first diffusion-based approach capable of directly generating realistic and diverse 3D polygonal meshes. In contrast to methods that use alternate 3D shape representations (e.g. implicit representations), our approach is a discrete denoising diffusion probabilistic model that operates natively on the polygonal mesh data structure. This enables learning of both the geometric properties of vertices and the topological characteristics of faces. Specifically, we treat meshes as quantized triangle soups, progressively corrupted with categorical noise in the forward diffusion phase. In the reverse diffusion phase, a transformer-based denoising network is trained to revert the noising process, restoring the original mesh structure. At inference, new meshes can be generated by applying this denoising network iteratively, starting with a completely noisy triangle soup. Consequently, our model is capable of producing high-quality 3D polygonal meshes, ready for integration into downstream 3D workflows. Our extensive experimental analysis shows that PolyDiff achieves a significant advantage (avg. FID and JSD improvement of 18.2 and 5.8 respectively) over current state-of-the-art methods.	\OURS is the first diffusion-based generative model that operates directly on polygonal meshes, representing them as quantized triangle soups to learn both geometric and topological characteristics.	Generating high-fidelity 3D shapes, often as polygonal meshes, is crucial for various applications, but existing methods struggle to capture mesh characteristics due to their reliance on alternate 3D representations.	\OURS employs a discrete denoising diffusion model that corrupts quantized triangle soups with categorical noise and then learns to reverse this process using a transformer-based denoising network.	\OURS outperforms state-of-the-art methods in unconditional mesh generation, achieving significant gains in FID and JSD metrics. The method generates more coherent and cohesive 3D shapes compared to techniques relying on autoencoders or autoregressive models. Analysis confirms that the discrete diffusion approach is better suited for mesh generation than continuous Gaussian noise.	Extending \OURS to generate scene-level meshes instead of single objects is a potential future direction. The sampling speed of \OURS could be improved by exploring better sampling techniques and diffusion model formulations.	3d mesh generation, diffusion models, deep learning, generative models, computer vision
2312.11396 Report	MAG-Edit: Localized Image Editing in Complex Scenarios via Mask-Based Attention-Adjusted Guidance	Qi Mao, Lan Chen, Yuchao Gu, Zhen Fang, Mike Zheng Shou	Recent diffusion-based image editing approaches have exhibited impressive editing capabilities in images with simple compositions. However, localized editing in complex scenarios has not been well-studied in the literature, despite its growing real-world demands. Existing mask-based inpainting methods fall short of retaining the underlying structure within the edit region. Meanwhile, mask-free attention-based methods often exhibit editing leakage and misalignment in more complex compositions. In this work, we develop MAG-Edit, a training-free, inference-stage optimization method, which enables localized image editing in complex scenarios. In particular, MAG-Edit optimizes the noise latent feature in diffusion models by maximizing two mask-based cross-attention constraints of the edit token, which in turn gradually enhances the local alignment with the desired prompt. Extensive quantitative and qualitative experiments demonstrate the effectiveness of our method in achieving both text alignment and structure preservation for localized editing within complex scenarios.	MAG-Edit, a training-free method for localized image editing in complex scenes with multiple objects, by optimizing noise latent features in diffusion models.	Existing mask-based methods struggle to maintain structural integrity within edited regions, while mask-free methods suffer from editing leakage and misalignment.	MAG-Edit optimizes noise latent features by maximizing two mask-based cross-attention constraints of the edit token, enhancing local alignment with the desired prompt.	MAG-Edit effectively balances editing efficiency and structure preservation in complex scenes. Quantitative evaluations demonstrate significant improvements in text alignment within localized regions. User studies confirm the superiority of MAG-Edit in text alignment, structure preservation, and overall editing quality.	Long inference time due to the optimization process. Limitations in editing scenarios requiring significant pose changes due to reliance on maintaining structure through cross-attention maps.	image editing, diffusion models, cross-attention, localized editing, complex scenes
2312.11392 Report	SCEdit: Efficient and Controllable Image Diffusion Generation via Skip Connection Editing	Zeyinzi Jiang, Chaojie Mao, Yulin Pan, Zhen Han, Jingfeng Zhang	Image diffusion models have been utilized in various tasks, such as text-to-image generation and controllable image synthesis. Recent research has introduced tuning methods that make subtle adjustments to the original models, yielding promising results in specific adaptations of foundational generative diffusion models. Rather than modifying the main backbone of the diffusion model, we delve into the role of skip connection in U-Net and reveal that hierarchical features aggregating long-distance information across encoder and decoder make a significant impact on the content and quality of image generation. Based on the observation, we propose an efficient generative tuning framework, dubbed SCEdit, which integrates and edits Skip Connection using a lightweight tuning module named SC-Tuner. Furthermore, the proposed framework allows for straightforward extension to controllable image synthesis by injecting different conditions with Controllable SC-Tuner, simplifying and unifying the network design for multi-condition inputs. Our SCEdit substantially reduces training parameters, memory usage, and computational expense due to its lightweight tuners, with backward propagation only passing to the decoder blocks. Extensive experiments conducted on text-to-image generation and controllable image synthesis tasks demonstrate the superiority of our method in terms of efficiency and performance. Project page: \url{https://scedit.github.io/}	This paper proposes SCEdit, an efficient and controllable image diffusion generation framework for efficient fine-tuning and controllable image synthesis.	Fine-tuning large diffusion models is resource-intensive. This work introduces an efficient alternative, SCEdit, that achieves comparable results with reduced computational cost and improved controllability.	SCEdit introduces lightweight tuning modules (SC-Tuner & CSC-Tuner) that edit the latent features within skip connections of a pre-trained U-Net, allowing for efficient adaptation without modifying the main backbone.	SCEdit outperforms existing text-to-image tuning methods on COCO2017 in FID score and visual quality while using significantly fewer parameters and memory. For controllable synthesis, SCEdit achieves strong results with various conditions (edges, depth, segmentation, etc.) using only 7.9% of ControlNet's parameters and 30% less memory. SCEdit supports composable generation by combining multiple conditions and demonstrates generalization ability, enabling tasks like sketch-to-image and controlled outpainting.	The performance depends on the pre-trained model due to the frozen backbone. Potential misuse of high-risk data during training could lead to harmful outputs.	image generation, diffusion models, efficient tuning, controllable synthesis, skip connections
2312.11370 Report	G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model	Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, Lingpeng Kong	Large language models (LLMs) have shown remarkable proficiency in human-level reasoning and generation capabilities, which encourages extensive research on their application in mathematical problem solving. However, current work has been largely focused on text-based mathematical problems, with limited investigation in problems involving geometric information. Addressing this gap, we aim to enable LLMs to solve geometric problems by understanding image input. We first analyze the limitations of current Multimodal Large Language Models (MLLMs) in this area: they struggle to accurately comprehending basic geometric elements and their relationships. To overcome these challenges, we take advantage of the unique characteristics of geometric problems (such as unique geometric logical form, and geometric scalability) and the capacity of the textual LLMs to build an enriched multimodal geometry dataset based on existing data. The augmented dataset, Geo170K, contains more than 170K geometric image-caption and question-answer pairs. Utilizing our constructed Geo170K dataset, we develop G-LLaVA, which demonstrates exceptional performance in solving geometric problems, significantly outperforming GPT-4-V on the MathVista benchmark with only 7B parameters.	This paper introduces a novel method for enhancing Multimodal Large Language Models (MLLMs) to solve geometric problems, addressing the limitations of current models in comprehending and reasoning about geometric information.	Existing MLLMs often struggle to understand geometric elements and their relationships, hindering their ability to solve geometric problems effectively. This paper aims to bridge this gap by improving the models' geometric reasoning capabilities.	The authors propose a two-phase approach: (1) Geometric Cross-Modal Alignment: Using existing datasets, they generate image captions and contrastive question-answer pairs, focusing on basic geometric elements. (2) Geometric Instruction Tuning: Utilizing text-only LLMs like ChatGPT, they enrich existing datasets by generating new problem variations, such as equation solving, value scaling, re-formulating conditions, and sentence paraphrasing.	The resulting model, G-LLaVA, significantly outperforms existing MLLMs on the MathVista benchmark, even surpassing GPT-4-V with only 7B parameters. G-LLaVA also demonstrates superior performance compared to traditional in-domain models on the GeoQA benchmark. The effectiveness of the proposed cross-modal alignment and instruction tuning strategies is validated through ablation studies.	The study primarily focuses on geometric problems, and further research is needed to assess its generalizability to other domains of mathematical reasoning. Future work can explore incorporating more sophisticated geometric reasoning techniques and expanding the dataset to encompass a wider range of geometric concepts and problem types.	multimodal large language models, geometric reasoning, data augmentation, instruction tuning, mathematical problem solving
2312.11360 Report	Paint-it: Text-to-Texture Synthesis via Deep Convolutional Texture Map Optimization and Physically-Based Rendering	Kim Youwang, Tae-Hyun Oh, Gerard Pons-Moll	We present Paint-it, a text-driven high-fidelity texture map synthesis method for 3D meshes via neural re-parameterized texture optimization. Paint-it synthesizes texture maps from a text description by synthesis-through-optimization, exploiting the Score-Distillation Sampling (SDS). We observe that directly applying SDS yields undesirable texture quality due to its noisy gradients. We reveal the importance of texture parameterization when using SDS. Specifically, we propose Deep Convolutional Physically-Based Rendering (DC-PBR) parameterization, which re-parameterizes the physically-based rendering (PBR) texture maps with randomly initialized convolution-based neural kernels, instead of a standard pixel-based parameterization. We show that DC-PBR inherently schedules the optimization curriculum according to texture frequency and naturally filters out the noisy signals from SDS. In experiments, Paint-it obtains remarkable quality PBR texture maps within 15 min., given only a text description. We demonstrate the generalizability and practicality of Paint-it by synthesizing high-quality texture maps for large-scale mesh datasets and showing test-time applications such as relighting and material control using a popular graphics engine. Project page: https://kim-youwang.github.io/paint-it	Paint-it: Text-driven high-fidelity PBR texture map synthesis for 3D meshes via neural re-parameterized texture optimization.	Existing methods for generating textured 3D assets from text often produce low-quality results or rely on computationally expensive techniques. This paper addresses these limitations by directly synthesizing high-fidelity, physically-based texture maps on existing 3D models, facilitating practical use in graphics engines and pipelines.	The method uses Score-Distillation Sampling (SDS) to guide the optimization of a Deep Convolutional Physically-Based Rendering (DC-PBR) model, which represents texture maps as randomly initialized U-Net convolutional kernels. This approach inherently schedules the optimization curriculum according to texture frequency and filters out noisy signals from SDS, leading to higher-quality results.	Paint-it generates high-quality PBR texture maps for various 3D meshes, including objects, humans, and animals, demonstrating its generalizability. The method produces superior texture maps compared to existing methods, as evidenced by qualitative comparisons and quantitative metrics like FID and user study scores. The synthesized PBR texture maps are compatible with popular graphics engines and enable practical applications like relighting and material control.	The optimization process can be time-consuming, taking 15-30 minutes per mesh. Future work includes exploring faster optimization techniques and building a large-scale PBR texture map dataset for training feed-forward generative models.	text-driven synthesis, pbr texture maps, 3d mesh texturing, score-distillation sampling, deep convolutional re-parameterization
2312.11232 Report	Self-Supervised Learning for Image Super-Resolution and Deblurring	Jérémy Scanvic, Mike Davies, Patrice Abry, Julián Tachella	Self-supervised methods have recently proved to be nearly as effective as supervised methods in various imaging inverse problems, paving the way for learning-based methods in scientific and medical imaging applications where ground truth data is hard or expensive to obtain. This is the case in magnetic resonance imaging and computed tomography. These methods critically rely on invariance to translations and/or rotations of the image distribution to learn from incomplete measurement data alone. However, existing approaches fail to obtain competitive performances in the problems of image super-resolution and deblurring, which play a key role in most imaging systems. In this work, we show that invariance to translations and rotations is insufficient to learn from measurements that only contain low-frequency information. Instead, we propose a new self-supervised approach that leverages the fact that many image distributions are approximately scale-invariant, and that enables recovering high-frequency information lost in the measurement process. We demonstrate throughout a series of experiments on real datasets that the proposed method outperforms other self-supervised approaches, and obtains performances on par with fully supervised learning.	This paper introduces a novel self-supervised learning approach for image super-resolution and deblurring leveraging the approximate scale-invariance of many image distributions.	Super-resolution and deblurring are crucial for various imaging systems, but existing self-supervised methods struggle when high-frequency information is lost in the measurement process.	The method trains a deep neural network using a new loss function combining the SURE loss and a novel equivariant loss based on downscaling transformations. A key aspect is stopping the gradient during downscaling to enhance performance.	The approach outperforms other self-supervised methods for both image deblurring and super-resolution. It achieves performance on par with fully supervised learning methods. Stopping the gradient during downscaling is shown to significantly improve reconstruction performance.	Theoretical analysis of necessary and sufficient conditions for learning from measurements alone with scaling transformations (semi-group) is missing. Exploring different downscaling implementations and training strategies could further improve performance.	image deblurring, image super-resolution, self-supervised learning, scale-invariance, equivariant imaging
2312.10998 Report	ID-Blau: Image Deblurring by Implicit Diffusion-based reBLurring AUgmentation	Jia-Hao Wu, Fu-Jen Tsai, Yan-Tsung Peng, Chung-Chi Tsai, Chia-Wen Lin, Yen-Yu Lin	Image deblurring aims to remove undesired blurs from an image captured in a dynamic scene. Much research has been dedicated to improving deblurring performance through model architectural designs. However, there is little work on data augmentation for image deblurring. Since continuous motion causes blurred artifacts during image exposure, we aspire to develop a groundbreaking blur augmentation method to generate diverse blurred images by simulating motion trajectories in a continuous space. This paper proposes Implicit Diffusion-based reBLurring AUgmentation (ID-Blau), utilizing a sharp image paired with a controllable blur condition map to produce a corresponding blurred image. We parameterize the blur patterns of a blurred image with their orientations and magnitudes as a pixel-wise blur condition map to simulate motion trajectories and implicitly represent them in a continuous space. By sampling diverse blur conditions, ID-Blau can generate various blurred images unseen in the training set. Experimental results demonstrate that ID-Blau can produce realistic blurred images for training and thus significantly improve performance for state-of-the-art deblurring models. The source code is available at https://github.com/plusgood-steven/ID-Blau.	This paper proposes Implicit Diffusion-based reBLurring AUgmentation (ID-Blau), a novel data augmentation strategy for image deblurring. ID-Blau generates realistic blurred images by simulating motion trajectories in a continuous space, using a sharp image and a controllable blur condition map.	Effective data augmentation is crucial for improving the performance of image deblurring models, but existing methods are limited in their ability to generate diverse and controllable blurred images.	The authors model blur conditions in a continuous space, representing blur orientations and magnitudes. They then use a diffusion model conditioned on sharp images and blur condition maps to generate realistic blurred images.	ID-Blau significantly improves the performance of four state-of-the-art deblurring models (MIMO-UNet+, Restormer, Stripformer, and FFTformer) on GoPro, HIDE, and RealBlur datasets. ID-Blau generates more realistic blurred images compared to the GoPro dataset by simulating continuous motion trajectories. The use of a diffusion process in ID-Blau leads to additional performance gains compared to directly training a reblurring model without diffusion.	The authors mainly evaluate ID-Blau on synthetic datasets (GoPro and HIDE) and one real-world dataset (RealBlur). Further evaluation on more diverse real-world blurry images is needed. ID-Blau requires additional training complexity compared to not using augmentation. Exploring more efficient ways to generate augmented data is an interesting direction.	image deblurring, data augmentation, diffusion models, blur simulation, continuous blur condition
2312.10945 Report	LaViP:Language-Grounded Visual Prompts	Nilakshan Kunananthaseelan, Jing Zhang, Mehrtash Harandi	We introduce a language-grounded visual prompting method to adapt the visual encoder of vision-language models for downstream tasks. By capitalizing on language integration, we devise a parameter-efficient strategy to adjust the input of the visual encoder, eliminating the need to modify or add to the model's parameters. Due to this design choice, our algorithm can operate even in black-box scenarios, showcasing adaptability in situations where access to the model's parameters is constrained. We will empirically demonstrate that, compared to prior art, grounding visual prompts with language enhances both the accuracy and speed of adaptation. Moreover, our algorithm excels in base-to-novel class generalization, overcoming limitations of visual prompting and exhibiting the capacity to generalize beyond seen classes. We thoroughly assess and evaluate our method across a variety of image recognition datasets, such as EuroSAT, UCF101, DTD, and CLEVR, spanning different learning situations, including few-shot learning, base-to-novel class generalization, and transfer learning.	This paper introduces LaViP, a language-grounded visual prompting method to adapt the visual encoder of vision-language models for downstream tasks.	Existing visual prompting techniques suffer from limitations such as unimodality in learning prompts and an inability to generalize beyond seen classes. LaViP addresses these limitations by leveraging the multimodal nature of vision-language models.	LaViP generates input-dependent visual prompts through low-rank matrix decomposition, incorporating both language and image information. It uses a Kronecker product to efficiently embed novel class knowledge for base-to-novel generalization.	LaViP outperforms previous methods in few-shot learning by a significant margin, achieving up to 11.84% improvement over CLIP Zero-Shot. In base-to-novel generalization, LaViP shows competitive performance, achieving an absolute gain of 2.64% compared to CoOp and CoCoOp. LaViP demonstrates strong performance across diverse datasets and consistently outperforms existing visual prompting methods.	LaViP's performance is sensitive to the chosen prompt template and may struggle with low-resolution images or datasets with limited semantic variation. Future work could explore learning multimodal prompts with mutual synergy between visual and textual information.	visual prompting, vision-language models, model reprogramming, few-shot learning, base-to-novel generalization
2312.10899 Report	MagicScroll: Nontypical Aspect-Ratio Image Generation for Visual Storytelling via Multi-Layered Semantic-Aware Denoising	Bingyuan Wang, Hengyu Meng, Zeyu Cai, Lanjiong Li, Yue Ma, Qifeng Chen, Zeyu Wang	Visual storytelling often uses nontypical aspect-ratio images like scroll paintings, comic strips, and panoramas to create an expressive and compelling narrative. While generative AI has achieved great success and shown the potential to reshape the creative industry, it remains a challenge to generate coherent and engaging content with arbitrary size and controllable style, concept, and layout, all of which are essential for visual storytelling. To overcome the shortcomings of previous methods including repetitive content, style inconsistency, and lack of controllability, we propose MagicScroll, a multi-layered, progressive diffusion-based image generation framework with a novel semantic-aware denoising process. The model enables fine-grained control over the generated image on object, scene, and background levels with text, image, and layout conditions. We also establish the first benchmark for nontypical aspect-ratio image generation for visual storytelling including mediums like paintings, comics, and cinematic panoramas, with customized metrics for systematic evaluation. Through comparative and ablation studies, MagicScroll showcases promising results in aligning with the narrative text, improving visual coherence, and engaging the audience. We plan to release the code and benchmark in the hope of a better collaboration between AI researchers and creative practitioners involving visual storytelling.	MagicScroll, a multi-layered, progressive diffusion-based image generation framework for creating coherent and engaging nontypical aspect-ratio images for visual storytelling, featuring semantic-aware denoising and multi-level control over style, content, and layout.	Existing methods struggle to generate coherent and engaging visual content for storytelling, especially with arbitrary size and controllable style, concept, and layout, which are crucial for conveying narrative and emotion in mediums like scroll paintings, comic strips, and panoramas.	The framework leverages GPT-based layout prediction, semantic-aware denoising, and text/image-based style control modules. It utilizes predicted object/scene masks, reference images, and style concepts to guide the generation process, ensuring coherence and alignment with the narrative text.	Outperforms existing methods in generating nontypical aspect-ratio images with higher content richness and fidelity to input text. Demonstrates superior performance in visual coherence and user engagement based on both quantitative metrics and subjective user ratings. Provides fine-grained control over the generated images at object, scene, and background levels, enabling diverse visual storytelling scenarios including painting, comic, and panorama styles.	Exploration of tokenizers and encoders specifically designed for ultra-long texts to improve story processing. Integration of additional conditional controls at various stages of the generation process and incorporation of pre-trained modules for enhanced controllability over visual effects.	visual storytelling, image generation, diffusion models, layout control, semantic-aware denoising
2312.10835 Report	Your Student is Better Than Expected: Adaptive Teacher-Student Collaboration for Text-Conditional Diffusion Models	Nikita Starodubcev, Artem Fedorov, Artem Babenko, Dmitry Baranchuk	Knowledge distillation methods have recently shown to be a promising direction to speedup the synthesis of large-scale diffusion models by requiring only a few inference steps. While several powerful distillation methods were recently proposed, the overall quality of student samples is typically lower compared to the teacher ones, which hinders their practical usage. In this work, we investigate the relative quality of samples produced by the teacher text-to-image diffusion model and its distilled student version. As our main empirical finding, we discover that a noticeable portion of student samples exhibit superior fidelity compared to the teacher ones, despite the "approximate" nature of the student. Based on this finding, we propose an adaptive collaboration between student and teacher diffusion models for effective text-to-image synthesis. Specifically, the distilled model produces the initial sample, and then an oracle decides whether it needs further improvements with a slow teacher model. Extensive experiments demonstrate that the designed pipeline surpasses state-of-the-art text-to-image alternatives for various inference budgets in terms of human preference. Furthermore, the proposed approach can be naturally used in popular applications such as text-guided image editing and controllable generation.	This paper finds that distilled text-to-image models can outperform teacher models on a significant number of samples, and proposes an adaptive collaboration method between student and teacher diffusion models for cost-effective and high-quality text-to-image synthesis.	Large-scale diffusion models excel in text-conditional image generation but suffer from high inference costs. Distillation methods offer faster inference but often at a quality loss. This paper explores a new direction of student-teacher collaboration to leverage the advantages of both.	The paper analyzes the performance of distilled models, revealing their strengths. It then proposes a three-step adaptive approach: 1) student generates an initial image, 2) an oracle (ImageReward estimator with a cut-off threshold) determines if improvement is needed, 3) if so, the teacher either refines the student sample or regenerates a new one.	Distilled text-to-image models can generate superior samples compared to teacher models for a noticeable portion of prompts, especially on challenging cases. The proposed adaptive collaborative approach surpasses baselines (including teacher models and other distillation methods) in terms of both human preference and automatic metrics (FID, CLIP score, ImageReward). The method effectively improves the quality and efficiency of text-guided image editing and controllable generation tasks.	The performance of the approach relies on the accuracy of the sample quality estimator and the effectiveness of the no-reference decision-making procedure. The paper primarily focuses on consistency distillation and evaluates limited distilled models. Exploring other distillation techniques and student models is left for future work.	text-to-image generation, diffusion models, knowledge distillation, adaptive collaboration, image quality assessment
2312.10763 Report	M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts	Mingsheng Li, Xin Chen, Chi Zhang, Sijin Chen, Hongyuan Zhu, Fukun Yin, Gang Yu, Tao Chen	Recently, 3D understanding has become popular to facilitate autonomous agents to perform further decisionmaking. However, existing 3D datasets and methods are often limited to specific tasks. On the other hand, recent progress in Large Language Models (LLMs) and Multimodal Language Models (MLMs) have demonstrated exceptional general language and imagery tasking performance. Therefore, it is interesting to unlock MLM's potential to be 3D generalist for wider tasks. However, current MLMs' research has been less focused on 3D tasks due to a lack of large-scale 3D instruction-following datasets. In this work, we introduce a comprehensive 3D instructionfollowing dataset called M3DBench, which possesses the following characteristics: 1) It supports general multimodal instructions interleaved with text, images, 3D objects, and other visual prompts. 2) It unifies diverse 3D tasks at both region and scene levels, covering a variety of fundamental abilities in real-world 3D environments. 3) It is a large-scale 3D instruction-following dataset with over 320k instruction-response pairs. Furthermore, we establish a new benchmark for assessing the performance of large models in understanding multi-modal 3D prompts. Extensive experiments demonstrate the effectiveness of our dataset and baseline, supporting general 3D-centric tasks, which can inspire future research.	This paper introduces M3DBench, a large-scale multi-modal 3D instruction-following dataset for developing general-purpose assistants in 3D environments.	Existing 3D datasets often focus on specific tasks, limiting the development of general-purpose 3D assistants. This dataset aims to bridge this gap and unlock the potential of Multi-modal Language Models (MLMs) in the 3D domain.	M3DBench leverages existing 3D datasets and uses LLMs to generate multi-modal instructions interleaved with text, coordinates, images, and 3D objects. The dataset covers diverse 3D tasks, including object detection, visual grounding, dense captioning, question answering, dialogue, planning, and navigation.	M3DBench contains over 320k instruction-response pairs, including over 138k multi-modal instructions. A simple baseline model trained on M3DBench demonstrates the effectiveness of the dataset in enabling MLMs to understand 3D scenes and follow instructions. The authors establish a benchmark for evaluating the performance of MLMs on various 3D tasks, including scene understanding, reasoning, and planning.	The performance of baseline models on certain tasks, such as detailed description and object localization, is suboptimal, indicating room for improvement in 3D MLM development. Future work can explore more sophisticated model architectures and training strategies to further enhance the capabilities of 3D MLMs.	3d vision, multi-modal learning, instruction following, large language models, dataset
2312.10665 Report	Silkie: Preference Distillation for Large Visual Language Models	Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, Lingpeng Kong	This paper explores preference distillation for large vision language models (LVLMs), improving their ability to generate helpful and faithful responses anchoring the visual context. We first build a vision-language feedback (VLFeedback) dataset utilizing AI annotation. Specifically, responses are generated by models sampled from 12 LVLMs, conditioned on multi-modal instructions sourced from various datasets. We adopt GPT-4V to assess the generated outputs regarding helpfulness, visual faithfulness, and ethical considerations. Furthermore, the preference supervision is distilled into Qwen-VL-Chat through the direct preference optimization (DPO) method. The resulting model Silkie, achieves 6.9% and 9.5% relative improvement on the MME benchmark regarding the perception and cognition capabilities, respectively. Silkie also demonstrates reduced hallucination by setting a new state-of-the-art score of 3.02 on the MMHal-Bench benchmark. Further analysis shows that DPO with our VLFeedback dataset mainly boosts the fine-grained perception and complex cognition abilities of LVLMs, leading to more comprehensive improvements compared to human-annotated preference datasets.	This paper introduces Silkie, a large vision language model (LVLM) enhanced with preference distillation to generate more helpful and faithful responses grounded in visual context.	Open-sourced LVLMs often exhibit misalignment issues, generating misleading content or biased responses. This work aims to improve LVLMs' reliability by aligning them with human preferences.	The authors construct VLFeedback, a large-scale multi-modal preference dataset annotated by GPT-4V. This dataset covers 80k multi-modal instructions and responses from 12 LVLMs, evaluated on helpfulness, visual faithfulness, and ethical considerations. Silkie is then trained using direct preference optimization (DPO) on this dataset, distilling the preferences into the model.	Silkie achieves significant improvements on the MME benchmark, demonstrating 6.9% and 9.5% relative gains in perception and cognition tasks, respectively. The model shows reduced hallucination, achieving a new state-of-the-art score of 3.02 on the MMHal-Bench benchmark. Analysis reveals that VLFeedback particularly benefits fine-grained perception tasks (e.g., OCR) and complex cognition tasks (e.g., code reasoning).	The VLFeedback dataset currently lacks sufficient supervision for safety alignment, potentially requiring the incorporation of red-teaming techniques in future work. The work focuses on a limited range of LVLMs and instruction datasets. Future iterations could incorporate newer models and diverse datasets for broader evaluation.	vision language models, preference distillation, ai feedback, hallucination reduction, multi-modal alignment
2312.10656 Report	VidToMe: Video Token Merging for Zero-Shot Video Editing	Xirui Li, Chao Ma, Xiaokang Yang, Ming-Hsuan Yang	Diffusion models have made significant advances in generating high-quality images, but their application to video generation has remained challenging due to the complexity of temporal motion. Zero-shot video editing offers a solution by utilizing pre-trained image diffusion models to translate source videos into new ones. Nevertheless, existing methods struggle to maintain strict temporal consistency and efficient memory consumption. In this work, we propose a novel approach to enhance temporal consistency in generated videos by merging self-attention tokens across frames. By aligning and compressing temporally redundant tokens across frames, our method improves temporal coherence and reduces memory consumption in self-attention computations. The merging strategy matches and aligns tokens according to the temporal correspondence between frames, facilitating natural temporal consistency in generated video frames. To manage the complexity of video processing, we divide videos into chunks and develop intra-chunk local token merging and inter-chunk global token merging, ensuring both short-term video continuity and long-term content consistency. Our video editing approach seamlessly extends the advancements in image editing to video editing, rendering favorable results in temporal consistency over state-of-the-art methods.	This paper presents VidToMe, a novel approach for zero-shot video editing that enhances temporal consistency by merging self-attention tokens across video frames.	Existing video editing methods struggle to maintain strict temporal consistency and efficient memory consumption due to the complexity of temporal motion in videos.	VidToMe merges similar tokens across video frames in the self-attention module of a pre-trained text-to-image diffusion model. It employs local token merging within short chunks and global token merging across chunks to ensure both short-term and long-term video consistency. This method reduces redundant computations and enforces consistent feature extraction across frames.	VidToMe significantly improves temporal consistency in generated videos compared to state-of-the-art methods, as demonstrated by qualitative and quantitative evaluations. The method reduces memory consumption in self-attention computations, making it more efficient. VidToMe seamlessly integrates with existing image editing techniques, allowing for versatile video editing applications.	The editing quality heavily relies on the performance of the chosen image editing method. The similarity-based token matching, while generally effective, has room for improvement to prevent the incorrect merging of visually similar objects.	video editing, diffusion models, temporal consistency, self-attention, token merging
2312.10457 Report	Semantic-Aware Autoregressive Image Modeling for Visual Representation Learning	Kaiyou Song, Shan Zhang, Tong Wang	The development of autoregressive modeling (AM) in computer vision lags behind natural language processing (NLP) in self-supervised pre-training. This is mainly caused by the challenge that images are not sequential signals and lack a natural order when applying autoregressive modeling. In this study, inspired by human beings' way of grasping an image, i.e., focusing on the main object first, we present a semantic-aware autoregressive image modeling (SemAIM) method to tackle this challenge. The key insight of SemAIM is to autoregressive model images from the semantic patches to the less semantic patches. To this end, we first calculate a semantic-aware permutation of patches according to their feature similarities and then perform the autoregression procedure based on the permutation. In addition, considering that the raw pixels of patches are low-level signals and are not ideal prediction targets for learning high-level semantic representation, we also explore utilizing the patch features as the prediction targets. Extensive experiments are conducted on a broad range of downstream tasks, including image classification, object detection, and instance/semantic segmentation, to evaluate the performance of SemAIM. The results demonstrate SemAIM achieves state-of-the-art performance compared with other self-supervised methods. Specifically, with ViT-B, SemAIM achieves 84.1% top-1 accuracy for fine-tuning on ImageNet, 51.3% AP and 45.4% AP for object detection and instance segmentation on COCO, which outperforms the vanilla MAE by 0.5%, 1.0%, and 0.5%, respectively.	This paper introduces SemAIM, a semantic-aware autoregressive image modeling method that predicts image patches in a semantically meaningful order (from most to least semantic) derived from patch feature similarities, aiming to mimic human visual understanding.	Autoregressive modeling in vision lags behind NLP due to the lack of a natural order for images. This paper addresses this by incorporating semantic understanding into the prediction order, making it more consistent with human perception and improving representation learning.	SemAIM calculates a semantic-aware permutation of image patches based on their feature similarities. A parallel encoder-decoder architecture then performs autoregressive modeling, predicting patches in the determined order. The method also explores using pre-trained features as prediction targets instead of raw pixels for learning richer semantic representations.	SemAIM significantly outperforms autoregressive methods using raster or stochastic orders, highlighting the importance of semantic-aware prediction. Using pre-trained features (DINO, CLIP) as prediction targets leads to better performance than predicting raw RGB values, indicating the benefit of learning from high-level representations. SemAIM achieves state-of-the-art results on ImageNet classification, COCO object detection/segmentation, and ADE20k semantic segmentation, showcasing its strong representation learning capabilities.	The current implementation considers only one 'center' patch for permutation generation, which may not be optimal for images with multiple salient objects. Future work can explore calculating multiple center patches or developing more sophisticated strategies for handling multi-object scenes.	autoregressive image modeling, self-supervised learning, vision transformer, semantic representation learning, computer vision
2312.10240 Report	Rich Human Feedback for Text-to-Image Generation	Youwei Liang, Junfeng He, Gang Li, Peizhao Li, Arseniy Klimovskiy, Nicholas Carolan, Jiao Sun, Jordi Pont-Tuset, Sarah Young, Feng Yang, Junjie Ke, Krishnamurthy Dj Dvijotham, Katie Collins, Yiwen Luo, Yang Li, Kai J Kohlhoff, Deepak Ramachandran, Vidhya Navalpakkam	Recent Text-to-Image (T2I) generation models such as Stable Diffusion and Imagen have made significant progress in generating high-resolution images based on text descriptions. However, many generated images still suffer from issues such as artifacts/implausibility, misalignment with text descriptions, and low aesthetic quality. Inspired by the success of Reinforcement Learning with Human Feedback (RLHF) for large language models, prior works collected human-provided scores as feedback on generated images and trained a reward model to improve the T2I generation. In this paper, we enrich the feedback signal by (i) marking image regions that are implausible or misaligned with the text, and (ii) annotating which words in the text prompt are misrepresented or missing on the image. We collect such rich human feedback on 18K generated images (RichHF-18K) and train a multimodal transformer to predict the rich feedback automatically. We show that the predicted rich human feedback can be leveraged to improve image generation, for example, by selecting high-quality training data to finetune and improve the generative models, or by creating masks with predicted heatmaps to inpaint the problematic regions. Notably, the improvements generalize to models (Muse) beyond those used to generate the images on which human feedback data were collected (Stable Diffusion variants). The RichHF-18K data set will be released in our GitHub repository: https://github.com/google-research/google-research/tree/master/richhf_18k.	This paper introduces RichHF-18K, the first rich human feedback dataset for image generation, containing fine-grained scores, implausibility/misalignment regions, and misaligned keywords on 18K generated images.	Current T2I evaluation metrics lack interpretability and actionable insights. This work aims to provide a more detailed and explainable understanding of image quality beyond single-score metrics.	They collect rich human feedback (scores, marked regions, misaligned keywords) on 18K images. Then, they train a multimodal transformer model, RAHF, to automatically predict this rich feedback.	RAHF effectively predicts human annotations for scores, implausibility/misalignment regions, and keywords. Using RAHF scores for finetuning or as guidance improves image generation quality in terms of plausibility and aesthetics. RAHF generalizes well to different generative models (e.g., improving Muse model trained on Stable Diffusion data).	Misalignment heatmap prediction is less accurate than implausibility heatmaps, possibly due to annotation noise. Future work includes collecting more diverse data beyond Stable Diffusion and exploring more ways to leverage rich feedback for T2I model improvement.	text-to-image generation, human feedback, image quality assessment, multimodal learning, explainable ai
2312.10144 Report	Data-Efficient Multimodal Fusion on a Single GPU	Noël Vouitsis, Zhaoyan Liu, Satya Krishna Gorti, Valentin Villecroze, Jesse C. Cresswell, Guangwei Yu, Gabriel Loaiza-Ganem, Maksims Volkovs	The goal of multimodal alignment is to learn a single latent space that is shared between multimodal inputs. The most powerful models in this space have been trained using massive datasets of paired inputs and large-scale computational resources, making them prohibitively expensive to train in many practical scenarios. We surmise that existing unimodal encoders pre-trained on large amounts of unimodal data should provide an effective bootstrap to create multimodal models from unimodal ones at much lower costs. We therefore propose FuseMix, a multimodal augmentation scheme that operates on the latent spaces of arbitrary pre-trained unimodal encoders. Using FuseMix for multimodal alignment, we achieve competitive performance -- and in certain cases outperform state-of-the art methods -- in both image-text and audio-text retrieval, with orders of magnitude less compute and data: for example, we outperform CLIP on the Flickr30K text-to-image retrieval task with $\sim \! 600\times$ fewer GPU days and $\sim \! 80\times$ fewer image-text pairs. Additionally, we show how our method can be applied to convert pre-trained text-to-image generative models into audio-to-image ones. Code is available at: https://github.com/layer6ai-labs/fusemix.	FuseMix, a computationally and data-efficient multimodal augmentation scheme for aligning latent spaces of pre-trained unimodal encoders.	Existing multimodal alignment models are computationally expensive and require massive paired datasets, limiting their practical application.	FuseMix leverages pre-trained unimodal encoders and performs mixup on their latent spaces with a shared mixing coefficient, followed by training lightweight adapters to align the augmented latents.	FuseMix achieves competitive multimodal alignment, outperforming some state-of-the-art methods in image-text and audio-text retrieval tasks. FuseMix requires significantly less compute and data compared to methods like CLIP. Dataset quality and diversity are crucial for good performance, especially in low-data regimes.	Limited by the semantic information learned by the pre-trained unimodal encoders. Future work could explore fine-tuning unimodal encoders during fusion.	multimodal fusion, multimodal alignment, data augmentation, contrastive learning, mixup
2312.10136 Report	Gradient-based Parameter Selection for Efficient Fine-Tuning	Zhi Zhang, Qizhe Zhang, Zijun Gao, Renrui Zhang, Ekaterina Shutova, Shiji Zhou, Shanghang Zhang	With the growing size of pre-trained models, full fine-tuning and storing all the parameters for various downstream tasks is costly and infeasible. In this paper, we propose a new parameter-efficient fine-tuning method, Gradient-based Parameter Selection (GPS), demonstrating that only tuning a few selected parameters from the pre-trained model while keeping the remainder of the model frozen can generate similar or better performance compared with the full model fine-tuning method. Different from the existing popular and state-of-the-art parameter-efficient fine-tuning approaches, our method does not introduce any additional parameters and computational costs during both the training and inference stages. Another advantage is the model-agnostic and non-destructive property, which eliminates the need for any other design specific to a particular model. Compared with the full fine-tuning, GPS achieves 3.33% (91.78% vs. 88.45%, FGVC) and 9.61% (73.1% vs. 65.57%, VTAB) improvement of the accuracy with tuning only 0.36% parameters of the pre-trained model on average over 24 image classification tasks; it also demonstrates a significant improvement of 17% and 16.8% in mDice and mIoU, respectively, on medical image segmentation task. Moreover, GPS achieves state-of-the-art performance compared with existing PEFT methods.	This paper introduces GPS, a novel parameter-efficient fine-tuning method that selects and tunes a small subset of parameters from a pre-trained model based on gradient values, achieving comparable or superior performance to full fine-tuning.	Full fine-tuning large pre-trained models for various downstream tasks is computationally expensive and infeasible. PEFT methods aim to address this by tuning only a minimal set of parameters while maintaining or improving performance.	GPS calculates the gradient of a loss function (SCL) with respect to the model parameters and selects the top-K connections with the highest gradient value for each neuron. During fine-tuning, only the selected parameters are updated using a binary mask.	GPS outperforms previous PEFT methods and full fine-tuning on FGVC and VTAB benchmarks, using only 0.36% of parameters on average. The method is model-agnostic, achieving consistent improvements across ViT, Swin Transformer, and ConvNeXt architectures. GPS demonstrates strong data efficiency, achieving good performance even with limited training data (few-shot learning).	The method doesn't fully exploit potential parameter sharing across similar downstream tasks. Reliance on a pre-trained model raises concerns about potential biases if the upstream model was trained on biased or harmful data.	parameter-efficient fine-tuning, gradient-based parameter selection, sub-network training, vision transformer, few-shot learning
2312.10120 Report	MVHuman: Tailoring 2D Diffusion with Multi-view Sampling For Realistic 3D Human Generation	Suyi Jiang, Haimin Luo, Haoran Jiang, Ziyu Wang, Jingyi Yu, Lan Xu	Recent months have witnessed rapid progress in 3D generation based on diffusion models. Most advances require fine-tuning existing 2D Stable Diffsuions into multi-view settings or tedious distilling operations and hence fall short of 3D human generation due to the lack of diverse 3D human datasets. We present an alternative scheme named MVHuman to generate human radiance fields from text guidance, with consistent multi-view images directly sampled from pre-trained Stable Diffsuions without any fine-tuning or distilling. Our core is a multi-view sampling strategy to tailor the denoising processes of the pre-trained network for generating consistent multi-view images. It encompasses view-consistent conditioning, replacing the original noises with ``consistency-guided noises'', optimizing latent codes, as well as utilizing cross-view attention layers. With the multi-view images through the sampling process, we adopt geometry refinement and 3D radiance field generation followed by a subsequent neural blending scheme for free-view rendering. Extensive experiments demonstrate the efficacy of our method, as well as its superiority to state-of-the-art 3D human generation methods.	Presents MVHuman, a novel scheme for generating human radiance fields from text guidance using pre-trained 2D diffusion models without fine-tuning or distillation.	Addresses limitations of existing 3D human generation methods, such as the reliance on scarce 3D datasets, inefficient optimization, and the presence of artifacts.	Employs a multi-view sampling strategy with a pre-trained Stable Diffusion model, including view-consistent conditioning, consistency-guided noise for denoising, optimization of latent codes, and cross-view attention.	Generates high-quality human assets with consistent multi-view images directly from text prompts. Outperforms state-of-the-art 3D human generation methods in qualitative and user study evaluations. Enables seamless integration of text-based editing and style transfer with LoRA models from 2D to 3D.	Relies on the accuracy of initial mesh and SMPL-X alignment. Limited ability to articulate specific details solely from textual descriptions.	3d human generation, text-to-3d, diffusion models, multi-view consistency, neural radiance fields
2312.10113 Report	Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulation	Qin Guo, Tianwei Lin	Recently, diffusion-based methods, like InstructPix2Pix (IP2P), have achieved effective instruction-based image editing, requiring only natural language instructions from the user. However, these methods often inadvertently alter unintended areas and struggle with multi-instruction editing, resulting in compromised outcomes. To address these issues, we introduce the Focus on Your Instruction (FoI), a method designed to ensure precise and harmonious editing across multiple instructions without extra training or test-time optimization. In the FoI, we primarily emphasize two aspects: (1) precisely extracting regions of interest for each instruction and (2) guiding the denoising process to concentrate within these regions of interest. For the first objective, we identify the implicit grounding capability of IP2P from the cross-attention between instruction and image, then develop an effective mask extraction method. For the second objective, we introduce a cross attention modulation module for rough isolation of target editing regions and unrelated regions. Additionally, we introduce a mask-guided disentangle sampling strategy to further ensure clear region isolation. Experimental results demonstrate that FoI surpasses existing methods in both quantitative and qualitative evaluations, especially excelling in multi-instruction editing task.	Introduces FoI, a method leveraging the implicit grounding ability of InstructPix2Pix for precise and harmonious multi-instruction image editing without extra training or test-time optimization.	Addresses limitations of existing text-guided image editing methods in accurately targeting editing areas, especially for multi-instruction edits, to achieve desired results without unintended modifications.	Utilizes IP2P's grounding ability to extract masks for areas of interest, introduces cross-condition attention modulation to focus instructions within their masks, and proposes a mask-guided disentangle sampling strategy to isolate editing regions.	Outperforms state-of-the-art methods in qualitative and quantitative evaluations, particularly in multi-instruction editing tasks. Achieves superior results in CLIP image similarity, Dinov2 image similarity, and PickScore, demonstrating fidelity to both original and edited images. Demonstrates robustness in balancing image preservation and instruction execution without requiring precise tuning of guidance scales.	Limited ultra-fine editing ability due to the resolution of cross-attention maps. Effectiveness is dependent on the capabilities of the pretrained IP2P model.	image editing, diffusion models, text-guided image manipulation, multi-instruction editing, attention mechanisms
2312.10111 Report	Plasticine3D: Non-rigid 3D editting with text guidance	Yige Chen, Ang Chen, Siyuan Chen, Ran Yi	With the help of Score Distillation Sampling(SDS) and the rapid development of various trainable 3D representations, Text-to-Image(T2I) diffusion models have been applied to 3D generation tasks and achieved considerable results. There are also some attempts toward the task of editing 3D objects leveraging this Text-to-3D pipeline. However, most methods currently focus on adding additional geometries, overwriting textures or both. But few of them can perform non-rigid transformation of 3D objects. For those who can perform non-rigid editing, on the other hand, suffer from low-resolution, lack of fidelity and poor flexibility. In order to address these issues, we present: Plasticine3D, a general, high-fidelity, photo-realistic and controllable non-rigid editing pipeline. Firstly, our work divides the editing process into a geometry editing stage and a texture editing stage to achieve more detailed and photo-realistic results ; Secondly, in order to perform non-rigid transformation with controllable results while maintain the fidelity towards original 3D models in the same time, we propose a multi-view-embedding(MVE) optimization strategy to ensure that the diffusion model learns the overall features of the original object and an embedding-fusion(EF) to control the degree of editing by adjusting the value of the fusing rate. We also design a geometry processing step before optimizing on the base geometry to cope with different needs of various editing tasks. Further more, to fully leverage the geometric prior from the original 3D object, we provide an optional replacement of score distillation sampling named score projection sampling(SPS) which enables us to directly perform optimization from the origin 3D mesh in most common median non-rigid editing scenarios. We demonstrate the effectiveness of our method on both the non-rigid 3D editing task and general 3D editing task.	Plasticine3D, a novel semantic-driven, photo-realistic, controllable non-rigid 3D editing pipeline that divides the editing process into geometry and appearance stages for detailed results.	Addresses limitations in existing 3D editing methods that struggle with non-rigid transformations, especially in preserving original details and offering control over the degree of editing.	Utilizes a two-stage geometry-appearance pipeline with multi-view embedding optimization (MVE), embedding fusion (EF), geometry processing, and score projection sampling (SPS) to achieve controllable and high-fidelity non-rigid transformations.	Embedding fusion enables control over the degree of editing by interpolating between optimized and target embeddings. Score projection sampling (SPS) enhances median-scale non-rigid editing by leveraging the original geometry as a starting point and guiding the transformation towards the target prompt. Outperforms baseline methods in qualitative and quantitative comparisons, demonstrating superior performance in non-rigid editing tasks while preserving details.	Janus problem (e.g., two-headed horse) occurs in some global and median-scale transformations. Fine-tuning the diffusion model is computationally expensive and time-consuming.	3d editing, non-rigid transformation, diffusion models, score distillation sampling, semantic-driven editing
2312.10103 Report	GSVA: Generalized Segmentation via Multimodal Large Language Models	Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, Gao Huang	Generalized Referring Expression Segmentation (GRES) extends the scope of classic RES to refer to multiple objects in one expression or identify the empty targets absent in the image. GRES poses challenges in modeling the complex spatial relationships of the instances in the image and identifying non-existing referents. Multimodal Large Language Models (MLLMs) have recently shown tremendous progress in these complicated vision-language tasks. Connecting Large Language Models (LLMs) and vision models, MLLMs are proficient in understanding contexts with visual inputs. Among them, LISA, as a representative, adopts a special [SEG] token to prompt a segmentation mask decoder, e.g., SAM, to enable MLLMs in the RES task. However, existing solutions to GRES remain unsatisfactory since current segmentation MLLMs cannot correctly handle the cases where users might reference multiple subjects in a singular prompt or provide descriptions incongruent with any image target. In this paper, we propose Generalized Segmentation Vision Assistant (GSVA) to address this gap. Specifically, GSVA reuses the [SEG] token to prompt the segmentation model towards supporting multiple mask references simultaneously and innovatively learns to generate a [REJ] token to reject the null targets explicitly. Experiments validate GSVA's efficacy in resolving the GRES issue, marking a notable enhancement and setting a new record on the GRES benchmark gRefCOCO dataset. GSVA also proves effective across various classic referring segmentation and comprehension tasks.	This paper introduces GSVA, a multimodal large language model that enhances referring expression segmentation by addressing multiple-target and empty-target scenarios in GRES.	Existing referring expression segmentation models struggle to segment multiple objects from a single instruction or handle descriptions that don't match any object in the image. This limits their practicality in real-world applications like embodied AI.	GSVA leverages the power of MLLMs and introduces two key designs: (1) learning to predict multiple [SEG] tokens to segment multiple targets, and (2) employing [REJ] tokens to reject descriptions of objects absent in the image.	GSVA achieves state-of-the-art performance on the GRES benchmark gRefCOCO dataset. GSVA demonstrates strong performance on classic referring segmentation tasks (RefCOCO, RefCOCO+, RefCOCOg) and comprehension tasks. Ablation studies confirm the importance of multiple [SEG] tokens and the [REJ] token for GSVA's performance.	The model might misperceive small or unclear objects, leading to incorrect [REJ] predictions. Using higher-resolution vision encoders could further enhance the model's accuracy.	referring expression segmentation, generalized referring expression segmentation, multimodal large language models, empty target rejection, multiple target segmentation
2312.10034 Report	SlimmeRF: Slimmable Radiance Fields	Shiran Yuan, Hao Zhao	Neural Radiance Field (NeRF) and its variants have recently emerged as successful methods for novel view synthesis and 3D scene reconstruction. However, most current NeRF models either achieve high accuracy using large model sizes, or achieve high memory-efficiency by trading off accuracy. This limits the applicable scope of any single model, since high-accuracy models might not fit in low-memory devices, and memory-efficient models might not satisfy high-quality requirements. To this end, we present SlimmeRF, a model that allows for instant test-time trade-offs between model size and accuracy through slimming, thus making the model simultaneously suitable for scenarios with different computing budgets. We achieve this through a newly proposed algorithm named Tensorial Rank Incrementation (TRaIn) which increases the rank of the model's tensorial representation gradually during training. We also observe that our model allows for more effective trade-offs in sparse-view scenarios, at times even achieving higher accuracy after being slimmed. We credit this to the fact that erroneous information such as floaters tend to be stored in components corresponding to higher ranks. Our implementation is available at https://github.com/Shiran-Yuan/SlimmeRF.	This paper proposes SlimmeRF, a novel method to reduce the number of parameters in neural radiance fields (NeRFs) using a slimmable tensorial representation.	Reducing the size of NeRF models is crucial for their application in resource-constrained environments. Existing compression methods often compromise model performance. This paper addresses the need for compact and accurate NeRFs.	SlimmeRF utilizes a tensorial representation for the appearance grid in NeRF and introduces a Training in Rank order with Initial Control (TRaIn) algorithm. This algorithm trains components of different tensor ranks sequentially, prioritizing lower ranks, to improve slimmability.	SlimmeRF significantly reduces the number of parameters in NeRFs while maintaining comparable or even exceeding the performance of baselines. The method demonstrates strong performance on benchmark datasets, including Synthetic NeRF, Tanks & Temples, and LLFF. The paper provides a theoretical analysis to explain the slimmability achieved by the TRaIn algorithm.	The paper notes limitations in controlling the degree of slimming for specific applications. Future work could explore extending the TRaIn algorithm to other components of NeRF, such as the density grid.	neural radiance fields, nerf, model compression, tensorial representation, view synthesis
2312.10032 Report	Osprey: Pixel Understanding with Visual Instruction Tuning	Yuqian Yuan, Wentong Li, Jian Liu, Dongqi Tang, Xinjie Luo, Chi Qin, Lei Zhang, Jianke Zhu	Multimodal large language models (MLLMs) have recently achieved impressive general-purpose vision-language capabilities through visual instruction tuning. However, current MLLMs primarily focus on image-level or box-level understanding, falling short in achieving fine-grained vision-language alignment at pixel level. Besides, the lack of mask-based instruction data limits their advancements. In this paper, we propose Osprey, a mask-text instruction tuning approach, to extend MLLMs by incorporating fine-grained mask regions into language instruction, aiming at achieving pixel-wise visual understanding. To achieve this goal, we first meticulously curate a mask-based region-text dataset with 724K samples, and then design a vision-language model by injecting pixel-level representation into LLM. Specifically, Osprey adopts a convolutional CLIP backbone as the vision encoder and employs a mask-aware visual extractor to extract precise visual mask features from high resolution input. Experimental results demonstrate Osprey's superiority in various region understanding tasks, showcasing its new capability for pixel-level instruction tuning. In particular, Osprey can be integrated with Segment Anything Model (SAM) seamlessly to obtain multi-granularity semantics. The source code, dataset and demo can be found at https://github.com/CircleRadon/Osprey.	Presents Osprey, a novel approach that integrates pixel-level mask region references into language instructions, enhancing Multimodal Large Language Models (MLLMs) for fine-grained visual understanding.	Existing MLLMs struggle with fine-grained visual understanding tasks due to reliance on image-level or box-level understanding, lacking pixel-level alignment between vision and language.	Introduces a mask-aware visual extractor to capture precise mask features, employs a convolutional CLIP backbone for high-resolution input, and curates a large-scale mask-based region-text dataset (Osprey-724K) for instruction tuning.	Osprey significantly outperforms previous methods on open-vocabulary segmentation, achieving 50.64% PQ, 29.17% AP, and 49.78% mIoU on Cityscapes. Achieves state-of-the-art results on referring object classification, obtaining 65.24% SS and 38.19% S-IoU on LVIS, and 73.06% SS and 52.72% S-IoU on PACO. Demonstrates superior performance in referring description and reasoning tasks on Ferret-Bench and exhibits strong performance on object hallucination benchmark POPE.	Computational cost increases significantly with larger input image sizes. Further research on effectively incorporating multi-modal information from various sources is needed.	multimodal large language models, fine-grained visual understanding, mask-based instruction tuning, region-based image understanding, open-vocabulary segmentation
2312.09767 Report	DreamTalk: When Expressive Talking Head Generation Meets Diffusion Probabilistic Models	Yifeng Ma, Shiwei Zhang, Jiayu Wang, Xiang Wang, Yingya Zhang, Zhidong Deng	Diffusion models have shown remarkable success in a variety of downstream generative tasks, yet remain under-explored in the important and challenging expressive talking head generation. In this work, we propose a DreamTalk framework to fulfill this gap, which employs meticulous design to unlock the potential of diffusion models in generating expressive talking heads. Specifically, DreamTalk consists of three crucial components: a denoising network, a style-aware lip expert, and a style predictor. The diffusion-based denoising network is able to consistently synthesize high-quality audio-driven face motions across diverse expressions. To enhance the expressiveness and accuracy of lip motions, we introduce a style-aware lip expert that can guide lip-sync while being mindful of the speaking styles. To eliminate the need for expression reference video or text, an extra diffusion-based style predictor is utilized to predict the target expression directly from the audio. By this means, DreamTalk can harness powerful diffusion models to generate expressive faces effectively and reduce the reliance on expensive style references. Experimental results demonstrate that DreamTalk is capable of generating photo-realistic talking faces with diverse speaking styles and achieving accurate lip motions, surpassing existing state-of-the-art counterparts.	DreamTalk, a novel framework leveraging diffusion models for generating expressive talking heads with diverse speaking styles and minimal reliance on style references.	Existing methods struggle to generate high-quality talking heads with diverse and accurate expressions, often relying on laborious style references like videos or text.	DreamTalk comprises a denoising network, a style-aware lip expert, and a style predictor. The denoising network generates facial motions conditioned on audio and a style reference. The lip expert ensures lip-sync accuracy across styles. The style predictor infers speaking styles directly from audio and the portrait, eliminating the need for reference videos.	Outperforms state-of-the-art methods in quantitative and qualitative evaluations on datasets like MEAD, HDTF, and Voxceleb2, demonstrating superior lip-sync accuracy, visual quality, and style consistency. Exhibits strong generalization capabilities, effectively handling out-of-domain portraits, multilingual speech, noisy audio, and songs. Enables versatile speaking style manipulation through techniques like classifier-free guidance scaling and style code interpolation.	Occasional artifacts, particularly around the mouth area during intense expressions, requiring further refinement of teeth generation and exploration of emotion-specific renderers. Lacks temporal awareness of speaking style variations, potentially leading to unnatural expressions at speech boundaries. Future work could focus on dynamically predicting style evolution.	talking head generation, diffusion models, expressive synthesis, lip sync, style prediction
2312.09641 Report	Ins-HOI: Instance Aware Human-Object Interactions Recovery	Jiajun Zhang, Yuxiang Zhang, Hongwen Zhang, Xiao Zhou, Boyao Zhou, Ruizhi Shao, Zonghai Hu, Yebin Liu	Accurately modeling detailed interactions between human/hand and object is an appealing yet challenging task. Current multi-view capture systems are only capable of reconstructing multiple subjects into a single, unified mesh, which fails to model the states of each instance individually during interactions. To address this, previous methods use template-based representations to track human/hand and object. However, the quality of the reconstructions is limited by the descriptive capabilities of the templates so that these methods are inherently struggle with geometry details, pressing deformations and invisible contact surfaces. In this work, we propose an end-to-end Instance-aware Human-Object Interactions recovery (Ins-HOI) framework by introducing an instance-level occupancy field representation. However, the real-captured data is presented as a holistic mesh, unable to provide instance-level supervision. To address this, we further propose a complementary training strategy that leverages synthetic data to introduce instance-level shape priors, enabling the disentanglement of occupancy fields for different instances. Specifically, synthetic data, created by randomly combining individual scans of humans/hands and objects, guides the network to learn a coarse prior of instances. Meanwhile, real-captured data helps in learning the overall geometry and restricting interpenetration in contact areas. As demonstrated in experiments, our method Ins-HOI supports instance-level reconstruction and provides reasonable and realistic invisible contact surfaces even in cases of extremely close interaction. To facilitate the research of this task, we collect a large-scale, high-fidelity 3D scan dataset, including 5.2k high-quality scans with real-world human-chair and hand-object interactions. The code and data will be public for research purposes.	This paper proposes Ins-HOI, an end-to-end framework for instance-level reconstruction of human/hand-object interactions from sparse-view RGB inputs, modeling intricate geometry and invisible contact surfaces using implicit surface representations.	Existing methods for HOI reconstruction often rely on template-based representations, limiting their ability to capture fine-grained geometry and soft deformations caused by contact. This work aims to overcome these limitations by leveraging implicit surface representations and introducing a novel complementary training strategy.	Ins-HOI utilizes an instance-level occupancy field to represent human/hand and object separately. It leverages both real-scanned data and synthetic data with instance-level ground truth for complementary training, ensuring both individual shape completeness and overall reconstruction reasonableness. The intersection between predicted instances is penalized during training to ensure plausible contact surfaces.	Ins-HOI achieves comparable or superior performance to state-of-the-art methods like PIFu and NeuS2 on holistic reconstruction while uniquely supporting instance-level reconstruction. The method effectively reconstructs invisible contact surfaces with plausible soft deformations, even for challenging cases of close interaction, as demonstrated by low intersection volumes between reconstructed instances. Experiments on unseen object types show that Ins-HOI can generalize well with a small amount of synthetic data fine-tuning.	While Ins-HOI produces reasonable contact surface reconstructions, capturing the precise deformations remains a challenge. The method currently requires fine-tuning for novel object types.	human-object interaction, hand-object interaction, 3d reconstruction, implicit surface representation, complementary training
2312.09608 Report	Faster Diffusion: Rethinking the Role of UNet Encoder in Diffusion Models	Senmao Li, Taihang Hu, Fahad Shahbaz Khan, Linxuan Li, Shiqi Yang, Yaxing Wang, Ming-Ming Cheng, Jian Yang	One of the key components within diffusion models is the UNet for noise prediction. While several works have explored basic properties of the UNet decoder, its encoder largely remains unexplored. In this work, we conduct the first comprehensive study of the UNet encoder. We empirically analyze the encoder features and provide insights to important questions regarding their changes at the inference process. In particular, we find that encoder features change gently, whereas the decoder features exhibit substantial variations across different time-steps. This finding inspired us to omit the encoder at certain adjacent time-steps and reuse cyclically the encoder features in the previous time-steps for the decoder. Further based on this observation, we introduce a simple yet effective encoder propagation scheme to accelerate the diffusion sampling for a diverse set of tasks. By benefiting from our propagation scheme, we are able to perform in parallel the decoder at certain adjacent time-steps. Additionally, we introduce a prior noise injection method to improve the texture details in the generated image. Besides the standard text-to-image task, we also validate our approach on other tasks: text-to-video, personalized generation and reference-guided generation. Without utilizing any knowledge distillation technique, our approach accelerates both the Stable Diffusion (SD) and the DeepFloyd-IF models sampling by 41$\%$ and 24$\%$ respectively, while maintaining high-quality generation performance. Our code is available in \href{https://github.com/hutaiHang/Faster-Diffusion}{FasterDiffusion}.	This paper presents encoder propagation, a novel method for accelerating diffusion model sampling without knowledge distillation.	Diffusion model sampling is computationally expensive due to iterative denoising. This work addresses this issue by reusing encoder features to improve efficiency.	The paper empirically analyzes UNet features and finds that encoder features change minimally across time-steps. This observation leads to the proposed encoder propagation scheme, which reuses encoder features from previous time-steps, enabling parallel decoding and significant speedup.	Encoder propagation accelerates Stable Diffusion sampling by 41% and DeepFloyd-IF by 24% while maintaining high generation quality. The method is compatible with existing acceleration techniques like DPM-Solver and ToMe. Qualitative and quantitative evaluations on tasks like text-to-video generation, personalized image generation, and reference-guided generation demonstrate the effectiveness of the proposed approach.	The method faces challenges in maintaining quality when using very few sampling steps (e.g., 5). Future work could explore adapting the technique for even faster generation with limited sampling steps.	diffusion models, image generation, sampling acceleration, encoder propagation, parallel decoding
2312.09579 Report	MobileSAMv2: Faster Segment Anything to Everything	Chaoning Zhang, Dongshen Han, Sheng Zheng, Jinwoo Choi, Tae-Ho Kim, Choong Seon Hong	Segment anything model (SAM) addresses two practical yet challenging segmentation tasks: \textbf{segment anything (SegAny)}, which utilizes a certain point to predict the mask for a single object of interest, and \textbf{segment everything (SegEvery)}, which predicts the masks for all objects on the image. What makes SegAny slow for SAM is its heavyweight image encoder, which has been addressed by MobileSAM via decoupled knowledge distillation. The efficiency bottleneck of SegEvery with SAM, however, lies in its mask decoder because it needs to first generate numerous masks with redundant grid-search prompts and then perform filtering to obtain the final valid masks. We propose to improve its efficiency by directly generating the final masks with only valid prompts, which can be obtained through object discovery. Our proposed approach not only helps reduce the total time on the mask decoder by at least 16 times but also achieves superior performance. Specifically, our approach yields an average performance boost of 3.6\% (42.5\% \textit{v.s.} 38.9\%) for zero-shot object proposal on the LVIS dataset with the mask AR@$K$ metric. Qualitative results show that our approach generates fine-grained masks while avoiding over-segmenting things. This project targeting faster SegEvery than the original SAM is termed MobileSAMv2 to differentiate from MobileSAM which targets faster SegAny. Moreover, we demonstrate that our new prompt sampling is also compatible with the distilled image encoders in MobileSAM, contributing to a unified framework for efficient SegAny and SegEvery. The code is available at the same link as MobileSAM Project \href{https://github.com/ChaoningZhang/MobileSAM}{\textcolor{red}{https://github.com/ChaoningZhang/MobileSAM}}. \end{abstract}	This paper introduces MobileSAMv2, an efficient approach for segmenting everything (SegEvery) in an image, addressing the efficiency bottleneck of the original SAM's mask decoder for this task.	The original SAM's SegEvery, while effective, is computationally expensive, particularly in the mask decoding stage, hindering its practical use.	MobileSAMv2 replaces SAM's grid-search prompt sampling with an object-aware prompt sampling strategy using YOLOv8 for object detection. This reduces the number of prompts, thereby speeding up mask decoding.	MobileSAMv2 significantly improves SegEvery efficiency by at least 16 times compared to SAM. It achieves comparable and even superior performance to SAM on the LVIS dataset for zero-shot object proposal. MobileSAMv2 effectively addresses the over-segmentation issue observed in SAM due to its object-aware prompt sampling.	The current implementation relies on object discovery for prompt sampling, which could be further optimized for efficiency. Exploring more powerful distilled image encoders to further reduce computation time without significantly sacrificing performance.	image segmentation, segment anything model (sam), object detection, prompt engineering, efficiency
2312.09305 Report	Stable Score Distillation for High-Quality 3D Generation	Boshi Tang, Jianan Wang, Zhiyong Wu, Lei Zhang	Although Score Distillation Sampling (SDS) has exhibited remarkable performance in conditional 3D content generation, a comprehensive understanding of its formulation is still lacking, hindering the development of 3D generation. In this work, we decompose SDS as a combination of three functional components, namely mode-seeking, mode-disengaging and variance-reducing terms, analyzing the properties of each. We show that problems such as over-smoothness and implausibility result from the intrinsic deficiency of the first two terms and propose a more advanced variance-reducing term than that introduced by SDS. Based on the analysis, we propose a simple yet effective approach named Stable Score Distillation (SSD) which strategically orchestrates each term for high-quality 3D generation and can be readily incorporated to various 3D generation frameworks and 3D representations. Extensive experiments validate the efficacy of our approach, demonstrating its ability to generate high-fidelity 3D content without succumbing to issues such as over-smoothness.	This paper presents Stable Score Distillation (SSD), a novel method for high-quality 3D content generation that leverages a comprehensive understanding of Score Distillation Sampling (SDS). The core contribution is decomposing the SDS estimator into three functional components: mode-disengaging, mode-seeking, and variance-reducing terms, and proposing a strategy to orchestrate them for improved 3D generation.	Despite the success of SDS in conditional 3D content generation, a thorough understanding of its formulation was lacking, hindering further development in the field. This paper addresses this gap by dissecting and analyzing SDS, paving the way for more effective and stable 3D generation techniques.	The paper analyzes the mathematical and numerical properties of each SDS component, identifying their limitations and strengths under different timestep regimes. It then proposes SSD, which strategically combines these components, leveraging the variance-reduced mode-seeking term for plausibility at low timesteps and the mode-disengaging term for trap escaping at high timesteps.	SSD successfully mitigates over-smoothness and implausibility issues prevalent in SDS-based 3D generation. The paper provides theoretical explanations for common observations and practices in 3D generation, such as the use of large CFG scales. Extensive experiments demonstrate SSD's efficacy and compatibility with various 3D generation frameworks and representations, achieving superior results compared to state-of-the-art methods.	The paper primarily focuses on single-object generation. Extending the analysis and method to more complex scenes with multiple objects presents an interesting future direction. While SSD effectively reduces over-smoothing, further investigation into alternative strategies for transient mode avoidance could lead to additional improvements.	3d generation, score distillation sampling, diffusion models, text-to-3d, generative modeling
2312.09256 Report	LIME: Localized Image Editing via Attention Regularization in Diffusion Models	Enis Simsar, Alessio Tonioni, Yongqin Xian, Thomas Hofmann, Federico Tombari	Diffusion models (DMs) have gained prominence due to their ability to generate high-quality, varied images, with recent advancements in text-to-image generation. The research focus is now shifting towards the controllability of DMs. A significant challenge within this domain is localized editing, where specific areas of an image are modified without affecting the rest of the content. This paper introduces LIME for localized image editing in diffusion models that do not require user-specified regions of interest (RoI) or additional text input. Our method employs features from pre-trained methods and a simple clustering technique to obtain precise semantic segmentation maps. Then, by leveraging cross-attention maps, it refines these segments for localized edits. Finally, we propose a novel cross-attention regularization technique that penalizes unrelated cross-attention scores in the RoI during the denoising steps, ensuring localized edits. Our approach, without re-training and fine-tuning, consistently improves the performance of existing methods in various editing benchmarks.	This paper introduces LIME, a localized image editing technique for diffusion models that leverages pre-trained InstructPix2Pix without requiring user-specified regions of interest or additional text input.	Localized image editing in diffusion models is a significant challenge due to the intertwined nature of image representations, where changes intended for one area can unintentionally affect others. Existing methods often rely on additional user input, such as masking the target area or providing extra text information, which adds complexity and doesn't guarantee seamless editing.	LIME uses features from pre-trained InstructPix2Pix and a simple clustering technique to obtain precise semantic segmentation maps. It then leverages cross-attention maps to refine these segments for localized edits. Finally, it employs a novel cross-attention regularization technique that penalizes unrelated cross-attention scores in the region of interest during denoising steps, ensuring localized edits.	LIME consistently improves the performance of existing methods in various editing benchmarks. LIME effectively implements localized edits while preserving the overall scene context, outperforming state-of-the-art models, including their fine-tuned versions on manually annotated datasets. LIME achieves significant improvements on metrics measuring structure and background preservation, indicating precise edits according to instructions while avoiding unintended changes to unaffected regions.	LIME may alter the scene's style, particularly in color, due to base model entanglement, though it still significantly improves edits compared to InstructPix2Pix. Prompt content can impact edit quality, as all tokens except start-of-text, stop words, and padding affect the region of interest during editing, leading to feature mixing.	image editing, diffusion models, localized editing, attention regularization, semantic segmentation
2312.09252 Report	FineControlNet: Fine-level Text Control for Image Generation with Spatially Aligned Text Control Injection	Hongsuk Choi, Isaac Kasahara, Selim Engin, Moritz Graule, Nikhil Chavan-Dafle, Volkan Isler	Recently introduced ControlNet has the ability to steer the text-driven image generation process with geometric input such as human 2D pose, or edge features. While ControlNet provides control over the geometric form of the instances in the generated image, it lacks the capability to dictate the visual appearance of each instance. We present FineControlNet to provide fine control over each instance's appearance while maintaining the precise pose control capability. Specifically, we develop and demonstrate FineControlNet with geometric control via human pose images and appearance control via instance-level text prompts. The spatial alignment of instance-specific text prompts and 2D poses in latent space enables the fine control capabilities of FineControlNet. We evaluate the performance of FineControlNet with rigorous comparison against state-of-the-art pose-conditioned text-to-image diffusion models. FineControlNet achieves superior performance in generating images that follow the user-provided instance-specific text prompts and poses compared with existing methods. Project webpage: https://samsunglabs.github.io/FineControlNet-project-page	FineControlNet allows users to control the appearance and pose of individual instances in a scene, enhancing text-to-image generation with fine-grained control over multiple objects.	Existing methods often struggle to generate images with distinct appearances for different instances, leading to visual feature blending or ignoring specific descriptions. FineControlNet addresses this limitation by providing fine-grained control over each instance's appearance while maintaining accurate pose control.	FineControlNet spatially aligns instance-level text prompts with corresponding 2D poses in the latent space during the reverse diffusion process. It separates and composes different conditions, leveraging pretrained Stable Diffusion and ControlNet, to generate images conditioned on both text and poses.	FineControlNet demonstrates superior performance in generating images that accurately reflect user-provided instance-specific text prompts and poses. Quantitative analysis shows FineControlNet achieves competitive image quality (FID) and pose control accuracy (AP) compared to state-of-the-art baselines. FineControlNet excels in CLIP Identity Observance (CIO) metrics, indicating a higher degree of text-image consistency and distinct identity generation for each instance.	FineControlNet may exhibit limitations in handling challenging poses, generating realistic human faces, and ensuring physically plausible scene compositions. Future work could explore enhancing generalization capabilities for extreme variations in instance count, scale, and proximity.	text-to-image generation, fine-grained control, instance-level conditioning, diffusion models, controlnet
2312.09249 Report	ZeroRF: Fast Sparse View 360° Reconstruction with Zero Pretraining	Ruoxi Shi, Xinyue Wei, Cheng Wang, Hao Su	We present ZeroRF, a novel per-scene optimization method addressing the challenge of sparse view 360{\deg} reconstruction in neural field representations. Current breakthroughs like Neural Radiance Fields (NeRF) have demonstrated high-fidelity image synthesis but struggle with sparse input views. Existing methods, such as Generalizable NeRFs and per-scene optimization approaches, face limitations in data dependency, computational cost, and generalization across diverse scenarios. To overcome these challenges, we propose ZeroRF, whose key idea is to integrate a tailored Deep Image Prior into a factorized NeRF representation. Unlike traditional methods, ZeroRF parametrizes feature grids with a neural network generator, enabling efficient sparse view 360{\deg} reconstruction without any pretraining or additional regularization. Extensive experiments showcase ZeroRF's versatility and superiority in terms of both quality and speed, achieving state-of-the-art results on benchmark datasets. ZeroRF's significance extends to applications in 3D content generation and editing. Project page: https://sarahweiii.github.io/zerorf/	ZeroRF is a novel per-scene optimization method that integrates a tailored Deep Image Prior into a factorized NeRF representation for fast and high-quality sparse view 360° reconstruction.	Existing methods for sparse-view 360° reconstruction struggle with data dependency, high computational cost, and limited generalization across diverse scenarios. They often fail to produce accurate and visually pleasing results due to noisy and distorted features obtained from limited input views.	ZeroRF parametrizes feature grids of factorized NeRF representations with a randomly-initialized deep neural network generator. It employs a plain MSE rendering loss and does not require any pretraining or additional regularization. The method leverages the deep prior captured within the generator network's architecture to produce clean and well-structured features even with sparse input views.	ZeroRF achieves state-of-the-art results on sparse view benchmarks, outperforming previous methods in terms of PSNR, SSIM, and LPIPS. ZeroRF is significantly faster than existing per-scene optimization approaches, reconstructing objects in as low as 30 seconds for common resolutions in 3D generation tasks. ZeroRF is robust and generalizes well across diverse scenarios, demonstrated by its high-quality reconstructions from both synthetic and real-world datasets.	ZeroRF might magnify the limitations of underlying factorized NeRF representations, such as axis-aligned artifacts in TensoRF. Extending ZeroRF to unbounded scenes requires further investigation, as the non-linear contraction of space in grid representations for unbounded scenes leads to distorted features.	neural radiance fields, nerf, sparse view reconstruction, deep image prior, 3d reconstruction
2312.09246 Report	SHAP-EDITOR: Instruction-guided Latent 3D Editing in Seconds	Minghao Chen, Junyu Xie, Iro Laina, Andrea Vedaldi	We propose a novel feed-forward 3D editing framework called Shap-Editor. Prior research on editing 3D objects primarily concentrated on editing individual objects by leveraging off-the-shelf 2D image editing networks. This is achieved via a process called distillation, which transfers knowledge from the 2D network to 3D assets. Distillation necessitates at least tens of minutes per asset to attain satisfactory editing results, and is thus not very practical. In contrast, we ask whether 3D editing can be carried out directly by a feed-forward network, eschewing test-time optimisation. In particular, we hypothesise that editing can be greatly simplified by first encoding 3D objects in a suitable latent space. We validate this hypothesis by building upon the latent space of Shap-E. We demonstrate that direct 3D editing in this space is possible and efficient by building a feed-forward editor network that only requires approximately one second per edit. Our experiments show that Shap-Editor generalises well to both in-distribution and out-of-distribution 3D assets with different prompts, exhibiting comparable performance with methods that carry out test-time optimisation for each edited instance.	This paper introduces \emph{\method}, a novel feed-forward 3D editing framework that performs semantic edits on 3D objects in latent space based on natural language instructions.	Existing 3D editing methods rely on time-consuming test-time optimisation, making them impractical for interactive applications. \method addresses this limitation by enabling near-instantaneous editing within a learned latent space.	\method leverages the latent space of a pre-trained 3D auto-encoder (Shap-E) and distills knowledge from multiple 2D image editors using a score distillation sampling loss. It learns a latent editor function that maps a source 3D object's latent code to an edited latent code based on the input instruction.	\emph{\method} achieves superior editing results compared to state-of-the-art optimisation-based methods while reducing inference time from minutes to seconds. The learned latent editor exhibits good generalisation capabilities, effectively editing unseen 3D objects and handling compositions of multiple edits. The latent space demonstrates partial linearity, enabling control over the strength of the applied edit through simple arithmetic operations.	The quality of \method is limited by the expressiveness of the underlying pre-trained 3D auto-encoder and 2D image editors. Although \method can learn from multiple instructions, achieving a fully open-ended 3D editor remains an open challenge.	3d editing, latent space, score distillation sampling, text-guided editing, feed-forward network
2312.09242 Report	Text2Immersion: Generative Immersive Scene with 3D Gaussians	Hao Ouyang, Kathryn Heal, Stephen Lombardi, Tiancheng Sun	We introduce Text2Immersion, an elegant method for producing high-quality 3D immersive scenes from text prompts. Our proposed pipeline initiates by progressively generating a Gaussian cloud using pre-trained 2D diffusion and depth estimation models. This is followed by a refining stage on the Gaussian cloud, interpolating and refining it to enhance the details of the generated scene. Distinct from prevalent methods that focus on single object or indoor scenes, or employ zoom-out trajectories, our approach generates diverse scenes with various objects, even extending to the creation of imaginary scenes. Consequently, Text2Immersion can have wide-ranging implications for various applications such as virtual reality, game development, and automated content creation. Extensive evaluations demonstrate that our system surpasses other methods in rendering quality and diversity, further progressing towards text-driven 3D scene generation. We will make the source code publicly accessible at the project page.	Presents Text2Immersion, a method for generating high-quality 3D immersive scenes from text prompts using 3D Gaussians.	Addresses limitations in existing text-to-3D methods that struggle with scene generation, limited diversity, and slow rendering speeds.	Two-stage pipeline: 1) Initialization of 3D Gaussian cloud from anchor views using diffusion models and depth estimation. 2) Refinement of the Gaussian cloud via inpainting and super-resolution using additional generated views.	Generates high-fidelity, diverse, and immersive 3D scenes from text prompts. Outperforms existing methods in terms of rendering quality, diversity, and alignment with text prompts. Achieves real-time rendering speeds (180 FPS on a 3070 laptop GPU).	Reliance on monocular depth estimation can lead to visual artifacts if estimations are inaccurate. Inpainting new objects during refinement may cause ghosting effects.	text-to-3d, 3d scene generation, 3d gaussian splatting, diffusion models, immersive environments
2312.09237 Report	Pixel Aligned Language Models	Jiarui Xu, Xingyi Zhou, Shen Yan, Xiuye Gu, Anurag Arnab, Chen Sun, Xiaolong Wang, Cordelia Schmid	Large language models have achieved great success in recent years, so as their variants in vision. Existing vision-language models can describe images in natural languages, answer visual-related questions, or perform complex reasoning about the image. However, it is yet unclear how localization tasks, such as word grounding or referring localization, can be performed using large language models. In this work, we aim to develop a vision-language model that can take locations, for example, a set of points or boxes, as either inputs or outputs. When taking locations as inputs, the model performs location-conditioned captioning, which generates captions for the indicated object or region. When generating locations as outputs, our model regresses pixel coordinates for each output word generated by the language model, and thus performs dense word grounding. Our model is pre-trained on the Localized Narrative dataset, which contains pixel-word-aligned captioning from human attention. We show our model can be applied to various location-aware vision-language tasks, including referring localization, location-conditioned captioning, and dense object captioning, archiving state-of-the-art performance on RefCOCO and Visual Genome. Project page: https://jerryxu.net/PixelLLM .	Introduces PixelLLM, a vision-language model that generates captions and aligns each word to a pixel location, enabling localization capabilities within LLMs.	Addresses the lack of fine-grained localization abilities in existing vision-language models, allowing for spatial understanding and reasoning in LLMs.	Leverages a novel architecture with a prompt feature extractor to condition image features on location prompts and adds a parallel MLP layer to the language model for per-token location regression. Trains on the Localized Narrative dataset with synchronized caption-location annotations.	Achieves state-of-the-art performance on RefCOCO referring localization and segmentation. Outperforms previous methods on dense object captioning and location-conditioned captioning tasks. Demonstrates the effectiveness of the per-token localization formulation, especially with dense supervision from the Localized Narrative dataset.	Limited evaluation on the less explored task of controlled trace generation. Reliance on the quality and noise levels within the Localized Narrative dataset's mouse trace annotations.	vision-language models, localization, referring expression comprehension, dense captioning, large language models
2312.09228 Report	3DGS-Avatar: Animatable Avatars via Deformable 3D Gaussian Splatting	Zhiyin Qian, Shaofei Wang, Marko Mihajlovic, Andreas Geiger, Siyu Tang	We introduce an approach that creates animatable human avatars from monocular videos using 3D Gaussian Splatting (3DGS). Existing methods based on neural radiance fields (NeRFs) achieve high-quality novel-view/novel-pose image synthesis but often require days of training, and are extremely slow at inference time. Recently, the community has explored fast grid structures for efficient training of clothed avatars. Albeit being extremely fast at training, these methods can barely achieve an interactive rendering frame rate with around 15 FPS. In this paper, we use 3D Gaussian Splatting and learn a non-rigid deformation network to reconstruct animatable clothed human avatars that can be trained within 30 minutes and rendered at real-time frame rates (50+ FPS). Given the explicit nature of our representation, we further introduce as-isometric-as-possible regularizations on both the Gaussian mean vectors and the covariance matrices, enhancing the generalization of our model on highly articulated unseen poses. Experimental results show that our method achieves comparable and even better performance compared to state-of-the-art approaches on animatable avatar creation from a monocular input, while being 400x and 250x faster in training and inference, respectively.	This paper presents 3DGS-Avatar, a novel method for creating animatable human avatars from monocular videos using 3D Gaussian Splatting (3DGS).	Existing NeRF-based methods for avatar creation are computationally expensive and slow in training and inference, making them impractical for real-time applications. This paper aims to address this limitation by leveraging the efficiency of 3DGS.	The proposed method leverages 3DGS and learns a non-rigid deformation network to reconstruct animatable clothed human avatars. It decomposes human deformation into non-rigid (pose-dependent cloth deformation) and rigid (skeleton-controlled) components. The approach uses a small MLP for color decoding, accounting for local deformations and dynamic lighting. As-isometric-as-possible regularizations are applied to Gaussian mean vectors and covariance matrices to enhance generalization to unseen poses.	The method achieves comparable or better performance than state-of-the-art approaches in animatable avatar creation from monocular inputs. It achieves significantly faster training (400x) and inference (250x) speeds compared to the most competitive baseline (HumanNeRF). The approach effectively generalizes to unseen poses and preserves finer details compared to other methods.	The training time, while significantly improved, still doesn't match the fastest grid-based methods. The method may produce blurry results in areas with high-frequency textures.	3d gaussian splatting, animatable avatars, monocular reconstruction, neural rendering, real-time rendering
2312.09222 Report	Mosaic-SDF for 3D Generative Models	Lior Yariv, Omri Puny, Natalia Neverova, Oran Gafni, Yaron Lipman	Current diffusion or flow-based generative models for 3D shapes divide to two: distilling pre-trained 2D image diffusion models, and training directly on 3D shapes. When training a diffusion or flow models on 3D shapes a crucial design choice is the shape representation. An effective shape representation needs to adhere three design principles: it should allow an efficient conversion of large 3D datasets to the representation form; it should provide a good tradeoff of approximation power versus number of parameters; and it should have a simple tensorial form that is compatible with existing powerful neural architectures. While standard 3D shape representations such as volumetric grids and point clouds do not adhere to all these principles simultaneously, we advocate in this paper a new representation that does. We introduce Mosaic-SDF (M-SDF): a simple 3D shape representation that approximates the Signed Distance Function (SDF) of a given shape by using a set of local grids spread near the shape's boundary. The M-SDF representation is fast to compute for each shape individually making it readily parallelizable; it is parameter efficient as it only covers the space around the shape's boundary; and it has a simple matrix form, compatible with Transformer-based architectures. We demonstrate the efficacy of the M-SDF representation by using it to train a 3D generative flow model including class-conditioned generation with the 3D Warehouse dataset, and text-to-3D generation using a dataset of about 600k caption-shape pairs.	This paper introduces Mosaic-SDF (M-SDF), a novel 3D shape representation for training generative models, which approximates the Signed Distance Function (SDF) using a set of local grids near the shape's boundary.	An effective 3D shape representation for generative models should be efficiently computable for large datasets, parameter efficient, and compatible with modern neural architectures. Existing representations often lack one or more of these properties.	M-SDF represents a shape as a set of local grids, each with a center, scale, and grid values sampled from the shape's SDF. This representation is trained using a permutation-equivariant Flow Matching model on two datasets: ShapeNetCore-V2 and a dataset of shapes with text descriptions.	M-SDF achieves superior surface approximation per parameter budget compared to Implicit Neural Representations (INRs) while requiring significantly less computation time. M-SDF outperforms or achieves comparable results to state-of-the-art methods in class-conditional 3D shape generation on ShapeNetCore-V2, as measured by various metrics including Fréchet PointNet++ Distance (FPD), Coverage (COV), and 1-Nearest Neighbor Accuracy (1-NNA). Qualitative results demonstrate that M-SDF generates high-fidelity shapes with sharper details compared to baselines, which often produce overly smooth results.	The current M-SDF representation only encodes the SDF and lacks texture or color information. The simple linear layer connecting local grids to the transformer could be replaced with more sophisticated architectures like convolutional layers or autoencoders.	3d shape representation, generative models, signed distance function, flow matching, mosaic-sdf
2312.09158 Report	General Object Foundation Model for Images and Videos at Scale	Junfeng Wu, Yi Jiang, Qihao Liu, Zehuan Yuan, Xiang Bai, Song Bai	We present GLEE in this work, an object-level foundation model for locating and identifying objects in images and videos. Through a unified framework, GLEE accomplishes detection, segmentation, tracking, grounding, and identification of arbitrary objects in the open world scenario for various object perception tasks. Adopting a cohesive learning strategy, GLEE acquires knowledge from diverse data sources with varying supervision levels to formulate general object representations, excelling in zero-shot transfer to new data and tasks. Specifically, we employ an image encoder, text encoder, and visual prompter to handle multi-modal inputs, enabling to simultaneously solve various object-centric downstream tasks while maintaining state-of-the-art performance. Demonstrated through extensive training on over five million images from diverse benchmarks, GLEE exhibits remarkable versatility and improved generalization performance, efficiently tackling downstream tasks without the need for task-specific adaptation. By integrating large volumes of automatically labeled data, we further enhance its zero-shot generalization capabilities. Additionally, GLEE is capable of being integrated into Large Language Models, serving as a foundational model to provide universal object-level information for multi-modal tasks. We hope that the versatility and universality of our method will mark a significant step in the development of efficient visual foundation models for AGI systems. The model and code will be released at https://glee-vision.github.io .	\methodNAME is a novel object-level foundation model for locating and identifying objects in images and videos, achieving detection, segmentation, tracking, grounding, and identification in an open-world setting.	Existing visual foundation models often focus on global image-level understanding, lacking the crucial ability to locate and identify individual objects. \methodNAME addresses this limitation by providing general and accurate object-level information.	\methodNAME utilizes a unified framework with an image encoder, text encoder, visual prompter, and object decoder. This enables multi-modal input handling and simultaneous solving of various object-centric tasks. Trained on over five million images with multi-granularity joint supervision, it excels in zero-shot transfer to new data and tasks.	\methodNAME achieves state-of-the-art performance on various object-level image tasks, including detection, referring expression comprehension, and open-world detection. It exhibits remarkable zero-shot generalization capabilities in large-vocabulary open-world video tracking tasks, surpassing existing models. Integrating automatically annotated data (SA1B, GRIT) enhances \methodNAME's zero-shot generalization and allows scaling to 10 million training images.	While \methodNAME excels in zero-shot transfer, it might benefit from fine-tuning for tasks heavily reliant on temporal consistency, like OVIS. Further improvements can be achieved by incorporating a larger and more diverse set of captioned data for enhanced text comprehension.	foundation models, object detection, instance segmentation, object tracking, zero-shot learning
2312.09147 Report	Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D Reconstruction with Transformers	Zi-Xin Zou, Zhipeng Yu, Yuan-Chen Guo, Yangguang Li, Ding Liang, Yan-Pei Cao, Song-Hai Zhang	Recent advancements in 3D reconstruction from single images have been driven by the evolution of generative models. Prominent among these are methods based on Score Distillation Sampling (SDS) and the adaptation of diffusion models in the 3D domain. Despite their progress, these techniques often face limitations due to slow optimization or rendering processes, leading to extensive training and optimization times. In this paper, we introduce a novel approach for single-view reconstruction that efficiently generates a 3D model from a single image via feed-forward inference. Our method utilizes two transformer-based networks, namely a point decoder and a triplane decoder, to reconstruct 3D objects using a hybrid Triplane-Gaussian intermediate representation. This hybrid representation strikes a balance, achieving a faster rendering speed compared to implicit representations while simultaneously delivering superior rendering quality than explicit representations. The point decoder is designed for generating point clouds from single images, offering an explicit representation which is then utilized by the triplane decoder to query Gaussian features for each point. This design choice addresses the challenges associated with directly regressing explicit 3D Gaussian attributes characterized by their non-structural nature. Subsequently, the 3D Gaussians are decoded by an MLP to enable rapid rendering through splatting. Both decoders are built upon a scalable, transformer-based architecture and have been efficiently trained on large-scale 3D datasets. The evaluations conducted on both synthetic datasets and real-world images demonstrate that our method not only achieves higher quality but also ensures a faster runtime in comparison to previous state-of-the-art techniques. Please see our project page at https://zouzx.github.io/TriplaneGaussian/.	Introduces TriplaneGaussian, a novel approach for fast 3D object reconstruction from single-view images using a hybrid Triplane-Gaussian representation.	Addresses limitations of existing methods that suffer from slow optimization or rendering processes, hindering fast 3D content creation.	Employs two transformer-based networks: a point decoder to generate a coarse point cloud and a triplane decoder to output implicit triplane features. Leverages projection-aware conditioning and geometry-aware encoding for improved reconstruction and novel view synthesis.	Achieves higher quality geometry reconstruction than Point-E, Shap-E, and One-2-3-45 on the GSO dataset. Outperforms Zero-1-2-3 and One-2-3-45 in novel view synthesis, demonstrating higher consistency and detail. Significantly faster in both reconstruction and rendering compared to baseline methods due to its feed-forward architecture and efficient rasterization.	Rendering quality is dependent on the accuracy of the initial point cloud prediction. Backside rendering tends to be blurry due to the non-probabilistic nature of the model.	3d reconstruction, single-view reconstruction, gaussian splatting, transformer, novel view synthesis
2312.09138 Report	Living Scenes: Multi-object Relocalization and Reconstruction in Changing 3D Environments	Liyuan Zhu, Shengyu Huang, Konrad Schindler, Iro Armeni	Research into dynamic 3D scene understanding has primarily focused on short-term change tracking from dense observations, while little attention has been paid to long-term changes with sparse observations. We address this gap with MoRE, a novel approach for multi-object relocalization and reconstruction in evolving environments. We view these environments as "living scenes" and consider the problem of transforming scans taken at different points in time into a 3D reconstruction of the object instances, whose accuracy and completeness increase over time. At the core of our method lies an SE(3)-equivariant representation in a single encoder-decoder network, trained on synthetic data. This representation enables us to seamlessly tackle instance matching, registration, and reconstruction. We also introduce a joint optimization algorithm that facilitates the accumulation of point clouds originating from the same instance across multiple scans taken at different points in time. We validate our method on synthetic and real-world data and demonstrate state-of-the-art performance in both end-to-end performance and individual subtasks.	Introduces MORE, a novel method for multi-object relocalization and reconstruction in evolving 3D environments over long time spans and from sparse observations.	Addresses the gap in research focusing on long-term changes with sparse observations in dynamic 3D scene understanding. This is important for applications that benefit from an integrated understanding of the environment accumulated over time.	Uses a single encoder-decoder network with an SE(3)-equivariant representation to tackle instance matching, registration, and reconstruction. Introduces a joint optimization algorithm to refine registration and reconstruction, accumulating point clouds from different scans for improved accuracy and completeness.	Achieves state-of-the-art performance in multi-object relocalization and reconstruction on synthetic and real-world datasets. Demonstrates the effectiveness of the joint optimization algorithm in improving geometric accuracy and completeness over time. Shows robustness to noisy and incomplete instance segmentation masks.	Test-time optimizations prevent real-time end-to-end execution. Faces challenges with multiple identical, similar, or symmetric shapes in the scene.	3d scene understanding, multi-object relocalization, point cloud registration, 3d reconstruction, se(3)-equivariant networks
2312.09128 Report	Tokenize Anything via Prompting	Ting Pan, Lulu Tang, Xinlong Wang, Shiguang Shan	We present a unified, promptable model capable of simultaneously segmenting, recognizing, and captioning anything. Unlike SAM, we aim to build a versatile region representation in the wild via visual prompting. To achieve this, we train a generalizable model with massive segmentation masks, e.g., SA-1B masks, and semantic priors from a pre-trained CLIP model with 5 billion parameters. Specifically, we construct a promptable image decoder by adding a semantic token to each mask token. The semantic token is responsible for learning the semantic priors in a predefined concept space. Through joint optimization of segmentation on mask tokens and concept prediction on semantic tokens, our model exhibits strong regional recognition and localization capabilities. For example, an additional 38M-parameter causal text decoder trained from scratch sets a new record with a CIDEr score of 150.7 on the Visual Genome region captioning task. We believe this model can be a versatile region-level image tokenizer, capable of encoding general-purpose region context for a broad range of perception tasks. Code and models are available at https://github.com/baaivision/tokenize-anything.	This paper introduces TAP, a unified and promptable model that simultaneously performs segmentation, recognition, and captioning of any given region in an image.	This is important because it moves towards a single, versatile vision model capable of diverse perception tasks with strong zero-shot generalization.	The authors achieve this by combining the segmentation capabilities of SAM with the semantic understanding of CLIP. They pre-train TAP on a new dataset, SemanticSA-1B, which integrates web-scale semantics from LAION-2B into the segmentation masks of SA-1B. This allows TAP to learn both pixel-level localization and region-level semantic understanding.	TAP exhibits strong zero-shot instance classification performance, achieving 59.0 AP on LVIS. TAP achieves competitive zero-shot segmentation performance compared to SAM, indicating that the added semantic understanding does not compromise its segmentation abilities. TAP sets a new record on the Visual Genome region captioning task with a CIDEr score of 150.7, demonstrating its capability to understand and generate language descriptions of visual regions.	The model is currently limited by the human-curated label space used during training, falling short of true open-world learning. The text decoder is fine-tuned on a limited region captioning dataset, potentially restricting its scalability and capacity for complex visual-language understanding.	vision foundation model, promptable segmentation, region recognition, image captioning, zero-shot learning
2312.09109 Report	VideoLCM: Video Latent Consistency Model	Xiang Wang, Shiwei Zhang, Han Zhang, Yu Liu, Yingya Zhang, Changxin Gao, Nong Sang	Consistency models have demonstrated powerful capability in efficient image generation and allowed synthesis within a few sampling steps, alleviating the high computational cost in diffusion models. However, the consistency model in the more challenging and resource-consuming video generation is still less explored. In this report, we present the VideoLCM framework to fill this gap, which leverages the concept of consistency models from image generation to efficiently synthesize videos with minimal steps while maintaining high quality. VideoLCM builds upon existing latent video diffusion models and incorporates consistency distillation techniques for training the latent consistency model. Experimental results reveal the effectiveness of our VideoLCM in terms of computational efficiency, fidelity and temporal consistency. Notably, VideoLCM achieves high-fidelity and smooth video synthesis with only four sampling steps, showcasing the potential for real-time synthesis. We hope that VideoLCM can serve as a simple yet effective baseline for subsequent research. The source code and models will be publicly available.	Introduces VideoLCM, a framework extending latent consistency models to video generation for efficient, high-quality synthesis with minimal steps.	Addresses the high computational cost and numerous sampling steps required by diffusion models for video generation, hindering real-time applications.	Leverages pretrained latent video diffusion models and consistency distillation to train a video latent consistency model. Employs DDIM as the ODE solver and incorporates classifier-free guidance during distillation.	Achieves high-fidelity video synthesis with only 4-6 sampling steps, significantly reducing computational cost compared to ~50 steps in previous methods. Demonstrates effectiveness for both text-to-video generation and compositional video synthesis tasks (e.g., depth-to-video, sketch-to-video). Exhibits improved time efficiency, particularly for high-resolution videos, compared to baseline diffusion models.	Relies on a strong teacher diffusion model for distillation, potentially limiting performance when training data is unavailable or from different domains. While significantly reducing steps, real-time video generation remains unachieved, motivating further research on faster algorithms without sacrificing quality.	video generation, consistency model, diffusion model, text-to-video, compositional video synthesis
2312.09069 Report	PI3D: Efficient Text-to-3D Generation with Pseudo-Image Diffusion	Ying-Tian Liu, Yuan-Chen Guo, Guan Luo, Heyi Sun, Wei Yin, Song-Hai Zhang	Diffusion models trained on large-scale text-image datasets have demonstrated a strong capability of controllable high-quality image generation from arbitrary text prompts. However, the generation quality and generalization ability of 3D diffusion models is hindered by the scarcity of high-quality and large-scale 3D datasets. In this paper, we present PI3D, a framework that fully leverages the pre-trained text-to-image diffusion models' ability to generate high-quality 3D shapes from text prompts in minutes. The core idea is to connect the 2D and 3D domains by representing a 3D shape as a set of Pseudo RGB Images. We fine-tune an existing text-to-image diffusion model to produce such pseudo-images using a small number of text-3D pairs. Surprisingly, we find that it can already generate meaningful and consistent 3D shapes given complex text descriptions. We further take the generated shapes as the starting point for a lightweight iterative refinement using score distillation sampling to achieve high-quality generation under a low budget. PI3D generates a single 3D shape from text in only 3 minutes and the quality is validated to outperform existing 3D generative models by a large margin.	PI3D, a framework that leverages pre-trained text-to-image diffusion models for high-quality 3D shape generation from text prompts.	Addresses the challenge of limited availability of high-quality, large-scale 3D datasets, which hinders the development of robust 3D generative models.	Represents 3D shapes as sets of pseudo RGB images derived from triplane representations. Fine-tunes a pre-trained text-to-image diffusion model on these pseudo-images and real images for enhanced generalization. Utilizes score distillation sampling for lightweight refinement of generated 3D shapes.	Generates 3D shapes from text prompts within minutes. Exhibits superior visual quality, 3D consistency, and generation speed compared to existing text-to-3D methods. Demonstrates improved generalization ability by leveraging knowledge from both 2D and 3D data.	Triplane fitting during training incurs linear cost with dataset size. Representational capacity limited by feature dimensions, posing challenges for detailed 3D generation.	text-to-3d, diffusion models, triplane representation, score distillation sampling, 3d generative models
2312.08892 Report	VaLID: Variable-Length Input Diffusion for Novel View Synthesis	Shijie Li, Farhad G. Zanjani, Haitam Ben Yahia, Yuki M. Asano, Juergen Gall, Amirhossein Habibian	Novel View Synthesis (NVS), which tries to produce a realistic image at the target view given source view images and their corresponding poses, is a fundamental problem in 3D Vision. As this task is heavily under-constrained, some recent work, like Zero123, tries to solve this problem with generative modeling, specifically using pre-trained diffusion models. Although this strategy generalizes well to new scenes, compared to neural radiance field-based methods, it offers low levels of flexibility. For example, it can only accept a single-view image as input, despite realistic applications often offering multiple input images. This is because the source-view images and corresponding poses are processed separately and injected into the model at different stages. Thus it is not trivial to generalize the model into multi-view source images, once they are available. To solve this issue, we try to process each pose image pair separately and then fuse them as a unified visual representation which will be injected into the model to guide image synthesis at the target-views. However, inconsistency and computation costs increase as the number of input source-view images increases. To solve these issues, the Multi-view Cross Former module is proposed which maps variable-length input data to fix-size output data. A two-stage training strategy is introduced to further improve the efficiency during training time. Qualitative and quantitative evaluation over multiple datasets demonstrates the effectiveness of the proposed method against previous approaches. The code will be released according to the acceptance.	VaLID, a diffusion-based novel view synthesis model, is proposed, enabling variable-length multi-view image fusion in both training and inference.	Existing diffusion-based NVS methods are limited to single-view input, hindering flexibility in real-world applications where multiple source images are often available.	VaLID employs an appearance-pose-entanglement conditioning strategy with a Multi-view Cross Former module to process variable-length input tokens into a fixed-size representation for efficient and consistent novel view generation. A two-stage training strategy enhances efficiency.	VaLID surpasses previous state-of-the-art methods quantitatively and qualitatively on GSO and RTMV datasets, even with single-view input. Performance improves with increasing input source images, showcasing the model's ability to leverage multi-view information. The token sampling strategy in training and inference improves efficiency without significant performance loss.	The fixed number of learnable tokens in Multi-view Cross Former may limit information collection when input tokens are excessive. Future work could explore incorporating geometric priors or text prompts for enhanced performance.	novel view synthesis, diffusion models, multi-view fusion, vision transformer, cross attention
2312.08889 Report	SEEAvatar: Photorealistic Text-to-3D Avatar Generation with Constrained Geometry and Appearance	Yuanyou Xu, Zongxin Yang, Yi Yang	Powered by large-scale text-to-image generation models, text-to-3D avatar generation has made promising progress. However, most methods fail to produce photorealistic results, limited by imprecise geometry and low-quality appearance. Towards more practical avatar generation, we present SEEAvatar, a method for generating photorealistic 3D avatars from text with SElf-Evolving constraints for decoupled geometry and appearance. For geometry, we propose to constrain the optimized avatar in a decent global shape with a template avatar. The template avatar is initialized with human prior and can be updated by the optimized avatar periodically as an evolving template, which enables more flexible shape generation. Besides, the geometry is also constrained by the static human prior in local parts like face and hands to maintain the delicate structures. For appearance generation, we use diffusion model enhanced by prompt engineering to guide a physically based rendering pipeline to generate realistic textures. The lightness constraint is applied on the albedo texture to suppress incorrect lighting effect. Experiments show that our method outperforms previous methods on both global and local geometry and appearance quality by a large margin. Since our method can produce high-quality meshes and textures, such assets can be directly applied in classic graphics pipeline for realistic rendering under any lighting condition. Project page at: https://yoxu515.github.io/SEEAvatar/.	Presents SEEAvatar, a method for generating photorealistic 3D avatars from text descriptions using self-evolving constraints for decoupled geometry and appearance.	Existing methods struggle to create photorealistic 3D avatars from text due to limitations in generating precise geometry and high-quality appearance, hindering applications in VR, gaming, and film.	Leverages DMTet for shape representation, guided by a 2D diffusion model with self-evolving constraints from a template avatar. Employs a physically based rendering pipeline with diffusion model guidance and lightness constraints for realistic texture generation.	Outperforms previous methods in generating high-quality avatars with accurate global shapes, fine local structures, and detailed textures. Generates decoupled meshes and textures, enabling integration with classic graphics pipelines for rendering and editing. Demonstrates flexibility in editing avatar geometry and appearance through text prompts.	Struggles to represent highly detailed structures like hair strands, loose clothing, and complex accessories. Appearance generation, while improved, still exhibits limitations in accurate roughness values and occasional lighting artifacts.	text-to-3d, avatar generation, photorealistic rendering, diffusion models, self-evolving constraints
2312.08887 Report	SpeedUpNet: A Plug-and-Play Hyper-Network for Accelerating Text-to-Image Diffusion Models	Weilong Chai, DanDan Zheng, Jiajiong Cao, Zhiquan Chen, Changbao Wang, Chenguang Ma	Text-to-image diffusion models (SD) exhibit significant advancements while requiring extensive computational resources. Though many acceleration methods have been proposed, they suffer from generation quality degradation or extra training cost generalizing to new fine-tuned models. To address these limitations, we propose a novel and universal Stable-Diffusion (SD) acceleration module called SpeedUpNet(SUN). SUN can be directly plugged into various fine-tuned SD models without extra training. This technique utilizes cross-attention layers to learn the relative offsets in the generated image results between negative and positive prompts achieving classifier-free guidance distillation with negative prompts controllable, and introduces a Multi-Step Consistency (MSC) loss to ensure a harmonious balance between reducing inference steps and maintaining consistency in the generated output. Consequently, SUN significantly reduces the number of inference steps to just 4 steps and eliminates the need for classifier-free guidance. It leads to an overall speedup of more than 10 times for SD models compared to the state-of-the-art 25-step DPM-solver++, and offers two extra advantages: (1) classifier-free guidance distillation with controllable negative prompts and (2) seamless integration into various fine-tuned Stable-Diffusion models without training. The effectiveness of the SUN has been verified through extensive experimentation. Project Page: https://williechai.github.io/speedup-plugin-for-stable-diffusions.github.io	The paper proposes SpeedUpNet (SUN), a universal Stable-Diffusion acceleration module that reduces inference steps to 4 while maintaining quality and diversity in generated images.	Existing diffusion model acceleration techniques often degrade generation quality or require retraining for new models, limiting their practical use. SUN aims to address these limitations.	SUN utilizes a teacher-student distillation framework with a SUN adapter containing cross-attention modules. It learns relative offsets between negative and positive prompts and introduces a Multi-Step Consistency (MSC) loss to maintain output consistency.	SUN achieves a 10x speedup compared to the 25-step DPM-solver++. SUN seamlessly integrates into various fine-tuned SD models without retraining. SUN demonstrates controllable classifier-free guidance distillation by learning from various negative prompts.	The paper primarily focuses on Stable-Diffusion models and may require adaptation for other architectures. Further research could explore extending SUN's efficiency and applicability to a wider range of generative tasks.	diffusion models, text-to-image generation, model acceleration, classifier-free guidance, knowledge distillation
2312.08885 Report	SceneWiz3D: Towards Text-guided 3D Scene Composition	Qihang Zhang, Chaoyang Wang, Aliaksandr Siarohin, Peiye Zhuang, Yinghao Xu, Ceyuan Yang, Dahua Lin, Bolei Zhou, Sergey Tulyakov, Hsin-Ying Lee	We are witnessing significant breakthroughs in the technology for generating 3D objects from text. Existing approaches either leverage large text-to-image models to optimize a 3D representation or train 3D generators on object-centric datasets. Generating entire scenes, however, remains very challenging as a scene contains multiple 3D objects, diverse and scattered. In this work, we introduce SceneWiz3D, a novel approach to synthesize high-fidelity 3D scenes from text. We marry the locality of objects with globality of scenes by introducing a hybrid 3D representation: explicit for objects and implicit for scenes. Remarkably, an object, being represented explicitly, can be either generated from text using conventional text-to-3D approaches, or provided by users. To configure the layout of the scene and automatically place objects, we apply the Particle Swarm Optimization technique during the optimization process. Furthermore, it is difficult for certain parts of the scene (e.g., corners, occlusion) to receive multi-view supervision, leading to inferior geometry. We incorporate an RGBD panorama diffusion model to mitigate it, resulting in high-quality geometry. Extensive evaluation supports that our approach achieves superior quality over previous approaches, enabling the generation of detailed and view-consistent 3D scenes.	Introduces SceneWiz3D, a novel approach to synthesize high-fidelity 3D scenes from text using a hybrid explicit-implicit representation and Particle Swarm Optimization.	Generating entire 3D scenes from text is crucial for immersive experiences but challenging due to the complexity of multiple objects and layouts.	Combines DMTets (explicit) for objects of interest and NeRF (implicit) for the environment; uses PSO to optimize object placement based on CLIP similarity; leverages RGBD panorama diffusion for enhanced geometry.	Achieves state-of-the-art performance in text-to-3D scene generation. Successfully synthesizes detailed and view-consistent scenes across various styles and layouts. Effectively mitigates the multi-face (Janus) problem often found in scene generation.	Shares limitations with SDS-based methods like long optimization time and potential color saturation. Scene configuration optimization is limited by CLIP's capabilities for fine-grained manipulation.	text-to-3d, scene synthesis, hybrid representation, particle swarm optimization, rgbd panorama diffusion
2312.08883 Report	EditGuard: Versatile Image Watermarking for Tamper Localization and Copyright Protection	Xuanyu Zhang, Runyi Li, Jiwen Yu, Youmin Xu, Weiqi Li, Jian Zhang	In the era where AI-generated content (AIGC) models can produce stunning and lifelike images, the lingering shadow of unauthorized reproductions and malicious tampering poses imminent threats to copyright integrity and information security. Current image watermarking methods, while widely accepted for safeguarding visual content, can only protect copyright and ensure traceability. They fall short in localizing increasingly realistic image tampering, potentially leading to trust crises, privacy violations, and legal disputes. To solve this challenge, we propose an innovative proactive forensics framework EditGuard, to unify copyright protection and tamper-agnostic localization, especially for AIGC-based editing methods. It can offer a meticulous embedding of imperceptible watermarks and precise decoding of tampered areas and copyright information. Leveraging our observed fragility and locality of image-into-image steganography, the realization of EditGuard can be converted into a united image-bit steganography issue, thus completely decoupling the training process from the tampering types. Extensive experiments demonstrate that our EditGuard balances the tamper localization accuracy, copyright recovery precision, and generalizability to various AIGC-based tampering methods, especially for image forgery that is difficult for the naked eye to detect. The project page is available at https://xuanyuzhang21.github.io/project/editguard/.	EditGuard, a proactive forensics framework, is presented to unify copyright protection and tamper-agnostic localization, particularly for AIGC-based editing methods.	The rise of AIGC models necessitates robust tools to combat unauthorized reproduction and malicious tampering, protecting copyright integrity and information security.	EditGuard embeds dual invisible watermarks (localization and copyright) into images. Leveraging the fragility and locality of I2I steganography, tamper localization is converted into a united image-bit steganography issue, decoupling training from specific tampering types.	EditGuard achieves over 95% localization precision and nearly 100% copyright accuracy, outperforming state-of-the-art methods. It demonstrates superior generalizability to various AIGC-based tampering methods, including those producing visually imperceptible forgeries. The framework shows robustness to common image degradations like noise and compression.	Future work includes optimizing localization watermark selection and extending EditGuard to other modalities like video and 3D scenes. Exploring end-to-end optimization for learning optimal localization watermarks and reducing information capacity for enhanced robustness are key areas of interest.	proactive forensics, tamper localization, copyright protection, image watermarking, ai-generated content (aigc)
2312.08882 Report	Neural Video Fields Editing	Shuzhou Yang, Chong Mou, Jiwen Yu, Yuhan Wang, Xiandong Meng, Jian Zhang	Diffusion models have revolutionized text-driven video editing. However, applying these methods to real-world editing encounters two significant challenges: (1) the rapid increase in GPU memory demand as the number of frames grows, and (2) the inter-frame inconsistency in edited videos. To this end, we propose NVEdit, a novel text-driven video editing framework designed to mitigate memory overhead and improve consistent editing for real-world long videos. Specifically, we construct a neural video field, powered by tri-plane and sparse grid, to enable encoding long videos with hundreds of frames in a memory-efficient manner. Next, we update the video field through off-the-shelf Text-to-Image (T2I) models to impart text-driven editing effects. A progressive optimization strategy is developed to preserve original temporal priors. Importantly, both the neural video field and T2I model are adaptable and replaceable, thus inspiring future research. Experiments demonstrate the ability of our approach to edit hundreds of frames with impressive inter-frame consistency. Our project is available at: https://nvedit.github.io/.	Presents NVEdit, a memory-efficient video editing framework that leverages neural video fields and off-the-shelf image processing techniques like Instruct-Pix2Pix+ (an enhanced version of Instruct-Pix2Pix).	Addresses challenges in existing text-driven video editing methods related to GPU memory constraints and inter-frame inconsistency, particularly for long videos.	Employs a two-stage process: 1) Video Fitting: Constructs a Neural Video Field (NVF) to capture temporal and content priors of a video efficiently. 2) Field Editing: Updates the NVF using a T2I model (primarily IP2P+) to impart text-driven edits while preserving temporal consistency through progressive optimization.	Achieves state-of-the-art temporal consistency and editing accuracy compared to existing methods. Demonstrates efficient memory usage, enabling editing of long videos with hundreds of frames. Showcases versatility by supporting various editing tasks like shape modification, scene changes, style transfer, and frame interpolation.	Temporal priors might still be affected during the field editing stage. Editing long videos can be time-consuming due to the increased number of frames requiring iterative optimization.	video editing, neural video field, text-to-image, instruct-pix2pix, temporal consistency
2312.08880 Report	GenDet: Towards Good Generalizations for AI-Generated Image Detection	Mingjian Zhu, Hanting Chen, Mouxiao Huang, Wei Li, Hailin Hu, Jie Hu, Yunhe Wang	The misuse of AI imagery can have harmful societal effects, prompting the creation of detectors to combat issues like the spread of fake news. Existing methods can effectively detect images generated by seen generators, but it is challenging to detect those generated by unseen generators. They do not concentrate on amplifying the output discrepancy when detectors process real versus fake images. This results in a close output distribution of real and fake samples, increasing classification difficulty in detecting unseen generators. This paper addresses the unseen-generator detection problem by considering this task from the perspective of anomaly detection and proposes an adversarial teacher-student discrepancy-aware framework. Our method encourages smaller output discrepancies between the student and the teacher models for real images while aiming for larger discrepancies for fake images. We employ adversarial learning to train a feature augmenter, which promotes smaller discrepancies between teacher and student networks when the inputs are fake images. Our method has achieved state-of-the-art on public benchmarks, and the visualization results show that a large output discrepancy is maintained when faced with various types of generators.	This paper proposes GenDet, an adversarial teacher-student discrepancy-aware framework for detecting AI-generated images, particularly those from unseen generators.	Detecting AI-generated images is crucial to combat misinformation and harmful societal effects, especially as these images become increasingly realistic.	GenDet uses a teacher-student framework with discrepancy-aware learning to differentiate real and fake images. It also employs a feature augmenter trained via adversarial learning to enhance generalization to unseen generators.	GenDet outperforms state-of-the-art methods on UniversalFakeDetect and GenImage datasets, showing significant improvements in average accuracy and mAP. The framework effectively handles degraded image classification tasks like low resolution and compression. Cross-dataset evaluation demonstrates GenDet's strong generalization ability even with large domain gaps.	The method relies on a pre-trained feature extractor (CLIP), potentially limiting its applicability to domains not well-represented in CLIP's training data. Further research can explore alternative feature augmentation techniques and architectures to enhance robustness.	ai-generated image detection, anomaly detection, teacher-student learning, adversarial learning, generalization
2312.08874 Report	Agent Attention: On the Integration of Softmax and Linear Attention	Dongchen Han, Tianzhu Ye, Yizeng Han, Zhuofan Xia, Shiji Song, Gao Huang	The attention module is the key component in Transformers. While the global attention mechanism offers high expressiveness, its excessive computational cost restricts its applicability in various scenarios. In this paper, we propose a novel attention paradigm, Agent Attention, to strike a favorable balance between computational efficiency and representation power. Specifically, the Agent Attention, denoted as a quadruple $(Q, A, K, V)$, introduces an additional set of agent tokens $A$ into the conventional attention module. The agent tokens first act as the agent for the query tokens $Q$ to aggregate information from $K$ and $V$, and then broadcast the information back to $Q$. Given the number of agent tokens can be designed to be much smaller than the number of query tokens, the agent attention is significantly more efficient than the widely adopted Softmax attention, while preserving global context modelling capability. Interestingly, we show that the proposed agent attention is equivalent to a generalized form of linear attention. Therefore, agent attention seamlessly integrates the powerful Softmax attention and the highly efficient linear attention. Extensive experiments demonstrate the effectiveness of agent attention with various vision Transformers and across diverse vision tasks, including image classification, object detection, semantic segmentation and image generation. Notably, agent attention has shown remarkable performance in high-resolution scenarios, owning to its linear attention nature. For instance, when applied to Stable Diffusion, our agent attention accelerates generation and substantially enhances image generation quality without any additional training. Code is available at https://github.com/LeapLabTHU/Agent-Attention.	This paper proposes Agent Attention, a novel attention mechanism for Vision Transformers that balances computational efficiency and representation power by introducing agent tokens to aggregate and broadcast global information.	The widely used Softmax attention in Transformers incurs high computational cost, limiting its applicability in vision tasks, while existing efficient attention mechanisms often compromise long-range modeling capabilities.	Agent Attention introduces a set of agent tokens to the conventional attention triplet (Q, K, V), forming a quadruplet (Q, A, K, V). Agent tokens first aggregate information from keys and values and then broadcast it to query tokens, effectively integrating Softmax and linear attention.	Agent Attention significantly improves performance across various vision tasks, including image classification, object detection, and semantic segmentation. The method excels in high-resolution scenarios, demonstrating the advantage of its global receptive field. Applied to Stable Diffusion, Agent Attention accelerates generation and enhances image quality without requiring additional training.	The paper primarily explores pooling for obtaining agent tokens, leaving room for investigating more advanced techniques. Future work includes exploring the application of Agent Attention to video modeling and multi-modal foundation models.	vision transformer, attention mechanism, agent attention, linear attention, image generation
2312.08873 Report	Diffusion Cocktail: Fused Generation from Diffusion Models	Haoming Liu, Yuanhe Guo, Shengjie Wang, Hongyi Wen	Diffusion models excel at generating high-quality images and are easy to extend, making them extremely popular among active users who have created an extensive collection of diffusion models with various styles by fine-tuning base models such as Stable Diffusion. Recent work has focused on uncovering semantic and visual information encoded in various components of a diffusion model, enabling better generation quality and more fine-grained control. However, those methods target improving a single model and overlook the vastly available collection of fine-tuned diffusion models. In this work, we study the combinations of diffusion models. We propose Diffusion Cocktail (Ditail), a training-free method that can accurately transfer content information between two diffusion models. This allows us to perform diverse generations using a set of diffusion models, resulting in novel images that are unlikely to be obtained by a single model alone. We also explore utilizing Ditail for style transfer, with the target style set by a diffusion model instead of an image. Ditail offers a more detailed manipulation of the diffusion generation, thereby enabling the vast community to integrate various styles and contents seamlessly and generate any content of any style.	Presents Diffusion Cocktail (Ditail), a training-free method for transferring content information between two diffusion models (DMs) for novel image generation and style transfer.	Addresses the limitation of existing methods that focus on improving single DMs and overlooks the vast collection of fine-tuned DMs, enabling diverse image generation by leveraging existing DM resources.	Injects latent representations from a source DM into specific layers of a target DM during the diffusion process, enabling style transfer with a target style defined by a DM.	Achieves high-quality style transfer between DMs, generating novel images by combining content and style information from different models. Enables style transfer of real images with the target style defined by a DM, allowing users to leverage diverse styles from fine-tuned DMs. Offers fine-grained control over the generation process through parameters like guidance strength and regional injection masks.	The effect of the negative prompt guidance parameter (beta) is case-sensitive and may not always yield significant results. Editing prompts that change the number of objects often lead to unsatisfactory results due to strong structure emphasis.	diffusion models, style transfer, image generation, content injection, deep learning
2312.08872 Report	The Lottery Ticket Hypothesis in Denoising: Towards Semantic-Driven Initialization	Jiafeng Mao, Xueting Wang, Kiyoharu Aizawa	Text-to-image diffusion models allow users control over the content of generated images. Still, text-to-image generation occasionally leads to generation failure requiring users to generate dozens of images under the same text prompt before they obtain a satisfying result. We formulate the lottery ticket hypothesis in denoising: randomly initialized Gaussian noise images contain special pixel blocks (winning tickets) that naturally tend to be denoised into specific content independently. The generation failure in standard text-to-image synthesis is caused by the gap between optimal and actual spatial distribution of winning tickets in initial noisy images. To this end, we implement semantic-driven initial image construction creating initial noise from known winning tickets for each concept mentioned in the prompt. We conduct a series of experiments that verify the properties of winning tickets and demonstrate their generalizability across images and prompts. Our results show that aggregating winning tickets into the initial noise image effectively induce the model to generate the specified object at the corresponding location.	This paper discovers and verifies the "Lottery Ticket Hypothesis in Denoising," revealing that specific pixel blocks within the initial noise images of diffusion models have inherent predispositions towards generating specific content, and introduces a semantic-driven initial image construction method leveraging these "winning tickets."	This discovery provides new insights into the denoising process in text-to-image diffusion models and offers a novel approach to enhance control over generated imagery, addressing the limitation of existing methods that primarily focus on refining the generation process rather than manipulating the initial noise.	The authors leverage the cross-attention mechanism in diffusion models to identify "winning tickets" - pixel blocks exhibiting high cross-attention values for specific concepts. They construct a collection of these winning tickets and utilize them to create semantically-driven initial images, guiding the model to generate specific content at desired locations.	Diffusion models demonstrate tolerance for non-Gaussian initial images constructed using winning tickets, producing high-quality images with effective content control. Winning tickets exhibit versatility across different prompts and images, maintaining their generative tendencies even when transferred between them. Combining semantic-driven initialization with existing layout-to-image synthesis methods significantly enhances their control effectiveness.	The winning ticket selection method based solely on category names may lead to unintended generation biases (e.g., color). The constructed initial images may deviate significantly from the normal distribution, potentially compromising the quality of generated images, especially when controlling large-sized objects.	diffusion models, text-to-image generation, lottery ticket hypothesis, semantic-driven initialization, cross-attention mechanism
2312.08870 Report	Vista-LLaMA: Reliable Video Narrator via Equal Distance to Visual Tokens	Fan Ma, Xiaojie Jin, Heng Wang, Yuchen Xian, Jiashi Feng, Yi Yang	Recent advances in large video-language models have displayed promising outcomes in video comprehension. Current approaches straightforwardly convert video into language tokens and employ large language models for multi-modal tasks. However, this method often leads to the generation of irrelevant content, commonly known as "hallucination", as the length of the text increases and the impact of the video diminishes. To address this problem, we propose Vista-LLaMA, a novel framework that maintains the consistent distance between all visual tokens and any language tokens, irrespective of the generated text length. Vista-LLaMA omits relative position encoding when determining attention weights between visual and text tokens, retaining the position encoding for text and text tokens. This amplifies the effect of visual tokens on text generation, especially when the relative distance is longer between visual and text tokens. The proposed attention mechanism significantly reduces the chance of producing irrelevant text related to the video content. Furthermore, we present a sequential visual projector that projects the current video frame into tokens of language space with the assistance of the previous frame. This approach not only captures the temporal relationship within the video, but also allows less visual tokens to encompass the entire video. Our approach significantly outperforms various previous methods (e.g., Video-ChatGPT, MovieChat) on four challenging open-ended video question answering benchmarks. We reach an accuracy of 60.7 on the zero-shot NExT-QA and 60.5 on the zero-shot MSRVTT-QA, setting a new state-of-the-art performance. This project is available at https://jinxxian.github.io/Vista-LLaMA.	This paper introduces Vista-LLaMA, a novel video-language model that enhances video understanding and temporal modeling within large language models (LLMs) for reliable video narration.	Existing methods for video comprehension often generate irrelevant content ('hallucination') as text length increases due to diminishing visual impact and lack of explicit temporal modeling.	Vista-LLaMA leverages two key innovations: 1) Equal Distance to Visual Tokens (EDVT) attention to maintain consistent influence of visual information on text generation, and 2) a sequential visual projector to encode temporal relationships between video frames.	Vista-LLaMA outperforms previous state-of-the-art methods on four challenging open-ended video question answering benchmarks, including achieving new state-of-the-art performance on zero-shot NExT-QA and MSRVTT-QA. EDVT attention significantly improves accuracy across various question types, demonstrating its ability to enhance multi-modal understanding in LLMs. The sequential visual projector effectively encodes temporal context, leading to improved performance compared to other visual projectors.	The evaluation relies on GPT-3.5 for assessment, which may introduce inaccuracies compared to the more expensive GPT-4. The study focuses on fine-tuning rather than pre-training, potentially limiting the full exploration of EDVT-Attention's capabilities for video comprehension and other multi-modal tasks.	video understanding, large language models, video question answering, multi-modal learning, hallucination reduction
2312.08825 Report	Guided Diffusion from Self-Supervised Diffusion Features	Vincent Tao Hu, Yunlu Chen, Mathilde Caron, Yuki M. Asano, Cees G. M. Snoek, Bjorn Ommer	Guidance serves as a key concept in diffusion models, yet its effectiveness is often limited by the need for extra data annotation or classifier pretraining. That is why guidance was harnessed from self-supervised learning backbones, like DINO. However, recent studies have revealed that the feature representation derived from diffusion model itself is discriminative for numerous downstream tasks as well, which prompts us to propose a framework to extract guidance from, and specifically for, diffusion models. Our research has yielded several significant contributions. Firstly, the guidance signals from diffusion models are on par with those from class-conditioned diffusion models. Secondly, feature regularization, when based on the Sinkhorn-Knopp algorithm, can further enhance feature discriminability in comparison to unconditional diffusion models. Thirdly, we have constructed an online training approach that can concurrently derive guidance from diffusion models for diffusion models. Lastly, we have extended the application of diffusion models along the constant velocity path of ODE to achieve a more favorable balance between sampling steps and fidelity. The performance of our methods has been outstanding, outperforming related baseline comparisons in large-resolution datasets, such as ImageNet256, ImageNet256-100 and LSUN-Churches. Our code will be released.	This paper proposes a novel framework to extract guidance signals directly from diffusion models themselves, eliminating the need for external data annotations or self-supervised learning backbones.	Current methods for guiding diffusion models towards higher fidelity outputs rely heavily on labor-intensive data annotation or the use of external pretrained models, which limits their practical applicability.	The authors introduce two approaches: 1) offline guidance extracts guidance from pretrained diffusion model features using k-means clustering, 2) online guidance utilizes an online optimal-transport-based algorithm (Sinkhorn-Knopp) to jointly learn the diffusion model and the guidance signals.	Guidance signals derived directly from diffusion models are on par with those from class-conditioned diffusion models. Online Sinkhorn-Knopp clustering significantly enhances feature discriminability compared to unconditional diffusion models. The proposed framework achieves a favorable balance between sampling speed and fidelity by leveraging the constant velocity path of ODEs.	Performance gap still exists compared to class-conditioned diffusion models which use extensive data annotations. Exploring other backbone architectures like DiT for potential improvements in future work.	diffusion models, self-guidance, image generation, sinkhorn-knopp algorithm, optimal transport
2312.08768 Report	Local Conditional Controlling for Text-to-Image Diffusion Models	Yibo Zhao, Liang Peng, Yang Yang, Zekai Luo, Hengjia Li, Yao Chen, Wei Zhao, qinglin lu, Boxi Wu, Wei Liu	Diffusion models have exhibited impressive prowess in the text-to-image task. Recent methods add image-level controls, e.g., edge and depth maps, to manipulate the generation process together with text prompts to obtain desired images. This controlling process is globally operated on the entire image, which limits the flexibility of control regions. In this paper, we introduce a new simple yet practical task setting: local control. It focuses on controlling specific local areas according to user-defined image conditions, where the rest areas are only conditioned by the original text prompt. This manner allows the users to flexibly control the image generation in a fine-grained way. However, it is non-trivial to achieve this goal. The naive manner of directly adding local conditions may lead to the local control dominance problem. To mitigate this problem, we propose a training-free method that leverages the updates of noised latents and parameters in the cross-attention map during the denosing process to promote concept generation in non-control areas. Moreover, we use feature mask constraints to mitigate the degradation of synthesized image quality caused by information differences inside and outside the local control area. Extensive experiments demonstrate that our method can synthesize high-quality images to the prompt under local control conditions. Code is available at https://github.com/YibooZhao/Local-Control.	This paper introduces 'local control', a new paradigm for controllable image synthesis using diffusion models, where users can control specific regions of an image using image conditions while the remaining image adheres to a text prompt.	Existing controllable image generation methods mainly focus on global, image-level control, lacking the flexibility for fine-grained local manipulations desired by users.	The paper proposes a training-free method that integrates with existing control models like ControlNet. It leverages the cross-attention maps during the denoising process to: 1) identify and regenerate objects ignored due to local control dominance, 2) focus token responses to refine object distinction, and 3) apply feature mask constraints to mitigate image degradation caused by information discrepancies.	The proposed method successfully synthesizes high-quality images that align with both local image conditions and text prompts. Extensive experiments on COCO and a custom dataset demonstrate superior performance over existing controllable methods like ControlNet and T2I-Adapter, achieving better FID, CLIP Score, and CLIP T2T similarity. Ablation studies validate the contribution of each proposed component: object regeneration, focused token response, and feature mask constraint, showcasing their effectiveness in handling local control dominance and improving image quality.	The method might encounter challenges in maintaining semantic consistency and object scaling between the locally controlled region and the rest of the image. Future work could explore establishing a more comprehensive connection between these regions to enhance visual coherence.	image synthesis, diffusion models, controllable generation, local control, cross-attention
2312.08754 Report	UniDream: Unifying Diffusion Priors for Relightable Text-to-3D Generation	Zexiang Liu, Yangguang Li, Youtian Lin, Xin Yu, Sida Peng, Yan-Pei Cao, Xiaojuan Qi, Xiaoshui Huang, Ding Liang, Wanli Ouyang	Recent advancements in text-to-3D generation technology have significantly advanced the conversion of textual descriptions into imaginative well-geometrical and finely textured 3D objects. Despite these developments, a prevalent limitation arises from the use of RGB data in diffusion or reconstruction models, which often results in models with inherent lighting and shadows effects that detract from their realism, thereby limiting their usability in applications that demand accurate relighting capabilities. To bridge this gap, we present UniDream, a text-to-3D generation framework by incorporating unified diffusion priors. Our approach consists of three main components: (1) a dual-phase training process to get albedo-normal aligned multi-view diffusion and reconstruction models, (2) a progressive generation procedure for geometry and albedo-textures based on Score Distillation Sample (SDS) using the trained reconstruction and diffusion models, and (3) an innovative application of SDS for finalizing PBR generation while keeping a fixed albedo based on Stable Diffusion model. Extensive evaluations demonstrate that UniDream surpasses existing methods in generating 3D objects with clearer albedo textures, smoother surfaces, enhanced realism, and superior relighting capabilities.	UniDream, a novel text-to-3D generation framework that generates relightable 3D objects from text descriptions by incorporating unified diffusion priors, disentangling illumination from textures.	Existing text-to-3D methods lack relighting capabilities due to inherent lighting and shadows baked into object textures, limiting realism and usability in applications demanding accurate lighting control.	UniDream utilizes a three-stage pipeline: 1) An albedo-normal aligned multi-view diffusion model (AN-MVM) generates consistent multi-view images. 2) A transformer-based reconstruction model (TRM) provides a 3D coarse model from albedo images. 3) Score Distillation Sample (SDS) refines the model, and Stable Diffusion generates PBR materials while keeping albedo fixed.	Realistic Materials: UniDream accurately generates PBR materials that approximate real-world textures and can be relit in various lighting conditions. Complete Geometry: UniDream excels at generating comprehensive geometric details, leading to more complete 3D objects. Stable Generation: UniDream demonstrates greater effectiveness in generating 3D objects due to the 3D prior and normal supervision.	Limited semantic and material generalization due to training data size. The rendering pipeline can be upgraded to incorporate path tracing for enhanced realism.	text-to-3d generation, relightable 3d objects, diffusion models, physically-based rendering, score distillation sampling
2312.08746 Report	DreamDrone	Hanyang Kong, Dongze Lian, Michael Bi Mi, Xinchao Wang	We introduce DreamDrone, an innovative method for generating unbounded flythrough scenes from textual prompts. Central to our method is a novel feature-correspondence-guidance diffusion process, which utilizes the strong correspondence of intermediate features in the diffusion model. Leveraging this guidance strategy, we further propose an advanced technique for editing the intermediate latent code, enabling the generation of subsequent novel views with geometric consistency. Extensive experiments reveal that DreamDrone significantly surpasses existing methods, delivering highly authentic scene generation with exceptional visual quality. This approach marks a significant step in zero-shot perpetual view generation from textual prompts, enabling the creation of diverse scenes, including natural landscapes like oases and caves, as well as complex urban settings such as Lego-style street views. Our code is publicly available.	Introduces DreamDrone, a zero-shot, training-free method for generating unbounded flythrough scenes from textual prompts.	Addresses limitations of existing perpetual view generation methods that struggle with forward camera movement, outdoor scenes, and text-based scene creation.	Leverages pre-trained text-to-image diffusion models and depth estimation models with a novel feature-correspondence-guidance diffusion process and latent code editing for geometry-consistent novel view generation.	Significantly outperforms existing methods in terms of visual quality and CLIP score, indicating strong text-scene alignment. Demonstrates versatility in generating diverse scenes, including natural landscapes, imaginative scenarios, and complex urban settings. Maintains high fidelity and detail in generated scenes, even over extended sequences of frames, unlike competing approaches.	Correspondence of high-frequency details between adjacent frames can be further improved. Reliance on accurate depth estimation can impact performance in scenes with unique styles.	perpetual view generation, text-to-scene synthesis, diffusion models, zero-shot learning, ai-generated content
2312.08744 Report	GOEnFusion: Gradient Origin Encodings for 3D Forward Diffusion Models	Animesh Karnewar, Andrea Vedaldi, Niloy J. Mitra, David Novotny	The recently introduced Forward-Diffusion method allows to train a 3D diffusion model using only 2D images for supervision. However, it does not easily generalise to different 3D representations and requires a computationally expensive auto-regressive sampling process to generate the underlying 3D scenes. In this paper, we propose GOEn: Gradient Origin Encoding (pronounced "gone"). GOEn can encode input images into any type of 3D representation without the need to use a pre-trained image feature extractor. It can also handle single, multiple or no source view(s) alike, by design, and tries to maximise the information transfer from the views to the encodings. Our proposed GOEnFusion model pairs GOEn encodings with a realisation of the Forward-Diffusion model which addresses the limitations of the vanilla Forward-Diffusion realisation. We evaluate how much information the GOEn mechanism transfers to the encoded representations, and how well it captures the prior distribution over the underlying 3D scenes, through the lens of a partial AutoEncoder. Lastly, the efficacy of the GOEnFusion model is evaluated on the recently proposed OmniObject3D dataset while comparing to the state-of-the-art Forward and non-Forward-Diffusion models and other 3D generative models.	This paper proposes GOEn (Gradient Origin Encoding), a novel encoding mechanism to encode source views into arbitrary 3D representations, and GOEnFusion, an improved realization of the Forward-Diffusion model for 3D generation and reconstruction.	The paper aims to address the limitations of existing 3D generative models, particularly in handling different 3D representations and requiring computationally expensive sampling processes.	GOEn encodes information from source views into 3D representations by computing the gradient of the log-likelihood of the observations under a differentiable forward operation. GOEnFusion integrates GOEn with a denoising network, enabling efficient generation and reconstruction of 3D scenes.	GOEn effectively transfers information from source views to different 3D representations, achieving promising results in partial autoencoding experiments. GOEnFusion outperforms the vanilla Forward-Diffusion model in 3D generation on the OmniObject3D dataset, demonstrating improved quality and efficiency. The GOEn mechanism shows strong potential for 3D reconstruction, achieving competitive results in a regression-based setting.	The application of GOEnFusion to various 3D representations is limited by their compatibility with existing denoising network architectures. Exploring the use of GOEn in other stochastic inverse problems beyond 3D vision is a potential area for future research.	3d generation, 3d reconstruction, forward-diffusion models, gradient origin networks, neural radiance fields
2312.08568 Report	NViST: In the Wild New View Synthesis from a Single Image with Transformers	Wonbong Jang, Lourdes Agapito	We propose NViST, a transformer-based model for efficient and generalizable novel-view synthesis from a single image for real-world scenes. In contrast to many methods that are trained on synthetic data, object-centred scenarios, or in a category-specific manner, NViST is trained on MVImgNet, a large-scale dataset of casually-captured real-world videos of hundreds of object categories with diverse backgrounds. NViST transforms image inputs directly into a radiance field, conditioned on camera parameters via adaptive layer normalisation. In practice, NViST exploits fine-tuned masked autoencoder (MAE) features and translates them to 3D output tokens via cross-attention, while addressing occlusions with self-attention. To move away from object-centred datasets and enable full scene synthesis, NViST adopts a 6-DOF camera pose model and only requires relative pose, dropping the need for canonicalization of the training data, which removes a substantial barrier to it being used on casually captured datasets. We show results on unseen objects and categories from MVImgNet and even generalization to casual phone captures. We conduct qualitative and quantitative evaluations on MVImgNet and ShapeNet to show that our model represents a step forward towards enabling true in-the-wild generalizable novel-view synthesis from a single image. Project webpage: https://wbjang.github.io/nvist_webpage.	The paper introduces NVist, a transformer-based model for novel view synthesis from single in-the-wild images, trained on the large-scale MVImgNet dataset.	Generalizing NeRF-based models to real-world scenes is challenging due to scale ambiguities, scene misalignments, and diverse backgrounds. This work aims to address these challenges by leveraging a large-scale, diverse dataset and a novel transformer architecture.	NVist uses a fine-tuned MAE as an encoder and a novel transformer decoder that maps features to a vector-matrix radiance field representation. It uses cross-attention for feature mapping, self-attention for occlusion reasoning, and adaptive layer normalization for conditioning on camera parameters. Notably, it only requires relative camera poses, allowing it to learn from casually captured datasets.	NVist demonstrates high-quality novel view synthesis on challenging real-world scenes from MVImgNet. The model generalizes well to unseen object categories and out-of-distribution phone-captured scenes. Quantitative comparisons on MVImgNet and ShapeNet-SRN show competitive performance against state-of-the-art methods.	Limited training resources led to downsampling images and potentially impacted sharpness. The absence of GAN losses or SDS might have contributed to some loss of detail.	novel view synthesis, transformers, neural radiance fields, single image 3d reconstruction, real-world scenes
2312.08563 Report	Efficient-NeRF2NeRF: Streamlining Text-Driven 3D Editing with Multiview Correspondence-Enhanced Diffusion Models	Liangchen Song, Liangliang Cao, Jiatao Gu, Yifan Jiang, Junsong Yuan, Hao Tang	The advancement of text-driven 3D content editing has been blessed by the progress from 2D generative diffusion models. However, a major obstacle hindering the widespread adoption of 3D content editing is its time-intensive processing. This challenge arises from the iterative and refining steps required to achieve consistent 3D outputs from 2D image-based generative models. Recent state-of-the-art methods typically require optimization time ranging from tens of minutes to several hours to edit a 3D scene using a single GPU. In this work, we propose that by incorporating correspondence regularization into diffusion models, the process of 3D editing can be significantly accelerated. This approach is inspired by the notion that the estimated samples during diffusion should be multiview-consistent during the diffusion generation process. By leveraging this multiview consistency, we can edit 3D content at a much faster speed. In most scenarios, our proposed technique brings a 10$\times$ speed-up compared to the baseline method and completes the editing of a 3D scene in 2 minutes with comparable quality.	This paper introduces a novel framework for efficiently editing NeRF models using text-based instructions, achieving a 10x speedup compared to previous methods by leveraging multiview consistency.	Existing text-driven 3D content editing techniques are computationally expensive and time-consuming, limiting their practical applications. This work addresses this challenge by significantly accelerating the editing process.	The proposed method regularizes the diffusion denoising process to maintain multiview consistency across generated images. This eliminates the need for iterative dataset updates and enables direct editing of the NeRF representation using a style matching loss.	The approach achieves a 10x speedup compared to the baseline Instruct-NeRF2NeRF, editing a 3D scene in just 2 minutes. Experiments demonstrate faster convergence and comparable editing quality to state-of-the-art methods. The method can be integrated with Instruct-NeRF2NeRF to further enhance editing quality and speed.	The final editing quality, while comparable, may not always surpass existing methods in all scenarios. Future work could explore alternative regularization techniques and loss functions to further improve editing fidelity and generalization.	nerf, 3d content editing, diffusion models, multiview consistency, text-driven editing
2312.08372 Report	SAM-guided Graph Cut for 3D Instance Segmentation	Haoyu Guo, He Zhu, Sida Peng, Yuang Wang, Yujun Shen, Ruizhen Hu, Xiaowei Zhou	This paper addresses the challenge of 3D instance segmentation by simultaneously leveraging 3D geometric and multi-view image information. Many previous works have applied deep learning techniques to 3D point clouds for instance segmentation. However, these methods often failed to generalize to various types of scenes due to the scarcity and low-diversity of labeled 3D point cloud data. Some recent works have attempted to lift 2D instance segmentations to 3D within a bottom-up framework. The inconsistency in 2D instance segmentations among views can substantially degrade the performance of 3D segmentation. In this work, we introduce a novel 3D-to-2D query framework to effectively exploit 2D segmentation models for 3D instance segmentation. Specifically, we pre-segment the scene into several superpoints in 3D, formulating the task into a graph cut problem. The superpoint graph is constructed based on 2D segmentation models, where node features are obtained from multi-view image features and edge weights are computed based on multi-view segmentation results, enabling the better generalization ability. To process the graph, we train a graph neural network using pseudo 3D labels from 2D segmentation models. Experimental results on the ScanNet, ScanNet++ and KITTI-360 datasets demonstrate that our method achieves robust segmentation performance and can generalize across different types of scenes. Our project page is available at https://zju3dv.github.io/sam_graph.	This paper proposes a novel 3D instance segmentation method that leverages 2D segmentation cues from Segment Anything Model (SAM) within a 3D-to-2D query framework.	Existing 3D instance segmentation methods suffer from the scarcity of labeled 3D data, while 2D-to-3D lifting methods often fail due to inconsistencies in multi-view 2D segmentation.	The method pre-segments the 3D scene into superpoints and constructs a graph. SAM is then used to annotate graph edges with affinity scores and nodes with aggregated image features. Finally, a graph neural network trained with pseudo labels from 2D segmentation refines the graph for 3D instance segmentation.	The method achieves state-of-the-art results on the ScanNet dataset. It exhibits excellent generalization ability, effectively segmenting scenes from ScanNet++ and KITTI-360 datasets without fine-tuning. Ablation studies demonstrate the effectiveness of SAM guidance, pseudo label training, and the graph neural network.	The method's reliance on both 3D geometry and multi-view images limits its application scenarios. Segmentation accuracy is constrained by the initial superpoint pre-segmentation, which could be improved by incorporating semantic information.	3d instance segmentation, segment anything model (sam), 3d-to-2d query, graph neural network, pseudo labels
2312.08366 Report	See, Say, and Segment: Teaching LMMs to Overcome False Premises	Tsung-Han Wu, Giscard Biamby, David Chan, Lisa Dunlap, Ritwik Gupta, Xudong Wang, Joseph E. Gonzalez, Trevor Darrell	Current open-source Large Multimodal Models (LMMs) excel at tasks such as open-vocabulary language grounding and segmentation but can suffer under false premises when queries imply the existence of something that is not actually present in the image. We observe that existing methods that fine-tune an LMM to segment images significantly degrade their ability to reliably determine ("see") if an object is present and to interact naturally with humans ("say"), a form of catastrophic forgetting. In this work, we propose a cascading and joint training approach for LMMs to solve this task, avoiding catastrophic forgetting of previous skills. Our resulting model can "see" by detecting whether objects are present in an image, "say" by telling the user if they are not, proposing alternative queries or correcting semantic errors in the query, and finally "segment" by outputting the mask of the desired objects if they exist. Additionally, we introduce a novel False Premise Correction benchmark dataset, an extension of existing RefCOCO(+/g) referring segmentation datasets (which we call FP-RefCOCO(+/g)). The results show that our method not only detects false premises up to 55% better than existing approaches, but under false premise conditions produces relative cIOU improvements of more than 31% over baselines, and produces natural language feedback judged helpful up to 67% of the time.	This paper introduces a novel False Premise Correction task for Large Multimodal Models (LMMs) and a new dataset, FP-RefCOCO(+/g), to address the issue of LMMs hallucinating segmentations for non-existent objects.	Existing LMMs for referring segmentation often fail to handle queries involving non-existent objects, hindering their ability to interact naturally with humans in real-world scenarios.	The authors propose two methods: a cascading approach combining separate LMMs for object detection and segmentation, and SESAME, a unified LMM jointly trained on a combined dataset to perform 'see', 'say', and 'segment' functions.	SESAME outperforms baselines in detecting false premise queries by up to 55%. SESAME provides helpful natural language feedback in response to false premise queries, judged helpful up to 67% of the time. SESAME achieves superior segmentation accuracy (cIoU) compared to baselines, with relative improvements of up to 31% under false premise conditions.	The model's ability to detect false premises still has room for improvement. The model sometimes generates hallucinated corrected premises that are still factually incorrect.	large multimodal models, referring segmentation, false premise detection, reasoning segmentation, human-computer interaction
2312.08338 Report	Global Latent Neural Rendering	Thomas Tanay, Matteo Maggioni	A recent trend among generalizable novel view synthesis methods is to learn a rendering operator acting over single camera rays. This approach is promising because it removes the need for explicit volumetric rendering, but it effectively treats target images as collections of independent pixels. Here, we propose to learn a global rendering operator acting over all camera rays jointly. We show that the right representation to enable such rendering is a 5-dimensional plane sweep volume consisting of the projection of the input images on a set of planes facing the target camera. Based on this understanding, we introduce our Convolutional Global Latent Renderer (ConvGLR), an efficient convolutional architecture that performs the rendering operation globally in a low-resolution latent space. Experiments on various datasets under sparse and generalizable setups show that our approach consistently outperforms existing methods by significant margins.	This paper introduces global latent neural rendering, a novel view synthesis approach that learns a generalizable light field model directly from plane sweep volumes.	This method addresses limitations of previous generalizable novel view synthesis approaches that rely on single-ray rendering, enabling more efficient and accurate rendering by processing all camera rays jointly.	The method utilizes a convolutional neural network called ConvGLR (Convolutional Global Latent Renderer). ConvGLR operates on plane sweep volumes (PSVs), exploiting their inherent encoding of epipolar geometry to perform global rendering in a low-resolution latent space.	ConvGLR consistently outperforms existing methods in sparse and generalizable novel view synthesis tasks across DTU, Real-Forward Facing, and Spaces datasets. The method exhibits significant improvements in rendering quality, particularly in challenging scenarios with limited input views. ConvGLR surpasses the performance of the winning entry and organizing team's baseline in the recent ICCV 2023 view synthesis challenge on the ILSH dataset.	Further optimization of ConvGLR's architecture and training strategies may yield additional performance gains. Exploration of scene-adaptive depth plane sampling could improve rendering accuracy in complex scenes.	novel view synthesis, plane sweep volume, global latent rendering, convolutional neural network, epipolar geometry
2312.08168 Report	Chat-3D v2: Bridging 3D Scene and Large Language Models with Object Identifiers	Haifeng Huang, Zehan Wang, Rongjie Huang, Luping Liu, Xize Cheng, Yang Zhao, Tao Jin, Zhou Zhao	Recent research has evidenced the significant potentials of Large Language Models (LLMs) in handling challenging tasks within 3D scenes. However, current models are constrained to addressing object-centric tasks, where each question-answer pair focuses solely on an individual object. In real-world applications, users may pose queries involving multiple objects or expect for answers that precisely reference various objects. We introduce the use of object identifiers to freely reference objects during a conversation. While this solution appears straightforward, it presents two main challenges: 1) How to establish a reliable one-to-one correspondence between each object and its identifier? 2) How to incorporate complex spatial relationships among dozens of objects into the embedding space of the LLM? To address these challenges, we propose a two-stage alignment method, which involves learning an attribute-aware token and a relation-aware token for each object. These tokens capture the object's attributes and spatial relationships with surrounding objects in the 3D scene. Once the alignment is established, we can fine-tune our model on various downstream tasks using instruction tuning. Experiments conducted on traditional datasets like ScanQA, ScanRefer, and Nr3D/Sr3D showcase the effectiveness of our proposed method. Additionally, we create a 3D scene captioning dataset annotated with rich object identifiers, with the assistant of GPT-4. This dataset aims to further explore the capability of object identifiers in effective object referencing and precise scene understanding.	This paper introduces a novel approach to 3D scene understanding using large language models (LLMs) by incorporating unique object identifiers for explicit object referencing.	Existing 3D scene understanding models are limited to object-centric tasks, struggling with complex queries involving multiple objects and precise referencing. This work enables LLMs to understand and reference objects within a 3D scene more effectively.	The authors propose a two-stage alignment method: object-level alignment learns attribute-aware tokens by mapping 3D object features to the LLM's embedding space, and scene-level alignment incorporates spatial relationships using a relation module to generate relation-aware tokens.	The model outperforms previous 3D LLMs and achieves comparable results to supervised baselines on 3D question answering and visual grounding tasks. The introduction of object identifiers enables the model to reference specific objects unambiguously, improving performance and user experience. The authors create an identifier-rich scene captioning dataset with GPT-4 assistance, further demonstrating the model's capability for comprehensive scene understanding.	The limited availability of 3D-language data poses challenges for optimal alignment between 3D and language spaces, impacting the model's ability to understand less frequent object classes. Future work can explore more data-efficient architectures, training schemes, and data scaling techniques to further enhance the model's 3D scene understanding capabilities.	3d scene understanding, large language models, object identifiers, multi-modal learning, 3d visual grounding
2312.08128 Report	Clockwork Diffusion: Efficient Generation With Model-Step Distillation	Amirhossein Habibian, Amir Ghodrati, Noor Fathima, Guillaume Sautiere, Risheek Garrepalli, Fatih Porikli, Jens Petersen	This work aims to improve the efficiency of text-to-image diffusion models. While diffusion models use computationally expensive UNet-based denoising operations in every generation step, we identify that not all operations are equally relevant for the final output quality. In particular, we observe that UNet layers operating on high-res feature maps are relatively sensitive to small perturbations. In contrast, low-res feature maps influence the semantic layout of the final image and can often be perturbed with no noticeable change in the output. Based on this observation, we propose Clockwork Diffusion, a method that periodically reuses computation from preceding denoising steps to approximate low-res feature maps at one or more subsequent steps. For multiple baselines, and for both text-to-image generation and image editing, we demonstrate that Clockwork leads to comparable or improved perceptual scores with drastically reduced computational complexity. As an example, for Stable Diffusion v1.5 with 8 DPM++ steps we save 32% of FLOPs with negligible FID and CLIP change.	Clockwork Diffusion accelerates text-to-image diffusion models by reusing low-resolution feature maps from preceding denoising steps.	Diffusion models are computationally expensive, and this work makes them faster by identifying and exploiting redundancy in the generation process.	The authors replace lower-resolution parts of the diffusion UNet with lightweight adaptors, conditioned on previous features and other inputs. They alternate between approximated and full UNet passes during sampling.	Clockwork Diffusion reduces FLOPs by up to 38% while maintaining comparable image quality on MS-COCO. The method is complementary to other optimization techniques and can be applied on top of distilled models. Clockwork Diffusion is also effective for text-guided image editing, leading to significant speedups for methods like Plug-and-Play.	Clockwork Diffusion is currently trained for a fixed operating point and scheduler. The method's effectiveness on non-UNet architectures is unknown.	diffusion models, text-to-image generation, image editing, model distillation, efficient inference
2312.08071 Report	Novel View Synthesis with View-Dependent Effects from a Single Image	Juan Luis Gonzalez Bello, Munchurl Kim	In this paper, we firstly consider view-dependent effects into single image-based novel view synthesis (NVS) problems. For this, we propose to exploit the camera motion priors in NVS to model view-dependent appearance or effects (VDE) as the negative disparity in the scene. By recognizing specularities "follow" the camera motion, we infuse VDEs into the input images by aggregating input pixel colors along the negative depth region of the epipolar lines. Also, we propose a `relaxed volumetric rendering' approximation that allows computing the densities in a single pass, improving efficiency for NVS from single images. Our method can learn single-image NVS from image sequences only, which is a completely self-supervised learning method, for the first time requiring neither depth nor camera pose annotations. We present extensive experiment results and show that our proposed method can learn NVS with VDEs, outperforming the SOTA single-view NVS methods on the RealEstate10k and MannequinChallenge datasets.	This paper presents NVSVDE-Net, the first single-view novel view synthesis (NVS) method that models view-dependent effects (VDEs) like reflections from a single image.	Existing single-view NVS methods struggle to model VDEs, limiting their realism. NVSVDE-Net addresses this gap by leveraging camera motion priors to synthesize VDEs.	The method introduces a 'relaxed volumetric rendering' approximation for efficient novel view generation and a novel approach to synthesize VDEs by leveraging negative disparities in the scene induced by target camera motion. Additionally, it utilizes a self-supervised training scheme based on image sequences, eliminating the need for depth or pose annotations.	NVSVDE-Net outperforms state-of-the-art single-view NVS methods on RealEstate10k and MannequinChallenge datasets by a large margin in terms of PSNR and other quality metrics. The method successfully generates plausible VDEs and depth maps from single images, enhancing realism. The 'relaxed volumetric rendering' approximation, coupled with a sampler module, allows for fast and efficient rendering of high-quality novel views.	The current architecture is limited in rendering high-frequency VDEs, focusing primarily on glossy reflections. Rendering novel views with large baselines (significantly exceeding training disparities) remains a challenge due to large occlusions and limited context for the sampler module.	novel view synthesis, view-dependent effects, single-image rendering, relaxed volumetric rendering, self-supervised learning
2312.08048 Report	Compositional Inversion for Stable Diffusion Models	Xulu Zhang, Xiao-Yong Wei, Jinlin Wu, Tianyi Zhang, Zhaoxiang Zhang, Zhen Lei, Qing Li	Inversion methods, such as Textual Inversion, generate personalized images by incorporating concepts of interest provided by user images. However, existing methods often suffer from overfitting issues, where the dominant presence of inverted concepts leads to the absence of other desired concepts. It stems from the fact that during inversion, the irrelevant semantics in the user images are also encoded, forcing the inverted concepts to occupy locations far from the core distribution in the embedding space. To address this issue, we propose a method that guides the inversion process towards the core distribution for compositional embeddings. Additionally, we introduce a spatial regularization approach to balance the attention on the concepts being composed. Our method is designed as a post-training approach and can be seamlessly integrated with other inversion methods. Experimental results demonstrate the effectiveness of our proposed approach in mitigating the overfitting problem and generating more diverse and balanced compositions of concepts in the synthesized images. The source code is available at https://github.com/zhangxulu1996/Compositional-Inversion.	This paper proposes a novel compositional inversion approach for text-to-image synthesis, aiming to address the overfitting issue in existing inversion methods and enable more balanced compositions of concepts in generated images.	Existing inversion methods often lead to the dominance of inverted concepts in generated images, suppressing the presence of other desired concepts. This limits the diversity and controllability of image synthesis, particularly when composing user-specific concepts with general ones.	The proposed approach consists of two components: (1) Semantic Inversion: Guides the embedding search towards the core distribution of concepts by utilizing anchor concepts as attractors, improving coherence with other concepts. (2) Spatial Inversion: Employs an MLP to recover coherent locations of composed concepts and regularizes attention maps to avoid the dominance of inverted concepts during image generation.	The proposed method achieves significant improvements over state-of-the-art methods in terms of text-alignment and concept likelihood, indicating enhanced compositionality and presence of desired concepts in generated images. The augmented Textual Inversion with the proposed method achieves comparable performance to fine-tuning based methods (Custom Diffusion, DreamBooth) without modifying network parameters. User study confirms the effectiveness of the method in generating high-quality compositions, while revealing that rigid objects are generally easier to compose than non-rigid objects.	The semantic inversion, while improving semantic completeness, may sometimes lead to the generation of low-probability scenes. Future work includes exploring the integration of visual features in location recovery and investigating the potential of the proposed method for multi-concept compositions.	text-to-image synthesis, textual inversion, compositionality, concept overfitting, spatial regularization
2312.08019 Report	AdapEdit: Spatio-Temporal Guided Adaptive Editing Algorithm for Text-Based Continuity-Sensitive Image Editing	Zhiyuan Ma, Guoli Jia, Bowen Zhou	With the great success of text-conditioned diffusion models in creative text-to-image generation, various text-driven image editing approaches have attracted the attentions of many researchers. However, previous works mainly focus on discreteness-sensitive instructions such as adding, removing or replacing specific objects, background elements or global styles (i.e., hard editing), while generally ignoring subject-binding but semantically fine-changing continuity-sensitive instructions such as actions, poses or adjectives, and so on (i.e., soft editing), which hampers generative AI from generating user-customized visual contents. To mitigate this predicament, we propose a spatio-temporal guided adaptive editing algorithm AdapEdit, which realizes adaptive image editing by introducing a soft-attention strategy to dynamically vary the guiding degree from the editing conditions to visual pixels from both temporal and spatial perspectives. Note our approach has a significant advantage in preserving model priors and does not require model training, fine-tuning, extra data, or optimization. We present our results over a wide variety of raw images and editing instructions, demonstrating competitive performance and showing it significantly outperforms the previous approaches.	This paper proposes AdapEdit, a spatio-temporal guided adaptive editing algorithm for complex continuity-sensitive image editing tasks, enhancing soft editing capabilities in text-guided image editing.	Existing text-based image editing methods struggle with complex, subject-binding instructions (soft editing) like actions or adjectives, limiting user customization in image generation.	AdapEdit uses a soft-attention strategy with two modules: 1) Flexible Word-Level Temporal (FWT) adjustment assigns guidance scales to words for temporal editing. 2) Dynamic Pixel-Level Spatial (DPS) weighting integrates edited features into the original image for spatial editing.	AdapEdit effectively performs soft editing tasks (e.g., changing postures, adjusting object counts) while preserving original image details. Quantitative evaluation shows AdapEdit achieves higher CLIP score and CLIP directional similarity compared to baselines, indicating better semantic consistency. Ablation studies confirm the effectiveness of FWT and DPS modules in achieving continuity-sensitive editing.	The performance of AdapEdit is sensitive to hyperparameter selection, requiring careful tuning for optimal results. Future work includes exploring more advanced attention mechanisms and extending AdapEdit to other generative models.	image editing, diffusion models, soft editing, text-guided image generation, spatio-temporal attention
2312.07661 Report	CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor	Shuyang Sun, Runjia Li, Philip Torr, Xiuye Gu, Siyang Li	Existing open-vocabulary image segmentation methods require a fine-tuning step on mask labels and/or image-text datasets. Mask labels are labor-intensive, which limits the number of categories in segmentation datasets. Consequently, the vocabulary capacity of pre-trained VLMs is severely reduced after fine-tuning. However, without fine-tuning, VLMs trained under weak image-text supervision tend to make suboptimal mask predictions. To alleviate these issues, we introduce a novel recurrent framework that progressively filters out irrelevant texts and enhances mask quality without training efforts. The recurrent unit is a two-stage segmenter built upon a frozen VLM. Thus, our model retains the VLM's broad vocabulary space and equips it with segmentation ability. Experiments show that our method outperforms not only the training-free counterparts, but also those fine-tuned with millions of data samples, and sets the new state-of-the-art records for both zero-shot semantic and referring segmentation. Concretely, we improve the current record by 28.8, 16.0, and 6.9 mIoU on Pascal VOC, COCO Object, and Pascal Context.	This paper proposes CLIP-as-RNN (CAR), a novel recurrent framework for open-vocabulary image segmentation that leverages a frozen pre-trained vision-language model (VLM) without requiring fine-tuning.	Existing open-vocabulary segmentation methods are limited by their reliance on fine-tuning, leading to reduced vocabulary capacity or suboptimal mask predictions. CAR addresses these limitations by preserving the VLM's broad vocabulary and enhancing mask quality without training.	CAR employs a recurrent architecture with a two-stage segmenter. The segmenter iteratively refines mask proposals and filters irrelevant text queries by assessing the alignment between visual and textual representations. This process continues until a stable state is achieved.	CAR significantly outperforms previous zero-shot open-vocabulary semantic segmentation methods, achieving state-of-the-art results on Pascal VOC, COCO Object, and Pascal Context. The method demonstrates strong performance in referring image segmentation, surpassing previous state-of-the-art on RefCOCO, RefCOCO+, and RefCOCOg. CAR establishes a strong baseline for zero-shot referring video segmentation on Ref-DAVIS 2017.	The performance of CAR is limited by the capabilities of the pre-trained VLM. Future work includes incorporating additional trainable modules and exploring integration with other VLMs.	open-vocabulary segmentation, vision-language models, zero-shot learning, referring segmentation, recurrent neural networks
2312.07541 Report	SMERF: Streamable Memory Efficient Radiance Fields for Real-Time Large-Scene Exploration	Daniel Duckworth, Peter Hedman, Christian Reiser, Peter Zhizhin, Jean-François Thibert, Mario Lučić, Richard Szeliski, Jonathan T. Barron	Recent techniques for real-time view synthesis have rapidly advanced in fidelity and speed, and modern methods are capable of rendering near-photorealistic scenes at interactive frame rates. At the same time, a tension has arisen between explicit scene representations amenable to rasterization and neural fields built on ray marching, with state-of-the-art instances of the latter surpassing the former in quality while being prohibitively expensive for real-time applications. In this work, we introduce SMERF, a view synthesis approach that achieves state-of-the-art accuracy among real-time methods on large scenes with footprints up to 300 m$^2$ at a volumetric resolution of 3.5 mm$^3$. Our method is built upon two primary contributions: a hierarchical model partitioning scheme, which increases model capacity while constraining compute and memory consumption, and a distillation training strategy that simultaneously yields high fidelity and internal consistency. Our approach enables full six degrees of freedom (6DOF) navigation within a web browser and renders in real-time on commodity smartphones and laptops. Extensive experiments show that our method exceeds the current state-of-the-art in real-time novel view synthesis by 0.78 dB on standard benchmarks and 1.78 dB on large scenes, renders frames three orders of magnitude faster than state-of-the-art radiance field models, and achieves real-time performance across a wide variety of commodity devices, including smartphones. We encourage readers to explore these models interactively at our project website: https://smerf-3d.github.io.	SMERF, a streamable, memory-efficient radiance field representation for real-time view synthesis of large scenes, is introduced. The method renders in real-time on a variety of devices, including smartphones, while exceeding the quality of existing real-time methods.	Existing real-time view synthesis techniques struggle to balance quality, speed, and representation size. This work aims to achieve high-fidelity rendering of large scenes in real-time on commodity hardware.	A hierarchical model architecture composed of MERF-like submodels is built, leveraging coordinate space partitioning, deferred appearance network partitioning, and feature gating. The model is trained via a novel distillation strategy using a high-fidelity ZipNeRF teacher.	SMERF achieves state-of-the-art accuracy among real-time methods, surpassing the previous best by 0.78 dB on standard benchmarks and 1.78 dB on large scenes. The method renders frames three orders of magnitude faster than state-of-the-art radiance field models like ZipNeRF. SMERF achieves real-time performance across a wide variety of commodity devices, including smartphones.	The model has high storage cost leading to increased loading times and network usage. Training costs are high, requiring significant GPU resources and time.	neural radiance fields, volumetric representation, image synthesis, real-time rendering, distillation
2312.07539 Report	HeadArtist: Text-conditioned 3D Head Generation with Self Score Distillation	Hongyu Liu, Xuan Wang, Ziyu Wan, Yujun Shen, Yibing Song, Jing Liao, Qifeng Chen	This work presents HeadArtist for 3D head generation from text descriptions. With a landmark-guided ControlNet serving as the generative prior, we come up with an efficient pipeline that optimizes a parameterized 3D head model under the supervision of the prior distillation itself. We call such a process self score distillation (SSD). In detail, given a sampled camera pose, we first render an image and its corresponding landmarks from the head model, and add some particular level of noise onto the image. The noisy image, landmarks, and text condition are then fed into the frozen ControlNet twice for noise prediction. Two different classifier-free guidance (CFG) weights are applied during these two predictions, and the prediction difference offers a direction on how the rendered image can better match the text of interest. Experimental results suggest that our approach delivers high-quality 3D head sculptures with adequate geometry and photorealistic appearance, significantly outperforming state-ofthe-art methods. We also show that the same pipeline well supports editing the generated heads, including both geometry deformation and appearance change.	HeadArtist: a novel pipeline for generating and editing 3D heads from text descriptions, leveraging self-score distillation (SSD) within a landmark-guided ControlNet framework.	3D head avatars are crucial for various applications, but existing methods struggle with limitations like over-saturation, over-smoothing, and multi-face Janus artifacts.	HeadArtist disentangles geometry and texture generation. It employs a landmark-guided ControlNet with SSD to optimize a parameterized 3D head model, minimizing the score difference between predicted noise distributions representing generated and target heads.	Generates high-quality 3D heads with intricate geometry and photorealistic textures, outperforming state-of-the-art methods. Effectively addresses issues like multi-face Janus artifacts and over-saturation common in previous methods. Enables 3D head editing, manipulating geometry and texture while preserving character identity.	Currently cannot achieve photorealism on par with 3D reconstruction or GAN-based methods. Struggles with generating complex characters, particularly those from Japanese animation, due to limitations of the FLAME initialization and the diffusion model.	3d head generation, text-guided synthesis, self score distillation, controlnet, 3d head editing
2312.07537 Report	FreeInit: Bridging Initialization Gap in Video Diffusion Models	Tianxing Wu, Chenyang Si, Yuming Jiang, Ziqi Huang, Ziwei Liu	Though diffusion-based video generation has witnessed rapid progress, the inference results of existing models still exhibit unsatisfactory temporal consistency and unnatural dynamics. In this paper, we delve deep into the noise initialization of video diffusion models, and discover an implicit training-inference gap that attributes to the unsatisfactory inference quality. Our key findings are: 1) the spatial-temporal frequency distribution of the initial latent at inference is intrinsically different from that for training, and 2) the denoising process is significantly influenced by the low-frequency components of the initial noise. Motivated by these observations, we propose a concise yet effective inference sampling strategy, FreeInit, which significantly improves temporal consistency of videos generated by diffusion models. Through iteratively refining the spatial-temporal low-frequency components of the initial latent during inference, FreeInit is able to compensate the initialization gap between training and inference, thus effectively improving the subject appearance and temporal consistency of generation results. Extensive experiments demonstrate that FreeInit consistently enhances the generation results of various text-to-video generation models without additional training.	This paper identifies an implicit training-inference gap in video diffusion models' noise initialization and proposes FreeInit, an iterative method refining the initial latent's low-frequency component during inference to enhance temporal consistency in generated videos.	Existing video diffusion models suffer from poor temporal consistency and unnatural dynamics in generated videos due to a discrepancy between training and inference noise initialization.	FreeInit iteratively refines initial noise by combining low-frequency components of generated noisy latents with high-frequency components of random Gaussian noise, bridging the gap between training and inference.	FreeInit significantly improves temporal consistency across various text-to-video models as measured by DINO metric. Qualitative analysis shows enhanced subject appearance and reduced temporal artifacts in generated videos. Ablation studies confirm the importance of noise reinitialization and appropriate filter selection for optimal performance.	FreeInit increases inference time, potentially mitigated by coarse-to-fine sampling strategies. Small, fast-moving objects may be distorted due to emphasis on low-frequency consistency.	video generation, diffusion models, temporal consistency, noise initialization, frequency domain analysis
2312.07536 Report	FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any Condition	Sicheng Mo, Fangzhou Mu, Kuan Heng Lin, Yanli Liu, Bochen Guan, Yin Li, Bolei Zhou	Recent approaches such as ControlNet offer users fine-grained spatial control over text-to-image (T2I) diffusion models. However, auxiliary modules have to be trained for each type of spatial condition, model architecture, and checkpoint, putting them at odds with the diverse intents and preferences a human designer would like to convey to the AI models during the content creation process. In this work, we present FreeControl, a training-free approach for controllable T2I generation that supports multiple conditions, architectures, and checkpoints simultaneously. FreeControl designs structure guidance to facilitate the structure alignment with a guidance image, and appearance guidance to enable the appearance sharing between images generated using the same seed. Extensive qualitative and quantitative experiments demonstrate the superior performance of FreeControl across a variety of pre-trained T2I models. In particular, FreeControl facilitates convenient training-free control over many different architectures and checkpoints, allows the challenging input conditions on which most of the existing training-free methods fail, and achieves competitive synthesis quality with training-based approaches.	FreeControl, a training-free method for controllable text-to-image (T2I) generation that supports multiple conditions, architectures, and checkpoints simultaneously.	Existing methods for controlling pre-trained T2I diffusion models require training an auxiliary module for each type of spatial condition, model architecture, and checkpoint, leading to high training cost, poor scalability and limited control signals.	FreeControl designs structure guidance to facilitate the structure alignment with a guidance image by modeling the subspace of features in T2I models, and appearance guidance to enable the appearance sharing between images generated using the same seed.	FreeControl supports a wide array of control conditions including challenging ones like 2D projections of point clouds and meshes, model architectures such as SD 1.5, 2.1, SD-XL 1.0, and customized checkpoints. FreeControl demonstrates superior results compared to previous training-free methods and achieves competitive performance with prior training-based approaches. FreeControl can be readily adapted for text-guided image-to-image translation.	FreeControl relies on DDIM inversion for feature extraction and gradient computation, leading to increased inference time. FreeControl relies on a low-resolution encoding of the guidance image, sometimes failing to recognize inputs with missing structure or accurately locate fine details.	text-to-image generation, controllable generation, diffusion models, training-free methods, image-to-image translation
2312.07533 Report	VILA: On Pre-training for Visual Language Models	Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, Song Han	Visual language models (VLMs) rapidly progressed with the recent success of large language models. There have been growing efforts on visual instruction tuning to extend the LLM with visual inputs, but lacks an in-depth study of the visual language pre-training process, where the model learns to perform joint modeling on both modalities. In this work, we examine the design options for VLM pre-training by augmenting LLM towards VLM through step-by-step controllable comparisons. We introduce three main findings: (1) freezing LLMs during pre-training can achieve decent zero-shot performance, but lack in-context learning capability, which requires unfreezing the LLM; (2) interleaved pre-training data is beneficial whereas image-text pairs alone are not optimal; (3) re-blending text-only instruction data to image-text data during instruction fine-tuning not only remedies the degradation of text-only tasks, but also boosts VLM task accuracy. With an enhanced pre-training recipe we build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models, e.g., LLaVA-1.5, across main benchmarks without bells and whistles. Multi-modal pre-training also helps unveil appealing properties of VILA, including multi-image reasoning, enhanced in-context learning, and better world knowledge.	This paper investigates and presents an enhanced pre-training recipe for auto-regressive Visual Language Models (VLMs), aiming to augment Large Language Models (LLMs) for improved visual understanding and reasoning.	Existing VLM research primarily focuses on instruction tuning, neglecting the crucial visual language pre-training stage. This paper addresses this gap by exploring design choices for effective VLM pre-training, which is vital for modality alignment and inheriting beneficial LLM properties like in-context learning.	The authors conduct controlled experiments, ablating design choices related to LLM training, visual language corpus selection (interleaved vs. image-text pairs), and data blending during pre-training and instruction tuning. They analyze the impact of these choices on downstream task performance (VQA, captioning, text-only tasks) and provide insights into embedding alignment.	Updating LLMs during pre-training is crucial for enabling in-context learning capabilities in VLMs, leading to improved performance on few-shot tasks. Interleaved image-text corpora are superior to image-text pairs for pre-training, preserving text-only capabilities of LLMs and facilitating visual in-context learning. Joint instruction tuning with both visual and text data remedies the degradation of text-only tasks while boosting VLM task accuracy.	The study is limited by computational resources, preventing exploration of billion-scale pre-training data. Future work includes scaling up the pre-training corpus, optimizing training throughput, and investigating token compression techniques for visual inputs.	visual language model, vlm pre-training, multi-modal learning, in-context learning, large language models
2312.07532 Report	Interfacing Foundation Models' Embeddings	Xueyan Zou, Linjie Li, Jianfeng Wang, Jianwei Yang, Mingyu Ding, Zhengyuan Yang, Feng Li, Hao Zhang, Shilong Liu, Arul Aravinthan, Yong Jae Lee, Lijuan Wang	We present FIND, a generalized interface for aligning foundation models' embeddings. As shown in teaser figure, a lightweight transformer interface without tuning any foundation model weights is enough for a unified image (segmentation) and dataset-level (retrieval) understanding. The proposed interface has the following favorable attributes: (1) Generalizable. It applies to various tasks spanning retrieval, segmentation, \textit{etc.}, under the same architecture and weights. (2) Prototypable. Different tasks are able to be implemented through prototyping attention masks and embedding types. (3) Extendable. The proposed interface is adaptive to new tasks, and new models. (4) Interleavable. With the benefit of multi-task multi-modal training, the proposed interface creates an interleaved shared embedding space. In light of the interleaved embedding space, we introduce the FIND-Bench, which introduces new training and evaluation annotations to the COCO dataset for interleave segmentation and retrieval. Our approach achieves state-of-the-art performance on FIND-Bench and competitive performance on standard retrieval and segmentation settings. The training, evaluation, and demo code as well as the dataset have been released at https://github.com/UX-Decoder/FIND.	This paper presents FIND, a generalized interface for aligning foundation model embeddings across modalities (vision and language) and granularities (pixel to image).	Training individual foundation models is costly and their full potential is limited by fixed output modalities and task objectives. FIND offers a more efficient and flexible approach by interfacing existing models.	FIND leverages a lightweight transformer interface with frozen pre-trained foundation models. It employs task-adaptive prototyping through configurable attention masks and embedding types to align vision and language embeddings.	FIND achieves state-of-the-art performance on the proposed FIND-Bench for interleaved image retrieval and segmentation. It exhibits competitive performance on standard benchmarks for generic, interactive, and grounded segmentation, as well as image-text retrieval. FIND demonstrates strong generalization capability, effectively handling out-of-domain images and complex language descriptions.	The current implementation requires training with a fixed resolution across all tasks, potentially limiting performance on certain tasks like image-text retrieval. Future work includes incorporating novel foundation models, exploring more cross-modal tasks, extending to longer contexts, and enabling more flexible object query granularities.	foundation models, multi-modal learning, image segmentation, image retrieval, interleaved understanding
2312.07509 Report	PEEKABOO: Interactive Video Generation via Masked-Diffusion	Yash Jain, Anshul Nasery, Vibhav Vineet, Harkirat Behl	Modern video generation models like Sora have achieved remarkable success in producing high-quality videos. However, a significant limitation is their inability to offer interactive control to users, a feature that promises to open up unprecedented applications and creativity. In this work, we introduce the first solution to equip diffusion-based video generation models with spatio-temporal control. We present Peekaboo, a novel masked attention module, which seamlessly integrates with current video generation models offering control without the need for additional training or inference overhead. To facilitate future research, we also introduce a comprehensive benchmark for interactive video generation. This benchmark offers a standardized framework for the community to assess the efficacy of emerging interactive video generation models. Our extensive qualitative and quantitative assessments reveal that Peekaboo achieves up to a 3.8x improvement in mIoU over baseline models, all while maintaining the same latency. Code and benchmark are available on the webpage.	This paper presents Peekaboo, a training-free method to add spatio-temporal control to off-the-shelf diffusion-based video generation models, allowing users to control object size, location, and trajectory using masks.	Interactive control in video generation is crucial for user creativity and various applications like education and entertainment, but existing models lack this feature or require expensive retraining.	Peekaboo introduces a masked attention module integrated into existing video generation models, focusing spatial, cross, and temporal attention on local contexts defined by user-provided masks without retraining or significant inference overhead.	Peekaboo achieves up to 3.8x improvement in mIoU over baseline models, demonstrating superior spatial control. It maintains high video generation quality, even surpassing baselines in some cases, as shown by FVD scores. The method is versatile, working with different T2V models (ZeroScope, ModelScope) and applicable to text-to-image models.	The performance depends on the base model's capabilities and can inherit its biases. Mismatch between input masks and text prompts, like contradicting motion directions, can lead to failures.	video generation, interactive control, diffusion models, spatio-temporal control, zero-training
2312.07504 Report	COLMAP-Free 3D Gaussian Splatting	Yang Fu, Sifei Liu, Amey Kulkarni, Jan Kautz, Alexei A. Efros, Xiaolong Wang	While neural rendering has led to impressive advances in scene reconstruction and novel view synthesis, it relies heavily on accurately pre-computed camera poses. To relax this constraint, multiple efforts have been made to train Neural Radiance Fields (NeRFs) without pre-processed camera poses. However, the implicit representations of NeRFs provide extra challenges to optimize the 3D structure and camera poses at the same time. On the other hand, the recently proposed 3D Gaussian Splatting provides new opportunities given its explicit point cloud representations. This paper leverages both the explicit geometric representation and the continuity of the input video stream to perform novel view synthesis without any SfM preprocessing. We process the input frames in a sequential manner and progressively grow the 3D Gaussians set by taking one input frame at a time, without the need to pre-compute the camera poses. Our method significantly improves over previous approaches in view synthesis and camera pose estimation under large motion changes. Our project page is https://oasisyang.github.io/colmap-free-3dgs	This paper presents COLMAP-Free 3D Gaussian Splatting (CF-3DGS), a novel method for performing novel view synthesis without relying on pre-computed camera poses from SfM algorithms like COLMAP.	Current neural rendering methods heavily depend on accurate camera poses, which are time-consuming to obtain and prone to errors. CF-3DGS addresses this limitation by jointly optimizing camera poses and scene reconstruction, enabling more flexible and robust view synthesis.	CF-3DGS leverages the temporal continuity of videos and the explicit representation of 3D Gaussian Splatting. It processes frames sequentially, using a local 3DGS to estimate relative poses between nearby frames and a global 3DGS to progressively build and refine the scene representation.	CF-3DGS achieves state-of-the-art novel view synthesis quality on Tanks and Temples and CO3D datasets, outperforming previous pose-unknown methods. It demonstrates robust camera pose estimation, especially for challenging scenes with large camera motion like 360° videos in CO3D. The method is efficient, achieving fast training and inference speeds thanks to the advantages of Gaussian Splatting.	The sequential optimization limits its application to ordered image sequences. Future work could explore extensions for unordered image collections.	novel view synthesis, 3d gaussian splatting, pose estimation, neural rendering, sfm-free
2312.07409 Report	DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing	Kaiwen Zhang, Yifan Zhou, Xudong Xu, Xingang Pan, Bo Dai	Diffusion models have achieved remarkable image generation quality surpassing previous generative models. However, a notable limitation of diffusion models, in comparison to GANs, is their difficulty in smoothly interpolating between two image samples, due to their highly unstructured latent space. Such a smooth interpolation is intriguing as it naturally serves as a solution for the image morphing task with many applications. In this work, we present DiffMorpher, the first approach enabling smooth and natural image interpolation using diffusion models. Our key idea is to capture the semantics of the two images by fitting two LoRAs to them respectively, and interpolate between both the LoRA parameters and the latent noises to ensure a smooth semantic transition, where correspondence automatically emerges without the need for annotation. In addition, we propose an attention interpolation and injection technique and a new sampling schedule to further enhance the smoothness between consecutive images. Extensive experiments demonstrate that DiffMorpher achieves starkly better image morphing effects than previous methods across a variety of object categories, bridging a critical functional gap that distinguished diffusion models from GANs.	This paper introduces DiffMorpher, a novel approach that enables smooth and natural image interpolation using pre-trained diffusion models, effectively bridging a key functional gap between diffusion models and GANs in image morphing.	Diffusion models excel in image generation but struggle with smooth interpolation between images, a task where GANs have traditionally excelled. This work addresses this limitation, opening new possibilities for diffusion models in applications requiring smooth image transitions, such as animations and image editing.	DiffMorpher leverages LoRAs to capture the semantics of two input images, interpolating between their LoRA parameters and latent noises. It also employs attention interpolation and replacement for smooth texture transitions, AdaIN adjustment for color and brightness consistency, and a new sampling schedule for uniform content transition speed.	DiffMorpher significantly outperforms previous image morphing methods, including GAN-based techniques, in terms of image fidelity, semantic consistency, and transition smoothness. Quantitative evaluation on the newly introduced MorphBench dataset confirms the superiority of DiffMorpher in achieving smooth and natural image morphing. A user study further substantiates the effectiveness of DiffMorpher, showing a clear preference for its results over those from baseline methods.	The need to train a LoRA for each input image adds computational overhead. DiffMorpher may struggle with morphing images that lack clear correspondence.	image morphing, diffusion models, lora, attention control, image interpolation
2312.07315 Report	NVS-Adapter: Plug-and-Play Novel View Synthesis from a Single Image	Yoonwoo Jeong, Jinwoo Lee, Chiheon Kim, Minsu Cho, Doyup Lee	Transfer learning of large-scale Text-to-Image (T2I) models has recently shown impressive potential for Novel View Synthesis (NVS) of diverse objects from a single image. While previous methods typically train large models on multi-view datasets for NVS, fine-tuning the whole parameters of T2I models not only demands a high cost but also reduces the generalization capacity of T2I models in generating diverse images in a new domain. In this study, we propose an effective method, dubbed NVS-Adapter, which is a plug-and-play module for a T2I model, to synthesize novel multi-views of visual objects while fully exploiting the generalization capacity of T2I models. NVS-Adapter consists of two main components; view-consistency cross-attention learns the visual correspondences to align the local details of view features, and global semantic conditioning aligns the semantic structure of generated views with the reference view. Experimental results demonstrate that the NVS-Adapter can effectively synthesize geometrically consistent multi-views and also achieve high performance on benchmarks without full fine-tuning of T2I models. The code and data are publicly available in ~\href{https://postech-cvlab.github.io/nvsadapter/}{https://postech-cvlab.github.io/nvsadapter/}.	Proposes NVS-Adapter, a plug-and-play module for Text-to-Image (T2I) models, to synthesize novel multi-views of objects from a single image while preserving the T2I model's ability to generate diverse images.	Fine-tuning large T2I models for Novel View Synthesis (NVS) is costly and can reduce their generalization ability in new domains. NVS-Adapter addresses this by adapting T2I models for NVS without full fine-tuning.	NVS-Adapter, integrated into a pretrained T2I model, uses two main components: (1) View-consistency cross-attention: learns visual correspondences between views to align local details. (2) Global semantic conditioning: aligns the semantic structure of generated views with the reference view.	Synthesizes geometrically consistent multi-views from a single image. Achieves competitive performance on Objaverse and Google Scanned Objects datasets without full fine-tuning. Demonstrates compatibility with other plug-and-play modules like ControlNets, enhancing NVS performance further.	Limited capacity in handling a large number of target views simultaneously. Reliance on Score Distillation Sampling (SDS) for 3D reconstruction, which can be computationally expensive.	novel view synthesis, text-to-image, transfer learning, diffusion models, cross-attention
2312.07231 Report	Fast Training of Diffusion Transformer with Extreme Masking for 3D Point Clouds Generation	Shentong Mo, Enze Xie, Yue Wu, Junsong Chen, Matthias Nießner, Zhenguo Li	Diffusion Transformers have recently shown remarkable effectiveness in generating high-quality 3D point clouds. However, training voxel-based diffusion models for high-resolution 3D voxels remains prohibitively expensive due to the cubic complexity of attention operators, which arises from the additional dimension of voxels. Motivated by the inherent redundancy of 3D compared to 2D, we propose FastDiT-3D, a novel masked diffusion transformer tailored for efficient 3D point cloud generation, which greatly reduces training costs. Specifically, we draw inspiration from masked autoencoders to dynamically operate the denoising process on masked voxelized point clouds. We also propose a novel voxel-aware masking strategy to adaptively aggregate background/foreground information from voxelized point clouds. Our method achieves state-of-the-art performance with an extreme masking ratio of nearly 99%. Moreover, to improve multi-category 3D generation, we introduce Mixture-of-Expert (MoE) in 3D diffusion model. Each category can learn a distinct diffusion path with different experts, relieving gradient conflict. Experimental results on the ShapeNet dataset demonstrate that our method achieves state-of-the-art high-fidelity and diverse 3D point cloud generation performance. Our FastDiT-3D improves 1-Nearest Neighbor Accuracy and Coverage metrics when generating 128-resolution voxel point clouds, using only 6.5% of the original training cost.	Presents FastDiT-3D, a fast diffusion transformer for efficient 3D point cloud generation that performs denoising on masked voxelized point clouds, achieving state-of-the-art performance at a significantly reduced training cost.	Addresses the prohibitively expensive training of voxel-based diffusion models for high-resolution 3D point clouds due to the cubic complexity of attention operators.	Employs a novel foreground-background aware masking strategy for efficient encoding and integrates Mixture of Expert (MoE) layers within Transformer blocks for multi-category adaptation.	Achieves state-of-the-art performance in generating high-fidelity and diverse 3D point clouds across categories on the ShapeNet dataset. Significantly reduces training costs to 6.5% of the original cost for 128-resolution voxel point cloud generation. Demonstrates the effectiveness of voxel-aware masking, 3D window attention, and MoE for efficient and high-quality 3D point cloud generation.	Exploration of explicit text control for 3D shape generation is left for future work. Scaling FastDiT-3D to large-scale text-3D datasets for text-to-3D generation is a potential future direction.	3d point cloud generation, diffusion models, transformers, masked modeling, mixture of experts
2312.07133 Report	Text2AC-Zero: Consistent Synthesis of Animated Characters using 2D Diffusion	Abdelrahman Eldesokey, Peter Wonka	We propose a zero-shot approach for consistent Text-to-Animated-Characters synthesis based on pre-trained Text-to-Image (T2I) diffusion models. Existing Text-to-Video (T2V) methods are expensive to train and require large-scale video datasets to produce diverse characters and motions. At the same time, their zero-shot alternatives fail to produce temporally consistent videos. We strive to bridge this gap, and we introduce a zero-shot approach that produces temporally consistent videos of animated characters and requires no training or fine-tuning. We leverage existing text-based motion diffusion models to generate diverse motions that we utilize to guide a T2I model. To achieve temporal consistency, we introduce the Spatial Latent Alignment module that exploits cross-frame dense correspondences that we compute to align the latents of the video frames. Furthermore, we propose Pixel-Wise Guidance to steer the diffusion process in a direction that minimizes visual discrepancies. Our proposed approach generates temporally consistent videos with diverse motions and styles, outperforming existing zero-shot T2V approaches in terms of pixel-wise consistency and user preference.	This paper introduces a zero-shot approach for generating temporally consistent videos of animated characters using pre-trained Text-to-Image (T2I) diffusion models and text-based motion diffusion models.	Existing Text-to-Video (T2V) methods are computationally expensive to train, require large-scale video datasets, and their zero-shot alternatives fail to produce temporally consistent videos.	The proposed approach leverages text-based motion diffusion models to generate motion sequences, which are then used to guide a pre-trained T2I model. They introduce a Spatial Latent Alignment module to align latent codes between video frames based on cross-frame dense correspondences and a Pixel-Wise Guidance strategy to refine details and further enhance temporal consistency.	The proposed approach outperforms existing zero-shot T2V approaches in terms of pixel-wise consistency as measured by the introduced Human Mean Squared Error metric. User studies show a strong preference for videos generated by the proposed method compared to baselines. The approach allows for control over character motion and style, enabling the generation of videos for scenarios that trained T2V models struggle with.	The method relies on the accuracy of ControlNet with depth conditioning and can inherit its limitations. The Pixel-Wise Guidance module, while effective, is computationally demanding in terms of GPU memory usage.	text-to-video synthesis, diffusion models, zero-shot learning, temporal consistency, animated characters
2312.07063 Report	Template Free Reconstruction of Human-object Interaction with Procedural Interaction Generation	Xianghui Xie, Bharat Lal Bhatnagar, Jan Eric Lenssen, Gerard Pons-Moll	Reconstructing human-object interaction in 3D from a single RGB image is a challenging task and existing data driven methods do not generalize beyond the objects present in the carefully curated 3D interaction datasets. Capturing large-scale real data to learn strong interaction and 3D shape priors is very expensive due to the combinatorial nature of human-object interactions. In this paper, we propose ProciGen (Procedural interaction Generation), a method to procedurally generate datasets with both, plausible interaction and diverse object variation. We generate 1M+ human-object interaction pairs in 3D and leverage this large-scale data to train our HDM (Hierarchical Diffusion Model), a novel method to reconstruct interacting human and unseen objects, without any templates. Our HDM is an image-conditioned diffusion model that learns both realistic interaction and highly accurate human and object shapes. Experiments show that our HDM trained with ProciGen significantly outperforms prior methods that requires template meshes and that our dataset allows training methods with strong generalization ability to unseen object instances. Our code and data are released.	This paper proposes ProciGen, a procedural interaction generation method, and HDM, a hierarchical diffusion model, to reconstruct human-object interactions in 3D from a single RGB image without object templates.	Existing data-driven methods struggle to generalize beyond curated datasets due to the vast number of possible object shapes and interaction variations. Capturing real data at scale is expensive, creating a need for scalable synthetic data generation.	ProciGen establishes dense correspondences between objects of the same category to transfer contact points from captured interactions to new object instances. It then jointly optimizes human and object poses to ensure plausible interactions. HDM uses a two-stage diffusion process, first jointly reconstructing human and object point clouds with segmentation labels, then refining them with separate diffusion models incorporating cross-attention to preserve interaction context.	ProciGen generates a dataset of over 1 million interaction images with 21k+ objects paired with 3D ground truth. HDM trained with ProciGen outperforms template-based methods like CHORE and template-free methods like PC2 on BEHAVE and InterCap datasets. Models trained on ProciGen demonstrate strong generalization to unseen objects, even generalizing to in-the-wild images from the COCO dataset.	The diversity of interaction poses in ProciGen is limited by the seed poses from existing datasets. HDM struggles to reconstruct accurate human shapes when large portions of the body are occluded.	human-object interaction, 3d reconstruction, diffusion models, synthetic data generation, template-free
2312.06971 Report	CCM: Adding Conditional Controls to Text-to-Image Consistency Models	Jie Xiao, Kai Zhu, Han Zhang, Zhiheng Liu, Yujun Shen, Yu Liu, Xueyang Fu, Zheng-Jun Zha	Consistency Models (CMs) have showed a promise in creating visual content efficiently and with high quality. However, the way to add new conditional controls to the pretrained CMs has not been explored. In this technical report, we consider alternative strategies for adding ControlNet-like conditional control to CMs and present three significant findings. 1) ControlNet trained for diffusion models (DMs) can be directly applied to CMs for high-level semantic controls but struggles with low-level detail and realism control. 2) CMs serve as an independent class of generative models, based on which ControlNet can be trained from scratch using Consistency Training proposed by Song et al. 3) A lightweight adapter can be jointly optimized under multiple conditions through Consistency Training, allowing for the swift transfer of DMs-based ControlNet to CMs. We study these three solutions across various conditional controls, including edge, depth, human pose, low-resolution image and masked image with text-to-image latent consistency models.	This paper explores and compares different strategies for adding ControlNet-like conditional control to Consistency Models (CMs) for image generation.	CMs are efficient for image generation, but how to effectively add new conditional controls to pretrained CMs remained unexplored.	The paper investigates three solutions: 1) Directly applying ControlNet trained on diffusion models (DMs) to CMs. 2) Training ControlNet from scratch using consistency training on CMs. 3) Using consistency training to optimize a lightweight adapter for transferring DMs-based ControlNet to CMs.	Directly applied DM ControlNet can transfer high-level semantic control to CMs but struggles with low-level details and realism. ControlNet can be successfully trained from scratch on CMs using consistency training, achieving better conditional generation. A lightweight adapter trained with consistency training can effectively bridge the gap between DMs and CMs, improving the transferability of ControlNet.	The study primarily focuses on visual quality without quantitative comparisons. Future work could explore more sophisticated adapter architectures or training strategies for improved transfer learning.	consistency models, controlnet, image generation, conditional image synthesis, transfer learning
2312.06947 Report	MaTe3D: Mask-guided Text-based 3D-aware Portrait Editing	Kangneng Zhou, Daiheng Gao, Xuan Wang, Jie Zhang, Peng Zhang, Xusen Sun, Longhao Zhang, Shiqi Yang, Bang Zhang, Liefeng Bo, Yaxing Wang, Ming-Ming Cheng	3D-aware portrait editing has a wide range of applications in multiple fields. However, current approaches are limited due that they can only perform mask-guided or text-based editing. Even by fusing the two procedures into a model, the editing quality and stability cannot be ensured. To address this limitation, we propose \textbf{MaTe3D}: mask-guided text-based 3D-aware portrait editing. In this framework, first, we introduce a new SDF-based 3D generator which learns local and global representations with proposed SDF and density consistency losses. This enhances masked-based editing in local areas; second, we present a novel distillation strategy: Conditional Distillation on Geometry and Texture (CDGT). Compared to exiting distillation strategies, it mitigates visual ambiguity and avoids mismatch between texture and geometry, thereby producing stable texture and convincing geometry while editing. Additionally, we create the CatMask-HQ dataset, a large-scale high-resolution cat face annotation for exploration of model generalization and expansion. We perform expensive experiments on both the FFHQ and CatMask-HQ datasets to demonstrate the editing quality and stability of the proposed method. Our method faithfully generates a 3D-aware edited face image based on a modified mask and a text prompt. Our code and models will be publicly released.	Proposes MaTe3D, a novel framework for mask-guided text-based 3D-aware portrait editing, enabling high-quality and stable manipulation of portraits using both masks and text prompts.	Addresses limitations of existing 3D portrait editing methods that struggle to effectively combine mask-guided and text-based manipulation in a single model, often resulting in unstable texture or unconvincing geometry.	Introduces a new SDF-based 3D generator with SDF and density consistency losses for accurate local and global representation learning. Develops Condition Distillation on Geometry and Texture (CDGT) to iteratively refine masks and combine gradients from images and normal maps, ensuring stable texture and convincing geometry during editing.	Achieves high-fidelity 3D portrait editing with accurate masks and text-driven modifications, outperforming existing methods in qualitative and quantitative comparisons. Demonstrates superior geometry reconstruction quality compared to IDE-3D, as evidenced by significantly lower Chamfer-L1 distances and higher normal consistency scores. Enables applications like real portrait editing, out-of-domain editing (e.g., adding animal textures to faces), and face swapping with celebrities.	Image quality may slightly deteriorate when prioritizing geometry learning through SDFs. Editing process is more time-consuming than baseline methods due to the iterative optimization strategy.	3d portrait editing, mask-guided editing, text-guided editing, diffusion models, score distillation sampling
2312.06742 Report	Honeybee: Locality-enhanced Projector for Multimodal LLM	Junbum Cha, Wooyoung Kang, Jonghwan Mun, Byungseok Roh	In Multimodal Large Language Models (MLLMs), a visual projector plays a crucial role in bridging pre-trained vision encoders with LLMs, enabling profound visual understanding while harnessing the LLMs' robust capabilities. Despite the importance of the visual projector, it has been relatively less explored. In this study, we first identify two essential projector properties: (i) flexibility in managing the number of visual tokens, crucial for MLLMs' overall efficiency, and (ii) preservation of local context from visual features, vital for spatial understanding. Based on these findings, we propose a novel projector design that is both flexible and locality-enhanced, effectively satisfying the two desirable properties. Additionally, we present comprehensive strategies to effectively utilize multiple and multifaceted instruction datasets. Through extensive experiments, we examine the impact of individual design choices. Finally, our proposed MLLM, Honeybee, remarkably outperforms previous state-of-the-art methods across various benchmarks, including MME, MMBench, SEED-Bench, and LLaVA-Bench, achieving significantly higher efficiency. Code and models are available at https://github.com/kakaobrain/honeybee.	This paper proposes Honeybee, a Multimodal Large Language Model (MLLM) that features a novel locality-enhanced projector. This projector aims to bridge the gap between pre-trained vision encoders and LLMs, enhancing visual understanding and efficiency.	Existing MLLMs often struggle with balancing efficiency and the preservation of local visual context. This work addresses these limitations to improve performance in tasks like spatial understanding.	The authors introduce two types of locality-enhanced projectors: Convolutional Abstractor (C-Abstractor) and Deformable attention-based Abstractor (D-Abstractor). They also perform extensive experiments to investigate optimal strategies for utilizing and combining diverse instruction datasets.	Locality-enhanced projectors demonstrate superior performance in spatial understanding tasks compared to traditional linear projectors and abstractors. Honeybee achieves state-of-the-art results on several MLLM benchmarks, including MME, MMBench, SEED-Bench, and LLaVA-Bench. The study highlights the importance of dataset diversity, balanced training, and fine-grained template selection in visual instruction tuning.	The impact of further architectural variations in projectors beyond the explored designs remains to be investigated. Exploring advanced applications of techniques like LoRA for more efficient LLM training could be beneficial.	multimodal large language models, visual instruction tuning, locality-enhanced projector, spatial understanding, instruction dataset utilization
2312.06739 Report	SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models	Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, Ying Shan	Current instruction-based editing methods, such as InstructPix2Pix, often fail to produce satisfactory results in complex scenarios due to their dependence on the simple CLIP text encoder in diffusion models. To rectify this, this paper introduces SmartEdit, a novel approach to instruction-based image editing that leverages Multimodal Large Language Models (MLLMs) to enhance their understanding and reasoning capabilities. However, direct integration of these elements still faces challenges in situations requiring complex reasoning. To mitigate this, we propose a Bidirectional Interaction Module that enables comprehensive bidirectional information interactions between the input image and the MLLM output. During training, we initially incorporate perception data to boost the perception and understanding capabilities of diffusion models. Subsequently, we demonstrate that a small amount of complex instruction editing data can effectively stimulate SmartEdit's editing capabilities for more complex instructions. We further construct a new evaluation dataset, Reason-Edit, specifically tailored for complex instruction-based image editing. Both quantitative and qualitative results on this evaluation dataset indicate that our SmartEdit surpasses previous methods, paving the way for the practical application of complex instruction-based image editing.	SmartEdit is an instruction-based image editing model that leverages Multimodal Large Language Models (MLLMs) to enhance understanding and reasoning in complex editing scenarios.	Existing methods struggle with complex instructions that involve multiple objects, specific attributes, or require world knowledge. SmartEdit addresses this limitation to improve the practicality of instruction-based editing.	SmartEdit integrates an MLLM (LLaVA) with a diffusion model, using a novel Bidirectional Interaction Module (BIM) for enhanced image-text feature interaction. It is trained on a dataset combining editing data, segmentation data, and synthesized complex editing pairs.	SmartEdit outperforms previous methods in complex understanding and reasoning scenarios, as shown on the newly collected Reason-Edit dataset. The BIM module proves crucial for enabling effective bidirectional information interaction. Joint training with diverse datasets, including synthetic complex editing data, significantly improves performance.	Evaluation metrics like CLIP Score and PSNR/SSIM/LPIPS may not perfectly align with human perception of editing quality. Data synthesis for complex scenarios can be challenging and might benefit from further exploration of automatic generation methods.	image editing, instruction-based editing, multimodal large language models, diffusion models, reasoning
2312.06731 Report	Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator	Henry Hengyuan Zhao, Pan Zhou, Mike Zheng Shou	Multimodal Large Language Models (MLLMs) demonstrate exceptional problem-solving capabilities, but there is limited research focusing on their ability to generate data by converting unlabeled images into visual instruction tuning data. To this end, this paper is the first to explore the potential of empowering MLLM to generate data rather than prompting GPT-4. We introduce Genixer, a holistic data generation pipeline consisting of four key steps: (i) instruction data collection, (ii) instruction template design, (iii) empowering MLLMs, and (iv) data generation and filtering. Additionally, we outline two modes of data generation: task-agnostic and task-specific, enabling controllable output. We demonstrate that a synthetic VQA-like dataset trained with LLaVA1.5 enhances performance on 10 out of 12 multimodal benchmarks. Additionally, the grounding MLLM Shikra, when trained with a REC-like synthetic dataset, shows improvements on 7 out of 8 REC datasets. Through experiments and synthetic data analysis, our findings are: (1) current MLLMs can serve as robust data generators without assistance from GPT-4V; (2) MLLMs trained with task-specific datasets can surpass GPT-4V in generating complex instruction tuning data; (3) synthetic datasets enhance performance across various multimodal benchmarks and help mitigate model hallucinations. The data, code, and models can be found at https://github.com/zhaohengyuan1/Genixer.	This paper introduces \genixer{}, an automatic data generation pipeline to produce high-quality instruction tuning data from unlabeled images using Multimodal Large Language Models (MLLMs).	Current methods for creating visual instruction data for MLLMs are limited by either image diversity or the cost and capabilities of prompting GPT-4.	\genixer{} consists of: (i) instruction data collection from various VL tasks, (ii) two-level instruction template design for task-specific/agnostic generation, (iii) empowering MLLMs (LLaVA1.5 and Shikra) for data generation, and (iv) automatic data filtering pipelines (Fuyu/CLIP-driven).	MLLMs trained with \genixer{} can generate high-quality visual instruction tuning data comparable to GPT-4V without extra cost. MLLMs trained with \genixer{} outperform GPT-4V in generating complex instruction data for tasks like REC. Synthetic datasets from \genixer{} improve MLLM performance on various benchmarks and mitigate model hallucinations.	The study is limited by computational constraints for testing larger LLM scales (e.g., 13B or 34B). Evaluating complex and open-ended data types like Referential Dialogue remains a challenge.	multimodal large language model, instruction tuning, data generation, synthetic data, visual question answering
2312.06725 Report	EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion	Zehuan Huang, Hao Wen, Junting Dong, Yaohui Wang, Yangguang Li, Xinyuan Chen, Yan-Pei Cao, Ding Liang, Yu Qiao, Bo Dai, Lu Sheng	Generating multiview images from a single view facilitates the rapid generation of a 3D mesh conditioned on a single image. Recent methods that introduce 3D global representation into diffusion models have shown the potential to generate consistent multiviews, but they have reduced generation speed and face challenges in maintaining generalizability and quality. To address this issue, we propose EpiDiff, a localized interactive multiview diffusion model. At the core of the proposed approach is to insert a lightweight epipolar attention block into the frozen diffusion model, leveraging epipolar constraints to enable cross-view interaction among feature maps of neighboring views. The newly initialized 3D modeling module preserves the original feature distribution of the diffusion model, exhibiting compatibility with a variety of base diffusion models. Experiments show that EpiDiff generates 16 multiview images in just 12 seconds, and it surpasses previous methods in quality evaluation metrics, including PSNR, SSIM and LPIPS. Additionally, EpiDiff can generate a more diverse distribution of views, improving the reconstruction quality from generated multiviews. Please see our project page at https://huanngzh.github.io/EpiDiff/.	EpiDiff, a localized interactive multiview diffusion model for efficiently generating multi-view consistent and high-quality images from a single view.	Generating multiview images from a single view is crucial for rapid 3D mesh generation but existing methods are slow or struggle to maintain quality and generalizability.	A lightweight epipolar attention block is inserted into a frozen diffusion model (Zero123) to enable cross-view interaction among neighboring views using epipolar constraints.	Generates 16 multi-view images in 12 seconds. Outperforms previous methods in PSNR, SSIM and LPIPS. Generates more diverse views, leading to better 3D reconstructions.	Less effective for views far from the input view due to base model limitations. Two-step process (synthesis then reconstruction) could be unified.	multi-view synthesis, diffusion models, epipolar geometry, 3d reconstruction, single-view reconstruction
2312.06713 Report	TeTriRF: Temporal Tri-Plane Radiance Fields for Efficient Free-Viewpoint Video	Minye Wu, Zehao Wang, Georgios Kouros, Tinne Tuytelaars	Neural Radiance Fields (NeRF) revolutionize the realm of visual media by providing photorealistic Free-Viewpoint Video (FVV) experiences, offering viewers unparalleled immersion and interactivity. However, the technology's significant storage requirements and the computational complexity involved in generation and rendering currently limit its broader application. To close this gap, this paper presents Temporal Tri-Plane Radiance Fields (TeTriRF), a novel technology that significantly reduces the storage size for Free-Viewpoint Video (FVV) while maintaining low-cost generation and rendering. TeTriRF introduces a hybrid representation with tri-planes and voxel grids to support scaling up to long-duration sequences and scenes with complex motions or rapid changes. We propose a group training scheme tailored to achieving high training efficiency and yielding temporally consistent, low-entropy scene representations. Leveraging these properties of the representations, we introduce a compression pipeline with off-the-shelf video codecs, achieving an order of magnitude less storage size compared to the state-of-the-art. Our experiments demonstrate that TeTriRF can achieve competitive quality with a higher compression rate.	Presents TeTriRF, a novel FVV modeling approach using Temporal Tri-Plane Radiance Fields for efficient generation and rendering with compact storage.	Addresses limitations of existing NeRF-based FVV techniques that suffer from large storage requirements and high computational complexity, hindering their application in long-duration sequences and complex scenes.	Introduces a hybrid representation (tri-planes and voxel grids) and a grouped multi-frame training scheme with intra- and inter-group regularization for temporally consistent and low-entropy representations. It leverages off-the-shelf video codecs (HEVC) for efficient compression.	Achieves competitive rendering quality with significantly reduced storage (10-100 KB/frame) compared to state-of-the-art methods. Demonstrates superior time efficiency in both training and rendering, enabling real-time playback. Successfully handles long sequence FVV, effectively capturing intricate details in dynamic scenes with complex motions.	Slight quality drop observed in some cases due to limitations of the utilized dataset. Future work includes exploring alternative video codecs and optimizing rendering for real-time performance on diverse devices using GLSL shaders.	neural radiance fields, free-viewpoint video, data compression, hybrid representation, video encoding
2312.06712 Report	Separate-and-Enhance: Compositional Finetuning for Text2Image Diffusion Models	Zhipeng Bao, Yijun Li, Krishna Kumar Singh, Yu-Xiong Wang, Martial Hebert	Despite recent significant strides achieved by diffusion-based Text-to-Image (T2I) models, current systems are still less capable of ensuring decent compositional generation aligned with text prompts, particularly for the multi-object generation. This work illuminates the fundamental reasons for such misalignment, pinpointing issues related to low attention activation scores and mask overlaps. While previous research efforts have individually tackled these issues, we assert that a holistic approach is paramount. Thus, we propose two novel objectives, the Separate loss and the Enhance loss, that reduce object mask overlaps and maximize attention scores, respectively. Our method diverges from conventional test-time-adaptation techniques, focusing on finetuning critical parameters, which enhances scalability and generalizability. Comprehensive evaluations demonstrate the superior performance of our model in terms of image realism, text-image alignment, and adaptability, notably outperforming prominent baselines. Ultimately, this research paves the way for T2I diffusion models with enhanced compositional capacities and broader applicability.	This paper introduces Separate-and-Enhance, a compositional finetuning strategy for diffusion-based Text-to-Image (T2I) models to address the issue of compositional misalignment in image generation.	Existing T2I models struggle to generate images with multiple objects accurately, often exhibiting misalignment between the generated image and the text prompt. This work aims to enhance the compositional capacity of these models, improving their ability to generate images with multiple objects that accurately reflect the input text.	The authors propose two novel objectives: 1) Separate loss, which minimizes the overlap between attention masks of different objects, and 2) Enhance loss, which maximizes the attention activation scores for each object. They selectively finetune specific parameters, primarily the Key mapping functions in the cross-attention modules of the diffusion model, to optimize these objectives.	The proposed Separate-and-Enhance method achieves superior text-image alignment and image realism compared to existing state-of-the-art T2I models. The method demonstrates scalability and effectiveness when trained on a large collection of concepts. The finetuned model exhibits strong generalization ability, effectively generating images from prompts containing unseen concept combinations.	The model exhibits limitations in discerning the meaning of polysemous words. Future work could explore incorporating a more robust language model and implementing a more diverse training process to address the polysemy challenge.	text-to-image synthesis, diffusion models, compositional generation, attention mechanisms, fine-tuning
2312.06709 Report	AM-RADIO: Agglomerative Vision Foundation Model -- Reduce All Domains Into One	Mike Ranzinger, Greg Heinrich, Jan Kautz, Pavlo Molchanov	A handful of visual foundation models (VFMs) have recently emerged as the backbones for numerous downstream tasks. VFMs like CLIP, DINOv2, SAM are trained with distinct objectives, exhibiting unique characteristics for various downstream tasks. We find that despite their conceptual differences, these models can be effectively merged into a unified model through multi-teacher distillation. We name this approach AM-RADIO (Agglomerative Model -- Reduce All Domains Into One). This integrative approach not only surpasses the performance of individual teacher models but also amalgamates their distinctive features, such as zero-shot vision-language comprehension, detailed pixel-level understanding, and open vocabulary segmentation capabilities. In pursuit of the most hardware-efficient backbone, we evaluated numerous architectures in our multi-teacher distillation pipeline using the same training recipe. This led to the development of a novel architecture (E-RADIO) that exceeds the performance of its predecessors and is at least 7x faster than the teacher models. Our comprehensive benchmarking process covers downstream tasks including ImageNet classification, ADE20k semantic segmentation, COCO object detection and LLaVa-1.5 framework. Code: https://github.com/NVlabs/RADIO	The paper introduces AM-RADIO, a multi-teacher distillation framework for training a single vision foundation model from scratch using multiple pretrained VFMs (CLIP, DINOv2, SAM) as teachers, resulting in a model that combines their strengths and often surpasses them.	Existing VFMs excel in specific domains (e.g., zero-shot learning, dense tasks) but lack comprehensive capabilities. AM-RADIO addresses this by creating a unified model that inherits and surpasses the strengths of individual teacher models.	AM-RADIO distills knowledge from multiple teacher VFMs by matching student and teacher feature representations using adaptor heads and a combination of cosine similarity and smooth L1 loss. It addresses challenges like input resolution mismatch and efficient training.	AM-RADIO models outperform teacher models on various benchmarks, including ImageNet classification, semantic segmentation, and visual question answering. The framework allows for flexibility in student architecture, leading to the development of E-RADIO, a novel efficient architecture that achieves high throughput without sacrificing accuracy. The study highlights the importance of full feature distillation for dense tasks and the complementary strengths of different teacher models.	The partitioned training scheme for different teacher objectives might lead to latent resolution-dependent modes in the student model. Future work includes exploring more sophisticated loss balancing techniques and student adaptor head architectures.	knowledge distillation, multi-teacher distillation, vision foundation models, efficient architectures, visual question answering
2312.06708 Report	Neutral Editing Framework for Diffusion-based Video Editing	Sunjae Yoon, Gwanhyeong Koo, Ji Woo Hong, Chang D. Yoo	Text-conditioned image editing has succeeded in various types of editing based on a diffusion framework. Unfortunately, this success did not carry over to a video, which continues to be challenging. Existing video editing systems are still limited to rigid-type editing such as style transfer and object overlay. To this end, this paper proposes Neutral Editing (NeuEdit) framework to enable complex non-rigid editing by changing the motion of a person/object in a video, which has never been attempted before. NeuEdit introduces a concept of `neutralization' that enhances a tuning-editing process of diffusion-based editing systems in a model-agnostic manner by leveraging input video and text without any other auxiliary aids (e.g., visual masks, video captions). Extensive experiments on numerous videos demonstrate adaptability and effectiveness of the NeuEdit framework. The website of our work is available here: https://neuedit.github.io	Presents Neutral Editing (NeuEdit), a framework for complex non-rigid video editing (e.g., changing object motion) using text prompts.	Existing video editing methods struggle with non-rigid edits, often limited to rigid transformations like style transfer and object overlay.	Introduces 'neutralization', reducing irrelevant content influence during model tuning and editing. Utilizes text and video analysis to identify and disentangle editing factors, generating 'neutral prompts' and 'neutral videos'.	Significantly improves textual alignment with target prompts, enabling edits like changing a person's pose or an object's motion. Maintains higher fidelity to unedited regions compared to existing methods. Demonstrates consistent performance across different video editing models and datasets.	Editing can be biased, unintentionally changing scene context related to the desired attribute. Editing moving objects to become still is challenging due to temporal consistency constraints in video diffusion models.	video editing, diffusion models, text-guided synthesis, non-rigid transformation, content disentanglement
2312.06706 Report	UNeR3D: Versatile and Scalable 3D RGB Point Cloud Generation from 2D Images in Unsupervised Reconstruction	Hongbin Lin, Juangui Xu, Qingfeng Xu, Zhengyu Hu, Handing Xu, Yunzhi Chen, Yongjun Hu, Zhenguo Nie	In the realm of 3D reconstruction from 2D images, a persisting challenge is to achieve high-precision reconstructions devoid of 3D Ground Truth data reliance. We present UNeR3D, a pioneering unsupervised methodology that sets a new standard for generating detailed 3D reconstructions solely from 2D views. Our model significantly cuts down the training costs tied to supervised approaches and introduces RGB coloration to 3D point clouds, enriching the visual experience. Employing an inverse distance weighting technique for color rendering, UNeR3D ensures seamless color transitions, enhancing visual fidelity. Our model's flexible architecture supports training with any number of views, and uniquely, it is not constrained by the number of views used during training when performing reconstructions. It can infer with an arbitrary count of views during inference, offering unparalleled versatility. Additionally, the model's continuous spatial input domain allows the generation of point clouds at any desired resolution, empowering the creation of high-resolution 3D RGB point clouds. We solidify the reconstruction process with a novel multi-view geometric loss and color loss, demonstrating that our model excels with single-view inputs and beyond, thus reshaping the paradigm of unsupervised learning in 3D vision. Our contributions signal a substantial leap forward in 3D vision, offering new horizons for content creation across diverse applications. Code is available at https://github.com/HongbinLin3589/UNeR3D.	UNeR3D: An unsupervised learning methodology for generating detailed 3D reconstructions (RGB point clouds) solely from 2D views.	Addresses the limitations of supervised 3D reconstruction methods that heavily rely on costly and time-consuming 3D Ground Truth data.	Combines neural radiance fields with a knn-based inverse distance weighting scheme, employing ResNet34 for feature extraction and a specialized MLP for point processing. Leverages multi-view geometric and color losses for training, allowing for single or multi-view reconstruction.	Achieves high-fidelity 3D reconstructions without 3D ground truth data during training. Introduces RGB color attributes to point clouds within the NeRF framework. Enables flexible reconstruction from a variable number of input views, including single-view reconstruction.	Generated 2D views may exhibit artifacts due to the model's generalizability. Distinguishing foreground and background elements in complex scenes can be challenging.	3d reconstruction, unsupervised learning, neural radiance fields, point cloud generation, inverse distance weighting
2312.06704 Report	SIFU: Side-view Conditioned Implicit Function for Real-world Usable Clothed Human Reconstruction	Zechuan Zhang, Zongxin Yang, Yi Yang	Creating high-quality 3D models of clothed humans from single images for real-world applications is crucial. Despite recent advancements, accurately reconstructing humans in complex poses or with loose clothing from in-the-wild images, along with predicting textures for unseen areas, remains a significant challenge. A key limitation of previous methods is their insufficient prior guidance in transitioning from 2D to 3D and in texture prediction. In response, we introduce SIFU (Side-view Conditioned Implicit Function for Real-world Usable Clothed Human Reconstruction), a novel approach combining a Side-view Decoupling Transformer with a 3D Consistent Texture Refinement pipeline.SIFU employs a cross-attention mechanism within the transformer, using SMPL-X normals as queries to effectively decouple side-view features in the process of mapping 2D features to 3D. This method not only improves the precision of the 3D models but also their robustness, especially when SMPL-X estimates are not perfect. Our texture refinement process leverages text-to-image diffusion-based prior to generate realistic and consistent textures for invisible views. Through extensive experiments, SIFU surpasses SOTA methods in both geometry and texture reconstruction, showcasing enhanced robustness in complex scenarios and achieving an unprecedented Chamfer and P2S measurement. Our approach extends to practical applications such as 3D printing and scene building, demonstrating its broad utility in real-world scenarios. Project page https://river-zhang.github.io/SIFU-projectpage/ .	This paper proposes SIFU, a novel method using a Side-view Conditioned Implicit Function with a 3D Consistent Texture Refinement pipeline for high-quality reconstruction of clothed humans from single images, suitable for real-world applications like 3D printing and scene creation.	Creating realistic 3D human models from single images is crucial for various applications. However, existing methods struggle with complex poses, loose clothing, and texture prediction for unseen areas.	SIFU uses a side-view decoupling transformer guided by SMPL-X normals to extract precise 3D features. A 3D Consistent Texture Refinement process then uses text-to-image diffusion models and consistent editing for detailed, consistent textures.	SIFU outperforms SOTA methods in geometry and texture quality, achieving a Chamfer and P2S of 0.6 cm on THuman2.0. Shows improved robustness in geometry reconstruction, even with inaccurate SMPL-X estimations. Effectively handles complex poses and loose clothing, producing realistic and consistent textures.	Reconstruction accuracy can be affected by inaccuracies in SMPL-X estimation. The method may struggle with clothing significantly separated from the body. Future work can explore diffusion models for both shape and texture, and refine reconstruction of specific body parts.	3d human reconstruction, implicit function, texture refinement, diffusion models, single-image reconstruction
2312.06703 Report	OpenSD: Unified Open-Vocabulary Segmentation and Detection	Shuai Li, Minghan Li, Pengfei Wang, Lei Zhang	Recently, a few open-vocabulary methods have been proposed by employing a unified architecture to tackle generic segmentation and detection tasks. However, their performance still lags behind the task-specific models due to the conflict between different tasks, and their open-vocabulary capability is limited due to the inadequate use of CLIP. To address these challenges, we present a universal transformer-based framework, abbreviated as OpenSD, which utilizes the same architecture and network parameters to handle open-vocabulary segmentation and detection tasks. First, we introduce a decoder decoupled learning strategy to alleviate the semantic conflict between thing and staff categories so that each individual task can be learned more effectively under the same framework. Second, to better leverage CLIP for end-to-end segmentation and detection, we propose dual classifiers to handle the in-vocabulary domain and out-of-vocabulary domain, respectively. The text encoder is further trained to be region-aware for both thing and stuff categories through decoupled prompt learning, enabling them to filter out duplicated and low-quality predictions, which is important to end-to-end segmentation and detection. Extensive experiments are conducted on multiple datasets under various circumstances. The results demonstrate that OpenSD outperforms state-of-the-art open-vocabulary segmentation and detection methods in both closed- and open-vocabulary settings. Code is available at https://github.com/strongwolf/OpenSD	This paper introduces OpenSD, a unified transformer-based framework for open-vocabulary segmentation and detection, employing the same architecture and parameters for both tasks.	Existing open-vocabulary segmentation and detection methods often lag behind task-specific models and underutilize CLIP. OpenSD aims to overcome these limitations by offering a single potent framework for these tasks.	OpenSD utilizes a two-stage pipeline. First, it generates object masks and boxes. Second, it predicts classifications using dual classifiers (for in-vocabulary and out-of-vocabulary domains) based on these outputs and leverages decoupled decoder learning and region-aware prompted dual classifiers to enhance performance.	OpenSD outperforms state-of-the-art open-vocabulary segmentation and detection methods in both closed and open-vocabulary settings. The decoupled decoder learning strategy effectively mitigates conflicts between different tasks, leading to performance gains. The region-aware dual classifiers, particularly the out-of-vocabulary classifier leveraging CLIP, significantly enhance performance in open-vocabulary settings.	The model currently relies on ensembling in-vocabulary and out-of-vocabulary classifiers, which could be streamlined. Future work can explore extending OpenSD to other vision tasks like image captioning or visual question answering.	open-vocabulary, segmentation, detection, transformer, clip
2312.06680 Report	Perceptual Similarity guidance and text guidance optimization for Editing Real Images using Guided Diffusion Models	Ruichen Zhang	When using a diffusion model for image editing, there are times when the modified image can differ greatly from the source. To address this, we apply a dual-guidance approach to maintain high fidelity to the original in areas that are not altered. First, we employ text-guided optimization, using text embeddings to direct latent space and classifier-free guidance. Second, we use perceptual similarity guidance, optimizing latent vectors with posterior sampling via Tweedie formula during the reverse process. This method ensures the realistic rendering of both the edited elements and the preservation of the unedited parts of the original image.	This paper introduces a novel dual-guidance approach for enhancing real image editing using diffusion models, combining text guidance optimization and perceptual similarity guidance to maintain fidelity to the original image in unaltered areas.	Existing image editing methods with diffusion models often struggle to balance incorporating edits suggested by the new text prompt while preserving the structure and details of the original image. This dual-guidance approach aims to address this limitation.	The method leverages text embeddings for text-guided optimization, employing classifier-free guidance to steer latent space manipulation. Additionally, it utilizes perceptual similarity guidance with posterior sampling via Tweedie's formula during the reverse diffusion process, ensuring realistic rendering and preserving unedited parts of the image.	The combined Perceptual Similarity and text optimization approach demonstrates superior CLIPScore, indicating improved alignment between edited images and the new text prompt. While all methods maintain PSNR scores around 20, suggesting visually perceptible differences, Perceptual Similarity + text optimization excels at preserving original image details, as corroborated by user evaluations. Although LPIPS values remain comparable across methods, indicating minor overall image differences, variations in detail preservation are evident.	The current image domain guidance relies on whole-image comparisons, potentially leading to distortions in extensively edited images. Future work might involve more localized comparisons to address this. Limitations stemming from Stable Diffusion and Prompt-to-Prompt editing, particularly inaccurate text-image alignment, present further avenues for refinement.	image editing, diffusion models, text-guided image editing, perceptual similarity, classifier-free guidance
2312.06663 Report	CAD: Photorealistic 3D Generation via Adversarial Distillation	Ziyu Wan, Despoina Paschalidou, Ian Huang, Hongyu Liu, Bokui Shen, Xiaoyu Xiang, Jing Liao, Leonidas Guibas	The increased demand for 3D data in AR/VR, robotics and gaming applications, gave rise to powerful generative pipelines capable of synthesizing high-quality 3D objects. Most of these models rely on the Score Distillation Sampling (SDS) algorithm to optimize a 3D representation such that the rendered image maintains a high likelihood as evaluated by a pre-trained diffusion model. However, finding a correct mode in the high-dimensional distribution produced by the diffusion model is challenging and often leads to issues such as over-saturation, over-smoothing, and Janus-like artifacts. In this paper, we propose a novel learning paradigm for 3D synthesis that utilizes pre-trained diffusion models. Instead of focusing on mode-seeking, our method directly models the distribution discrepancy between multi-view renderings and diffusion priors in an adversarial manner, which unlocks the generation of high-fidelity and photorealistic 3D content, conditioned on a single image and prompt. Moreover, by harnessing the latent space of GANs and expressive diffusion model priors, our method facilitates a wide variety of 3D applications including single-view reconstruction, high diversity generation and continuous 3D interpolation in the open domain. The experiments demonstrate the superiority of our pipeline compared to previous works in terms of generation quality and diversity.	This paper introduces Consistent Adversarial Distillation (CAD), a novel method for generating high-quality, photorealistic 3D objects from a single image and text prompt by leveraging pre-trained diffusion models.	Existing 3D generation methods based on score distillation often suffer from limitations like over-saturation, over-smoothing, and limited diversity. CAD aims to overcome these issues by directly modeling the distribution of a pre-trained diffusion model.	CAD employs a 3D-aware GAN to learn the conditional distribution of a pre-trained diffusion model. It introduces adversarial distillation to minimize the distribution gap between multi-view renderings and diffusion priors. It also proposes strategies for sampling diverse multi-view images from the diffusion model and refining them to enhance quality.	CAD outperforms baseline methods in terms of photorealism and diversity, as demonstrated by quantitative evaluation using CLIP similarity scores and qualitative comparisons. The importance of pose pruning and distribution refinement for improving generation quality is highlighted through ablation studies. The paper shows that directly modeling the 3D distribution using a GAN leads to superior results compared to single-mode fitting methods.	The optimization speed is limited by the computational cost of volumetric rendering, suggesting the exploration of more efficient rendering techniques as future work. The paper focuses on single-condition generation, and exploring joint training with multiple conditions could further enhance diversity.	3d generation, diffusion models, generative adversarial networks, single-view reconstruction, adversarial distillation
2312.06662 Report	Photorealistic Video Generation with Diffusion Models	Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, José Lezama	We present W.A.L.T, a transformer-based approach for photorealistic video generation via diffusion modeling. Our approach has two key design decisions. First, we use a causal encoder to jointly compress images and videos within a unified latent space, enabling training and generation across modalities. Second, for memory and training efficiency, we use a window attention architecture tailored for joint spatial and spatiotemporal generative modeling. Taken together these design decisions enable us to achieve state-of-the-art performance on established video (UCF-101 and Kinetics-600) and image (ImageNet) generation benchmarks without using classifier free guidance. Finally, we also train a cascade of three models for the task of text-to-video generation consisting of a base latent video diffusion model, and two video super-resolution diffusion models to generate videos of $512 \times 896$ resolution at $8$ frames per second.	The paper introduces WALT (Window Attention Latent Transformer), a novel transformer-based framework for efficient latent video diffusion models.	Current video diffusion models struggle with the high computational demands of video processing. WALT leverages a unified latent space for images and videos, enabling efficient training and generation by incorporating causal encoding and windowed attention in a transformer architecture.	WALT has two main stages: (1) A causal 3D CNN encoder-decoder maps images and videos into a shared latent space. (2) A transformer model with alternating spatial and spatiotemporal window attention learns to generate images and videos in this space. The model utilizes AdaLN-LoRA for efficient conditioning, self-conditioning, and a cascaded approach for high-resolution video generation.	WALT achieves state-of-the-art results on video generation benchmarks UCF-101 and Kinetics-600, and image generation benchmark ImageNet, without relying on classifier-free guidance. Joint training on image and video data is shown to be crucial for high-quality text-to-video generation. Ablation studies demonstrate the importance of smaller patch sizes, local window attention, self-conditioning, AdaLN-LoRA, and a zero terminal SNR noise schedule.	The Inception Score for text-to-video generation, while competitive, is slightly lower than PYoCo, potentially due to the use of less powerful text embeddings. Further scaling of the model size beyond 3B parameters is expected to further improve performance.	video generation, diffusion models, transformers, latent space, window attention
2312.06661 Report	UpFusion: Novel View Diffusion from Unposed Sparse View Observations	Bharath Raj Nagoor Kani, Hsin-Ying Lee, Sergey Tulyakov, Shubham Tulsiani	We propose UpFusion, a system that can perform novel view synthesis and infer 3D representations for an object given a sparse set of reference images without corresponding pose information. Current sparse-view 3D inference methods typically rely on camera poses to geometrically aggregate information from input views, but are not robust in-the-wild when such information is unavailable/inaccurate. In contrast, UpFusion sidesteps this requirement by learning to implicitly leverage the available images as context in a conditional generative model for synthesizing novel views. We incorporate two complementary forms of conditioning into diffusion models for leveraging the input views: a) via inferring query-view aligned features using a scene-level transformer, b) via intermediate attentional layers that can directly observe the input image tokens. We show that this mechanism allows generating high-fidelity novel views while improving the synthesis quality given additional (unposed) images. We evaluate our approach on the Co3Dv2 and Google Scanned Objects datasets and demonstrate the benefits of our method over pose-reliant sparse-view methods as well as single-view methods that cannot leverage additional views. Finally, we also show that our learned model can generalize beyond the training categories and even allow reconstruction from self-captured images of generic objects in-the-wild.	Presents UpFusion, a system for 3D inference and novel view synthesis from sparse, unposed images, leveraging a conditional diffusion model conditioned on UpSRT features.	Addresses limitations of existing sparse-view 3D methods that rely on accurate camera poses, which are often unavailable in real-world scenarios.	Combines UpSRT (unposed scene representation transformer) for query-view aligned features with a conditional diffusion model for novel view synthesis. It further optimizes a 3D representation using score-based distillation.	Outperforms pose-dependent methods relying on predicted camera poses (e.g., SparseFusion with RelPose++). Achieves better novel view synthesis than UpSRT and single-view methods, especially when leveraging additional unposed images. Demonstrates generalization beyond training categories, including on self-captured images.	Generated views may not always be precisely consistent with the input images. Scaling of performance with additional views is not as strong as in pose-aware methods.	3d reconstruction, novel view synthesis, diffusion models, unposed images, sparse view
2312.06660 Report	EdgeSAM: Prompt-In-the-Loop Distillation for On-Device Deployment of SAM	Chong Zhou, Xiangtai Li, Chen Change Loy, Bo Dai	This paper presents EdgeSAM, an accelerated variant of the Segment Anything Model (SAM), optimized for efficient execution on edge devices with minimal compromise in performance. Our approach involves distilling the original ViT-based SAM image encoder into a purely CNN-based architecture, better suited for edge devices. We carefully benchmark various distillation strategies and demonstrate that task-agnostic encoder distillation fails to capture the full knowledge embodied in SAM. To overcome this bottleneck, we include both the prompt encoder and mask decoder in the distillation process, with box and point prompts in the loop, so that the distilled model can accurately capture the intricate dynamics between user input and mask generation. To mitigate dataset bias issues stemming from point prompt distillation, we incorporate a lightweight module within the encoder. EdgeSAM achieves a 40-fold speed increase compared to the original SAM, and it also outperforms MobileSAM, being 14 times as fast when deployed on edge devices while enhancing the mIoUs on COCO and LVIS by 2.3 and 3.2 respectively. It is also the first SAM variant that can run at over 30 FPS on an iPhone 14. Code and models are available at https://github.com/chongzhou96/EdgeSAM.	This paper proposes EdgeSAM, an accelerated version of the Segment Anything Model (SAM), optimized for efficient execution on edge devices while retaining comparable performance.	Deploying SAM on edge devices like smartphones is challenging due to its large computational requirements, hindering real-time interactive segmentation.	The authors distill the knowledge from SAM's ViT-based encoder into a CNN-based architecture, employ a novel 'prompt-in-the-loop' distillation strategy to capture the interaction between user input and mask generation, and introduce a module to adapt to granularity priors of specific datasets.	EdgeSAM achieves a 40-fold speed increase compared to the original SAM and is 1.6 times faster than MobileSAM on an NVIDIA 2080 Ti GPU. On an iPhone 14, EdgeSAM achieves an encoding speed of 14ms per image, making it 14 times faster than MobileSAM on the same platform. EdgeSAM maintains comparable accuracy to SAM in box-prompt performance and outperforms MobileSAM in point-prompt performance across various datasets.	Limitations in model capacity and training exclusively with ground-truth boxes might lead to performance discrepancies. Further exploration of quantization, pruning, and on-device optimization for enhanced performance.	interactive segmentation, edge computing, model compression, knowledge distillation, segment anything model (sam)
2312.06655 Report	Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior	Fangfu Liu, Diankun Wu, Yi Wei, Yongming Rao, Yueqi Duan	Recently, 3D content creation from text prompts has demonstrated remarkable progress by utilizing 2D and 3D diffusion models. While 3D diffusion models ensure great multi-view consistency, their ability to generate high-quality and diverse 3D assets is hindered by the limited 3D data. In contrast, 2D diffusion models find a distillation approach that achieves excellent generalization and rich details without any 3D data. However, 2D lifting methods suffer from inherent view-agnostic ambiguity thereby leading to serious multi-face Janus issues, where text prompts fail to provide sufficient guidance to learn coherent 3D results. Instead of retraining a costly viewpoint-aware model, we study how to fully exploit easily accessible coarse 3D knowledge to enhance the prompts and guide 2D lifting optimization for refinement. In this paper, we propose Sherpa3D, a new text-to-3D framework that achieves high-fidelity, generalizability, and geometric consistency simultaneously. Specifically, we design a pair of guiding strategies derived from the coarse 3D prior generated by the 3D diffusion model: a structural guidance for geometric fidelity and a semantic guidance for 3D coherence. Employing the two types of guidance, the 2D diffusion model enriches the 3D content with diversified and high-quality results. Extensive experiments show the superiority of our Sherpa3D over the state-of-the-art text-to-3D methods in terms of quality and 3D consistency.	Presents Sherpa3D, a novel text-to-3D generation framework that leverages readily available 3D diffusion models to guide 2D lifting optimization, resulting in high-fidelity, diverse, and geometrically consistent 3D assets.	Addresses limitations of existing text-to-3D methods that struggle to balance generalizability, high fidelity, and geometric consistency due to reliance on limited 3D data or costly viewpoint-aware models.	Utilizes a coarse 3D prior generated by a 3D diffusion model to guide the optimization process of a 2D diffusion model. Introduces two guiding strategies: structural guidance, which leverages normal maps for geometric fidelity, and semantic guidance, which uses high-level features for 3D coherence. Employs a step annealing technique to balance the influence of 3D guidance during optimization.	Generates high-fidelity 3D assets with compelling texture quality and multi-view consistency, outperforming existing methods. Exhibits strong generalization ability across diverse text prompts, effectively mitigating multi-face Janus problems. Achieves high efficiency, generating production-ready 3D models from text prompts within 25 minutes on a single GPU.	Current generation quality is limited by the backbone of chosen 3D and 2D diffusion models, which can be addressed in future work by using larger, more sophisticated models like SDXL and DeepFloyd. Future work will explore extending the framework to more creative text-to-4D generation.	text-to-3d generation, 3d diffusion models, score distillation sampling, multi-view consistency, geometric fidelity
2312.06644 Report	AnyHome: Open-Vocabulary Generation of Structured and Textured 3D Homes	Rao Fu, Zehao Wen, Zichen Liu, Srinath Sridhar	Inspired by cognitive theories, we introduce AnyHome, a framework that translates any text into well-structured and textured indoor scenes at a house-scale. By prompting Large Language Models (LLMs) with designed templates, our approach converts provided textual narratives into amodal structured representations. These representations guarantee consistent and realistic spatial layouts by directing the synthesis of a geometry mesh within defined constraints. A Score Distillation Sampling process is then employed to refine the geometry, followed by an egocentric inpainting process that adds lifelike textures to it. AnyHome stands out with its editability, customizability, diversity, and realism. The structured representations for scenes allow for extensive editing at varying levels of granularity. Capable of interpreting texts ranging from simple labels to detailed narratives, AnyHome generates detailed geometries and textures that outperform existing methods in both quantitative and qualitative measures.	Introduces AnyHome, a framework that converts text descriptions into structured 3D house models with realistic textures, leveraging LLMs and egocentric inpainting.	Addresses limitations in existing text-to-3D methods that struggle with robust structure, open-vocabulary furniture/objects, realistic textures, and house-scale generation.	1. Uses LLMs to convert text into structured representations (floorplans, room layouts). 2. Employs graph-based representations and placement rules for coherent layouts. 3. Uses SDS for refining object placement. 4. Applies egocentric inpainting for realistic textures.	Generates house-scale scenes with diverse structures and realistic textures from open-vocabulary text. Allows detailed scene editing through text at various levels (room type, layout, object appearance). Outperforms baselines in layout quality (OOB rate) and text-scene alignment (Caption-sim, CLIP-sim).	LLMs' limited understanding of 3D space can lead to illogical object placement. Maintaining texture consistency in multi-view inpainting remains a challenge.	text-to-3d, 3d scene generation, large language models, egocentric vision, score distillation sampling
2312.06642 Report	CorresNeRF: Image Correspondence Priors for Neural Radiance Fields	Yixing Lao, Xiaogang Xu, Zhipeng Cai, Xihui Liu, Hengshuang Zhao	Neural Radiance Fields (NeRFs) have achieved impressive results in novel view synthesis and surface reconstruction tasks. However, their performance suffers under challenging scenarios with sparse input views. We present CorresNeRF, a novel method that leverages image correspondence priors computed by off-the-shelf methods to supervise NeRF training. We design adaptive processes for augmentation and filtering to generate dense and high-quality correspondences. The correspondences are then used to regularize NeRF training via the correspondence pixel reprojection and depth loss terms. We evaluate our methods on novel view synthesis and surface reconstruction tasks with density-based and SDF-based NeRF models on different datasets. Our method outperforms previous methods in both photometric and geometric metrics. We show that this simple yet effective technique of using correspondence priors can be applied as a plug-and-play module across different NeRF variants. The project page is at https://yxlao.github.io/corres-nerf.	This paper presents CorresNeRF, a method that leverages image correspondence priors for improved neural radiance field training with sparse input views.	Training NeRFs with sparse input views is challenging but important for real-world applications where dense view capture is costly.	The method involves an automatic augmentation and outlier filtering process for dense, high-quality correspondence generation. These correspondences are then used to regularize NeRF training via pixel reprojection and depth loss terms.	CorresNeRF outperforms previous methods in novel view synthesis, achieving significantly better photometric metrics and depth prediction on the LLFF dataset. It also excels in surface reconstruction, leading to more accurate surfaces and improved rendering quality compared to baseline SDF-based methods on the DTU dataset. Ablation studies confirm the efficacy of the correspondence augmentation, filtering, and loss terms, demonstrating robustness to noisy correspondences.	Current correspondence generation methods may not be robust for extreme cases like unreasonable camera positions or specific textures. Future work will explore leveraging NeRF for correspondence learning to address these limitations.	neural radiance fields, nerf, image correspondence, sparse view synthesis, surface reconstruction
2312.06640 Report	Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution	Shangchen Zhou, Peiqing Yang, Jianyi Wang, Yihang Luo, Chen Change Loy	Text-based diffusion models have exhibited remarkable success in generation and editing, showing great promise for enhancing visual content with their generative prior. However, applying these models to video super-resolution remains challenging due to the high demands for output fidelity and temporal consistency, which is complicated by the inherent randomness in diffusion models. Our study introduces Upscale-A-Video, a text-guided latent diffusion framework for video upscaling. This framework ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences; globally, without training, a flow-guided recurrent latent propagation module is introduced to enhance overall video stability by propagating and fusing latent across the entire sequences. Thanks to the diffusion paradigm, our model also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation, enabling a trade-off between fidelity and quality. Extensive experiments show that Upscale-A-Video surpasses existing methods in both synthetic and real-world benchmarks, as well as in AI-generated videos, showcasing impressive visual realism and temporal consistency.	This paper introduces Upscale-A-Video, a text-guided latent diffusion framework for video upscaling that leverages pretrained image diffusion models to enhance the quality of low-quality videos.	Existing VSR methods struggle to generate realistic textures and details, especially in real-world scenarios with complex and unknown degradations. This work explores the potential of diffusion models for VSR to produce temporally consistent videos with realistic details.	Upscale-A-Video employs a local-global temporal consistency strategy. Locally, it integrates temporal layers into U-Net and VAE-Decoder for short-sequence consistency. Globally, a flow-guided recurrent latent propagation module enhances stability across long sequences. The model also incorporates text prompts for guiding texture creation and allows adjustable noise levels to balance restoration and generation.	Upscale-A-Video outperforms existing methods in both synthetic and real-world benchmarks, as well as on AI-generated videos, showing improved detail generation and artifact removal. The model effectively leverages text prompts to guide texture creation, leading to more realistic and high-quality details. Adjustable noise levels allow users to control the trade-off between restoration fidelity and detail generation, enabling versatility in different scenarios.	The model currently relies on pretrained image diffusion models, and exploring joint training on image and video data could further enhance performance. Investigating more efficient and accurate flow estimation methods for the latent propagation module could further improve temporal consistency.	video super-resolution, diffusion models, temporal consistency, text-guided generation, real-world vsr
2312.06573 Report	ControlNet-XS: Designing an Efficient and Effective Architecture for Controlling Text-to-Image Diffusion Models	Denis Zavadski, Johann-Friedrich Feiden, Carsten Rother	The field of image synthesis has made tremendous strides forward in the last years. Besides defining the desired output image with text-prompts, an intuitive approach is to additionally use spatial guidance in form of an image, such as a depth map. For this, a recent and highly popular approach is to use a controlling network, such as ControlNet, in combination with a pre-trained image generation model, such as Stable Diffusion. When evaluating the design of existing controlling networks, we observe that they all suffer from the same problem of a delay in information flowing between the generation and controlling process. This, in turn, means that the controlling network must have generative capabilities. In this work we propose a new controlling architecture, called ControlNet-XS, which does not suffer from this problem, and hence can focus on the given task of learning to control. In contrast to ControlNet, our model needs only a fraction of parameters, and hence is about twice as fast during inference and training time. Furthermore, the generated images are of higher quality and the control is of higher fidelity. All code and pre-trained models will be made publicly available.	This paper introduces ControlNet-XS, an efficient and effective architecture for controlling text-to-image diffusion models, addressing the issue of delayed information flow in previous control models.	Controlling large-scale text-to-image diffusion models with spatial guidance, like depth maps or sketches, is crucial for users to achieve their desired image output.	The paper proposes a novel architecture that eliminates the delay in information flow between the generative and controlling processes by enabling direct communication between their encoders. This allows for a significantly smaller control network trained from scratch without inheriting weights from the generative model.	ControlNet-XS, despite its smaller size, outperforms state-of-the-art methods like ControlNet and T2I-Adapter in terms of image quality and control fidelity. The architecture effectively controls large diffusion models, demonstrated by its application to Stable Diffusion XL with a control network significantly smaller in size. Analysis reveals that control effectiveness varies across different U-Net blocks, with encoder blocks being more critical, and large control models can introduce unwanted biases in the generated images.	Controlling networks can introduce biases in the generative model, even with a smaller size, demanding further research to minimize these biases. Future work can explore a better understanding of the generative model's mechanisms to design even more effective and application-specific control tools.	text-to-image synthesis, diffusion models, controllable image generation, spatial guidance, controlnet
2312.06439 Report	DreamControl: Control-Based Text-to-3D Generation with 3D Self-Prior	Tianyu Huang, Yihan Zeng, Zhilu Zhang, Wan Xu, Hang Xu, Songcen Xu, Rynson W. H. Lau, Wangmeng Zuo	3D generation has raised great attention in recent years. With the success of text-to-image diffusion models, the 2D-lifting technique becomes a promising route to controllable 3D generation. However, these methods tend to present inconsistent geometry, which is also known as the Janus problem. We observe that the problem is caused mainly by two aspects, i.e., viewpoint bias in 2D diffusion models and overfitting of the optimization objective. To address it, we propose a two-stage 2D-lifting framework, namely DreamControl, which optimizes coarse NeRF scenes as 3D self-prior and then generates fine-grained objects with control-based score distillation. Specifically, adaptive viewpoint sampling and boundary integrity metric are proposed to ensure the consistency of generated priors. The priors are then regarded as input conditions to maintain reasonable geometries, in which conditional LoRA and weighted score are further proposed to optimize detailed textures. DreamControl can generate high-quality 3D content in terms of both geometry consistency and texture fidelity. Moreover, our control-based optimization guidance is applicable to more downstream tasks, including user-guided generation and 3D animation. The project page is available at https://github.com/tyhuang0428/DreamControl.	The paper proposes DreamControl, a two-stage 2D-lifting framework for text-to-3D generation that addresses the Janus problem (inconsistent geometry) by leveraging a coarse NeRF representation as a 3D self-prior.	Existing 2D-lifting methods for 3D generation often produce inconsistent geometry due to viewpoint bias in 2D diffusion models and overfitting during optimization.	DreamControl first generates a coarse NeRF shape as a 3D self-prior using adaptive viewpoint sampling and a boundary integrity metric to minimize inconsistencies. Then, it utilizes control-based score distillation with a conditional LoRA and weighted score to generate detailed textures while maintaining the prior's geometry.	DreamControl generates high-quality 3D content with improved geometry consistency and texture fidelity compared to previous methods. The proposed control-based guidance is applicable to other tasks like user-guided generation and 3D animation. Quantitative results demonstrate DreamControl's superiority in generating consistent geometries and preserving texture details.	The method may fail when 3D priors look similar from different viewpoints. Future work could explore ways to further enhance the diversity of generated content.	text-to-3d generation, 3d self-prior, janus problem, control-based score distillation, nerf
2312.06285 Report	Compensation Sampling for Improved Convergence in Diffusion Models	Hui Lu, Albert ali Salah, Ronald Poppe	Diffusion models achieve remarkable quality in image generation, but at a cost. Iterative denoising requires many time steps to produce high fidelity images. We argue that the denoising process is crucially limited by an accumulation of the reconstruction error due to an initial inaccurate reconstruction of the target data. This leads to lower quality outputs, and slower convergence. To address this issue, we propose compensation sampling to guide the generation towards the target domain. We introduce a compensation term, implemented as a U-Net, which adds negligible computation overhead during training and, optionally, inference. Our approach is flexible and we demonstrate its application in unconditional generation, face inpainting, and face de-occlusion using benchmark datasets CIFAR-10, CelebA, CelebA-HQ, FFHQ-256, and FSG. Our approach consistently yields state-of-the-art results in terms of image quality, while accelerating the denoising process to converge during training by up to an order of magnitude.	This paper proposes "compensation sampling" (CS), a novel sampling method to address the error accumulation issue during the training process of diffusion models, which leads to faster convergence and higher-quality image generation.	Diffusion models, while achieving impressive results in image generation, often suffer from slow training and inference due to the iterative nature of the denoising process, resulting in the accumulation of reconstruction errors. This paper aims to alleviate this limitation and improve the efficiency of diffusion models.	The proposed compensation sampling algorithm introduces a learned compensation term, implemented as a lightweight U-Net model, to guide the reconstruction towards the clean data distribution. This term counteracts the accumulation of errors during training. The approach is evaluated on various image generation tasks including unconditional generation, face inpainting, and face de-occlusion.	Compensation sampling significantly accelerates the training convergence of diffusion models, up to an order of magnitude faster than traditional methods. The generated images using compensation sampling consistently exhibit higher quality, achieving state-of-the-art results on benchmark datasets like CIFAR-10, CelebA, and FFHQ, outperforming existing diffusion and GAN-based methods. The compensation term's computational overhead during training and inference is negligible.	The study primarily focuses on image generation tasks. Further investigation is needed to explore the applicability of compensation sampling in other domains such as audio or video generation. While the compensation module is lightweight, its impact on memory footprint during training, particularly for high-resolution images, needs further analysis and optimization.	diffusion models, image generation, compensation sampling, deep learning, computer vision
2312.06205 Report	The Journey, Not the Destination: How Data Guides Diffusion Models	Kristian Georgiev, Joshua Vendrow, Hadi Salman, Sung Min Park, Aleksander Madry	Diffusion models trained on large datasets can synthesize photo-realistic images of remarkable quality and diversity. However, attributing these images back to the training data-that is, identifying specific training examples which caused an image to be generated-remains a challenge. In this paper, we propose a framework that: (i) provides a formal notion of data attribution in the context of diffusion models, and (ii) allows us to counterfactually validate such attributions. Then, we provide a method for computing these attributions efficiently. Finally, we apply our method to find (and evaluate) such attributions for denoising diffusion probabilistic models trained on CIFAR-10 and latent diffusion models trained on MS COCO. We provide code at https://github.com/MadryLab/journey-TRAK .	This paper introduces a framework for attributing images synthesized by diffusion models back to their training data by identifying influential training examples at each step of the diffusion process.	Attributing generated images to training data is crucial for understanding model behavior, detecting memorization and bias, and addressing privacy concerns. This is particularly important for diffusion models, which are increasingly used in various machine learning applications.	The authors propose a step-by-step attribution method that analyzes the evolution of the conditional distribution of generated images over the diffusion process. They utilize the \trak method to efficiently estimate the influence of training examples on the model output at each step.	The method identifies positively influential training examples that resemble the generated image throughout the diffusion process and negatively influential examples that differ in specific attributes. Attribution scores are shown to be counterfactually predictive, meaning they can accurately predict the impact of removing training examples on the generated images. The framework enables feature-level attribution by localizing the analysis to specific patches in the generated image.	The current method relies on proxies for certain quantities in the diffusion process, and finding more accurate approximations could further improve the attributions. Scaling the framework to larger diffusion models and datasets, while feasible, presents a computational challenge.	data attribution, diffusion models, generative models, counterfactual analysis, feature attribution
2312.06198 Report	Optimized View and Geometry Distillation from Multi-view Diffuser	Youjia Zhang, Zikai Song, Junqing Yu, Yawei Luo, Wei Yang	Generating multi-view images from a single input view using image-conditioned diffusion models is a recent advancement and has shown considerable potential. However, issues such as the lack of consistency in synthesized views and over-smoothing in extracted geometry persist. Previous methods integrate multi-view consistency modules or impose additional supervisory to enhance view consistency while compromising on the flexibility of camera positioning and limiting the versatility of view synthesis. In this study, we consider the radiance field optimized during geometry extraction as a more rigid consistency prior, compared to volume and ray aggregation used in previous works. We further identify and rectify a critical bias in the traditional radiance field optimization process through score distillation from a multi-view diffuser. We introduce an Unbiased Score Distillation (USD) that utilizes unconditioned noises from a 2D diffusion model, greatly refining the radiance field fidelity. We leverage the rendered views from the optimized radiance field as the basis and develop a two-step specialization process of a 2D diffusion model, which is adept at conducting object-specific denoising and generating high-quality multi-view images. Finally, we recover faithful geometry and texture directly from the refined multi-view images. Empirical evaluations demonstrate that our optimized geometry and view distillation technique generates comparable results to the state-of-the-art models trained on extensive datasets, all while maintaining freedom in camera positioning. Please see our project page at https://youjiazhang.github.io/USD/.	This paper introduces an optimized view and geometry distillation technique from a multi-view diffusion model, addressing issues like view inconsistency and geometry over-smoothing in previous methods.	Generating consistent multi-view images and high-quality 3D models from a single image is a challenging task with broad applications in various fields.	The proposed method uses an Unbiased Score Distillation (USD) to rectify bias in the multi-view diffuser and leverages the optimized radiance field as a consistency prior. A two-step DreamBooth specialization process further refines a 2D diffusion model for generating high-quality multi-view images. Finally, NeuS recovers the geometry and texture from the refined images.	The USD method significantly improves the quality of the extracted radiance field compared to traditional SDS/SJC methods. The proposed approach generates multi-view images and geometries comparable to state-of-the-art models trained on large datasets, while maintaining flexibility in camera positioning. The method effectively addresses the limitations of previous approaches, particularly in terms of view consistency and geometry detail.	The underlying causes of the bias issue in the Zero-1-to-3 model are not fully understood. Future work will focus on a theoretical analysis of the bias and explore applications of USD in other domains.	multi-view diffusion model, 3d reconstruction, score distillation, dreambooth, nerf
2312.06158 Report	Adaptive Feature Selection for No-Reference Image Quality Assessment using Contrastive Mitigating Semantic Noise Sensitivity	Xudong Li, Timin Gao, Xiawu Zheng, Runze Hu, Jingyuan Zheng, Yunhang Shen, Ke Li, Yutao Liu, Pingyang Dai, Yan Zhang, Rongrong Ji	The current state-of-the-art No-Reference Image Quality Assessment (NR-IQA) methods typically use feature extraction in upstream backbone networks, which assumes that all extracted features are relevant. However, we argue that not all features are beneficial, and some may even be harmful, necessitating careful selection. Empirically, we find that many image pairs with small feature spatial distances can have vastly different quality scores. To address this issue, we propose a Quality-Aware Feature Matching IQA metric(QFM-IQM) that employs contrastive learning to remove harmful features from the upstream task. Specifically, our approach enhances the semantic noise distinguish capabilities of neural networks by comparing image pairs with similar semantic features but varying quality scores and adaptively adjusting the upstream task's features by introducing disturbance. Furthermore, we utilize a distillation framework to expand the dataset and improve the model's generalization ability. Our approach achieves superior performance to the state-of-the-art NR-IQA methods on 8 standard NR-IQA datasets, achieving PLCC values of 0.932 (vs. 0.908 in TID2013) and 0.913 (vs. 0.894 in LIVEC).	The paper proposes QFM-IQM, a novel No-Reference Image Quality Assessment (NR-IQA) method using contrastive learning to filter irrelevant features and a distillation framework for better generalization.	Existing NR-IQA methods struggle to differentiate between images with similar semantic content but different quality scores due to feature aliasing.	QFM-IQM employs 1) Semantic Noise Matching (SNM) to pair images with similar semantics but different quality, 2) Quality Consistency Contrastive (QCC) module for robustness to semantic noise, and 3) Distilled Label Extension (DLE) to expand the dataset using pseudo-labels for improved generalization.	QFM-IQM outperforms 15 state-of-the-art NR-IQA methods on 8 benchmark datasets. Cross-dataset validation experiments show QFM-IQM's superior generalization capability. Ablation study confirms the effectiveness of QCC and DLE in improving performance.	The impact of varying the number of matched features (K) requires further investigation. Exploring different knowledge distillation strategies might further improve performance.	image quality assessment, no-reference iqa, contrastive learning, knowledge distillation, feature selection
2312.06116 Report	Stellar: Systematic Evaluation of Human-Centric Personalized Text-to-Image Methods	Panos Achlioptas, Alexandros Benetatos, Iordanis Fostiropoulos, Dimitris Skourtis	In this work, we systematically study the problem of personalized text-to-image generation, where the output image is expected to portray information about specific human subjects. E.g., generating images of oneself appearing at imaginative places, interacting with various items, or engaging in fictional activities. To this end, we focus on text-to-image systems that input a single image of an individual to ground the generation process along with text describing the desired visual context. Our first contribution is to fill the literature gap by curating high-quality, appropriate data for this task. Namely, we introduce a standardized dataset (Stellar) that contains personalized prompts coupled with images of individuals that is an order of magnitude larger than existing relevant datasets and where rich semantic ground-truth annotations are readily available. Having established Stellar to promote cross-systems fine-grained comparisons further, we introduce a rigorous ensemble of specialized metrics that highlight and disentangle fundamental properties such systems should obey. Besides being intuitive, our new metrics correlate significantly more strongly with human judgment than currently used metrics on this task. Last but not least, drawing inspiration from the recent works of ELITE and SDXL, we derive a simple yet efficient, personalized text-to-image baseline that does not require test-time fine-tuning for each subject and which sets quantitatively and in human trials a new SoTA. For more information, please visit our project's website: https://stellar-gen-ai.github.io/.	This paper introduces Stellar, a large-scale dataset for personalized text-to-image generation, proposes novel evaluation metrics specifically designed for such systems, and presents StellarNet, a simple yet effective baseline model that sets a new state-of-the-art.	Personalized text-to-image generation, while promising, lacks standardized data and specialized evaluation metrics, hindering progress in the field.	The authors curate Stellar, a dataset of 20,000 imaginative prompts paired with 400 unique human identities. They introduce five new metrics focusing on identity preservation, attribute accuracy, stability across different images of the same subject, and object/relation faithfulness. They develop StellarNet, leveraging SDXL and dynamic textual inversion to generate personalized images.	The proposed metrics demonstrate significantly stronger correlation with human judgment compared to existing metrics. StellarNet outperforms other state-of-the-art personalized text-to-image generation methods both quantitatively and in human evaluation. The authors highlight the potential for misuse of personalized image generation and urge responsible use and content moderation.	StellarNet, while effective, inherits potential biases present in the underlying SDXL model. Future work could focus on mitigating biases and developing more robust content moderation techniques for personalized image generation.	text-to-image generation, personalized ai, dataset, evaluation metrics, ethical considerations
2312.06109 Report	Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models	Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, Jinrong Yang, Jianjian Sun, Chunrui Han, Xiangyu Zhang	Modern Large Vision-Language Models (LVLMs) enjoy the same vision vocabulary -- CLIP, which can cover most common vision tasks. However, for some special vision task that needs dense and fine-grained vision perception, e.g., document-level OCR or chart understanding, especially in non-English scenarios, the CLIP-style vocabulary may encounter low efficiency in tokenizing the vision knowledge and even suffer out-of-vocabulary problem. Accordingly, we propose Vary, an efficient and effective method to scale up the vision vocabulary of LVLMs. The procedures of Vary are naturally divided into two folds: the generation and integration of a new vision vocabulary. In the first phase, we devise a vocabulary network along with a tiny decoder-only transformer to produce the desired vocabulary via autoregression. In the next, we scale up the vanilla vision vocabulary by merging the new one with the original one (CLIP), enabling the LVLMs can quickly garner new features. Compared to the popular BLIP-2, MiniGPT4, and LLaVA, Vary can maintain its vanilla capabilities while enjoying more excellent fine-grained perception and understanding ability. Specifically, Vary is competent in new document parsing features (OCR or markdown conversion) while achieving 78.2% ANLS in DocVQA and 36.2% in MMVet. Our code will be publicly available on the homepage.	Proposes Vary, a method for scaling up the vision vocabulary of Large Vision-Language Models (LVLMs) to improve performance on tasks requiring dense and fine-grained vision perception, such as document OCR and chart understanding.	Existing LVLMs often rely on a CLIP-based vision vocabulary, which may not be efficient or effective for specialized vision tasks, particularly in non-English scenarios.	Vary generates a new vision vocabulary using a vocabulary network and a tiny decoder-only transformer trained on document and chart images. It then integrates this new vocabulary with the original CLIP vocabulary in the LVLM.	Vary-base achieves comparable performance to specialized document parsing models on English document OCR and outperforms them on markdown conversion. Vary-base demonstrates significant improvements on downstream VQA tasks, achieving strong results on DocVQA and ChartQA. Vary-base maintains competitive general performance compared to other LVLMs on the MMVet benchmark.	The paper acknowledges that the current method for scaling up the visual vocabulary can be further improved. Future work will explore applying Vary to other fine-grained vision tasks beyond document and chart understanding.	vision-language models, vocabulary expansion, document ocr, chart understanding, fine-grained vision perception
2312.06059 Report	CONFORM: Contrast is All You Need For High-Fidelity Text-to-Image Diffusion Models	Tuna Han Salih Meral, Enis Simsar, Federico Tombari, Pinar Yanardag	Images produced by text-to-image diffusion models might not always faithfully represent the semantic intent of the provided text prompt, where the model might overlook or entirely fail to produce certain objects. Existing solutions often require customly tailored functions for each of these problems, leading to sub-optimal results, especially for complex prompts. Our work introduces a novel perspective by tackling this challenge in a contrastive context. Our approach intuitively promotes the segregation of objects in attention maps while also maintaining that pairs of related attributes are kept close to each other. We conduct extensive experiments across a wide variety of scenarios, each involving unique combinations of objects, attributes, and scenes. These experiments effectively showcase the versatility, efficiency, and flexibility of our method in working with both latent and pixel-based diffusion models, including Stable Diffusion and Imagen. Moreover, we publicly share our source code to facilitate further research.	This paper introduces CONFORM, a training-free method using a contrastive objective and test-time optimization to improve the fidelity of pre-trained text-to-image diffusion models.	Existing methods for improving the fidelity of text-to-image models often rely on tailored solutions for specific problems, leading to sub-optimal performance.	CONFORM leverages attention maps of object and attribute tokens as features. It treats attributes of a specific object as positive pairs and contrasts them against other attributes and objects. The method utilizes InfoNCE loss with a contrastive objective applied to cross-attention maps from multiple timesteps during the diffusion process.	CONFORM effectively addresses missing objects, attribute binding errors, and object miscounting in generated images. Quantitative evaluation using CLIP similarity, BLIP captioning similarity, and TIFA scores demonstrate CONFORM's superiority over existing methods. User studies confirm that CONFORM generates images that better align with text prompts compared to other state-of-the-art approaches.	The method may struggle to generate successful images when the initial attention map significantly excludes key objects. In some cases, CONFORM's refinement process might lead to object separation in the generated image, although the text prompt accuracy is improved.	text-to-image synthesis, diffusion models, contrastive learning, attention maps, image fidelity
2312.06038 Report	Correcting Diffusion Generation through Resampling	Yujian Liu, Yang Zhang, Tommi Jaakkola, Shiyu Chang	Despite diffusion models' superior capabilities in modeling complex distributions, there are still non-trivial distributional discrepancies between generated and ground-truth images, which has resulted in several notable problems in image generation, including missing object errors in text-to-image generation and low image quality. Existing methods that attempt to address these problems mostly do not tend to address the fundamental cause behind these problems, which is the distributional discrepancies, and hence achieve sub-optimal results. In this paper, we propose a particle filtering framework that can effectively address both problems by explicitly reducing the distributional discrepancies. Specifically, our method relies on a set of external guidance, including a small set of real images and a pre-trained object detector, to gauge the distribution gap, and then design the resampling weight accordingly to correct the gap. Experiments show that our methods can effectively correct missing object errors and improve image quality in various image generation tasks. Notably, our method outperforms the existing strongest baseline by 5% in object occurrence and 1.0 in FID on MS-COCO. Our code is publicly available at https://github.com/UCSB-NLP-Chang/diffusion_resampling.git.	This paper introduces a particle filtering framework for diffusion models, addressing missing object errors and low image quality by minimizing distributional discrepancies between generated and real images.	Existing methods often fail to address the root cause of these errors: the distributional gap between generated and ground-truth images.	The proposed framework uses external guidance (real images and/or an object detector) to compute resampling weights during the diffusion process, guiding generated samples closer to the desired distribution.	Outperforms baselines in object occurrence and image quality on text-to-image generation benchmarks like MS-COCO. Achieves state-of-the-art FID scores on ImageNet-64 for class-conditioned generation. Demonstrates the effectiveness of particle filtering and resampling strategies in aligning generated distributions with ground-truth.	The reliance on external guidance (object detectors, real images) can limit generalizability. Further exploration is needed to optimize computational efficiency and reduce dependence on large numbers of function evaluations.	diffusion models, particle filtering, text-to-image generation, image quality, object detection
2312.05915 Report	Diffusion for Natural Image Matting	Yihan Hu, Yiheng Lin, Wei Wang, Yao Zhao, Yunchao Wei, Humphrey Shi	We aim to leverage diffusion to address the challenging image matting task. However, the presence of high computational overhead and the inconsistency of noise sampling between the training and inference processes pose significant obstacles to achieving this goal. In this paper, we present DiffMatte, a solution designed to effectively overcome these challenges. First, DiffMatte decouples the decoder from the intricately coupled matting network design, involving only one lightweight decoder in the iterations of the diffusion process. With such a strategy, DiffMatte mitigates the growth of computational overhead as the number of samples increases. Second, we employ a self-aligned training strategy with uniform time intervals, ensuring a consistent noise sampling between training and inference across the entire time domain. Our DiffMatte is designed with flexibility in mind and can seamlessly integrate into various modern matting architectures. Extensive experimental results demonstrate that DiffMatte not only reaches the state-of-the-art level on the Composition-1k test set, surpassing the best methods in the past by 5% and 15% in the SAD metric and MSE metric respectively, but also show stronger generalization ability in other benchmarks.	This paper proposes DiffMatte, a novel approach that leverages diffusion models for natural image matting tasks, achieving state-of-the-art performance and strong generalization ability.	Existing deep learning-based matting solutions struggle to capture both high-level context and low-level texture information, leading to inaccuracies. Diffusion models, known for their ability to model complex data distributions and generate realistic textures, have not been effectively applied to matting due to high computational overhead and noise sampling inconsistencies.	DiffMatte introduces two key innovations: 1) It decouples the image encoder and decoder, utilizing a lightweight decoder for iterative refinement during the diffusion process, significantly reducing computational overhead. 2) It employs a self-aligned training strategy with uniform time intervals, ensuring consistent noise sampling between training and inference, mitigating performance decay caused by data discrepancy.	DiffMatte achieves state-of-the-art performance on the Composition-1k benchmark, surpassing previous best methods by a significant margin. The method demonstrates strong generalization ability, outperforming previous methods on Distinctions-646 and Semantic Image Matting test sets. DiffMatte's iterative refinement process allows for continuous improvement of predictions with increasing sampling steps, enhancing the quality of the generated alpha mattes.	The model's performance reaches a plateau with increasing sample steps, limited by the accuracy of the matting model's predictions. Future work can explore incorporating artificial correction methods during inference to achieve interactive matting, expanding its applications in image editing.	image matting, diffusion models, deep learning, computer vision, iterative refinement
2312.05889 Report	SuperPrimitive: Scene Reconstruction at a Primitive Level	Kirill Mazur, Gwangbin Bae, Andrew J. Davison	Joint camera pose and dense geometry estimation from a set of images or a monocular video remains a challenging problem due to its computational complexity and inherent visual ambiguities. Most dense incremental reconstruction systems operate directly on image pixels and solve for their 3D positions using multi-view geometry cues. Such pixel-level approaches suffer from ambiguities or violations of multi-view consistency (e.g. caused by textureless or specular surfaces). We address this issue with a new image representation which we call a SuperPrimitive. SuperPrimitives are obtained by splitting images into semantically correlated local regions and enhancing them with estimated surface normal directions, both of which are predicted by state-of-the-art single image neural networks. This provides a local geometry estimate per SuperPrimitive, while their relative positions are adjusted based on multi-view observations. We demonstrate the versatility of our new representation by addressing three 3D reconstruction tasks: depth completion, few-view structure from motion, and monocular dense visual odometry.	This paper introduces SuperPrimitives, a novel image representation for dense 3D reconstruction that combines local geometric priors from single-image neural networks with multi-view optimization.	Dense incremental reconstruction from monocular images or videos is challenging due to computational complexity and visual ambiguities. Existing methods often struggle to balance reliable initial geometry estimates with multi-view consistency.	SuperPrimitives are constructed by segmenting an image into semantically correlated regions and enhancing them with estimated surface normals. These primitives' scales are then optimized jointly with camera poses using multi-view photometric consistency.	The method achieves state-of-the-art results in zero-shot depth completion on the VOID benchmark. It outperforms competitors in few-view structure from motion on the ScanNet dataset, even without global priors. The approach enables a simple yet effective monocular visual odometry system that surpasses previous methods on the challenging TUM RGB-D dataset.	The method assumes geometric continuity within each predicted image segment, which may not always hold. The current implementation does not explicitly handle occlusions during primitive alignment, which could be improved in future work.	3d reconstruction, superprimitives, depth completion, structure from motion, visual odometry
2312.05849 Report	InteractDiffusion: Interaction Control in Text-to-Image Diffusion Models	Jiun Tian Hoe, Xudong Jiang, Chee Seng Chan, Yap-Peng Tan, Weipeng Hu	Large-scale text-to-image (T2I) diffusion models have showcased incredible capabilities in generating coherent images based on textual descriptions, enabling vast applications in content generation. While recent advancements have introduced control over factors such as object localization, posture, and image contours, a crucial gap remains in our ability to control the interactions between objects in the generated content. Well-controlling interactions in generated images could yield meaningful applications, such as creating realistic scenes with interacting characters. In this work, we study the problems of conditioning T2I diffusion models with Human-Object Interaction (HOI) information, consisting of a triplet label (person, action, object) and corresponding bounding boxes. We propose a pluggable interaction control model, called InteractDiffusion that extends existing pre-trained T2I diffusion models to enable them being better conditioned on interactions. Specifically, we tokenize the HOI information and learn their relationships via interaction embeddings. A conditioning self-attention layer is trained to map HOI tokens to visual tokens, thereby conditioning the visual tokens better in existing T2I diffusion models. Our model attains the ability to control the interaction and location on existing T2I diffusion models, which outperforms existing baselines by a large margin in HOI detection score, as well as fidelity in FID and KID. Project page: https://jiuntian.github.io/interactdiffusion.	This work proposes InteractDiffusion, a pluggable interaction control model, which enhances pre-trained text-to-image diffusion models with Human-Object Interaction (HOI) control, improving the generation of images with specific interactions between objects.	Current text-to-image diffusion models struggle to accurately depict interactions between objects, a crucial aspect of realistic image generation. Controlling interactions enables diverse applications in e-commerce, gaming, and interactive storytelling.	The paper introduces InteractDiffusion, a module consisting of: (1) Interaction Tokenizer (InToken) to transform HOI information into meaningful tokens; (2) Interaction Embedding (InBedding) to capture intricate relationships between interacting objects; (3) Interaction Transformer (InFormer) to integrate HOI tokens into the visual tokens of the diffusion model.	InteractDiffusion renders more accurate and coherent interactions between objects compared to existing methods, aligning better with provided instructions. The model maintains high image generation quality, even with added parameters for interaction control, showing comparable or even slightly better FID and KID scores. Quantitative evaluation using HOI Detection Score shows significant improvement over baselines, demonstrating effective control over interactions in generated images.	Generated interactions, while improved, may still lack finer details compared to real images, as evidenced by performance differences with larger HOI detectors. Existing large pre-trained models lack comprehensive understanding of interactions, potentially hindering the full potential of interaction control.	text-to-image generation, diffusion models, human-object interaction, interaction control, image synthesis
2312.05760 Report	RepViT-SAM: Towards Real-Time Segmenting Anything	Ao Wang, Hui Chen, Zijia Lin, Jungong Han, Guiguang Ding	Segment Anything Model (SAM) has shown impressive zero-shot transfer performance for various computer vision tasks recently. However, its heavy computation costs remain daunting for practical applications. MobileSAM proposes to replace the heavyweight image encoder in SAM with TinyViT by employing distillation, which results in a significant reduction in computational requirements. However, its deployment on resource-constrained mobile devices still encounters challenges due to the substantial memory and computational overhead caused by self-attention mechanisms. Recently, RepViT achieves the state-of-the-art performance and latency trade-off on mobile devices by incorporating efficient architectural designs of ViTs into CNNs. Here, to achieve real-time segmenting anything on mobile devices, following MobileSAM, we replace the heavyweight image encoder in SAM with RepViT model, ending up with the RepViT-SAM model. Extensive experiments show that RepViT-SAM can enjoy significantly better zero-shot transfer capability than MobileSAM, along with nearly $10\times$ faster inference speed. The code and models are available at \url{https://github.com/THU-MIG/RepViT}.	This paper introduces RepViT-SAM, a model that replaces the heavy image encoder in Segment Anything Model (SAM) with RepViT to enable real-time performance on mobile devices.	SAM, despite its impressive zero-shot transfer capabilities for various computer vision tasks, suffers from high computational costs. This limits its practical applications, especially on resource-constrained devices.	The authors replace the ViT-H image encoder in SAM with RepViT-M2.3. They train RepViT-SAM using a decoupled distillation strategy, directly distilling the image encoder from the ViT-H in the original SAM with an MSE loss.	RepViT-SAM achieves significantly better zero-shot transfer capability than MobileSAM across various tasks like edge detection, instance segmentation, and video object segmentation. RepViT-SAM exhibits a near 10x faster inference speed compared to MobileSAM, making it suitable for real-time applications on mobile devices. RepViT-SAM demonstrates strong performance even surpassing the original SAM in specific downstream tasks like anomaly detection.	The decoupled distillation strategy, while enabling efficient transfer of macro-level visual features, may limit the model's performance on tasks requiring fine-grained details. Further exploration of different RepViT variants and distillation strategies could potentially yield even better performance and efficiency trade-offs.	segment anything model, repvit, mobile vision, zero-shot learning, model compression
2312.05695 Report	The Counterattack of CNNs in Self-Supervised Learning: Larger Kernel Size might be All You Need	Tianjin Huang, Tianlong Chen, Zhangyang Wang, Shiwei Liu	Vision Transformers have been rapidly uprising in computer vision thanks to their outstanding scaling trends, and gradually replacing convolutional neural networks (CNNs). Recent works on self-supervised learning (SSL) introduce siamese pre-training tasks, on which Transformer backbones continue to demonstrate ever stronger results than CNNs. People come to believe that Transformers or self-attention modules are inherently more suitable than CNNs in the context of SSL. However, it is noteworthy that most if not all prior arts of SSL with CNNs chose the standard ResNets as their backbones, whose architecture effectiveness is known to already lag behind advanced Vision Transformers. Therefore, it remains unclear whether the self-attention operation is crucial for the recent advances in SSL - or CNNs can deliver the same excellence with more advanced designs, too? Can we close the SSL performance gap between Transformers and CNNs? To answer these intriguing questions, we apply self-supervised pre-training to the recently proposed, stronger lager-kernel CNN architecture and conduct an apple-to-apple comparison with Transformers, in their SSL performance. Our results show that we are able to build pure CNN SSL architectures that perform on par with or better than the best SSL-trained Transformers, by just scaling up convolutional kernel sizes besides other small tweaks. Impressively, when transferring to the downstream tasks \texttt{MS COCO} detection and segmentation, our SSL pre-trained CNN model (trained in 100 epochs) achieves the same good performance as the 300-epoch pre-trained Transformer counterpart. We hope this work can help to better understand what is essential (or not) for self-supervised learning backbones.	This paper investigates whether the recent success of large-kernel CNNs in supervised learning can be translated to self-supervised learning (SSL), challenging the notion that Transformers are inherently superior for SSL.	The dominance of Transformers in SSL has led to a belief that self-attention is crucial, neglecting the potential of advanced CNN architectures.	The authors adapt the ConvNeXt architecture for SSL by adding BatchNorm layers after depthwise convolutions and scaling up kernel sizes. They compare this modified CNN, dubbed BC-SSL, with state-of-the-art SSL-trained Transformers (ViT and Swin) using the DINO framework.	BC-SSL achieves comparable or better performance than Swin Transformers on ImageNet classification with linear probe and k-NN evaluation, while having faster inference throughput. BC-SSL exhibits significant performance gains on downstream tasks like object detection and segmentation on MS COCO, outperforming Swin Transformers trained for the same number of epochs. BC-SSL demonstrates increasing robustness to distribution shifts with larger kernel sizes, surpassing both ResNet and Swin Transformer in robustness benchmarks.	The benefits of increasing kernel size in BC-SSL seem to saturate at 9x9, exploring larger kernels with techniques like structure re-parameterization or sparsity is left for future work. The study focuses on ConvNeXt; exploring other large-kernel CNN architectures like RepLKNet and SLaK in SSL could be beneficial.	self-supervised learning, convolutional neural networks, vision transformers, large kernels, robustness
2312.05664 Report	CoGS: Controllable Gaussian Splatting	Heng Yu, Joel Julin, Zoltán Á. Milacski, Koichiro Niinuma, László A. Jeni	Capturing and re-animating the 3D structure of articulated objects present significant barriers. On one hand, methods requiring extensively calibrated multi-view setups are prohibitively complex and resource-intensive, limiting their practical applicability. On the other hand, while single-camera Neural Radiance Fields (NeRFs) offer a more streamlined approach, they have excessive training and rendering costs. 3D Gaussian Splatting would be a suitable alternative but for two reasons. Firstly, existing methods for 3D dynamic Gaussians require synchronized multi-view cameras, and secondly, the lack of controllability in dynamic scenarios. We present CoGS, a method for Controllable Gaussian Splatting, that enables the direct manipulation of scene elements, offering real-time control of dynamic scenes without the prerequisite of pre-computing control signals. We evaluated CoGS using both synthetic and real-world datasets that include dynamic objects that differ in degree of difficulty. In our evaluations, CoGS consistently outperformed existing dynamic and controllable neural representations in terms of visual fidelity.	Presents CoGS, a method for Controllable Gaussian Splatting that allows direct manipulation of dynamic scenes captured by a monocular camera without pre-computed control signals.	Addresses limitations of NeRFs (computational cost, implicit representation) by using explicit 3D Gaussian representations for efficient rendering and manipulation of dynamic scenes.	Extends 3D Gaussian Splatting to dynamic scenes by learning deformation fields for Gaussian parameters and introduces control by: 1) generating 3D masks, 2) extracting control signals from Gaussian trajectories, and 3) re-aligning these signals for manipulation.	Outperforms existing dynamic and controllable neural representations in visual fidelity on synthetic and real-world datasets. Successfully manipulates various dynamic scenes including faces, toy cars, and animated objects. Demonstrates real-time control capabilities without relying on pre-defined control signals.	Faces challenges with highly reflective objects and large-scale non-rigid motion. Current control signal extraction using PCA may not generalize to highly complex movements.	gaussian splatting, dynamic scene representation, controllable animation, monocular vision, 3d reconstruction
2312.05616 Report	Iterative Token Evaluation and Refinement for Real-World Super-Resolution	Chaofeng Chen, Shangchen Zhou, Liang Liao, Haoning Wu, Wenxiu Sun, Qiong Yan, Weisi Lin	Real-world image super-resolution (RWSR) is a long-standing problem as low-quality (LQ) images often have complex and unidentified degradations. Existing methods such as Generative Adversarial Networks (GANs) or continuous diffusion models present their own issues including GANs being difficult to train while continuous diffusion models requiring numerous inference steps. In this paper, we propose an Iterative Token Evaluation and Refinement (ITER) framework for RWSR, which utilizes a discrete diffusion model operating in the discrete token representation space, i.e., indexes of features extracted from a VQGAN codebook pre-trained with high-quality (HQ) images. We show that ITER is easier to train than GANs and more efficient than continuous diffusion models. Specifically, we divide RWSR into two sub-tasks, i.e., distortion removal and texture generation. Distortion removal involves simple HQ token prediction with LQ images, while texture generation uses a discrete diffusion model to iteratively refine the distortion removal output with a token refinement network. In particular, we propose to include a token evaluation network in the discrete diffusion process. It learns to evaluate which tokens are good restorations and helps to improve the iterative refinement results. Moreover, the evaluation network can first check status of the distortion removal output and then adaptively select total refinement steps needed, thereby maintaining a good balance between distortion removal and texture generation. Extensive experimental results show that ITER is easy to train and performs well within just 8 iterative steps. Our codes will be available publicly.	This paper proposes ITER, an Iterative Token Evaluation and Refinement framework for Real-World Super-Resolution (RWSR) that operates in the discrete token representation space.	RWSR is challenging due to complex, unidentified degradations in low-quality images. Existing GAN-based methods are difficult to train, while diffusion models require many inference steps. ITER addresses these limitations by being easier to train than GANs and more efficient than diffusion models.	The method divides RWSR into distortion removal and texture generation. It uses a pre-trained VQGAN codebook, a distortion removal encoder, and a conditioned discrete diffusion model with a novel token evaluation block for iterative refinement.	ITER achieves state-of-the-art performance on real-world benchmarks, outperforming both GAN and diffusion-based methods. The iterative refinement with token evaluation effectively generates realistic textures and avoids local propagation problems. The adaptive inference strategy balances distortion removal and texture generation based on initial restoration quality and allows user control over texture strength through a threshold.	The performance of ITER is limited by the reconstruction quality of the pre-trained VQGAN. Future work could explore faster architectures for the token evaluation and refinement networks to further improve inference speed.	image super-resolution, real-world super-resolution, discrete diffusion model, token evaluation, vqgan
2312.05541 Report	DPoser: Diffusion Model as Robust 3D Human Pose Prior	Junzhe Lu, Jing Lin, Hongkun Dou, Ailing Zeng, Yue Deng, Yulun Zhang, Haoqian Wang	This work targets to construct a robust human pose prior. However, it remains a persistent challenge due to biomechanical constraints and diverse human movements. Traditional priors like VAEs and NDFs often exhibit shortcomings in realism and generalization, notably with unseen noisy poses. To address these issues, we introduce DPoser, a robust and versatile human pose prior built upon diffusion models. DPoser regards various pose-centric tasks as inverse problems and employs variational diffusion sampling for efficient solving. Accordingly, designed with optimization frameworks, DPoser seamlessly benefits human mesh recovery, pose generation, pose completion, and motion denoising tasks. Furthermore, due to the disparity between the articulated poses and structured images, we propose truncated timestep scheduling to enhance the effectiveness of DPoser. Our approach demonstrates considerable enhancements over common uniform scheduling used in image domains, boasting improvements of 5.4%, 17.2%, and 3.8% across human mesh recovery, pose completion, and motion denoising, respectively. Comprehensive experiments demonstrate the superiority of DPoser over existing state-of-the-art pose priors across multiple tasks.	Presents DPoser, a novel human pose prior built upon diffusion models that achieves state-of-the-art performance across diverse pose-related tasks.	Existing human pose priors struggle with realism and generalization, especially for unseen or noisy poses, limiting their effectiveness in real-world applications.	DPoser leverages variational diffusion sampling to integrate diffusion prior within optimization frameworks for tasks like human mesh recovery, pose generation, pose completion, and motion denoising. It also introduces a truncated timestep scheduling strategy tailored for the characteristics of pose data.	DPoser generates more realistic and diverse human poses compared to previous state-of-the-art methods. In human mesh recovery, DPoser outperforms existing priors even when fitting from scratch. For pose completion, DPoser consistently shows superior performance under various occlusion scenarios, effectively reconstructing plausible full 3D poses from partial observations.	DPoser, relying on variational inference, might exhibit mode-seeking behavior, limiting solution diversity. Future work could explore particle-based variational inference or other advanced diffusion-based solvers to enhance solution diversity and handle more complex inverse problems with unknown parameters in the measurement operator.	human pose prior, diffusion models, variational diffusion sampling, truncated timestep scheduling, human mesh recovery
2312.05525 Report	You Only Learn One Query: Learning Unified Human Query for Single-Stage Multi-Person Multi-Task Human-Centric Perception	Sheng Jin, Shuhuai Li, Tong Li, Wentao Liu, Chen Qian, Ping Luo	Human-centric perception (e.g. pedetrian detection, segmentation, pose estimation, and attribute analysis) is a long-standing problem for computer vision. This paper introduces a unified and versatile framework (HQNet) for single-stage multi-person multi-task human-centric perception (HCP). Our approach centers on learning a unified human query representation, denoted as Human Query, which captures intricate instance-level features for individual persons and disentangles complex multi-person scenarios. Although different HCP tasks have been well-studied individually, single-stage multi-task learning of HCP tasks has not been fully exploited in the literature due to the absence of a comprehensive benchmark dataset. To address this gap, we propose COCO-UniHuman benchmark dataset to enable model development and comprehensive evaluation. Experimental results demonstrate the proposed method's state-of-the-art performance among multi-task HCP models and its competitive performance compared to task-specific HCP models. Moreover, our experiments underscore Human Query's adaptability to new HCP tasks, thus demonstrating its robust generalization capability. Codes and data will be publicly accessible.	This paper introduces HQNet, a unified and versatile single-stage framework for multi-person multi-task human-centric perception (HCP) that centers on learning a unified human query representation (Human Query).	Single-stage multi-task learning of HCP tasks has not been fully exploited due to the absence of a comprehensive benchmark dataset, hindering the development of algorithms that treat various HCP tasks as a unified problem.	HQNet uses a backbone network, Transformer encoder and decoder, and task-specific heads. It leverages HumanQuery-Instance Matching and Gender-aided human Model Selection to exploit interactions between HCP tasks.	HQNet achieves state-of-the-art performance on the COCO-UniHuman benchmark for various HCP tasks, including detection, segmentation, pose estimation, and attribute recognition. The learned Human Query exhibits strong transferability to novel HCP tasks such as face detection and multi-object tracking. Co-learning multiple HCP tasks within the unified framework leads to improved overall performance due to inter-task synergy.	The framework is currently limited to RGB images and could be extended to video or multi-modal data. Future work can explore more comprehensive multi-task HCP scenarios.	human-centric perception, unified vision model, multi-task learning, query-based learning, coco-unihuman dataset
2312.05482 Report	BARET : Balanced Attention based Real image Editing driven by Target-text Inversion	Yuming Qiao, Fanyi Wang, Jingwen Su, Yanhao Zhang, Yunjie Yu, Siyu Wu, Guo-Jun Qi	Image editing approaches with diffusion models have been rapidly developed, yet their applicability are subject to requirements such as specific editing types (e.g., foreground or background object editing, style transfer), multiple conditions (e.g., mask, sketch, caption), and time consuming fine-tuning of diffusion models. For alleviating these limitations and realizing efficient real image editing, we propose a novel editing technique that only requires an input image and target text for various editing types including non-rigid edits without fine-tuning diffusion model. Our method contains three novelties:(I) Target-text Inversion Schedule (TTIS) is designed to fine-tune the input target text embedding to achieve fast image reconstruction without image caption and acceleration of convergence.(II) Progressive Transition Scheme applies progressive linear interpolation between target text embedding and its fine-tuned version to generate transition embedding for maintaining non-rigid editing capability.(III) Balanced Attention Module (BAM) balances the tradeoff between textual description and image semantics.By the means of combining self-attention map from reconstruction process and cross-attention map from transition process, the guidance of target text embeddings in diffusion process is optimized.In order to demonstrate editing capability, effectiveness and efficiency of the proposed BARET, we have conducted extensive qualitative and quantitative experiments. Moreover, results derived from user study and ablation study further prove the superiority over other methods.	This paper introduces BARET, a text-based real image editing technique that uses only an input image and target text for various edits, including non-rigid transformations, without requiring fine-tuning of the diffusion model.	Existing methods have limitations like requiring specific editing types, multiple input conditions, or time-consuming fine-tuning. This work aims to overcome these limitations for efficient and versatile real image editing.	BARET consists of three components: 1) Target-text Inversion Schedule (TTIS) for efficient image reconstruction by fine-tuning target text embeddings. 2) Progressive Transition Scheme to enhance non-rigid editing by progressively interpolating between target text and fine-tuned embeddings. 3) Balanced Attention Module (BAM) to balance original image features and non-rigid changes by leveraging self-attention and cross-attention maps.	BARET outperforms existing methods in terms of text alignment, image fidelity, and efficiency, especially for complex non-rigid edits. User study confirms BARET's superiority in visual quality, achieving higher scores compared to baseline methods. BARET demonstrates fast convergence (16s for reconstruction) compared to methods requiring diffusion model fine-tuning (10-20 minutes).	The effectiveness of BARET for complex compositions or edits requiring high-level semantic understanding needs further investigation. Exploring automatic optimization of interpolation parameters for different editing tasks could enhance usability.	image editing, diffusion models, text-guided editing, non-rigid transformation, attention mechanisms
2312.05476 Report	Exploring the Naturalness of AI-Generated Images	Zijian Chen, Wei Sun, Haoning Wu, Zicheng Zhang, Jun Jia, Zhongpeng Ji, Fengyu Sun, Shangling Jui, Xiongkuo Min, Guangtao Zhai, Wenjun Zhang	The proliferation of Artificial Intelligence-Generated Images (AGIs) has greatly expanded the Image Naturalness Assessment (INA) problem. Different from early definitions that mainly focus on tone-mapped images with limited distortions (e.g., exposure, contrast, and color reproduction), INA on AI-generated images is especially challenging as it has more diverse contents and could be affected by factors from multiple perspectives, including low-level technical distortions and high-level rationality distortions. In this paper, we take the first step to benchmark and assess the visual naturalness of AI-generated images. First, we construct the AI-Generated Image Naturalness (AGIN) database by conducting a large-scale subjective study to collect human opinions on the overall naturalness as well as perceptions from technical and rationality perspectives. AGIN verifies that naturalness is universally and disparately affected by technical and rationality distortions. Second, we propose the Joint Objective Image Naturalness evaluaTor (JOINT), to automatically predict the naturalness of AGIs that aligns human ratings. Specifically, JOINT imitates human reasoning in naturalness evaluation by jointly learning both technical and rationality features. We demonstrate that JOINT significantly outperforms baselines for providing more subjectively consistent results on naturalness assessment.	This paper introduces AGIN, the first database for AI-generated image naturalness assessment, and proposes JOINT, an objective naturalness evaluator that jointly learns technical and rationality features.	Evaluating the naturalness of AI-generated images is crucial as they become increasingly prevalent, and traditional IQA methods fall short in addressing the diverse contents and rationality factors involved.	AGIN was constructed through a large-scale subjective study collecting human opinions on technical and rationality perspectives, and JOINT uses a two-branch architecture mimicking human naturalness reasoning.	AGIN reveals that naturalness is disparately affected by both technical and rationality distortions. The impact of factors within these two perspectives on naturalness varies significantly. JOINT significantly outperforms baselines, demonstrating the effectiveness of joint learning in image naturalness evaluation.	The database is limited to five generative tasks, and future work can expand to more tasks and modalities. The current objective model, while effective, can be further improved by exploring more sophisticated architectures and training strategies.	ai-generated images, image naturalness assessment, database, subjective evaluation, deep learning
2312.05390 Report	NoiseCLR: A Contrastive Learning Approach for Unsupervised Discovery of Interpretable Directions in Diffusion Models	Yusuf Dalva, Pinar Yanardag	Generative models have been very popular in the recent years for their image generation capabilities. GAN-based models are highly regarded for their disentangled latent space, which is a key feature contributing to their success in controlled image editing. On the other hand, diffusion models have emerged as powerful tools for generating high-quality images. However, the latent space of diffusion models is not as thoroughly explored or understood. Existing methods that aim to explore the latent space of diffusion models usually relies on text prompts to pinpoint specific semantics. However, this approach may be restrictive in areas such as art, fashion, or specialized fields like medicine, where suitable text prompts might not be available or easy to conceive thus limiting the scope of existing work. In this paper, we propose an unsupervised method to discover latent semantics in text-to-image diffusion models without relying on text prompts. Our method takes a small set of unlabeled images from specific domains, such as faces or cats, and a pre-trained diffusion model, and discovers diverse semantics in unsupervised fashion using a contrastive learning objective. Moreover, the learned directions can be applied simultaneously, either within the same domain (such as various types of facial edits) or across different domains (such as applying cat and face edits within the same image) without interfering with each other. Our extensive experiments show that our method achieves highly disentangled edits, outperforming existing approaches in both diffusion-based and GAN-based latent space editing methods.	This paper introduces NoiseCLR, an unsupervised contrastive learning method for discovering interpretable semantic directions within the latent space of pre-trained text-to-image diffusion models like Stable Diffusion.	This is important because it allows for disentangled image editing in diffusion models without relying on text prompts, which can be limiting in domains where suitable prompts are difficult to formulate.	NoiseCLR leverages a contrastive learning objective to learn latent directions by encouraging similarity between edits made by the same direction while repelling edits from different directions. It operates directly on the noise estimations of the diffusion model.	NoiseCLR successfully discovers a variety of disentangled directions across different domains (faces, cats, cars, artwork) using a single diffusion model. The method allows for intra-domain and cross-domain editing, enabling the combination of multiple edits within and across different semantic categories. Evaluations demonstrate that NoiseCLR outperforms existing diffusion-based and achieves competitive results with GAN-based image editing methods, both qualitatively and quantitatively.	The manipulation capabilities of NoiseCLR are inherently limited by the biases present in the datasets used to train the underlying diffusion model and its associated language model. Similar to other image synthesis tools, there are ethical concerns regarding the potential misuse of NoiseCLR for malicious purposes, such as generating deepfakes.	diffusion models, image editing, latent space exploration, contrastive learning, unsupervised learning
2312.05295 Report	Disentangled Clothed Avatar Generation from Text Descriptions	Jionghao Wang, Yuan Liu, Zhiyang Dou, Zhengming Yu, Yongqing Liang, Xin Li, Wenping Wang, Rong Xie, Li Song	In this paper, we introduced a novel text-to-avatar generation method that separately generates the human body and the clothes and allows high-quality animation on the generated avatar. While recent advancements in text-to-avatar generation have yielded diverse human avatars from text prompts, these methods typically combine all elements-clothes, hair, and body-into a single 3D representation. Such an entangled approach poses challenges for downstream tasks like editing or animation. To overcome these limitations, we propose a novel disentangled 3D avatar representation named Sequentially Offset-SMPL (SO-SMPL), building upon the SMPL model. SO-SMPL represents the human body and clothes with two separate meshes, but associates them with offsets to ensure the physical alignment between the body and the clothes. Then, we design an Score Distillation Sampling(SDS)-based distillation framework to generate the proposed SO-SMPL representation from text prompts. In comparison with existing text-to-avatar methods, our approach not only achieves higher exture and geometry quality and better semantic alignment with text prompts, but also significantly improves the visual quality of character animation, virtual try-on, and avatar editing. Our project page is at https://shanemankiw.github.io/SO-SMPL/.	This paper presents SO-SMPL, a novel method for generating disentangled 3D human avatars with clothes from text prompts.	Disentangling clothes and body in 3D avatar generation allows for more realistic animation, easier editing, and virtual try-on applications.	SO-SMPL represents human body and clothes as separate meshes with offsets to ensure alignment, leveraging score distillation sampling and a two-stage optimization process for generation.	Achieves higher texture and geometry quality compared to previous text-to-avatar methods. Generates separate clothes meshes that can be fitted to different body shapes. Enables realistic animation by simulating clothes and body motions separately.	Limited to clothing types compatible with the SMPL-X topology, excluding items like skirts and dresses. Generated clothes lack sewing patterns and physical properties.	3d avatar generation, text-to-3d, disentangled representation, score distillation sampling, virtual try-on
2312.05288 Report	MotionCrafter: One-Shot Motion Customization of Diffusion Models	Yuxin Zhang, Fan Tang, Nisha Huang, Haibin Huang, Chongyang Ma, Weiming Dong, Changsheng Xu	The essence of a video lies in its dynamic motions, including character actions, object movements, and camera movements. While text-to-video generative diffusion models have recently advanced in creating diverse contents, controlling specific motions through text prompts remains a significant challenge. A primary issue is the coupling of appearance and motion, often leading to overfitting on appearance. To tackle this challenge, we introduce MotionCrafter, a novel one-shot instance-guided motion customization method. MotionCrafter employs a parallel spatial-temporal architecture that injects the reference motion into the temporal component of the base model, while the spatial module is independently adjusted for character or style control. To enhance the disentanglement of motion and appearance, we propose an innovative dual-branch motion disentanglement approach, comprising a motion disentanglement loss and an appearance prior enhancement strategy. During training, a frozen base model provides appearance normalization, effectively separating appearance from motion and thereby preserving diversity. Comprehensive quantitative and qualitative experiments, along with user preference tests, demonstrate that MotionCrafter can successfully integrate dynamic motions while preserving the coherence and quality of the base model with a wide range of appearance generation capabilities. Project page: https://zyxelsa.github.io/homepage-motioncrafter. Codes are available at https://github.com/zyxElsa/MotionCrafter.	Introduces MotionCrafter, a one-shot instance-guided method for customizing dynamic motions in text-to-video generation.	Addresses the challenge of controlling specific motions in generated videos, which current text-to-video models struggle with, particularly in decoupling motion from appearance.	Employs a parallel spatial-temporal architecture to fine-tune pre-trained text-to-video models, separating appearance and motion learning. Introduces a dual-branch motion disentanglement approach using a frozen base model as an appearance prior and a motion disentanglement loss to separate motion from the reference video's appearance.	Successfully integrates dynamic motions from reference videos into generated videos with different appearances based on text prompts. Outperforms state-of-the-art methods in qualitative and quantitative evaluations, demonstrating superior motion fidelity and appearance diversity. Receives higher user preference scores in a user study, particularly for motion accuracy and visual quality.	Limited ability to maintain coherence for complex actions spanning many frames. Struggles to capture detailed dynamics in group actions due to inherent motion complexity and limitations in current text-to-video models.	text-to-video generation, motion customization, video editing, diffusion models, motion disentanglement
2312.05284 Report	SlimSAM: 0.1% Data Makes Segment Anything Slim	Zigeng Chen, Gongfan Fang, Xinyin Ma, Xinchao Wang	Current approaches for compressing the Segment Anything Model (SAM) yield commendable results, yet necessitate extensive data to train a new network from scratch. Employing conventional pruning techniques can remarkably reduce data requirements but would suffer from a degradation in performance. To address this challenging trade-off, we introduce SlimSAM, a novel data-efficient SAM compression method that achieves superior performance with extremely less training data. The essence of SlimSAM is encapsulated in the alternate slimming framework which effectively enhances knowledge inheritance under severely limited training data availability and exceptional pruning ratio. Diverging from prior techniques, our framework progressively compresses the model by alternately pruning and distilling distinct, decoupled sub-structures. Disturbed Taylor pruning is also proposed to address the misalignment between the pruning objective and training target, thereby boosting the post-distillation after pruning. SlimSAM yields significant performance improvements while demanding over 10 times less training data than any other existing compression methods. Even when compared to the original SAM, SlimSAM achieves approaching performance while reducing parameter counts to merely 1.4% (9.1M), MACs to 0.8% (23G), and requiring only 0.1% (10k) of the SAM training data. The code is available at http://github.com/czg1225/SlimSAM.	SlimSAM, a data-efficient compression method for the Segment Anything Model (SAM), which achieves high performance with minimal training data by reusing pre-trained weights and employing a novel modernized pruning-distillation procedure.	SAM's large size and computational demands make it unsuitable for resource-constrained devices, hindering its wider application. Existing compression methods require extensive data to train from scratch or suffer performance degradation with conventional pruning.	SlimSAM leverages an alternate slimming framework, alternately pruning and distilling decoupled embedding and bottleneck sub-structures. It also introduces disturbed Taylor pruning, a label-free importance estimation method aligning pruning objectives with distillation targets.	SlimSAM achieves approaching performance to the original SAM-H with 1.4% parameters and 0.8% MACs using only 0.1% of the training data. It outperforms other SAM compression techniques in terms of performance, efficiency, and training data requirements. The method consistently surpasses other structural pruning methods, especially at high pruning ratios.	While mitigating the need for large training datasets, more data can further enhance performance, particularly at higher pruning rates. The effectiveness of global pruning in bottleneck compression is highly dependent on the chosen importance normalization method.	segment anything, model compression, data-efficient, model pruning, knowledge distillation
2312.05283 Report	Nuvo: Neural UV Mapping for Unruly 3D Representations	Pratul P. Srinivasan, Stephan J. Garbin, Dor Verbin, Jonathan T. Barron, Ben Mildenhall	Existing UV mapping algorithms are designed to operate on well-behaved meshes, instead of the geometry representations produced by state-of-the-art 3D reconstruction and generation techniques. As such, applying these methods to the volume densities recovered by neural radiance fields and related techniques (or meshes triangulated from such fields) results in texture atlases that are too fragmented to be useful for tasks such as view synthesis or appearance editing. We present a UV mapping method designed to operate on geometry produced by 3D reconstruction and generation techniques. Instead of computing a mapping defined on a mesh's vertices, our method Nuvo uses a neural field to represent a continuous UV mapping, and optimizes it to be a valid and well-behaved mapping for just the set of visible points, i.e. only points that affect the scene's appearance. We show that our model is robust to the challenges posed by ill-behaved geometry, and that it produces editable UV mappings that can represent detailed appearance.	This paper presents a novel UV mapping method called \model, specifically designed to handle the complex geometry produced by modern 3D reconstruction and generation techniques like NeRF.	Existing UV mapping algorithms often generate highly fragmented texture atlases when applied to the non-smooth and intricate geometry generated by these techniques, making them unsuitable for tasks such as view synthesis and appearance editing.	\model utilizes neural fields to represent a continuous UV mapping. It optimizes this mapping by minimizing a set of losses, encouraging bijectivity, low distortion, and meaningful chart assignment solely for the visible points in the scene.	\model effectively represents detailed surface appearance, achieving comparable or superior view synthesis results compared to directly optimizing appearance on mesh vertices. \model generates UV mappings that are competitive with state-of-the-art methods on standard meshes and significantly outperforms all baselines on challenging geometry extracted from NeRF reconstructions. The generated UV maps are suitable for integration into standard graphics pipelines, as demonstrated by baking the optimized coordinates onto meshes with minimal performance loss.	While \model's point sampling approach provides flexibility, it poses challenges in guaranteeing global bijectivity and distortion minimization. The current \model lacks interactive features, limiting user control over aspects like cut placement and region-specific distortion minimization.	uv mapping, neural radiance fields (nerf), 3d reconstruction, appearance editing, texture atlas
2312.05251 Report	Reconstructing Hands in 3D with Transformers	Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, Jitendra Malik	We present an approach that can reconstruct hands in 3D from monocular input. Our approach for Hand Mesh Recovery, HaMeR, follows a fully transformer-based architecture and can analyze hands with significantly increased accuracy and robustness compared to previous work. The key to HaMeR's success lies in scaling up both the data used for training and the capacity of the deep network for hand reconstruction. For training data, we combine multiple datasets that contain 2D or 3D hand annotations. For the deep model, we use a large scale Vision Transformer architecture. Our final model consistently outperforms the previous baselines on popular 3D hand pose benchmarks. To further evaluate the effect of our design in non-controlled settings, we annotate existing in-the-wild datasets with 2D hand keypoint annotations. On this newly collected dataset of annotations, HInt, we demonstrate significant improvements over existing baselines. We make our code, data and models available on the project website: https://geopavlakos.github.io/hamer/.	The paper proposes HAMER, a fully transformer-based approach for 3D hand mesh recovery from monocular images or videos, achieving improved accuracy and robustness by leveraging a large vision transformer model and extensive training data.	Accurately reconstructing 3D hand meshes from monocular input is crucial for various applications like robotics, action recognition, and sign language understanding. This paper addresses the need for more robust and accurate hand mesh recovery models, particularly in challenging in-the-wild scenarios.	HAMER utilizes a vision transformer (ViT) architecture pre-trained on large-scale image data and fine-tuned on a combination of existing datasets with 2D or 3D hand annotations, resulting in 2.7M training examples. The model regresses MANO hand model parameters and camera parameters, supervised by 2D and 3D losses, along with adversarial losses to promote natural hand poses.	HAMER achieves state-of-the-art results on standard 3D hand pose benchmarks (FreiHAND and HO3Dv2) outperforming previous methods in most metrics. Evaluation on the newly introduced 'Hand Interactions in the wild' (HINT) dataset, comprising challenging in-the-wild images annotated with 2D keypoints and occlusion labels, shows significant improvements over baselines (2-3x better PCK@0.05). Ablation studies confirm the importance of both large-scale training data and the high-capacity ViT architecture for HAMER's performance.	Limited evaluation on temporal aspects of hand motion, as HAMER is a single-frame approach. Future work includes extending the approach to handle hand-object interaction more explicitly and exploring the use of temporal information for video-based reconstruction.	3d hand mesh reconstruction, monocular vision, vision transformer, hand pose estimation, in-the-wild datasets
2312.05210 Report	IntrinsicAvatar: Physically Based Inverse Rendering of Dynamic Humans from Monocular Videos via Explicit Ray Tracing	Shaofei Wang, Božidar Antić, Andreas Geiger, Siyu Tang	We present IntrinsicAvatar, a novel approach to recovering the intrinsic properties of clothed human avatars including geometry, albedo, material, and environment lighting from only monocular videos. Recent advancements in human-based neural rendering have enabled high-quality geometry and appearance reconstruction of clothed humans from just monocular videos. However, these methods bake intrinsic properties such as albedo, material, and environment lighting into a single entangled neural representation. On the other hand, only a handful of works tackle the problem of estimating geometry and disentangled appearance properties of clothed humans from monocular videos. They usually achieve limited quality and disentanglement due to approximations of secondary shading effects via learned MLPs. In this work, we propose to model secondary shading effects explicitly via Monte-Carlo ray tracing. We model the rendering process of clothed humans as a volumetric scattering process, and combine ray tracing with body articulation. Our approach can recover high-quality geometry, albedo, material, and lighting properties of clothed humans from a single monocular video, without requiring supervised pre-training using ground truth materials. Furthermore, since we explicitly model the volumetric scattering process and ray tracing, our model naturally generalizes to novel poses, enabling animation of the reconstructed avatar in novel lighting conditions.	IntrinsicAvatar, a novel approach to recover intrinsic properties of clothed human avatars (geometry, albedo, material, environment lighting) from monocular videos using volumetric scattering and Monte-Carlo ray tracing.	Existing methods for reconstructing clothed humans from monocular videos entangle intrinsic properties in a single neural representation, limiting editing capabilities and relighting under novel conditions. This work aims to disentangle these properties.	The method models clothed humans as articulated neural radiance fields, using iNGP with SDF for geometry and separate MLPs for radiance, albedo, and material. It employs volumetric scattering with Monte-Carlo ray tracing in canonical space for physically based inverse rendering, enabling relighting for unseen poses.	Achieves high-quality reconstruction of clothed human avatars with disentangled intrinsic properties from monocular videos. Significantly outperforms the state-of-the-art method (Relighting 4D) both qualitatively and quantitatively. Enables realistic rendering of the learned avatars under novel lighting conditions and poses.	Does not consider pose-dependent non-rigid motion, limiting applicability to more dynamic scenarios. Relatively slow inference time (around 20 seconds per image)	inverse rendering, neural radiance fields, human reconstruction, volumetric scattering, monte-carlo ray tracing
2312.05208 Report	ControlRoom3D: Room Generation using Semantic Proxy Rooms	Jonas Schult, Sam Tsai, Lukas Höllein, Bichen Wu, Jialiang Wang, Chih-Yao Ma, Kunpeng Li, Xiaofang Wang, Felix Wimbauer, Zijian He, Peizhao Zhang, Bastian Leibe, Peter Vajda, Ji Hou	Manually creating 3D environments for AR/VR applications is a complex process requiring expert knowledge in 3D modeling software. Pioneering works facilitate this process by generating room meshes conditioned on textual style descriptions. Yet, many of these automatically generated 3D meshes do not adhere to typical room layouts, compromising their plausibility, e.g., by placing several beds in one bedroom. To address these challenges, we present ControlRoom3D, a novel method to generate high-quality room meshes. Central to our approach is a user-defined 3D semantic proxy room that outlines a rough room layout based on semantic bounding boxes and a textual description of the overall room style. Our key insight is that when rendered to 2D, this 3D representation provides valuable geometric and semantic information to control powerful 2D models to generate 3D consistent textures and geometry that aligns well with the proxy room. Backed up by an extensive study including quantitative metrics and qualitative user evaluations, our method generates diverse and globally plausible 3D room meshes, thus empowering users to design 3D rooms effortlessly without specialized knowledge.	\name{} is a novel method that generates diverse and globally plausible 3D room meshes from user-defined 3D semantic layouts and text prompts describing the desired room style.	Manually creating 3D environments for AR/VR is difficult and requires expert knowledge. Existing methods often result in implausible layouts. \name{} addresses these limitations by leveraging user-defined layouts to guide the generation process, leading to higher quality and more plausible 3D rooms.	\name{} utilizes a 3D semantic proxy room defined by bounding boxes and text prompts. It employs: (1) Guided Panorama Generation for style consistency, (2) Geometry Alignment to match predicted depth with the proxy room, (3) Mesh Cleaning to remove low-quality regions, and (4) Mesh Completion to fill in missing areas with new content.	Significantly outperforms baseline methods in generating plausible and user-preferred 3D room layouts. Fine-tuning adapters on a dataset with rendered 3D bounding boxes significantly improves object generation within defined layouts. The Geometry Alignment module is crucial for ensuring generated objects align correctly with the proxy room.	Limited to generating indoor room-scale environments. Relies on a pre-defined set of semantic classes for the proxy room.	3d scene generation, text-to-3d, semantic scene understanding, layout-aware generation, generative ai
2312.05133 Report	GIR: 3D Gaussian Inverse Rendering for Relightable Scene Factorization	Yahao Shi, Yanmin Wu, Chenming Wu, Xing Liu, Chen Zhao, Haocheng Feng, Jingtuo Liu, Liangjun Zhang, Jian Zhang, Bin Zhou, Errui Ding, Jingdong Wang	This paper presents GIR, a 3D Gaussian Inverse Rendering method for relightable scene factorization. Compared to existing methods leveraging discrete meshes or neural implicit fields for inverse rendering, our method utilizes 3D Gaussians to estimate the material properties, illumination, and geometry of an object from multi-view images. Our study is motivated by the evidence showing that 3D Gaussian is a more promising backbone than neural fields in terms of performance, versatility, and efficiency. In this paper, we aim to answer the question: ``How can 3D Gaussian be applied to improve the performance of inverse rendering?'' To address the complexity of estimating normals based on discrete and often in-homogeneous distributed 3D Gaussian representations, we proposed an efficient self-regularization method that facilitates the modeling of surface normals without the need for additional supervision. To reconstruct indirect illumination, we propose an approach that simulates ray tracing. Extensive experiments demonstrate our proposed GIR's superior performance over existing methods across multiple tasks on a variety of widely used datasets in inverse rendering. This substantiates its efficacy and broad applicability, highlighting its potential as an influential tool in relighting and reconstruction. Project page: https://3dgir.github.io	This paper introduces GIR, a novel inverse rendering framework based on 3D Gaussian Splatting (3DGS) that estimates material properties, geometry, and illumination from multi-view images in high fidelity.	Inverse rendering is a fundamental computer vision problem with applications in various fields such as scene understanding, image manipulation, AR/VR, etc. The proposed method, GIR, leverages the strengths of 3DGS, a promising alternative to NeRFs, for high-performance inverse rendering.	The paper proposes a novel inverse rendering framework leveraging 3D Gaussian Splatting. The method introduces a self-regularization method for accurate surface normal estimation and an approximate ray tracing approach for efficient indirect illumination reconstruction. The framework jointly optimizes for geometry, materials, and illumination using a combination of MAE, DSSIM, and smoothing losses.	GIR achieves high-fidelity reconstruction of normal maps, specular and diffuse components, roughness and metallic properties, indirect illumination, and environmental maps. The proposed self-regularization method for normal estimation in 3DGS proves effective without needing additional supervision. Extensive experiments demonstrate GIR's superior performance over existing state-of-the-art methods on various benchmark datasets for relighting and novel view synthesis tasks.	The indirect illumination reconstruction uses an approximate method, which can be further improved. Future work can explore extending GIR for dynamic scene modeling and content generation leveraging the versatility of 3DGS.	inverse rendering, 3d gaussian splatting, relightable scene factorization, normal estimation, indirect illumination
2312.05107 Report	DreaMoving: A Human Video Generation Framework based on Diffusion Models	Mengyang Feng, Jinlin Liu, Kai Yu, Yuan Yao, Zheng Hui, Xiefan Guo, Xianhui Lin, Haolan Xue, Chen Shi, Xiaowen Li, Aojie Li, Xiaoyang Kang, Biwen Lei, Miaomiao Cui, Peiran Ren, Xuansong Xie	In this paper, we present DreaMoving, a diffusion-based controllable video generation framework to produce high-quality customized human videos. Specifically, given target identity and posture sequences, DreaMoving can generate a video of the target identity moving or dancing anywhere driven by the posture sequences. To this end, we propose a Video ControlNet for motion-controlling and a Content Guider for identity preserving. The proposed model is easy to use and can be adapted to most stylized diffusion models to generate diverse results. The project page is available at https://dreamoving.github.io/dreamoving	Presents DreaMoving, a diffusion-based controllable video generation framework that produces high-quality customized human videos based on target identity and posture sequences.	Addresses challenges in human-centric video generation, particularly in character dance, where existing text-to-video models struggle with intraframe consistency, length, diversity, personalization, and controllability.	Utilizes a Video ControlNet for motion control, a Content Guider for identity preservation, and incorporates motion blocks for temporal consistency. Employs a multi-stage training process including long-frame pretraining, Video ControlNet training, and expression fine-tuning.	Generates high-quality, consistent videos with controlled motion based on input pose or depth sequences. Enables content control through text prompts for background and image prompts for precise human appearance guidance. Demonstrates generalization ability by generating videos in the style of unseen stylized images.	Relies on accurate pose/depth estimation for optimal control. Further exploration of diverse motion control mechanisms beyond pose and depth.	video generation, diffusion models, controllable generation, human-centric content, motion control
2312.05039 Report	SmartMask: Context Aware High-Fidelity Mask Generation for Fine-grained Object Insertion and Layout Control	Jaskirat Singh, Jianming Zhang, Qing Liu, Cameron Smith, Zhe Lin, Liang Zheng	The field of generative image inpainting and object insertion has made significant progress with the recent advent of latent diffusion models. Utilizing a precise object mask can greatly enhance these applications. However, due to the challenges users encounter in creating high-fidelity masks, there is a tendency for these methods to rely on more coarse masks (e.g., bounding box) for these applications. This results in limited control and compromised background content preservation. To overcome these limitations, we introduce SmartMask, which allows any novice user to create detailed masks for precise object insertion. Combined with a ControlNet-Inpaint model, our experiments demonstrate that SmartMask achieves superior object insertion quality, preserving the background content more effectively than previous methods. Notably, unlike prior works the proposed approach can also be used even without user-mask guidance, which allows it to perform mask-free object insertion at diverse positions and scales. Furthermore, we find that when used iteratively with a novel instruction-tuning based planning model, SmartMask can be used to design detailed layouts from scratch. As compared with user-scribble based layout design, we observe that SmartMask allows for better quality outputs with layout-to-image generation methods. Project page is available at https://smartmask-gen.github.io	Introduces SmartMask, a context-aware diffusion model for generating fine-grained object masks for precise insertion and layout control.	Addresses limitations of coarse-mask inpainting methods which often modify background content and offer limited control over object placement and scale.	Leverages semantic amodal segmentation data to train a diffusion model for predicting object masks, enabling mask-free or user-guided (bounding box, scribbles) object insertion and employs an instruction-tuning based planning model for iterative layout design.	Achieves better background preservation compared to state-of-the-art inpainting methods. Allows for mask-free object insertion at diverse positions and scales. Facilitates fine-grained layout design from scratch, enabling higher quality layout-to-image generation.	Reliance on semantic layouts for mask prediction may limit depth context. Training data size for SmartMask is smaller compared to typical inpainting models, potentially limiting generalizability to out-of-distribution objects.	image inpainting, object insertion, layout generation, diffusion models, semantic segmentation
2312.05038 Report	Prompt-In-Prompt Learning for Universal Image Restoration	Zilong Li, Yiming Lei, Chenglong Ma, Junping Zhang, Hongming Shan	Image restoration, which aims to retrieve and enhance degraded images, is fundamental across a wide range of applications. While conventional deep learning approaches have notably improved the image quality across various tasks, they still suffer from (i) the high storage cost needed for various task-specific models and (ii) the lack of interactivity and flexibility, hindering their wider application. Drawing inspiration from the pronounced success of prompts in both linguistic and visual domains, we propose novel Prompt-In-Prompt learning for universal image restoration, named PIP. First, we present two novel prompts, a degradation-aware prompt to encode high-level degradation knowledge and a basic restoration prompt to provide essential low-level information. Second, we devise a novel prompt-to-prompt interaction module to fuse these two prompts into a universal restoration prompt. Third, we introduce a selective prompt-to-feature interaction module to modulate the degradation-related feature. By doing so, the resultant PIP works as a plug-and-play module to enhance existing restoration models for universal image restoration. Extensive experimental results demonstrate the superior performance of PIP on multiple restoration tasks, including image denoising, deraining, dehazing, deblurring, and low-light enhancement. Remarkably, PIP is interpretable, flexible, efficient, and easy-to-use, showing promising potential for real-world applications. The code is available at https://github.com/longzilicart/pip_universal.	Proposes Prompt-in-Prompt (PIP) learning, a novel plug-and-play module that enhances existing image restoration backbones for universal image restoration by integrating high-level and low-level degradation knowledge via prompts.	Addresses the limitations of conventional deep learning approaches for image restoration, such as high storage cost for task-specific models and lack of interactivity, by enabling a single model to handle multiple degradation types effectively.	Learns two types of prompts: degradation-aware prompts (high-level) and basic restoration prompts (low-level). These prompts are fused through a prompt-to-prompt interaction module. A selective prompt-to-feature interaction module then modulates degradation-related features based on the fused prompts.	Outperforms state-of-the-art universal image restoration methods on benchmark datasets across denoising, deraining, dehazing, deblurring, and low-light enhancement. Demonstrates the effectiveness of decoupled degradation-aware prompts in improving restoration performance. Shows efficiency by achieving significant performance gains with only a slight increase in parameters and FLOPS compared to the baseline models.	Slight computational overhead in training and inference compared to baseline models. Limited improvement in model generalization to unknown degradation types.	image restoration, prompt learning, universal models, deep learning, computer vision
2312.04966 Report	Customizing Motion in Text-to-Video Diffusion Models	Joanna Materzynska, Josef Sivic, Eli Shechtman, Antonio Torralba, Richard Zhang, Bryan Russell	We introduce an approach for augmenting text-to-video generation models with customized motions, extending their capabilities beyond the motions depicted in the original training data. By leveraging a few video samples demonstrating specific movements as input, our method learns and generalizes the input motion patterns for diverse, text-specified scenarios. Our contributions are threefold. First, to achieve our results, we finetune an existing text-to-video model to learn a novel mapping between the depicted motion in the input examples to a new unique token. To avoid overfitting to the new custom motion, we introduce an approach for regularization over videos. Second, by leveraging the motion priors in a pretrained model, our method can produce novel videos featuring multiple people doing the custom motion, and can invoke the motion in combination with other motions. Furthermore, our approach extends to the multimodal customization of motion and appearance of individualized subjects, enabling the generation of videos featuring unique characters and distinct motions. Third, to validate our method, we introduce an approach for quantitatively evaluating the learned custom motion and perform a systematic ablation study. We show that our method significantly outperforms prior appearance-based customization approaches when extended to the motion customization task.	This paper presents a method for customizing text-to-video diffusion models by incorporating new motions from a small set of exemplar videos.	Current text-to-video models are limited to motions present in their training data. This work enables these models to generate videos with user-defined motions, broadening their applicability.	The method involves fine-tuning a pre-trained text-to-video model's spatial and temporal layers, using a novel video regularization technique and a sampling strategy that emphasizes motion patterns. This allows associating a unique text token with the new motion.	The approach successfully customizes models with diverse motions like dancing, gestures, and camera movements. Quantitative evaluation shows significant improvement in motion accuracy compared to adapting image customization methods. The method generalizes well, enabling the generation of customized motions with new subjects, multiple people, varying timings, and in combination with other motions.	The model occasionally overfits to the appearance of training videos, leading to memorization. The reliance on a pre-trained action recognition model for evaluation limits the scope to existing gesture datasets. Future work could explore alternative evaluation metrics.	text-to-video generation, motion customization, diffusion models, video regularization, motion accuracy
2312.04965 Report	Inversion-Free Image Editing with Natural Language	Sihan Xu, Yidong Huang, Jiayi Pan, Ziqiao Ma, Joyce Chai	Despite recent advances in inversion-based editing, text-guided image manipulation remains challenging for diffusion models. The primary bottlenecks include 1) the time-consuming nature of the inversion process; 2) the struggle to balance consistency with accuracy; 3) the lack of compatibility with efficient consistency sampling methods used in consistency models. To address the above issues, we start by asking ourselves if the inversion process can be eliminated for editing. We show that when the initial sample is known, a special variance schedule reduces the denoising step to the same form as the multi-step consistency sampling. We name this Denoising Diffusion Consistent Model (DDCM), and note that it implies a virtual inversion strategy without explicit inversion in sampling. We further unify the attention control mechanisms in a tuning-free framework for text-guided editing. Combining them, we present inversion-free editing (InfEdit), which allows for consistent and faithful editing for both rigid and non-rigid semantic changes, catering to intricate modifications without compromising on the image's integrity and explicit inversion. Through extensive experiments, InfEdit shows strong performance in various editing tasks and also maintains a seamless workflow (less than 3 seconds on one single A40), demonstrating the potential for real-time applications. Project Page: https://sled-group.github.io/InfEdit/	This paper proposes InfEdit, an inversion-free editing framework for consistent and faithful text-guided image manipulation in diffusion models.	Text-guided image editing in diffusion models is challenging due to the limitations of inversion-based methods, including lengthy processes, trade-offs between consistency and accuracy, and incompatibility with efficient consistency sampling.	The authors introduce the Denoising Diffusion Consistent Model (DDCM) that eliminates explicit inversion. They further propose Unified Attention Control (UAC), combining cross-attention and mutual self-attention for both rigid and non-rigid editing.	InfEdit achieves competitive or superior performance to inversion-based methods while being significantly more efficient. Unified Attention Control (UAC) further improves InfEdit's performance in editing quality, consistency, and efficiency. InfEdit demonstrates compatibility with Latent Consistency Models (LCMs) for even faster and higher-quality image editing.	The paper acknowledges potential ethical concerns regarding copyright infringement and deceptive misuse. Future work could explore mitigating inherent biases in pre-trained models used by InfEdit.	image editing, diffusion models, attention mechanisms, consistency models, text-guided image manipulation
2312.04963 Report	Text-to-3D Generation with Bidirectional Diffusion using both 2D and 3D priors	Lihe Ding, Shaocong Dong, Zhanpeng Huang, Zibin Wang, Yiyuan Zhang, Kaixiong Gong, Dan Xu, Tianfan Xue	Most 3D generation research focuses on up-projecting 2D foundation models into the 3D space, either by minimizing 2D Score Distillation Sampling (SDS) loss or fine-tuning on multi-view datasets. Without explicit 3D priors, these methods often lead to geometric anomalies and multi-view inconsistency. Recently, researchers have attempted to improve the genuineness of 3D objects by directly training on 3D datasets, albeit at the cost of low-quality texture generation due to the limited texture diversity in 3D datasets. To harness the advantages of both approaches, we propose Bidirectional Diffusion(BiDiff), a unified framework that incorporates both a 3D and a 2D diffusion process, to preserve both 3D fidelity and 2D texture richness, respectively. Moreover, as a simple combination may yield inconsistent generation results, we further bridge them with novel bidirectional guidance. In addition, our method can be used as an initialization of optimization-based models to further improve the quality of 3D model and efficiency of optimization, reducing the generation process from 3.4 hours to 20 minutes. Experimental results have shown that our model achieves high-quality, diverse, and scalable 3D generation. Project website: https://bidiff.github.io/.	This paper proposes BiDiff, a novel bidirectional diffusion model for high-quality text-to-3D generation. It integrates pretrained 2D and 3D diffusion models within a unified framework with bidirectional guidance for joint 2D-3D feature learning.	Existing text-to-3D methods struggle to achieve both high-quality texture (often present in 2D-based methods) and 3D consistency (often present in 3D-based methods). This paper aims to bridge this gap and enable efficient and controllable 3D generation.	BiDiff utilizes a hybrid representation (SDF for 3D and multi-view images for 2D) with mutually transformable capabilities. It employs a 3D diffusion model (guided by denoised 2D images) and a 2D multi-view diffusion model (guided by rendered 3D images). Additionally, it uses outputs as initialization for optimization-based methods to further enhance quality.	Achieves high-quality, diverse, and scalable 3D generation with separate control over geometry and texture. Generates more diverse and text-aligned 3D objects compared to pure optimization methods, while being significantly faster (40 seconds vs. hours). Acts as a strong initialization for optimization-based methods, improving both speed and quality, and reducing geometric errors.	The resolution of the generated 3D model during the diffusion process is limited. The controllability of fine-grained geometry is limited and requires further exploration.	text-to-3d generation, diffusion models, bidirectional guidance, 3d consistency, texture control
2312.04884 Report	UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models	Yiming Zhao, Zhouhui Lian	Text-to-Image (T2I) generation methods based on diffusion model have garnered significant attention in the last few years. Although these image synthesis methods produce visually appealing results, they frequently exhibit spelling errors when rendering text within the generated images. Such errors manifest as missing, incorrect or extraneous characters, thereby severely constraining the performance of text image generation based on diffusion models. To address the aforementioned issue, this paper proposes a novel approach for text image generation, utilizing a pre-trained diffusion model (i.e., Stable Diffusion [27]). Our approach involves the design and training of a light-weight character-level text encoder, which replaces the original CLIP encoder and provides more robust text embeddings as conditional guidance. Then, we fine-tune the diffusion model using a large-scale dataset, incorporating local attention control under the supervision of character-level segmentation maps. Finally, by employing an inference stage refinement process, we achieve a notably high sequence accuracy when synthesizing text in arbitrarily given images. Both qualitative and quantitative results demonstrate the superiority of our method to the state of the art. Furthermore, we showcase several potential applications of the proposed UDiffText, including text-centric image synthesis, scene text editing, etc. Code and model will be available at https://github.com/ZYM-PKU/UDiffText .	This paper proposes UDiffText, a novel diffusion model-based method for synthesizing accurate and harmonious text within both synthetic and real-world images, addressing the text rendering challenges (e.g., spelling errors) faced by existing T2I models.	Existing T2I generation models, while producing visually appealing results, often struggle with rendering accurate text within images, hindering their application in text-centric image synthesis and editing tasks.	The proposed UDiffText leverages a light-weight character-level text encoder to derive robust text embeddings and fine-tunes a pre-trained diffusion model using a combination of denoising score matching, local attention loss based on character-level segmentation maps, and a scene text recognition loss. It further incorporates a refinement process during inference to enhance text accuracy.	UDiffText achieves superior text rendering accuracy and visual context coherency compared to previous state-of-the-art methods, as demonstrated by both qualitative and quantitative evaluations. The use of a character-level text encoder and local attention control significantly improves the model's ability to attend to and accurately render individual characters. The proposed method exhibits promising potential in various applications, including text-centric image synthesis, scene text editing, and enhancing the text accuracy of existing T2I models.	The model's reliance on visual context may limit its performance in backgrounds with minimal texture or patterns. Current implementation effectively handles text sequences up to 12 characters, requiring further development for longer texts (e.g., paragraph generation). Future work will explore improving controllability and diversity and extending the method to other text-related image synthesis tasks.	text-to-image generation, diffusion models, scene text editing, character-level text encoder, local attention
2312.04875 Report	MVDD: Multi-View Depth Diffusion Models	Zhen Wang, Qiangeng Xu, Feitong Tan, Menglei Chai, Shichen Liu, Rohit Pandey, Sean Fanello, Achuta Kadambi, Yinda Zhang	Denoising diffusion models have demonstrated outstanding results in 2D image generation, yet it remains a challenge to replicate its success in 3D shape generation. In this paper, we propose leveraging multi-view depth, which represents complex 3D shapes in a 2D data format that is easy to denoise. We pair this representation with a diffusion model, MVDD, that is capable of generating high-quality dense point clouds with 20K+ points with fine-grained details. To enforce 3D consistency in multi-view depth, we introduce an epipolar line segment attention that conditions the denoising step for a view on its neighboring views. Additionally, a depth fusion module is incorporated into diffusion steps to further ensure the alignment of depth maps. When augmented with surface reconstruction, MVDD can also produce high-quality 3D meshes. Furthermore, MVDD stands out in other tasks such as depth completion, and can serve as a 3D prior, significantly boosting many downstream tasks, such as GAN inversion. State-of-the-art results from extensive experiments demonstrate MVDD's excellent ability in 3D shape generation, depth completion, and its potential as a 3D prior for downstream tasks.	This paper presents MVDD, a novel diffusion model for 3D shape generation that utilizes a multi-view depth representation.	This approach addresses the limitations of existing 3D shape generation methods that struggle with scalability, fine-grained detail, and versatility by leveraging the strengths of diffusion models and the multi-view depth representation.	MVDD enforces cross-view consistency using a novel epipolar "line segment" attention mechanism and a depth fusion module during the denoising process. This allows for the generation of high-resolution, consistent depth maps that can be fused into dense point clouds and further reconstructed into high-quality meshes.	MVDD achieves state-of-the-art results on standard 3D shape generation benchmarks, outperforming existing methods in both quality and diversity of generated shapes. MVDD effectively performs depth completion, demonstrating its ability to leverage learned 3D information to complete missing data. MVDD serves as an effective 3D prior for downstream tasks like 3D GAN inversion, improving reconstruction quality and preventing geometric collapse.	The current implementation of MVDD assumes a fixed number of views for the multi-view depth representation. Exploring alternative depth fusion techniques and their integration into the diffusion process could further enhance the model's performance.	3d shape generation, denoising diffusion models, multi-view depth, epipolar geometry, depth fusion
2312.04820 Report	Learn to Optimize Denoising Scores for 3D Generation: A Unified and Improved Diffusion Prior on NeRF and 3D Gaussian Splatting	Xiaofeng Yang, Yiwen Chen, Cheng Chen, Chi Zhang, Yi Xu, Xulei Yang, Fayao Liu, Guosheng Lin	We propose a unified framework aimed at enhancing the diffusion priors for 3D generation tasks. Despite the critical importance of these tasks, existing methodologies often struggle to generate high-caliber results. We begin by examining the inherent limitations in previous diffusion priors. We identify a divergence between the diffusion priors and the training procedures of diffusion models that substantially impairs the quality of 3D generation. To address this issue, we propose a novel, unified framework that iteratively optimizes both the 3D model and the diffusion prior. Leveraging the different learnable parameters of the diffusion prior, our approach offers multiple configurations, affording various trade-offs between performance and implementation complexity. Notably, our experimental results demonstrate that our method markedly surpasses existing techniques, establishing new state-of-the-art in the realm of text-to-3D generation. Furthermore, our approach exhibits impressive performance on both NeRF and the newly introduced 3D Gaussian Splatting backbones. Additionally, our framework yields insightful contributions to the understanding of recent score distillation methods, such as the VSD and DDS loss.	This paper introduces LODS, a unified framework to improve diffusion priors for 3D generation by iteratively optimizing the 3D model and diffusion prior, aligning them closer to the original diffusion model's score.	Existing diffusion priors for 3D generation struggle to produce high-quality results due to a divergence between the priors and diffusion model training procedures.	LODS extends the SDS loss with learnable parameters (null embedding or low-rank model parameters) and iteratively optimizes the 3D model and these parameters, bridging the gap between training and inference of diffusion models.	LODS significantly improves 3D generation quality over previous methods, achieving state-of-the-art performance on the T3Bench benchmark. LODS successfully mitigates issues like over-saturation and 'floating objects' observed in previous methods. The method demonstrates strong performance on both NeRF and 3D Gaussian Splatting backbones, offering flexibility and efficiency.	The current focus is primarily on improving texture details, with limitations in enhancing the geometry of generated 3D models. Future work can explore integrating LODS with geometry-aware diffusion models to further improve geometric quality.	3d generation, diffusion models, diffusion priors, text-to-3d, nerf
2312.04806 Report	RL Dreams: Policy Gradient Optimization for Score Distillation based 3D Generation	Aradhya N. Mathur, Phu Pham, Aniket Bera, Ojaswa Sharma	3D generation has rapidly accelerated in the past decade owing to the progress in the field of generative modeling. Score Distillation Sampling (SDS) based rendering has improved 3D asset generation to a great extent. Further, the recent work of Denoising Diffusion Policy Optimization (DDPO) demonstrates that the diffusion process is compatible with policy gradient methods and has been demonstrated to improve the 2D diffusion models using an aesthetic scoring function. We first show that this aesthetic scorer acts as a strong guide for a variety of SDS-based methods and demonstrates its effectiveness in text-to-3D synthesis. Further, we leverage the DDPO approach to improve the quality of the 3D rendering obtained from 2D diffusion models. Our approach, DDPO3D, employs the policy gradient method in tandem with aesthetic scoring. To the best of our knowledge, this is the first method that extends policy gradient methods to 3D score-based rendering and shows improvement across SDS-based methods such as DreamGaussian, which are currently driving research in text-to-3D synthesis. Our approach is compatible with score distillation-based methods, which would facilitate the integration of diverse reward functions into the generative process. Our project page can be accessed via https://ddpo3d.github.io.	The paper introduces DDPO3D, a novel framework that integrates Denoising Diffusion Policy Optimization (DDPO) with score distillation sampling (SDS) methods for improved 3D generation, enhancing visual quality and allowing for non-differentiable reward functions.	Existing text-to-3D generation methods often struggle with visual fidelity and lack the flexibility to incorporate diverse reward signals beyond traditional losses. This work addresses these limitations, pushing the boundaries of high-quality, reward-driven 3D generation.	DDPO3D treats the 3D generation process as a Markov Decision Process (MDP) within the SDS framework. By leveraging a pre-trained 2D diffusion model and an aesthetic scoring function, it guides the optimization of 3D representations (NeRFs or Gaussian Splats) through policy gradients, maximizing both image quality and adherence to reward functions.	DDPO3D demonstrably improves the quality of 3D objects generated from text prompts, evidenced by higher CLIP scores and visual fidelity improvements. Integrating an aesthetic scoring function with SDS methods significantly enhances the visual quality and details of the generated 3D assets. The framework proves compatible with various SDS-based 3D generation techniques, including DreamGaussian, DreamFusion, and GSGen, highlighting its adaptability.	The inclusion of policy gradients introduces a trade-off between generation quality and runtime, demanding further exploration for optimization. The paper primarily relies on CLIP and aesthetic scores for evaluation due to the lack of standardized metrics for text-to-3D generation, suggesting a need for better evaluation strategies in the field.	3d generation, text-to-3d, diffusion models, score distillation sampling, reinforcement learning
2312.04655 Report	ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations	Maitreya Patel, Changhoon Kim, Sheng Cheng, Chitta Baral, Yezhou Yang	Text-to-image (T2I) diffusion models, notably the unCLIP models (e.g., DALL-E-2), achieve state-of-the-art (SOTA) performance on various compositional T2I benchmarks, at the cost of significant computational resources. The unCLIP stack comprises T2I prior and diffusion image decoder. The T2I prior model alone adds a billion parameters compared to the Latent Diffusion Models, which increases the computational and high-quality data requirements. We introduce ECLIPSE, a novel contrastive learning method that is both parameter and data-efficient. ECLIPSE leverages pre-trained vision-language models (e.g., CLIP) to distill the knowledge into the prior model. We demonstrate that the ECLIPSE trained prior, with only 3.3% of the parameters and trained on a mere 2.8% of the data, surpasses the baseline T2I priors with an average of 71.6% preference score under resource-limited setting. It also attains performance on par with SOTA big models, achieving an average of 63.36% preference score in terms of the ability to follow the text compositions. Extensive experiments on two unCLIP diffusion image decoders, Karlo and Kandinsky, affirm that ECLIPSE priors consistently deliver high performance while significantly reducing resource dependency.	This work introduces \eclipse, a novel contrastive learning method for training text-to-image priors in unCLIP models that is both parameter and data-efficient.	Existing text-to-image (T2I) diffusion models, particularly unCLIP models, while achieving state-of-the-art performance, demand significant computational resources due to their large prior models and extensive training data requirements.	\eclipse leverages pre-trained vision-language models like CLIP to distill knowledge into compact non-diffusion prior models using a contrastive learning objective function.	\eclipse priors, with only 3.3% of parameters, outperform baseline priors and achieve comparable performance to SOTA models using only 2.8% of the training data. Empirical analysis shows that traditional diffusion priors, while benefiting from larger datasets, are resource-intensive and negatively impacted by increased prior steps and noise injection. The choice of pre-trained vision-language model significantly influences performance, with better models leading to superior results.	The aesthetic quality of generated images can be further improved, potentially through refined training data selection. Future work could explore integrating \eclipse with existing knowledge distillation and model compression techniques for enhanced efficiency.	text-to-image generation, diffusion models, contrastive learning, parameter efficiency, data efficiency
2312.04567 Report	Scaling Laws of Synthetic Images for Model Training ... for Now	Lijie Fan, Kaifeng Chen, Dilip Krishnan, Dina Katabi, Phillip Isola, Yonglong Tian	Recent significant advances in text-to-image models unlock the possibility of training vision systems using synthetic images, potentially overcoming the difficulty of collecting curated data at scale. It is unclear, however, how these models behave at scale, as more synthetic data is added to the training set. In this paper we study the scaling laws of synthetic images generated by state of the art text-to-image models, for the training of supervised models: image classifiers with label supervision, and CLIP with language supervision. We identify several factors, including text prompts, classifier-free guidance scale, and types of text-to-image models, that significantly affect scaling behavior. After tuning these factors, we observe that synthetic images demonstrate a scaling trend similar to, but slightly less effective than, real images in CLIP training, while they significantly underperform in scaling when training supervised image classifiers. Our analysis indicates that the main reason for this underperformance is the inability of off-the-shelf text-to-image models to generate certain concepts, a limitation that significantly impairs the training of image classifiers. Our findings also suggest that scaling synthetic data can be particularly effective in scenarios such as: (1) when there is a limited supply of real images for a supervised problem (e.g., fewer than 0.5 million images in ImageNet), (2) when the evaluation dataset diverges significantly from the training data, indicating the out-of-distribution scenario, or (3) when synthetic data is used in conjunction with real images, as demonstrated in the training of CLIP models.	This paper studies the scaling laws of synthetic data for training supervised vision models, particularly examining the impact of different factors like text prompts, classifier-free guidance scale, and text-to-image models.	Understanding the scaling behavior of synthetic data is crucial for exploring the potential of using synthetic images to train vision models, potentially overcoming limitations of real data collection.	The paper conducts empirical studies on supervised image classifiers and CLIP models, analyzing the scaling trends of synthetic images generated by Stable Diffusion, Imagen, and Muse in comparison to real images. They investigate the impact of different generation configurations on scaling ability, focusing on performance at 1.3M scale (ImageNet size) and scaling behavior.	Synthetic data exhibits power-law scaling in both supervised and CLIP training, but with lower efficiency compared to real data. Tuning prompt design, classifier-free guidance, and text-to-image model choice significantly improves the scaling ability of synthetic data. While generally underperforming real data for supervised classifiers, synthetic data can be more effective in cases of limited real data, out-of-distribution scenarios, and in combination with real data for CLIP training.	Scaling behavior beyond 4M images is limited by model capacity, requiring larger models. Current text-to-image models struggle to generate certain concepts accurately, limiting the scaling potential of synthetic data.	synthetic data, scaling laws, text-to-image models, supervised learning, clip
2312.04566 Report	Gen2Det: Generate to Detect	Saksham Suri, Fanyi Xiao, Animesh Sinha, Sean Chang Culatana, Raghuraman Krishnamoorthi, Chenchen Zhu, Abhinav Shrivastava	Recently diffusion models have shown improvement in synthetic image quality as well as better control in generation. We motivate and present Gen2Det, a simple modular pipeline to create synthetic training data for object detection for free by leveraging state-of-the-art grounded image generation methods. Unlike existing works which generate individual object instances, require identifying foreground followed by pasting on other images, we simplify to directly generating scene-centric images. In addition to the synthetic data, Gen2Det also proposes a suite of techniques to best utilize the generated data, including image-level filtering, instance-level filtering, and better training recipe to account for imperfections in the generation. Using Gen2Det, we show healthy improvements on object detection and segmentation tasks under various settings and agnostic to detection methods. In the long-tailed detection setting on LVIS, Gen2Det improves the performance on rare categories by a large margin while also significantly improving the performance on other categories, e.g. we see an improvement of 2.13 Box AP and 1.84 Mask AP over just training on real data on LVIS with Mask R-CNN. In the low-data regime setting on COCO, Gen2Det consistently improves both Box and Mask AP by 2.27 and 1.85 points. In the most general detection setting, Gen2Det still demonstrates robust performance gains, e.g. it improves the Box and Mask AP on COCO by 0.45 and 0.32 points.	Introduces Gen2Det, a modular pipeline for creating synthetic training data for object detection by leveraging grounded image generation methods.	Aims to improve object detection and segmentation performance, particularly for rare categories and in low-data regimes, by generating more realistic and contextually relevant training data.	Utilizes grounded inpainting diffusion models to generate scene-centric images with new object instances. Employs image-level and instance-level filtering to remove low-quality generations. Introduces a sampling strategy for mixing real and synthetic data and modifies the loss function to accommodate filtered instances.	Achieves substantial improvements in object detection and segmentation on LVIS and COCO datasets, particularly for rare categories. Demonstrates consistent gains in low-data settings, highlighting the benefits of synthetic data augmentation in data-constrained scenarios. Improves both box and mask AP despite not using segmentation masks for synthetic data, indicating the generation of semantically rich synthetic images.	Current generation models may still lack diversity, limiting the potential benefits of scaling up synthetic data. Exploration of improved filtering techniques or generation models could further enhance the quality of synthetic data and potentially lead to even larger performance gains.	object detection, synthetic data, diffusion models, grounded inpainting, long-tailed learning
2312.04565 Report	MuRF: Multi-Baseline Radiance Fields	Haofei Xu, Anpei Chen, Yuedong Chen, Christos Sakaridis, Yulun Zhang, Marc Pollefeys, Andreas Geiger, Fisher Yu	We present Multi-Baseline Radiance Fields (MuRF), a general feed-forward approach to solving sparse view synthesis under multiple different baseline settings (small and large baselines, and different number of input views). To render a target novel view, we discretize the 3D space into planes parallel to the target image plane, and accordingly construct a target view frustum volume. Such a target volume representation is spatially aligned with the target view, which effectively aggregates relevant information from the input views for high-quality rendering. It also facilitates subsequent radiance field regression with a convolutional network thanks to its axis-aligned nature. The 3D context modeled by the convolutional network enables our method to synthesis sharper scene structures than prior works. Our MuRF achieves state-of-the-art performance across multiple different baseline settings and diverse scenarios ranging from simple objects (DTU) to complex indoor and outdoor scenes (RealEstate10K and LLFF). We also show promising zero-shot generalization abilities on the Mip-NeRF 360 dataset, demonstrating the general applicability of MuRF.	This paper introduces Multi-Baseline Radiance Fields (MuRF), a feed-forward neural radiance field model for novel view synthesis that effectively handles both small and large baseline camera settings.	Existing methods for sparse view synthesis often specialize in either small or large baselines, limiting their general applicability. MuRF addresses this by providing a unified solution that excels in both scenarios.	MuRF constructs a target view frustum volume, spatially aligned with the target view, to effectively aggregate multi-view information. This volume, along with multi-view features and their cosine similarities, is processed by a 3D context-aware convolutional decoder to reconstruct the radiance field.	MuRF achieves state-of-the-art performance on DTU and RealEstate10K datasets, outperforming specialized small and large baseline methods, respectively. It exhibits promising zero-shot generalization abilities on the Mip-NeRF 360 dataset, indicating its robustness to unseen data. Ablation studies validate the importance of the target view frustum volume and the 3D context-aware decoder.	The model currently assumes known camera parameters and static scenes. Performance could be further enhanced with larger and more diverse scene-level datasets.	novel view synthesis, neural radiance fields, multi-baseline, sparse view synthesis, computer vision
2312.04564 Report	EAGLES: Efficient Accelerated 3D Gaussians with Lightweight EncodingS	Sharath Girish, Kamal Gupta, Abhinav Shrivastava	Recently, 3D Gaussian splatting (3D-GS) has gained popularity in novel-view scene synthesis. It addresses the challenges of lengthy training times and slow rendering speeds associated with Neural Radiance Fields (NeRFs). Through rapid, differentiable rasterization of 3D Gaussians, 3D-GS achieves real-time rendering and accelerated training. They, however, demand substantial memory resources for both training and storage, as they require millions of Gaussians in their point cloud representation for each scene. We present a technique utilizing quantized embeddings to significantly reduce per-point memory storage requirements and a coarse-to-fine training strategy for a faster and more stable optimization of the Gaussian point clouds. Our approach develops a pruning stage which results in scene representations with fewer Gaussians, leading to faster training times and rendering speeds for real-time rendering of high resolution scenes. We reduce storage memory by more than an order of magnitude all while preserving the reconstruction quality. We validate the effectiveness of our approach on a variety of datasets and scenes preserving the visual quality while consuming 10-20x lesser memory and faster training/inference speed. Project page and code is available https://efficientgaussian.github.io	This paper introduces EAGLES, a novel approach to compress 3D Gaussian point cloud representations for novel view synthesis, leading to significant reductions in storage and runtime memory while maintaining high reconstruction quality.	3D Gaussian Splatting (3D-GS), while advantageous over NeRFs for novel view synthesis, suffers from high memory usage due to the millions of Gaussians needed to represent scenes. This work aims to address this limitation.	The authors employ quantized embeddings to compress color and rotation attributes, quantize opacity for optimization improvement, use a coarse-to-fine training strategy, and implement influence pruning to eliminate redundant Gaussians.	EAGLES achieves comparable or better reconstruction quality than 3D-GS while reducing storage size by 10-20 times. The method accelerates both training and rendering, achieving higher FPS and lower training times across datasets. EAGLES significantly reduces GPU memory consumption during both training and rendering compared to 3D-GS.	Further exploration of more complex decoders and quantization techniques could yield additional compression. Investigating the integration of meta-learning approaches for compressing scenes from auxiliary datasets is a potential future direction.	3d gaussian splatting, novel view synthesis, point cloud compression, quantization, progressive training
2312.04561 Report	GenDeF: Learning Generative Deformation Field for Video Generation	Wen Wang, Kecheng Zheng, Qiuyu Wang, Hao Chen, Zifan Shi, Ceyuan Yang, Yujun Shen, Chunhua Shen	We offer a new perspective on approaching the task of video generation. Instead of directly synthesizing a sequence of frames, we propose to render a video by warping one static image with a generative deformation field (GenDeF). Such a pipeline enjoys three appealing advantages. First, we can sufficiently reuse a well-trained image generator to synthesize the static image (also called canonical image), alleviating the difficulty in producing a video and thereby resulting in better visual quality. Second, we can easily convert a deformation field to optical flows, making it possible to apply explicit structural regularizations for motion modeling, leading to temporally consistent results. Third, the disentanglement between content and motion allows users to process a synthesized video through processing its corresponding static image without any tuning, facilitating many applications like video editing, keypoint tracking, and video segmentation. Both qualitative and quantitative results on three common video generation benchmarks demonstrate the superiority of our GenDeF method.	This paper presents GenDeF, a novel video generation approach that decomposes videos into a content-rich canonical image and a motion-encoding deformation field, enabling high-quality video synthesis by warping the canonical image.	This method addresses challenges in video generation related to high dimensionality, motion complexity, and temporal consistency, while also facilitating downstream video processing applications.	GenDeF utilizes a GAN framework with a canonical image branch and a deformation field branch, both conditioned on input latent codes. The canonical image captures shared content, while the deformation field, conditioned on the canonical image features, encodes temporal motion. The model is trained with adversarial losses and a structural temporal smoothness constraint.	GenDeF achieves state-of-the-art results on standard video generation benchmarks, demonstrating superior performance in temporal consistency and individual frame quality. The explicit content-motion decomposition allows for generating multiple plausible videos with varied motions from a single canonical image. The method facilitates downstream applications such as consistent video editing, point tracking, and video segmentation by leveraging the interpretable canonical image and deformation field.	The fixed-resolution representation of the canonical image and deformation field limits the model's ability to handle arbitrary resolutions and extremely long videos. The current approach relies on generated data, limiting its direct applicability to real-world videos where canonical images and deformation fields are not readily available.	video generation, generative adversarial networks, deformation fields, canonical image, video editing
2312.04560 Report	NeRFiller: Completing Scenes via Generative 3D Inpainting	Ethan Weber, Aleksander Hołyński, Varun Jampani, Saurabh Saxena, Noah Snavely, Abhishek Kar, Angjoo Kanazawa	We propose NeRFiller, an approach that completes missing portions of a 3D capture via generative 3D inpainting using off-the-shelf 2D visual generative models. Often parts of a captured 3D scene or object are missing due to mesh reconstruction failures or a lack of observations (e.g., contact regions, such as the bottom of objects, or hard-to-reach areas). We approach this challenging 3D inpainting problem by leveraging a 2D inpainting diffusion model. We identify a surprising behavior of these models, where they generate more 3D consistent inpaints when images form a 2$\times$2 grid, and show how to generalize this behavior to more than four images. We then present an iterative framework to distill these inpainted regions into a single consistent 3D scene. In contrast to related works, we focus on completing scenes rather than deleting foreground objects, and our approach does not require tight 2D object masks or text. We compare our approach to relevant baselines adapted to our setting on a variety of scenes, where NeRFiller creates the most 3D consistent and plausible scene completions. Our project page is at https://ethanweber.me/nerfiller.	Presents NeRFiller, a method for completing missing parts of 3D scenes using off-the-shelf 2D inpainting diffusion models.	Addresses the challenge of incomplete 3D captures by enabling 3D-aware and multi-view consistent scene completion.	Identifies and leverages a 'Grid Prior' in diffusion models, generalizing it to multiple views with 'Joint Multi-View Inpainting' and iteratively distilling inpainted regions into a 3D scene representation.	Joint Multi-View Inpainting improves 3D consistency compared to individual image inpainting. NeRFiller generates more plausible and consistent 3D completions than adapted object-removal baselines. The method allows for reference-guided inpainting for controlled scene completion.	Limited resolution and potential blur in generated content. Challenges in applying the method to casually captured scenes with large masked regions and specific mask patterns.	3d inpainting, scene completion, diffusion models, neural radiance fields, multi-view consistency
2312.04558 Report	MonoGaussianAvatar: Monocular Gaussian Point-based Head Avatar	Yufan Chen, Lizhen Wang, Qijing Li, Hongjiang Xiao, Shengping Zhang, Hongxun Yao, Yebin Liu	The ability to animate photo-realistic head avatars reconstructed from monocular portrait video sequences represents a crucial step in bridging the gap between the virtual and real worlds. Recent advancements in head avatar techniques, including explicit 3D morphable meshes (3DMM), point clouds, and neural implicit representation have been exploited for this ongoing research. However, 3DMM-based methods are constrained by their fixed topologies, point-based approaches suffer from a heavy training burden due to the extensive quantity of points involved, and the last ones suffer from limitations in deformation flexibility and rendering efficiency. In response to these challenges, we propose MonoGaussianAvatar (Monocular Gaussian Point-based Head Avatar), a novel approach that harnesses 3D Gaussian point representation coupled with a Gaussian deformation field to learn explicit head avatars from monocular portrait videos. We define our head avatars with Gaussian points characterized by adaptable shapes, enabling flexible topology. These points exhibit movement with a Gaussian deformation field in alignment with the target pose and expression of a person, facilitating efficient deformation. Additionally, the Gaussian points have controllable shape, size, color, and opacity combined with Gaussian splatting, allowing for efficient training and rendering. Experiments demonstrate the superior performance of our method, which achieves state-of-the-art results among previous methods.	This paper introduces MonoGaussianAvatar, a novel approach for creating dynamic 3D head avatars from monocular portrait videos using 3D Gaussian points and a Gaussian deformation field.	Existing methods for creating 3D head avatars suffer from limitations in topology, training burden, deformation flexibility, and rendering efficiency. This new method aims to address these challenges.	The method uses 3D Gaussian points to represent facial features and learns a deformation field to animate these points according to target poses and expressions. It employs a two-stage initialization strategy for Gaussians and a novel point insertion/deletion approach for efficient training.	MonoGaussianAvatar outperforms state-of-the-art methods in terms of structure similarity, image similarity, and Peak Signal-to-Noise Ratio. The method accurately captures fine details like teeth and hair, surpasses mesh-based approaches in modeling thin hair strands, and avoids holes during significant head movements. The introduced Gaussian deformation field proves crucial for preserving the structure of accessories and preventing blurring in novel poses.	The current method lacks the ability to model reflections on eyeglass lenses, which presents an area for future research. The method's reliance on 3DMM priors limits its ability to handle extreme expressions that deviate from these priors.	3d head avatar, gaussian splatting, monocular reconstruction, facial animation, point-based representation
2312.04557 Report	GenTron: Delving Deep into Diffusion Transformers for Image and Video Generation	Shoufa Chen, Mengmeng Xu, Jiawei Ren, Yuren Cong, Sen He, Yanping Xie, Animesh Sinha, Ping Luo, Tao Xiang, Juan-Manuel Perez-Rua	In this study, we explore Transformer-based diffusion models for image and video generation. Despite the dominance of Transformer architectures in various fields due to their flexibility and scalability, the visual generative domain primarily utilizes CNN-based U-Net architectures, particularly in diffusion-based models. We introduce GenTron, a family of Generative models employing Transformer-based diffusion, to address this gap. Our initial step was to adapt Diffusion Transformers (DiTs) from class to text conditioning, a process involving thorough empirical exploration of the conditioning mechanism. We then scale GenTron from approximately 900M to over 3B parameters, observing significant improvements in visual quality. Furthermore, we extend GenTron to text-to-video generation, incorporating novel motion-free guidance to enhance video quality. In human evaluations against SDXL, GenTron achieves a 51.1% win rate in visual quality (with a 19.8% draw rate), and a 42.3% win rate in text alignment (with a 42.9% draw rate). GenTron also excels in the T2I-CompBench, underscoring its strengths in compositional generation. We believe this work will provide meaningful insights and serve as a valuable reference for future research.	The paper introduces GenTron, a family of generative Transformer-based diffusion models for high-quality text-to-image and video generation.	This work bridges the gap between the dominant architectures used in visual generation (CNNs) and those used in NLP and visual perception (Transformers).	The authors adapt Diffusion Transformers (DiTs) from class-conditional to text-conditional image generation, explore different conditioning mechanisms, scale the model up to 3B parameters, and extend it to video generation using a novel motion-free guidance technique.	Cross-attention outperforms adaptive layer norm (adaLN) for text conditioning in GenTron. Scaling GenTron to 3B parameters significantly improves visual quality. Motion-free guidance improves the quality of generated videos and allows for the integration of image-text data during training.	The performance of video generation, while promising, still lags behind state-of-the-art image generation. Future work could focus on developing more efficient training methods for large-scale Transformer-based diffusion models.	text-to-image generation, text-to-video generation, diffusion models, transformers, motion-free guidance
2312.04551 Report	Free3D: Consistent Novel View Synthesis without 3D Representation	Chuanxia Zheng, Andrea Vedaldi	We introduce Free3D, a simple accurate method for monocular open-set novel view synthesis (NVS). Similar to Zero-1-to-3, we start from a pre-trained 2D image generator for generalization, and fine-tune it for NVS. Compared to other works that took a similar approach, we obtain significant improvements without resorting to an explicit 3D representation, which is slow and memory-consuming, and without training an additional network for 3D reconstruction. Our key contribution is to improve the way the target camera pose is encoded in the network, which we do by introducing a new ray conditioning normalization (RCN) layer. The latter injects pose information in the underlying 2D image generator by telling each pixel its viewing direction. We further improve multi-view consistency by using light-weight multi-view attention layers and by sharing generation noise between the different views. We train Free3D on the Objaverse dataset and demonstrate excellent generalization to new categories in new datasets, including OmniObject3D and GSO. The project page is available at https://chuanxiaz.com/free3d/.	This paper presents \method, a novel approach for single-view, open-set novel view synthesis that improves pose accuracy and multi-view consistency without relying on explicit 3D representations.	Existing open-set novel view synthesis methods often struggle with accurate camera pose control and consistent multi-view generation, especially without relying on computationally expensive 3D representations.	The methodology leverages a pre-trained 2D image generator enhanced by a novel Ray Conditioning Normalization (RCN) layer for accurate pose encoding. Multi-view consistency is achieved through a pseudo-3D cross-view attention module and multi-view noise sharing during image generation.	\method surpasses state-of-the-art models in pose accuracy and view consistency on the Objaverse dataset, even those trained on larger datasets or employing explicit 3D representations. The method demonstrates strong generalization ability by achieving superior results on unseen datasets like OmniObject3D and GSO, outperforming competing methods without fine-tuning. Ablation studies confirm the effectiveness of RCN, multi-view attention, and noise sharing in enhancing pose accuracy and multi-view consistency.	The method relies on a pre-trained 2D generator, which, while enabling generalization, might limit its capacity to model complex 3D structures not well-represented in the training data. Future work could explore the integration of more advanced attention mechanisms or training strategies to further enhance multi-view consistency and detail preservation.	novel view synthesis, open-set learning, generative models, ray conditioning, multi-view consistency
2312.04539 Report	Auto-Vocabulary Semantic Segmentation	Osman Ülger, Maksymilian Kulicki, Yuki Asano, Martin R. Oswald	Open-ended image understanding tasks gained significant attention from the research community, particularly with the emergence of Vision-Language Models. Open-Vocabulary Segmentation (OVS) methods are capable of performing semantic segmentation without relying on a fixed vocabulary, and in some cases, they operate without the need for training or fine-tuning. However, OVS methods typically require users to specify the vocabulary based on the task or dataset at hand. In this paper, we introduce \textit{Auto-Vocabulary Semantic Segmentation (AVS)}, advancing open-ended image understanding by eliminating the necessity to predefine object categories for segmentation. Our approach, \ours, presents a framework that autonomously identifies relevant class names using enhanced BLIP embeddings, which are utilized for segmentation afterwards. Given that open-ended object category predictions cannot be directly compared with a fixed ground truth, we develop a Large Language Model-based Auto-Vocabulary Evaluator (LAVE) to efficiently evaluate the automatically generated class names and their corresponding segments. Our method sets new benchmarks on datasets such as PASCAL VOC and Context, ADE20K, and Cityscapes for AVS and showcases competitive performance to OVS methods that require specified class names.	This paper introduces Auto-Vocabulary Semantic Segmentation (AVS), a novel task aiming to segment images and assign classes without predefined categories, user input, or additional training data, unlike traditional or open-vocabulary methods.	This work pushes the boundaries of open-ended image understanding by enabling a system to autonomously determine relevant object categories for segmentation, similar to human perception.	The proposed AVS framework utilizes BLIP-Cluster-Caption (BCC) to generate class names by clustering BLIP embeddings, enhancing them for semantic accuracy, and captioning each cluster. These generated nouns then guide a pre-trained open-vocabulary segmentation model (X-Decoder) for pixel-level prediction. A novel evaluation metric, LAVE, leverages an LLM to map generated categories to dataset annotations for performance assessment.	AVS framework shows competitive performance against open-vocabulary methods on PASCAL VOC, ADE20K, and Cityscapes, despite not having access to predefined categories. BCC effectively identifies and segments out-of-vocabulary classes, demonstrating comprehension beyond fixed datasets. The LLM-based evaluator, LAVE, successfully bridges the gap between open-ended predictions and fixed ground truth annotations.	Performance on datasets with a high number of instances, like Cityscapes, is notably lower than some open-vocabulary methods, suggesting room for improvement in handling complex scenes. Occasional misclassifications occur due to the model struggling to differentiate between semantically similar classes (e.g., hyponyms and hypernyms), highlighting the need for further refinement in semantic reasoning.	auto-vocabulary semantic segmentation, open-vocabulary semantic segmentation, semantic segmentation, vision-language models, image captioning
2312.04534 Report	PICTURE: PhotorealistIC virtual Try-on from UnconstRained dEsigns	Shuliang Ning, Duomin Wang, Yipeng Qin, Zirong Jin, Baoyuan Wang, Xiaoguang Han	In this paper, we propose a novel virtual try-on from unconstrained designs (ucVTON) task to enable photorealistic synthesis of personalized composite clothing on input human images. Unlike prior arts constrained by specific input types, our method allows flexible specification of style (text or image) and texture (full garment, cropped sections, or texture patches) conditions. To address the entanglement challenge when using full garment images as conditions, we develop a two-stage pipeline with explicit disentanglement of style and texture. In the first stage, we generate a human parsing map reflecting the desired style conditioned on the input. In the second stage, we composite textures onto the parsing map areas based on the texture input. To represent complex and non-stationary textures that have never been achieved in previous fashion editing works, we first propose extracting hierarchical and balanced CLIP features and applying position encoding in VTON. Experiments demonstrate superior synthesis quality and personalization enabled by our method. The flexible control over style and texture mixing brings virtual try-on to a new level of user experience for online shopping and fashion design.	This paper presents ucVTON, a novel virtual try-on method that allows users to synthesize personalized clothing with flexible style (text/image) and texture (full garment, cropped sections, or texture patches) conditions.	Existing virtual try-on methods are limited in the types of inputs they allow, hindering users from mixing and matching style and texture elements from different garments.	A two-stage pipeline is proposed, disentangling style and texture. Stage 1 generates a parsing map reflecting the desired style. Stage 2 composites textures onto the parsing map based on the texture input. The method also introduces hierarchical and balanced CLIP features with position encoding to handle complex, non-stationary textures.	The method achieves significantly higher style prediction accuracy compared to prior arts. It outperforms state-of-the-art methods in texture quality, particularly when using full garment images as texture input. User studies confirm that ucVTON is preferred for its fidelity in style and texture, and overall image quality.	The model currently lacks control over garment shape and fit. Future work will explore incorporating user control over these aspects to further enhance personalization.	virtual try-on, fashion editing, diffusion models, style and texture disentanglement, clip features
2312.04524 Report	RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models	Ozgur Kara, Bariscan Kurtkaya, Hidir Yesiltepe, James M. Rehg, Pinar Yanardag	Recent advancements in diffusion-based models have demonstrated significant success in generating images from text. However, video editing models have not yet reached the same level of visual quality and user control. To address this, we introduce RAVE, a zero-shot video editing method that leverages pre-trained text-to-image diffusion models without additional training. RAVE takes an input video and a text prompt to produce high-quality videos while preserving the original motion and semantic structure. It employs a novel noise shuffling strategy, leveraging spatio-temporal interactions between frames, to produce temporally consistent videos faster than existing methods. It is also efficient in terms of memory requirements, allowing it to handle longer videos. RAVE is capable of a wide range of edits, from local attribute modifications to shape transformations. In order to demonstrate the versatility of RAVE, we create a comprehensive video evaluation dataset ranging from object-focused scenes to complex human activities like dancing and typing, and dynamic scenes featuring swimming fish and boats. Our qualitative and quantitative experiments highlight the effectiveness of RAVE in diverse video editing scenarios compared to existing methods. Our code, dataset and videos can be found in https://rave-video.github.io.	RAVE is a zero-shot video editing method that leverages pre-trained text-to-image diffusion models for style, attribute, and shape editing in videos, while preserving motion and structure.	Existing video editing methods lack the visual quality, user control, and efficiency of their image editing counterparts. RAVE addresses this gap by enabling diverse video edits using pre-trained models without the need for extensive training.	RAVE introduces a novel noise shuffling strategy within a grid-based video editing framework. This strategy leverages spatio-temporal interactions during the diffusion process to enhance temporal consistency across video frames, even for longer videos.	RAVE demonstrates superior temporal consistency and textual alignment compared to baseline methods, as evidenced by both quantitative and qualitative evaluations. The noise shuffling strategy in RAVE proves effective in maintaining consistency across multiple grids, overcoming limitations of conventional attention mechanisms for longer videos. RAVE exhibits efficiency in terms of runtime, achieving edits approximately 25% faster than the closest competitor, making it suitable for real-time video editing applications.	RAVE faces limitations in maintaining consistent shape transformations for extreme shape edits in longer videos. Fine details can exhibit flickering in the edited videos, especially in cases requiring high-frequency edits, due to the absence of explicit pixel-level deflickering techniques.	video editing, diffusion models, text-guided editing, temporal consistency, zero-shot learning
2312.04483 Report	Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation	Zhiwu Qing, Shiwei Zhang, Jiayu Wang, Xiang Wang, Yujie Wei, Yingya Zhang, Changxin Gao, Nong Sang	Despite diffusion models having shown powerful abilities to generate photorealistic images, generating videos that are realistic and diverse still remains in its infancy. One of the key reasons is that current methods intertwine spatial content and temporal dynamics together, leading to a notably increased complexity of text-to-video generation (T2V). In this work, we propose HiGen, a diffusion model-based method that improves performance by decoupling the spatial and temporal factors of videos from two perspectives, i.e., structure level and content level. At the structure level, we decompose the T2V task into two steps, including spatial reasoning and temporal reasoning, using a unified denoiser. Specifically, we generate spatially coherent priors using text during spatial reasoning and then generate temporally coherent motions from these priors during temporal reasoning. At the content level, we extract two subtle cues from the content of the input video that can express motion and appearance changes, respectively. These two cues then guide the model's training for generating videos, enabling flexible content variations and enhancing temporal stability. Through the decoupled paradigm, HiGen can effectively reduce the complexity of this task and generate realistic videos with semantics accuracy and motion stability. Extensive experiments demonstrate the superior performance of HiGen over the state-of-the-art T2V methods.	Proposes HiGen-T2V, a diffusion model-based text-to-video generation method that decouples spatial and temporal factors to improve realism and diversity	Current T2V methods struggle to jointly generate realistic spatial content and diverse temporal dynamics due to the complexity of video data	Decouples video generation at two levels: 1) Structure level: separates text-to-video generation into spatial reasoning (generates spatially coherent priors from text) and temporal reasoning (generates temporally coherent motions from priors). 2) Content level: extracts motion and appearance cues from input videos to guide model training and enhance stability and diversity	Achieves superior performance compared to state-of-the-art T2V methods on MSR-VTT dataset. Demonstrates improved spatial quality and temporal stability through ablation studies. Allows flexible control over generated videos by manipulating motion and appearance factors.	Object detail generation lags behind image synthesis models due to computational and data limitations. Modeling human and animal actions realistically, especially with substantial motion, remains challenging.	text-to-video generation, diffusion models, spatio-temporal decoupling, motion and appearance analysis, deep learning
2312.04461 Report	PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding	Zhen Li, Mingdeng Cao, Xintao Wang, Zhongang Qi, Ming-Ming Cheng, Ying Shan	Recent advances in text-to-image generation have made remarkable progress in synthesizing realistic human photos conditioned on given text prompts. However, existing personalized generation methods cannot simultaneously satisfy the requirements of high efficiency, promising identity (ID) fidelity, and flexible text controllability. In this work, we introduce PhotoMaker, an efficient personalized text-to-image generation method, which mainly encodes an arbitrary number of input ID images into a stack ID embedding for preserving ID information. Such an embedding, serving as a unified ID representation, can not only encapsulate the characteristics of the same input ID comprehensively, but also accommodate the characteristics of different IDs for subsequent integration. This paves the way for more intriguing and practically valuable applications. Besides, to drive the training of our PhotoMaker, we propose an ID-oriented data construction pipeline to assemble the training data. Under the nourishment of the dataset constructed through the proposed pipeline, our PhotoMaker demonstrates better ID preservation ability than test-time fine-tuning based methods, yet provides significant speed improvements, high-quality generation results, strong generalization capabilities, and a wide range of applications. Our project page is available at https://photo-maker.github.io/	This paper proposes PhotoMaker, an efficient personalized text-to-image generation method that encodes multiple input ID images into a stacked ID embedding to generate high-quality, customizable human photos with high ID fidelity.	Existing personalized generation methods struggle to simultaneously achieve high efficiency, promising ID fidelity, and flexible text controllability. PhotoMaker aims to address these limitations.	PhotoMaker uses a stacked ID embedding created by concatenating embeddings of multiple input ID images. This embedding is integrated with text embedding to guide a diffusion model (SDXL) for image generation. An ID-oriented data construction pipeline is also proposed to train PhotoMaker.	PhotoMaker demonstrates high ID fidelity and generation quality comparable to DreamBooth while being significantly faster. The method offers flexibility in controlling ID attributes like age and gender by simply modifying class words in text prompts. PhotoMaker enables novel applications like identity mixing, bringing persons from artworks/old photos to reality, and stylization while preserving ID characteristics.	PhotoMaker currently focuses on generating a single person and does not support multi-person ID control. The method is biased towards the training dataset (SDXL) and may inherit its limitations.	text-to-image generation, personalized image synthesis, diffusion models, identity preservation, image editing
2312.04433 Report	DreamVideo: Composing Your Dream Videos with Customized Subject and Motion	Yujie Wei, Shiwei Zhang, Zhiwu Qing, Hangjie Yuan, Zhiheng Liu, Yu Liu, Yingya Zhang, Jingren Zhou, Hongming Shan	Customized generation using diffusion models has made impressive progress in image generation, but remains unsatisfactory in the challenging video generation task, as it requires the controllability of both subjects and motions. To that end, we present DreamVideo, a novel approach to generating personalized videos from a few static images of the desired subject and a few videos of target motion. DreamVideo decouples this task into two stages, subject learning and motion learning, by leveraging a pre-trained video diffusion model. The subject learning aims to accurately capture the fine appearance of the subject from provided images, which is achieved by combining textual inversion and fine-tuning of our carefully designed identity adapter. In motion learning, we architect a motion adapter and fine-tune it on the given videos to effectively model the target motion pattern. Combining these two lightweight and efficient adapters allows for flexible customization of any subject with any motion. Extensive experimental results demonstrate the superior performance of our DreamVideo over the state-of-the-art methods for customized video generation. Our project page is at https://dreamvideo-t2v.github.io.	DreamVideo, a novel approach to generate personalized videos by customizing both subject identity and motion patterns using a pre-trained video diffusion model and two lightweight adapters.	Customized video generation is challenging as it requires controllability of both subjects and motions, which is not well addressed by existing methods.	DreamVideo decouples the task into subject learning and motion learning. Subject learning captures appearance details from static images via textual inversion and a fine-tuned identity adapter. Motion learning models motion patterns from videos using a motion adapter with appearance guidance.	DreamVideo outperforms state-of-the-art methods in qualitative and quantitative comparisons, including AnimateDiff, ModelScopeT2V, and LoRA fine-tuning. The method effectively combines customized subjects and motions under various contexts, preserving both identity and motion fidelity. DreamVideo exhibits strong performance in individual subject and motion customization, surpassing alternatives like Textual Inversion, Dreamix, and Tune-A-Video.	DreamVideo currently doesn't support customizing multiple subjects with multiple motions. The approach may struggle with fine-grained single video motion, achieving similar patterns instead of frame-by-frame correspondence.	video generation, diffusion models, customization, subject customization, motion customization
2312.04429 Report	Approximate Caching for Efficiently Serving Diffusion Models	Shubham Agarwal, Subrata Mitra, Sarthak Chakraborty, Srikrishna Karanam, Koyel Mukherjee, Shiv Saini	Text-to-image generation using diffusion models has seen explosive popularity owing to their ability in producing high quality images adhering to text prompts. However, production-grade diffusion model serving is a resource intensive task that not only require high-end GPUs which are expensive but also incurs considerable latency. In this paper, we introduce a technique called approximate-caching that can reduce such iterative denoising steps for an image generation based on a prompt by reusing intermediate noise states created during a prior image generation for similar prompts. Based on this idea, we present an end to end text-to-image system, Nirvana, that uses the approximate-caching with a novel cache management-policy Least Computationally Beneficial and Frequently Used (LCBFU) to provide % GPU compute savings, 19.8% end-to-end latency reduction and 19% dollar savings, on average, on two real production workloads. We further present an extensive characterization of real production text-to-image prompts from the perspective of caching, popularity and reuse of intermediate states in a large production environment.	Introduces \sys, a system using approximate caching to reduce compute cost and latency in text-to-image generation with diffusion models by reusing intermediate noise states from prior prompts.	Diffusion models for text-to-image generation, while producing high-quality images, are computationally expensive and have high latency, hindering interactive user experiences and increasing costs.	Leverages approximate caching by storing intermediate noise states from previous image generations. Employs a novel cache management policy (LCBFU) to prioritize states offering the most compute savings. Uses a match predictor to avoid unnecessary cache searches, further reducing latency.	\sys achieves up to 50% reduction in GPU usage and end-to-end latency while maintaining image quality comparable to vanilla diffusion models. Reduces overall cost, latency, and compute requirements by ~20% on average, improving system throughput by 27%. User study (N=60) shows 79% preference for \sys generated images, significantly higher than retrieval-based methods and close to vanilla diffusion model quality.	Image diversity might decrease over time if the system encounters a high volume of very similar prompts. Reliance on embedding similarity for prompt matching can be challenging for long, complex prompts.	text-to-image generation, diffusion models, approximate caching, latency reduction, cost optimization
2312.04424 Report	Cascade-Zero123: One Image to Highly Consistent 3D with Self-Prompted Nearby Views	Yabo Chen, Jiemin Fang, Yuyang Huang, Taoran Yi, Xiaopeng Zhang, Lingxi Xie, Xinggang Wang, Wenrui Dai, Hongkai Xiong, Qi Tian	Synthesizing multi-view 3D from one single image is a significant and challenging task. For this goal, Zero-1-to-3 methods aim to extend a 2D latent diffusion model to the 3D scope. These approaches generate the target-view image with a single-view source image and the camera pose as condition information. However, the one-to-one manner adopted in Zero-1-to-3 incurs challenges for building geometric and visual consistency across views, especially for complex objects. We propose a cascade generation framework constructed with two Zero-1-to-3 models, named Cascade-Zero123, to tackle this issue, which progressively extracts 3D information from the source image. Specifically, a self-prompting mechanism is designed to generate several nearby views at first. These views are then fed into the second-stage model along with the source image as generation conditions. With self-prompted multiple views as the supplementary information, our Cascade-Zero123 generates more highly consistent novel-view images than Zero-1-to-3. The promotion is significant for various complex and challenging scenes, involving insects, humans, transparent objects, and stacked multiple objects etc. The project page is at https://cascadezero123.github.io/.	This paper introduces \name, a novel cascade framework based on Zero-1-to-3 models, designed to improve the geometric and visual consistency of novel view synthesis from a single image.	Existing single image to 3D methods, particularly Zero-1-to-3 approaches, struggle to maintain consistency across views, especially for complex objects and scenes with large pose variations. \name addresses this limitation by progressively extracting 3D information.	\name utilizes two cascaded Zero-1-to-3 models. The first generates several nearby views from the input image. These, along with the input, are fed to the second model, which leverages cross-attention to synthesize the final target view with enhanced consistency.	Significantly improves geometric and visual consistency in novel view synthesis, especially for complex scenes (e.g., stacked objects, insects) compared to Zero-1-to-3. Achieves better visual quality compared to methods like SyncDreamer while maintaining consistency. Demonstrates superior performance on Objaverse and RealFusion15 datasets based on metrics like PSNR, SSIM, LPIPS, and CLIP-score.	Limited ability to handle cases with heavy occlusion due to reliance on 2D information. Performance can degrade with high elevation angles in input images due to Zero-1-to-3's sensitivity.	novel view synthesis, single image to 3d, latent diffusion model, zero-1-to-3, view consistency
2312.04410 Report	Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models	Jiayi Guo, Xingqian Xu, Yifan Pu, Zanlin Ni, Chaofei Wang, Manushree Vasu, Shiji Song, Gao Huang, Humphrey Shi	Recently, diffusion models have made remarkable progress in text-to-image (T2I) generation, synthesizing images with high fidelity and diverse contents. Despite this advancement, latent space smoothness within diffusion models remains largely unexplored. Smooth latent spaces ensure that a perturbation on an input latent corresponds to a steady change in the output image. This property proves beneficial in downstream tasks, including image interpolation, inversion, and editing. In this work, we expose the non-smoothness of diffusion latent spaces by observing noticeable visual fluctuations resulting from minor latent variations. To tackle this issue, we propose Smooth Diffusion, a new category of diffusion models that can be simultaneously high-performing and smooth. Specifically, we introduce Step-wise Variation Regularization to enforce the proportion between the variations of an arbitrary input latent and that of the output image is a constant at any diffusion training step. In addition, we devise an interpolation standard deviation (ISTD) metric to effectively assess the latent space smoothness of a diffusion model. Extensive quantitative and qualitative experiments demonstrate that Smooth Diffusion stands out as a more desirable solution not only in T2I generation but also across various downstream tasks. Smooth Diffusion is implemented as a plug-and-play Smooth-LoRA to work with various community models. Code is available at https://github.com/SHI-Labs/Smooth-Diffusion.	The paper proposes "Smooth Diffusion," a novel category of diffusion models enhancing latent space smoothness without sacrificing performance.	Many downstream tasks like image interpolation, inversion, and editing benefit from a smooth latent space where minor input changes correspond to steady output changes.	The authors introduce "Step-wise Variation Regularization" to enforce a consistent ratio between input latent variations and output image variations during training.	Smooth Diffusion significantly improves the continuity of transitions in image interpolation. It reduces reconstruction errors in image inversion compared to baselines like Stable Diffusion. Smooth Diffusion better preserves unedited image content during text-based and drag-based editing.	Fully fine-tuned models are prone to collapse under the proposed regularization, suggesting a need for careful design. The impact of the regularization strength requires fine-tuning based on specific tasks and datasets.	diffusion models, latent space, image interpolation, image inversion, image editing
2312.04302 Report	Prompt Highlighter: Interactive Control for Multi-Modal LLMs	Yuechen Zhang, Shengju Qian, Bohao Peng, Shu Liu, Jiaya Jia	This study targets a critical aspect of multi-modal LLMs' (LLMs&VLMs) inference: explicit controllable text generation. Multi-modal LLMs empower multi-modality understanding with the capability of semantic generation yet bring less explainability and heavier reliance on prompt contents due to their autoregressive generative nature. While manipulating prompt formats could improve outputs, designing specific and precise prompts per task can be challenging and ineffective. To tackle this issue, we introduce a novel inference method, Prompt Highlighter, which enables users to highlight specific prompt spans to interactively control the focus during generation. Motivated by the classifier-free diffusion guidance, we form regular and unconditional context pairs based on highlighted tokens, demonstrating that the autoregressive generation in models can be guided in a classifier-free way. Notably, we find that, during inference, guiding the models with highlighted tokens through the attention weights leads to more desired outputs. Our approach is compatible with current LLMs and VLMs, achieving impressive customized generation results without training. Experiments confirm its effectiveness in focusing on input contexts and generating reliable content. Without tuning on LLaVA-v1.5, our method secured 70.7 in the MMBench test and 1552.5 in MME-perception. The code is available at: https://github.com/dvlab-research/Prompt-Highlighter/	This paper introduces Prompt Highlighter, a novel inference method for multi-modal LLMs that enables users to highlight specific prompt spans to interactively control the focus during text generation.	Multi-modal LLMs are powerful but lack explainability and heavily rely on prompt engineering, which can be challenging and ineffective. Prompt Highlighter addresses this by offering more intuitive and fine-grained control over generation.	Inspired by classifier-free diffusion guidance, the method constructs regular and unconditional context pairs based on highlighted tokens. It leverages attention mechanisms to guide the model's focus towards highlighted parts, enabling customized generation.	Prompt Highlighter enables fine-grained control over generation, allowing users to highlight specific parts of text and images to influence output. The method is effective in mitigating hallucinations and improving the reliability of generated content, as evidenced by quantitative evaluations on benchmarks like MMBench and MME. User studies confirm that a significant majority of users find Prompt Highlighter beneficial and prefer its outputs over traditional inference methods.	The approach introduces additional computational overhead due to the extra decoding branch, although the impact is marginal. The quality of generated content is contingent on the capabilities of the base model. Poorly trained models may exhibit limitations in accurately emphasizing or de-emphasizing highlighted sections.	multi-modal llms, controllable text generation, prompt highlighting, user interaction, classifier-free guidance
2312.04086 Report	MTVG : Multi-text Video Generation with Text-to-Video Models	Gyeongrok Oh, Jaehwan Jeong, Sieun Kim, Wonmin Byeon, Jinkyu Kim, Sungwoong Kim, Hyeokmin Kwon, Sangpil Kim	Recently, video generation has attracted massive attention and yielded noticeable outcomes. Concerning the characteristics of video, multi-text conditioning incorporating sequential events is necessary for next-step video generation. In this work, we propose a novel multi-text video generation~(MTVG) by directly utilizing a pre-trained diffusion-based text-to-video~(T2V) generation model without additional fine-tuning. To generate consecutive video segments, visual consistency generated by distinct prompts is necessary with diverse variations, such as motion and content-related transitions. Our proposed MTVG includes Dynamic Noise and Last Frame Aware Inversion which reinitialize the noise latent to preserve visual coherence between videos of different prompts and prevent repetitive motion or contents. Furthermore, we present Structure Guiding Sampling to maintain the global appearance across the frames in a single video clip, where we leverage iterative latent updates across the preceding frame. Additionally, our Prompt Generator allows for arbitrary format of text conditions consisting of diverse events. As a result, our extensive experiments, including diverse transitions of descriptions, demonstrate that our proposed methods show superior generated outputs in terms of semantically coherent and temporally seamless video.Video examples are available in our project page: https://kuai-lab.github.io/mtvg-page.	This paper presents MTVG, a novel pipeline for generating videos from multiple text prompts, leveraging pre-trained text-to-video models without requiring further training.	Existing text-to-video generation methods often struggle to create coherent and dynamic videos from a sequence of prompts, limiting their ability to portray complex narratives.	MTVG employs two key techniques: (1) Last Frame-Aware Latent Initialization, which preserves visual consistency across transitions by incorporating elements of the preceding video clip, and (2) Structure-Guided Sampling, which enhances temporal coherence within each video segment.	MTVG generates more semantically coherent and temporally seamless videos compared to existing zero-shot video generation methods. Quantitative results using CLIP-Text and CLIP-Image metrics demonstrate superior performance over baseline models. Human evaluation confirms that MTVG produces more natural and visually appealing videos, reflecting a strong alignment with given prompts.	The quality of generated videos can be influenced by the inherent limitations of the pre-trained text-to-video model. Further exploration of prompt engineering and fine-tuning strategies could potentially enhance the overall performance.	video generation, multi-text conditioning, diffusion models, zero-shot learning, temporal coherence
2312.04005 Report	KOALA: Self-Attention Matters in Knowledge Distillation of Latent Diffusion Models for Memory-Efficient and Fast Image Synthesis	Youngwan Lee, Kwanyong Park, Yoorhim Cho, Yong-Ju Lee, Sung Ju Hwang	Stable diffusion is the mainstay of the text-to-image (T2I) synthesis in the community due to its generation performance and open-source nature. Recently, Stable Diffusion XL (SDXL), the successor of stable diffusion, has received a lot of attention due to its significant performance improvements with a higher resolution of 1024x1024 and a larger model. However, its increased computation cost and model size require higher-end hardware(e.g., bigger VRAM GPU) for end-users, incurring higher costs of operation. To address this problem, in this work, we propose an efficient latent diffusion model for text-to-image synthesis obtained by distilling the knowledge of SDXL. To this end, we first perform an in-depth analysis of the denoising U-Net in SDXL, which is the main bottleneck of the model, and then design a more efficient U-Net based on the analysis. Secondly, we explore how to effectively distill the generation capability of SDXL into an efficient U-Net and eventually identify four essential factors, the core of which is that self-attention is the most important part. With our efficient U-Net and self-attention-based knowledge distillation strategy, we build our efficient T2I models, called KOALA-1B & -700M, while reducing the model size up to 54% and 69% of the original SDXL model. In particular, the KOALA-700M is more than twice as fast as SDXL while still retaining a decent generation quality. We hope that due to its balanced speed-performance tradeoff, our KOALA models can serve as a cost-effective alternative to SDXL in resource-constrained environments.	KOALA, an efficient text-to-image synthesis model distilled from SDXL, achieves a better speed-performance trade-off for resource-constrained environments.	SDXL, while achieving state-of-the-art image generation quality, requires high-end hardware due to its large model size and computational cost, limiting accessibility.	The authors design efficient U-Net architectures by analyzing SDXL's U-Net and propose a knowledge distillation strategy focusing on self-attention features.	KOALA reduces the model size up to 69% and inference time by 60% compared to SDXL. KOALA consistently outperforms BK-SDM in both visual aesthetics (HPSv2) and image-text alignment (T2I-CompBench). KOALA-700M achieves better performance than SDM-v2.0 while having a similar model size and inference speed, and can operate on an 8GB GPU.	KOALA shows limitations in rendering legible text and handling complex prompts with multiple attributes, potentially due to the training dataset. Future work includes exploring the integration of machine-generated detailed captions for improved text-alignment.	text-to-image synthesis, stable diffusion, knowledge distillation, model compression, self-attention
2312.03913 Report	Controllable Human-Object Interaction Synthesis	Jiaman Li, Alexander Clegg, Roozbeh Mottaghi, Jiajun Wu, Xavier Puig, C. Karen Liu	Synthesizing semantic-aware, long-horizon, human-object interaction is critical to simulate realistic human behaviors. In this work, we address the challenging problem of generating synchronized object motion and human motion guided by language descriptions in 3D scenes. We propose Controllable Human-Object Interaction Synthesis (CHOIS), an approach that generates object motion and human motion simultaneously using a conditional diffusion model given a language description, initial object and human states, and sparse object waypoints. While language descriptions inform style and intent, waypoints ground the motion in the scene and can be effectively extracted using high-level planning methods. Naively applying a diffusion model fails to predict object motion aligned with the input waypoints and cannot ensure the realism of interactions that require precise hand-object contact and appropriate contact grounded by the floor. To overcome these problems, we introduce an object geometry loss as additional supervision to improve the matching between generated object motion and input object waypoints. In addition, we design guidance terms to enforce contact constraints during the sampling process of the trained diffusion model.	Presents CHOIS, a novel approach for synthesizing synchronized object and human motion in 3D scenes guided by language descriptions and sparse object waypoints using a conditional diffusion model.	Synthesizing realistic and semantically aware human-object interactions is crucial for various applications like computer graphics and robotics. Previous methods struggled with larger, diverse objects and synthesizing both human and object motion from initial states.	A conditional diffusion model generates synchronized object and human motion, conditioned on language, object geometry, initial states, and waypoints. An object geometry loss improves object motion accuracy. Guidance terms enforce contact constraints during sampling, enhancing realism.	CHOIS successfully generates synchronized object and human motion aligning with language descriptions and object waypoints. The method generalizes to novel objects, demonstrating robustness beyond seen datasets. Human perceptual studies confirm that CHOIS outperforms baselines in terms of text consistency and interaction quality.	The model does not explicitly handle articulated objects. Waypoint extraction currently relies on heuristics and could be improved with learned approaches.	human-object interaction, motion synthesis, diffusion models, 3d scenes, language guidance
2312.03884 Report	WonderJourney: Going from Anywhere to Everywhere	Hong-Xing Yu, Haoyi Duan, Junhwa Hur, Kyle Sargent, Michael Rubinstein, William T. Freeman, Forrester Cole, Deqing Sun, Noah Snavely, Jiajun Wu, Charles Herrmann	We introduce WonderJourney, a modularized framework for perpetual 3D scene generation. Unlike prior work on view generation that focuses on a single type of scenes, we start at any user-provided location (by a text description or an image) and generate a journey through a long sequence of diverse yet coherently connected 3D scenes. We leverage an LLM to generate textual descriptions of the scenes in this journey, a text-driven point cloud generation pipeline to make a compelling and coherent sequence of 3D scenes, and a large VLM to verify the generated scenes. We show compelling, diverse visual results across various scene types and styles, forming imaginary "wonderjourneys". Project website: https://kovenyu.com/WonderJourney/	Introduces WonderJourney, a modular framework for generating a sequence of diverse and coherent 3D scenes from a text description or an image, simulating a journey through an imaginary world.	Addresses the limitations of prior perpetual view generation methods that focus on single scene types or domains, enabling more creative and varied visual storytelling.	Leverages an LLM for scene description generation, a text-driven visual module for creating coherent 3D scenes from those descriptions, and a VLM for validating the generated scenes.	Generates compelling and diverse visual results across various scene types and styles. Shows significant user preference over baseline methods in terms of diversity, visual quality, scene complexity, and overall interest. Demonstrates the ability to generate long and controlled journeys using user-provided descriptions like poems or story abstracts.	Reliance on pretrained models may inherit their biases and limitations. The generation process can sometimes produce undesirable artifacts like photo borders or out-of-focus objects, requiring additional validation and regeneration.	3d scene generation, text-to-3d, perpetual view generation, large language models, vision-language models
2312.03869 Report	Inpaint3D: 3D Scene Content Generation using 2D Inpainting Diffusion	Kira Prabhu, Jane Wu, Lynn Tsai, Peter Hedman, Dan B Goldman, Ben Poole, Michael Broxton	This paper presents a novel approach to inpainting 3D regions of a scene, given masked multi-view images, by distilling a 2D diffusion model into a learned 3D scene representation (e.g. a NeRF). Unlike 3D generative methods that explicitly condition the diffusion model on camera pose or multi-view information, our diffusion model is conditioned only on a single masked 2D image. Nevertheless, we show that this 2D diffusion model can still serve as a generative prior in a 3D multi-view reconstruction problem where we optimize a NeRF using a combination of score distillation sampling and NeRF reconstruction losses. Predicted depth is used as additional supervision to encourage accurate geometry. We compare our approach to 3D inpainting methods that focus on object removal. Because our method can generate content to fill any 3D masked region, we additionally demonstrate 3D object completion, 3D object replacement, and 3D scene completion.	This paper introduces a novel method for 3D inpainting of scenes from multi-view images by leveraging a pre-trained 2D inpainting diffusion model as a generative prior for a learned 3D scene representation (NeRF).	This approach addresses the limitations of existing 3D inpainting methods that either struggle with 3D consistency or require computationally expensive training of 3D-aware diffusion models.	The method employs a joint optimization framework that combines score distillation sampling (SDS) with traditional NeRF reconstruction losses. This allows the model to leverage the 2D inpainting diffusion model for generating content in masked regions while maintaining consistency with the unmasked regions of the input images.	The method generates realistic and 3D-consistent inpainted content for various mask types, including sphere masks, object masks, scribble masks, and outpainting masks. Quantitative evaluation on the SPIn-NeRF dataset shows that the proposed method outperforms SPIn-NeRF in terms of SSIM and LPIPS, demonstrating improved 3D consistency. The use of a patch-based depth regularizer significantly improves the overall depth map quality and 3D consistency of the inpainted results.	The randomness inherent in SDS can lead to high variance in generated content, sometimes lacking high-frequency detail. Future work will explore alternative diffusion prior distillation methods and incorporate additional 3D priors to enhance the level of detail in inpainted results.	3d inpainting, diffusion models, nerf, score distillation sampling, multi-view reconstruction
2312.03816 Report	AVID: Any-Length Video Inpainting with Diffusion Model	Zhixing Zhang, Bichen Wu, Xiaoyan Wang, Yaqiao Luo, Luxin Zhang, Yinan Zhao, Peter Vajda, Dimitris Metaxas, Licheng Yu	Recent advances in diffusion models have successfully enabled text-guided image inpainting. While it seems straightforward to extend such editing capability into the video domain, there have been fewer works regarding text-guided video inpainting. Given a video, a masked region at its initial frame, and an editing prompt, it requires a model to do infilling at each frame following the editing guidance while keeping the out-of-mask region intact. There are three main challenges in text-guided video inpainting: ($i$) temporal consistency of the edited video, ($ii$) supporting different inpainting types at different structural fidelity levels, and ($iii$) dealing with variable video length. To address these challenges, we introduce Any-Length Video Inpainting with Diffusion Model, dubbed as AVID. At its core, our model is equipped with effective motion modules and adjustable structure guidance, for fixed-length video inpainting. Building on top of that, we propose a novel Temporal MultiDiffusion sampling pipeline with a middle-frame attention guidance mechanism, facilitating the generation of videos with any desired duration. Our comprehensive experiments show our model can robustly deal with various inpainting types at different video duration ranges, with high quality. More visualization results are made publicly available at https://zhang-zx.github.io/AVID/ .	This paper introduces AVID, a novel framework for text-guided video inpainting that handles variable video lengths and diverse editing types while maintaining temporal consistency.	Text-guided video inpainting is a challenging task due to the need for temporal consistency, support for various editing types and structural fidelity levels, and handling variable video lengths. This work addresses these challenges to enable flexible video editing with text.	AVID integrates motion modules into a text-to-image inpainting diffusion model, incorporates a structure guidance module adaptable to different inpainting tasks, and employs a Temporal MultiDiffusion sampling pipeline with middle-frame attention guidance for variable video length handling.	AVID effectively performs diverse inpainting tasks, including object swapping, re-texturing, and uncropping, while maintaining high visual quality and temporal consistency. The proposed Temporal MultiDiffusion pipeline enables seamless inpainting in videos longer than the model's training duration. Quantitative and qualitative comparisons demonstrate AVID's superiority over existing video editing methods in terms of background preservation, text-video alignment, and temporal consistency.	The performance of AVID is limited by the capabilities of the underlying text-to-video model, particularly in handling complex actions. Future work includes exploring learnable structure guidance scales controlled by editing prompts and addressing discontinuity issues in videos with reappearing objects.	video inpainting, diffusion models, text-guided editing, temporal consistency, motion modules
2312.03806 Report	XCube ($\mathcal{X}^3$): Large-Scale 3D Generative Modeling using Sparse Voxel Hierarchies	Xuanchi Ren, Jiahui Huang, Xiaohui Zeng, Ken Museth, Sanja Fidler, Francis Williams	We present $\mathcal{X}^3$ (pronounced XCube), a novel generative model for high-resolution sparse 3D voxel grids with arbitrary attributes. Our model can generate millions of voxels with a finest effective resolution of up to $1024^3$ in a feed-forward fashion without time-consuming test-time optimization. To achieve this, we employ a hierarchical voxel latent diffusion model which generates progressively higher resolution grids in a coarse-to-fine manner using a custom framework built on the highly efficient VDB data structure. Apart from generating high-resolution objects, we demonstrate the effectiveness of XCube on large outdoor scenes at scales of 100m$\times$100m with a voxel size as small as 10cm. We observe clear qualitative and quantitative improvements over past approaches. In addition to unconditional generation, we show that our model can be used to solve a variety of tasks such as user-guided editing, scene completion from a single scan, and text-to-3D. More results and details can be found at https://research.nvidia.com/labs/toronto-ai/xcube/.	Introduces XCubes, a novel generative model for producing high-resolution sparse 3D voxel grids with attributes like signed distances, normals, and semantics.	Addresses limitations of current 3D generative models in scaling to large outdoor scenes and high resolutions, aiming to unlock new possibilities for 3D content generation.	Employs a hierarchical voxel latent diffusion model that generates increasingly detailed grids using a coarse-to-fine approach, facilitated by a custom VDB data structure based framework for efficiency.	Achieves state-of-the-art results on object generation benchmarks like ShapeNet and Objaverse, outperforming methods using point clouds, triplanes, and dense voxels. Demonstrates scalability by generating high-quality outdoor scenes from Waymo and Karton City datasets at resolutions up to 1024^3 with fine details. Enables user-guided editing, scene completion from single scans, and text-to-3D generation, highlighting the model's versatility.	Text-to-3D capability limited by the scale of existing 3D datasets compared to massive image datasets. Future work includes exploring image-conditioning and leveraging the learned 3D prior for downstream tasks like reconstruction and perception.	3d generation, voxel diffusion models, sparse representations, hierarchical modeling, large-scale scenes
2312.03795 Report	AnimatableDreamer: Text-Guided Non-rigid 3D Model Generation and Reconstruction with Canonical Score Distillation	Xinzhou Wang, Yikai Wang, Junliang Ye, Zhengyi Wang, Fuchun Sun, Pengkun Liu, Ling Wang, Kai Sun, Xintong Wang, Bin He	Advances in 3D generation have facilitated sequential 3D model generation (a.k.a 4D generation), yet its application for animatable objects with large motion remains scarce. Our work proposes AnimatableDreamer, a text-to-4D generation framework capable of generating diverse categories of non-rigid objects on skeletons extracted from a monocular video. At its core, AnimatableDreamer is equipped with our novel optimization design dubbed Canonical Score Distillation (CSD), which lifts 2D diffusion for temporal consistent 4D generation. CSD, designed from a score gradient perspective, generates a canonical model with warp-robustness across different articulations. Notably, it also enhances the authenticity of bones and skinning by integrating inductive priors from a diffusion model. Furthermore, with multi-view distillation, CSD infers invisible regions, thereby improving the fidelity of monocular non-rigid reconstruction. Extensive experiments demonstrate the capability of our method in generating high-flexibility text-guided 3D models from the monocular video, while also showing improved reconstruction performance over existing non-rigid reconstruction methods.	AnimatableDreamer, a novel framework, is presented that leverages text prompts and monocular videos to generate and reconstruct animatable 3D models of generic categories with non-rigid deformations.	Existing methods for generating deformable 3D objects struggle with large motions and often lack diversity or rely heavily on multi-view data. AnimatableDreamer addresses these limitations by using a novel optimization design called Canonical Score Distillation (CSD).	AnimatableDreamer operates in two stages: 1) Skeleton Extraction: Extracts skeletons, skinning, and motions from monocular videos using CSD to refine unseen regions. 2) Skeleton-Based Generation: Generates a new canonical model guided by the extracted skeleton, bones, and text prompt, ensuring time consistency and warping robustness through CSD.	Generates high-quality, animatable 3D models with text prompts from a template video, demonstrating time consistency and morphological plausibility. CSD enhances the generation and reconstruction of non-rigid 3D models, ensuring morphological plausibility after warping and improving reconstruction quality in unseen regions. Outperforms existing methods in monocular non-rigid object reconstruction, especially with limited viewpoints and large motion, as shown by quantitative and qualitative comparisons.	Requires large VRAM due to high-resolution rendering for CSD training. Simultaneous feeding of four images to MVDream poses a computational burden.	4d generation, diffusion model, non-rigid reconstruction, canonical score distillation, skeleton-based generation
2312.03793 Report	AnimateZero: Video Diffusion Models are Zero-Shot Image Animators	Jiwen Yu, Xiaodong Cun, Chenyang Qi, Yong Zhang, Xintao Wang, Ying Shan, Jian Zhang	Large-scale text-to-video (T2V) diffusion models have great progress in recent years in terms of visual quality, motion and temporal consistency. However, the generation process is still a black box, where all attributes (e.g., appearance, motion) are learned and generated jointly without precise control ability other than rough text descriptions. Inspired by image animation which decouples the video as one specific appearance with the corresponding motion, we propose AnimateZero to unveil the pre-trained text-to-video diffusion model, i.e., AnimateDiff, and provide more precise appearance and motion control abilities for it. For appearance control, we borrow intermediate latents and their features from the text-to-image (T2I) generation for ensuring the generated first frame is equal to the given generated image. For temporal control, we replace the global temporal attention of the original T2V model with our proposed positional-corrected window attention to ensure other frames align with the first frame well. Empowered by the proposed methods, AnimateZero can successfully control the generating progress without further training. As a zero-shot image animator for given images, AnimateZero also enables multiple new applications, including interactive video generation and real image animation. The detailed experiments demonstrate the effectiveness of the proposed method in both T2V and related applications.	This paper presents AnimateZero, a zero-shot method for controllable video generation and image animation, by modifying the architecture of pre-trained text-to-video diffusion models.	Existing text-to-video diffusion models lack precise control over appearance and motion, limiting their ability for step-by-step video generation from specific images.	AnimateZero decouples appearance and motion control. It inserts intermediate latents from text-to-image generation to control the first frame appearance and utilizes a positional-corrected window attention mechanism to ensure temporal consistency across frames.	AnimateZero generates videos that better match the text prompt and the original text-to-image domain compared to baselines. It achieves comparable or superior quality to state-of-the-art image-to-video tools. AnimateZero demonstrates potential for various applications, including controllable video generation, image animation, frame interpolation and looped video generation.	AnimateZero's motion generation is limited by the motion prior of the base video diffusion model. Domain gap issues can arise when animating real images due to style, resolution, and potential degradations.	text-to-video generation, image animation, diffusion models, controllable generation, zero-shot learning
2312.03771 Report	DreamInpainter: Text-Guided Subject-Driven Image Inpainting with Diffusion Models	Shaoan Xie, Yang Zhao, Zhisheng Xiao, Kelvin C. K. Chan, Yandong Li, Yanwu Xu, Kun Zhang, Tingbo Hou	This study introduces Text-Guided Subject-Driven Image Inpainting, a novel task that combines text and exemplar images for image inpainting. While both text and exemplar images have been used independently in previous efforts, their combined utilization remains unexplored. Simultaneously accommodating both conditions poses a significant challenge due to the inherent balance required between editability and subject fidelity. To tackle this challenge, we propose a two-step approach DreamInpainter. First, we compute dense subject features to ensure accurate subject replication. Then, we employ a discriminative token selection module to eliminate redundant subject details, preserving the subject's identity while allowing changes according to other conditions such as mask shape and text prompts. Additionally, we introduce a decoupling regularization technique to enhance text control in the presence of exemplar images. Our extensive experiments demonstrate the superior performance of our method in terms of visual quality, identity preservation, and text control, showcasing its effectiveness in the context of text-guided subject-driven image inpainting.	This paper introduces the task of Text-Guided Subject-Driven Image Inpainting, aiming to combine the advantages of text-conditioned and exemplar-based inpainting for enhanced control and creativity.	This task addresses the limitations of current inpainting techniques that struggle to balance identity preservation with editability guided by both text prompts and exemplar images.	The authors propose DreamInpainter, a two-step approach. First, dense subject features are extracted from an exemplar image using the UNet encoder of a pre-trained diffusion model. Then, a discriminative token selection module filters these features, preserving key identity information while allowing for edits based on text prompts and mask shapes. A decoupling regularization technique is also introduced to enhance text control in the presence of exemplar images.	DreamInpainter effectively preserves subject identity while allowing flexible text-guided edits like attribute changes, shape modifications, and style transfers. The method outperforms strong baselines in terms of identity preservation and text alignment, as shown by quantitative metrics like R-FID, F-CLIP, and F-DINO. The importance of both the token selection module and the decoupling regularization is demonstrated through ablation studies, highlighting their role in preventing copy-paste artifacts and enhancing text control.	DreamInpainter may struggle to preserve intricate details when dealing with complex reference objects due to the fixed number of selected tokens. Future work could explore adaptive token selection based on object complexity to further enhance detail preservation without sacrificing editability.	image inpainting, text-to-image generation, diffusion models, subject-driven generation, token selection
2312.03763 Report	Gaussian3Diff: 3D Gaussian Diffusion for 3D Full Head Synthesis and Editing	Yushi Lan, Feitong Tan, Di Qiu, Qiangeng Xu, Kyle Genova, Zeng Huang, Sean Fanello, Rohit Pandey, Thomas Funkhouser, Chen Change Loy, Yinda Zhang	We present a novel framework for generating photorealistic 3D human head and subsequently manipulating and reposing them with remarkable flexibility. The proposed approach leverages an implicit function representation of 3D human heads, employing 3D Gaussians anchored on a parametric face model. To enhance representational capabilities and encode spatial information, we embed a lightweight tri-plane payload within each Gaussian rather than directly storing color and opacity. Additionally, we parameterize the Gaussians in a 2D UV space via a 3DMM, enabling effective utilization of the diffusion model for 3D head avatar generation. Our method facilitates the creation of diverse and realistic 3D human heads with fine-grained editing over facial features and expressions. Extensive experiments demonstrate the effectiveness of our method.	This paper proposes \nickname{}, a novel framework for generating and manipulating photorealistic 3D human heads, enabling high-level and fine-grained control over facial shape, texture, and expression.	Existing methods for 3D-aware portrait generation and editing often lack flexibility, particularly in local feature editing, and struggle with disentangling shape and texture. \nickname{} addresses these limitations by introducing a novel representation and leveraging diffusion models.	The method represents 3D heads using 3D Gaussians anchored to a 3D Morphable Model (3DMM), with each Gaussian containing a tri-plane payload to encode local appearance. It employs an analysis-by-synthesis approach, reconstructing a large dataset of 3D heads while learning a shared latent space via an auto-decoder. A 2D diffusion model is then trained on this latent space for generating and editing.	The method achieves high-quality 3D reconstruction with intrinsic support for 3DMM-driven animation, outperforming existing methods on expression editing benchmarks. It demonstrates superior editing capabilities, including inter-subject attribute transfer, local region-based editing, and 3D in-painting, while maintaining high fidelity and view consistency. The use of a shared latent space and UV space parameterization enables disentanglement of shape and texture, facilitating smooth interpolation and manipulation of facial features.	The model currently exhibits bias inherited from the training dataset, such as a tendency for generated females to smile more often. While the method excels at multi-view reconstruction, single-image inversion remains challenging and an area for future work.	3d head generation, diffusion models, 3d gaussian representation, facial editing, 3dmm
2312.03701 Report	Return of Unconditional Generation: A Self-supervised Representation Generation Method	Tianhong Li, Dina Katabi, Kaiming He	Unconditional generation -- the problem of modeling data distribution without relying on human-annotated labels -- is a long-standing and fundamental challenge in generative models, creating a potential of learning from large-scale unlabeled data. In the literature, the generation quality of an unconditional method has been much worse than that of its conditional counterpart. This gap can be attributed to the lack of semantic information provided by labels. In this work, we show that one can close this gap by generating semantic representations in the representation space produced by a self-supervised encoder. These representations can be used to condition the image generator. This framework, called Representation-Conditioned Generation (RCG), provides an effective solution to the unconditional generation problem without using labels. Through comprehensive experiments, we observe that RCG significantly improves unconditional generation quality: e.g., it achieves a new state-of-the-art FID of 2.15 on ImageNet 256x256, largely reducing the previous best of 5.91 by a relative 64%. Our unconditional results are situated in the same tier as the leading class-conditional ones. We hope these encouraging observations will attract the community's attention to the fundamental problem of unconditional generation. Code is available at https://github.com/LTH14/rcg.	This paper introduces Representation-Conditioned Generation (RCG), a novel framework for unconditional image generation that leverages self-supervised representations to improve generation quality.	Unconditional generation, which aims to learn data distributions without human-annotated labels, often lags behind conditional methods. RCG bridges this gap by utilizing the rich semantic information embedded within self-supervised representations.	RCG uses a pre-trained self-supervised encoder to map images to a representation space. A lightweight representation diffusion model is trained to generate representations. Finally, an image generator (e.g., ADM, DiT, MAGE) generates images conditioned on these representations.	RCG significantly improves unconditional generation quality across different image generators (LDM, ADM, DiT, MAGE) and datasets (ImageNet, CIFAR-10, iNaturalist). On ImageNet 256x256, RCG achieves a state-of-the-art FID of 2.15 for unconditional generation, rivaling leading class-conditional methods. The representation space learned by RCG exhibits semantic smoothness, enabling controlled image manipulation via representation interpolation.	RCG's performance depends on the quality of the pre-trained self-supervised encoder. Exploring the potential of pre-training RCG components on larger unlabeled datasets for improved generalization and downstream task adaptation.	unconditional image generation, self-supervised representation learning, diffusion models, generative models, representation learning
2312.03700 Report	OneLLM: One Framework to Align All Modalities with Language	Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, Xiangyu Yue	Multimodal large language models (MLLMs) have gained significant attention due to their strong multimodal understanding capability. However, existing works rely heavily on modality-specific encoders, which usually differ in architecture and are limited to common modalities. In this paper, we present OneLLM, an MLLM that aligns eight modalities to language using a unified framework. We achieve this through a unified multimodal encoder and a progressive multimodal alignment pipeline. In detail, we first train an image projection module to connect a vision encoder with LLM. Then, we build a universal projection module (UPM) by mixing multiple image projection modules and dynamic routing. Finally, we progressively align more modalities to LLM with the UPM. To fully leverage the potential of OneLLM in following instructions, we also curated a comprehensive multimodal instruction dataset, including 2M items from image, audio, video, point cloud, depth/normal map, IMU and fMRI brain activity. OneLLM is evaluated on 25 diverse benchmarks, encompassing tasks such as multimodal captioning, question answering and reasoning, where it delivers excellent performance. Code, data, model and online demo are available at https://github.com/csuhan/OneLLM	Proposes OneLLM, an MLLM aligning 8 modalities to language using a unified framework with a universal encoder and progressive multimodal alignment.	Existing MLLMs rely on modality-specific encoders, limiting scalability and expansion to diverse modalities.	Trains a vision LLM for initialization, progressively aligns other modalities using a universal encoder (pretrained CLIP-ViT) and a universal projection module (mixture of experts). Fine-tunes on a curated multimodal instruction dataset.	Outperforms existing MMLLMs and specialized models on 25 multimodal benchmarks, including captioning, question answering, and reasoning tasks. Demonstrates strong zero-shot capabilities on tasks like audio question answering and depth/normal map scene classification. Shows effectiveness of joint training for data-scarce modalities and benefits of image-text pretraining for multimodal alignment.	Limited by the availability of large-scale, high-quality datasets for modalities beyond images. Future work includes collecting high-quality datasets and designing new encoders for fine-grained multimodal understanding.	multimodal learning, large language models, vision-language models, multimodal alignment, unified framework
2312.03641 Report	MotionCtrl: A Unified and Flexible Motion Controller for Video Generation	Zhouxia Wang, Ziyang Yuan, Xintao Wang, Tianshui Chen, Menghan Xia, Ping Luo, Ying Shan	Motions in a video primarily consist of camera motion, induced by camera movement, and object motion, resulting from object movement. Accurate control of both camera and object motion is essential for video generation. However, existing works either mainly focus on one type of motion or do not clearly distinguish between the two, limiting their control capabilities and diversity. Therefore, this paper presents MotionCtrl, a unified and flexible motion controller for video generation designed to effectively and independently control camera and object motion. The architecture and training strategy of MotionCtrl are carefully devised, taking into account the inherent properties of camera motion, object motion, and imperfect training data. Compared to previous methods, MotionCtrl offers three main advantages: 1) It effectively and independently controls camera motion and object motion, enabling more fine-grained motion control and facilitating flexible and diverse combinations of both types of motion. 2) Its motion conditions are determined by camera poses and trajectories, which are appearance-free and minimally impact the appearance or shape of objects in generated videos. 3) It is a relatively generalizable model that can adapt to a wide array of camera poses and trajectories once trained. Extensive qualitative and quantitative experiments have been conducted to demonstrate the superiority of MotionCtrl over existing methods.	MotionCtrl: A Unified and Flexible Motion Controller for Video Generation	Accurate control of both camera and object motion is crucial for video generation, but existing methods often lack independent control or clear distinction between the two.	MotionCtrl introduces two modules: CMCM (Camera Motion Control Module) fusing camera poses with LVDM's temporal transformers for global motion, and OMCM (Object Motion Control Module) spatially incorporating object trajectories into LVDM's convolutional layers. It is trained using augmented datasets: Realestate10k with captions for CMCM and WebVid with synthesized object trajectories for OMCM.	Independently controls camera and object motion, enabling fine-grained adjustments and diverse combinations. Uses camera poses and trajectories as motion conditions, avoiding unnatural appearance artifacts in generated videos. Generalizes to a wide range of camera movements and object trajectories without fine-tuning for each specific motion.	Reliance on separate datasets for camera and object motion training due to the lack of a comprehensive dataset. Further improvements in object trajectory synthesis for more realistic and complex object motion control.	video generation, motion control, camera motion, object motion, text-to-video
2312.03628 Report	Boosting Segment Anything Model Towards Open-Vocabulary Learning	Xumeng Han, Longhui Wei, Xuehui Yu, Zhiyang Dou, Xin He, Kuiran Wang, Zhenjun Han, Qi Tian	The recent Segment Anything Model (SAM) has emerged as a new paradigmatic vision foundation model, showcasing potent zero-shot generalization and flexible prompting. Despite SAM finding applications and adaptations in various domains, its primary limitation lies in the inability to grasp object semantics. In this paper, we present Sambor to seamlessly integrate SAM with the open-vocabulary object detector in an end-to-end framework. While retaining all the remarkable capabilities inherent to SAM, we enhance it with the capacity to detect arbitrary objects based on human inputs like category names or reference expressions. To accomplish this, we introduce a novel SideFormer module that extracts SAM features to facilitate zero-shot object localization and inject comprehensive semantic information for open-vocabulary recognition. In addition, we devise an open-set region proposal network (Open-set RPN), enabling the detector to acquire the open-set proposals generated by SAM. Sambor demonstrates superior zero-shot performance across benchmarks, including COCO and LVIS, proving highly competitive against previous SoTA methods. We aspire for this work to serve as a meaningful endeavor in endowing SAM to recognize diverse object categories and advancing open-vocabulary learning with the support of vision foundation models.	This paper proposes Sambor, an end-to-end open-vocabulary object detection framework that integrates the Segment Anything Model (SAM) and enables it to detect arbitrary objects based on human inputs such as category names or phrases.	While SAM excels in zero-shot segmentation, it lacks the semantic understanding for object recognition. Sambor addresses this limitation, enhancing SAM's capabilities and advancing open-vocabulary learning.	Sambor utilizes a novel SideFormer module to extract SAM features for zero-shot object localization and inject semantic information from CLIP for recognition. Additionally, it employs an Open-set RPN to generate region proposals from SAM's output.	Sambor achieves state-of-the-art zero-shot performance on COCO and LVIS benchmarks. The proposed SideFormer module effectively combines SAM and CLIP features, enhancing both object localization and recognition. The Open-set RPN significantly improves proposal quality, further boosting detection performance.	Sambor's performance can be further enhanced by scaling up training with larger image-text datasets and by incorporating few-shot learning capabilities. Exploring the integration of more interactive operations for gradual improvement is left for future work.	open-vocabulary object detection, segment anything model (sam), vision foundation models, zero-shot learning, open-set recognition
2312.03626 Report	TokenCompose: Grounding Diffusion with Token-level Supervision	Zirui Wang, Zhizhou Sha, Zheng Ding, Yilin Wang, Zhuowen Tu	We present TokenCompose, a Latent Diffusion Model for text-to-image generation that achieves enhanced consistency between user-specified text prompts and model-generated images. Despite its tremendous success, the standard denoising process in the Latent Diffusion Model takes text prompts as conditions only, absent explicit constraint for the consistency between the text prompts and the image contents, leading to unsatisfactory results for composing multiple object categories. TokenCompose aims to improve multi-category instance composition by introducing the token-wise consistency terms between the image content and object segmentation maps in the finetuning stage. TokenCompose can be applied directly to the existing training pipeline of text-conditioned diffusion models without extra human labeling information. By finetuning Stable Diffusion, the model exhibits significant improvements in multi-category instance composition and enhanced photorealism for its generated images.	TokenCompose, a Latent Diffusion Model that enhances consistency between text prompts and generated images, particularly in composing multiple object categories, by incorporating token-wise consistency terms during fine-tuning.	Standard Latent Diffusion Models lack explicit constraints for text-image consistency, leading to unsatisfactory compositions, especially for multiple object categories.	Leverages pretrained vision models (Grounded SAM, Grounding DINO) to generate segmentation maps for noun tokens in training captions, then jointly optimizes the diffusion model with denoising and token-image grounding objectives.	Achieves state-of-the-art performance on multi-category instance composition benchmarks (VISOR, MultiGen). Exhibits enhanced photorealism as measured by FID scores on COCO and Flickr30K Entities. Maintains efficient inference speed comparable to standard text-conditioned diffusion models.	Currently focuses on noun tokens, leaving room for incorporating other parts of speech (adjectives, verbs) as training objectives. Exploration of different grounding objectives and architectures for further improvement.	text-to-image generation, latent diffusion models, compositionality, image understanding, multi-category instance composition
2312.03611 Report	DreamComposer: Controllable 3D Object Generation via Multi-View Conditions	Yunhan Yang, Yukun Huang, Xiaoyang Wu, Yuan-Chen Guo, Song-Hai Zhang, Hengshuang Zhao, Tong He, Xihui Liu	Utilizing pre-trained 2D large-scale generative models, recent works are capable of generating high-quality novel views from a single in-the-wild image. However, due to the lack of information from multiple views, these works encounter difficulties in generating controllable novel views. In this paper, we present DreamComposer, a flexible and scalable framework that can enhance existing view-aware diffusion models by injecting multi-view conditions. Specifically, DreamComposer first uses a view-aware 3D lifting module to obtain 3D representations of an object from multiple views. Then, it renders the latent features of the target view from 3D representations with the multi-view feature fusion module. Finally the target view features extracted from multi-view inputs are injected into a pre-trained diffusion model. Experiments show that DreamComposer is compatible with state-of-the-art diffusion models for zero-shot novel view synthesis, further enhancing them to generate high-fidelity novel view images with multi-view conditions, ready for controllable 3D object reconstruction and various other applications.	DreamComposer is a flexible and scalable framework that enhances existing view-aware diffusion models for controllable novel view synthesis by injecting multi-view conditions.	Existing methods for novel view synthesis struggle to generate controllable novel views due to the lack of information from multiple views.	DreamComposer uses a three-stage approach: 1) target-aware 3D lifting to obtain 3D representations from multi-view inputs, 2) multi-view feature fusion to render and fuse 3D features into target-view 2D features, 3) target-view feature injection to incorporate the fused features into a pre-trained diffusion model.	DreamComposer enables controllable novel view synthesis by conditioning on multiple input views. It improves the accuracy of unseen viewpoints compared to single-view methods. DreamComposer is compatible with existing state-of-the-art models like Zero-1-to-3 and SyncDreamer, enhancing their controllability and fidelity.	Preserving fine-grained textures from non-main view input images remains challenging. Angular deviations between multi-view input images can affect generation quality.	novel view synthesis, diffusion models, multi-view conditioning, 3d object generation, controllable image synthesis
2312.03594 Report	A Task is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting	Junhao Zhuang, Yanhong Zeng, Wenran Liu, Chun Yuan, Kai Chen	Achieving high-quality versatile image inpainting, where user-specified regions are filled with plausible content according to user intent, presents a significant challenge. Existing methods face difficulties in simultaneously addressing context-aware image inpainting and text-guided object inpainting due to the distinct optimal training strategies required. To overcome this challenge, we introduce PowerPaint, the first high-quality and versatile inpainting model that excels in both tasks. First, we introduce learnable task prompts along with tailored fine-tuning strategies to guide the model's focus on different inpainting targets explicitly. This enables PowerPaint to accomplish various inpainting tasks by utilizing different task prompts, resulting in state-of-the-art performance. Second, we demonstrate the versatility of the task prompt in PowerPaint by showcasing its effectiveness as a negative prompt for object removal. Additionally, we leverage prompt interpolation techniques to enable controllable shape-guided object inpainting. Finally, we extensively evaluate PowerPaint on various inpainting benchmarks to demonstrate its superior performance for versatile image inpainting. We release our codes and models on our project page: https://powerpaint.github.io/.	This paper presents PowerPaint, a versatile image inpainting model that excels in both text-guided object inpainting and context-aware image inpainting through the use of learnable task prompts.	Existing image inpainting methods struggle to effectively handle both context-aware and text-guided inpainting due to their conflicting optimal training strategies. PowerPaint addresses this challenge, offering a unified solution for versatile high-quality inpainting.	PowerPaint introduces three learnable task prompts (P_obj, P_ctxt, P_shape) and fine-tunes a text-to-image model (Stable Diffusion) with different strategies for each task. It leverages classifier-free guidance sampling with task prompts and enables controllable shape-guided inpainting through prompt interpolation.	PowerPaint achieves state-of-the-art performance on various inpainting benchmarks for both text-guided object inpainting and context-aware image inpainting. The learned task prompts effectively function as negative prompts, enhancing object removal capabilities in crowded scenes. Prompt interpolation facilitates controllable shape-guided object inpainting, balancing object shape and textual description adherence.	The synthesis quality is limited by the underlying text-to-image model. Achieving precise shape control for small objects remains challenging due to sparse representation during training.	image inpainting, text-guided synthesis, context-aware inpainting, task prompts, diffusion models
2312.03587 Report	Language-Informed Visual Concept Learning	Sharon Lee, Yunzhi Zhang, Shangzhe Wu, Jiajun Wu	Our understanding of the visual world is centered around various concept axes, characterizing different aspects of visual entities. While different concept axes can be easily specified by language, e.g. color, the exact visual nuances along each axis often exceed the limitations of linguistic articulations, e.g. a particular style of painting. In this work, our goal is to learn a language-informed visual concept representation, by simply distilling large pre-trained vision-language models. Specifically, we train a set of concept encoders to encode the information pertinent to a set of language-informed concept axes, with an objective of reproducing the input image through a pre-trained Text-to-Image (T2I) model. To encourage better disentanglement of different concept encoders, we anchor the concept embeddings to a set of text embeddings obtained from a pre-trained Visual Question Answering (VQA) model. At inference time, the model extracts concept embeddings along various axes from new test images, which can be remixed to generate images with novel compositions of visual concepts. With a lightweight test-time finetuning procedure, it can also generalize to novel concepts unseen at training.	This paper proposes a framework for learning disentangled and compositional visual concepts grounded in language by distilling knowledge from pre-trained text-to-image and visual question answering models.	Learning such representations is crucial for enabling flexible manipulation and generation of images with desired combinations of visual concepts.	The framework trains a set of concept encoders to extract concept embeddings from images, guided by two objectives: 1) reconstructing the input image through a pre-trained T2I model given axis-informed text prompts and 2) aligning the concept embeddings with corresponding text embeddings from a pre-trained VQA model.	The learned concept encoders can extract disentangled concept embeddings from images, enabling the generation of images with novel compositions of concepts via remixing. The framework allows for generalization to unseen concepts through a lightweight test-time finetuning procedure. Qualitative and quantitative experiments demonstrate the effectiveness of the proposed method in visual concept editing compared to text-based prompting baselines.	The current model requires concept axes to be pre-defined, limiting the generality of the concept space it can capture. Training separate encoders for each concept axis does not fully exploit the potential hierarchical structure among them.	visual concept learning, text-to-image generation, visual question answering, concept disentanglement, image generation
2312.03584 Report	Context Diffusion: In-Context Aware Image Generation	Ivona Najdenkoska, Animesh Sinha, Abhimanyu Dubey, Dhruv Mahajan, Vignesh Ramanathan, Filip Radenovic	We propose Context Diffusion, a diffusion-based framework that enables image generation models to learn from visual examples presented in context. Recent work tackles such in-context learning for image generation, where a query image is provided alongside context examples and text prompts. However, the quality and fidelity of the generated images deteriorate when the prompt is not present, demonstrating that these models are unable to truly learn from the visual context. To address this, we propose a novel framework that separates the encoding of the visual context and preserving the structure of the query images. This results in the ability to learn from the visual context and text prompts, but also from either one of them. Furthermore, we enable our model to handle few-shot settings, to effectively address diverse in-context learning scenarios. Our experiments and user study demonstrate that Context Diffusion excels in both in-domain and out-of-domain tasks, resulting in an overall enhancement in image quality and fidelity compared to counterpart models.	This paper introduces Context Diffusion, a diffusion-based image generation model that learns from visual context examples, alongside text prompts and query images, effectively separating structure preservation (query image) from style and detail infusion (context images).	Existing in-context image generation models struggle to effectively utilize visual context without strong reliance on text prompts, limiting their flexibility and generalization to unseen tasks.	The model encodes visual context separately from the query image, injecting it alongside text embeddings into the cross-attention layers of a diffusion model, enabling learning from either or both conditioning signals. Trained on diverse image-to-map and map-to-image tasks, it supports single and multiple context image inputs.	Context Diffusion demonstrates superior fidelity to visual context even without text prompts, outperforming prior art in both in-domain and out-of-domain tasks. The model effectively generalizes to unseen tasks like sketch-to-image and image editing, showcasing true in-context learning capability. Using multiple context images further enhances image quality, particularly in the absence of text prompts, highlighting the benefit of few-shot learning.	The current design assumes alignment between text prompts and context images; future work could explore complementary information between them. Generating images with fine-grained details, especially for local edits, remains challenging and presents an area for improvement.	image generation, in-context learning, diffusion models, few-shot learning, controllable generation
2312.03517 Report	FRDiff : Feature Reuse for Universal Training-free Acceleration of Diffusion Models	Junhyuk So, Jungwon Lee, Eunhyeok Park	The substantial computational costs of diffusion models, especially due to the repeated denoising steps necessary for high-quality image generation, present a major obstacle to their widespread adoption. While several studies have attempted to address this issue by reducing the number of score function evaluations (NFE) using advanced ODE solvers without fine-tuning, the decreased number of denoising iterations misses the opportunity to update fine details, resulting in noticeable quality degradation. In our work, we introduce an advanced acceleration technique that leverages the temporal redundancy inherent in diffusion models. Reusing feature maps with high temporal similarity opens up a new opportunity to save computation resources without compromising output quality. To realize the practical benefits of this intuition, we conduct an extensive analysis and propose a novel method, FRDiff. FRDiff is designed to harness the advantages of both reduced NFE and feature reuse, achieving a Pareto frontier that balances fidelity and latency trade-offs in various generative tasks.	FRDiff, a novel zero-shot diffusion model acceleration technique leveraging feature reuse (FR) based on temporal redundancy in iterative generation, achieving up to 1.76x speedup without quality loss.	Diffusion models, while powerful, suffer from high computational cost due to numerous denoising steps, hindering wider adoption. FRDiff addresses this by reducing redundant computations.	FRDiff reuses similar feature maps from adjacent timesteps, combines reduced NFE with FR via score mixing for optimal quality-latency trade-off, and employs Auto-FR for automatic tuning.	FRDiff achieves up to 1.76x acceleration without noticeable quality degradation compared to baseline DDIM. Quantitative analysis shows superior FID scores and speed compared to DDIM with reduced NFE, demonstrating better Pareto fronts. FRDiff is successfully applied to various tasks like super-resolution, image inpainting, and text-to-video generation, showcasing versatility.	Applicability may be limited when score function evaluation time steps are not continuous (e.g., DPM-Solver++). Further investigation needed for methods with non-consecutive score function evaluations.	diffusion models, model acceleration, feature reuse, zero-shot learning, image generation
2312.03461 Report	HiFi4G: High-Fidelity Human Performance Rendering via Compact Gaussian Splatting	Yuheng Jiang, Zhehao Shen, Penghao Wang, Zhuo Su, Yu Hong, Yingliang Zhang, Jingyi Yu, Lan Xu	We have recently seen tremendous progress in photo-real human modeling and rendering. Yet, efficiently rendering realistic human performance and integrating it into the rasterization pipeline remains challenging. In this paper, we present HiFi4G, an explicit and compact Gaussian-based approach for high-fidelity human performance rendering from dense footage. Our core intuition is to marry the 3D Gaussian representation with non-rigid tracking, achieving a compact and compression-friendly representation. We first propose a dual-graph mechanism to obtain motion priors, with a coarse deformation graph for effective initialization and a fine-grained Gaussian graph to enforce subsequent constraints. Then, we utilize a 4D Gaussian optimization scheme with adaptive spatial-temporal regularizers to effectively balance the non-rigid prior and Gaussian updating. We also present a companion compression scheme with residual compensation for immersive experiences on various platforms. It achieves a substantial compression rate of approximately 25 times, with less than 2MB of storage per frame. Extensive experiments demonstrate the effectiveness of our approach, which significantly outperforms existing approaches in terms of optimization speed, rendering quality, and storage overhead.	HiFi4G, an explicit and compact Gaussian-based approach for high-fidelity 4D human performance rendering from dense footage.	Efficiently rendering realistic human performance and integrating it into the rasterization pipeline remains challenging. Existing methods suffer from limitations such as vulnerability to occlusions, lack of texture, blurriness, high storage costs, or the inability to handle large motions.	HiFi4G leverages a dual-graph mechanism with a coarse deformation graph for motion priors and a fine-grained Gaussian graph for constraints. It employs a 4D Gaussian optimization scheme with adaptive spatial-temporal regularizers and introduces a compression scheme with residual compensation.	HiFi4G outperforms existing methods in optimization speed, rendering quality, and storage overhead. The dual-graph mechanism and regularization designs effectively recover spatial-temporally consistent 4D Gaussians. The compression scheme achieves a 25x compression rate, requiring less than 2MB per frame.	HiFi4G heavily relies on segmentation, which can be challenging in scenes with human-object interactions. The Gaussian optimization process, although efficient, still requires several minutes and presents a bottleneck for future acceleration.	4d human performance rendering, gaussian splatting, non-rigid tracking, compact representation, immersive experiences
2312.03459 Report	F3-Pruning: A Training-Free and Generalized Pruning Strategy towards Faster and Finer Text-to-Video Synthesis	Sitong Su, Jianzhi Liu, Lianli Gao, Jingkuan Song	Recently Text-to-Video (T2V) synthesis has undergone a breakthrough by training transformers or diffusion models on large-scale datasets. Nevertheless, inferring such large models incurs huge costs.Previous inference acceleration works either require costly retraining or are model-specific.To address this issue, instead of retraining we explore the inference process of two mainstream T2V models using transformers and diffusion models.The exploration reveals the redundancy in temporal attention modules of both models, which are commonly utilized to establish temporal relations among frames.Consequently, we propose a training-free and generalized pruning strategy called F3-Pruning to prune redundant temporal attention weights.Specifically, when aggregate temporal attention values are ranked below a certain ratio, corresponding weights will be pruned.Extensive experiments on three datasets using a classic transformer-based model CogVideo and a typical diffusion-based model Tune-A-Video verify the effectiveness of F3-Pruning in inference acceleration, quality assurance and broad applicability.	This paper introduces F$^3$-Pruning, a training-free, generalized pruning strategy for accelerating text-to-video inference by pruning redundant temporal attention weights.	Existing inference acceleration methods for text-to-video models are either computationally expensive (require retraining) or model-specific. This paper proposes a method that is both efficient and generalizable.	The authors analyze the inference process of transformer and diffusion-based text-to-video models and identify redundancy in temporal attention modules. Based on this, they propose F$^3$-Pruning which prunes temporal attention weights based on the aggregate attention score, effectively removing redundant connections.	F$^3$-Pruning speeds up inference by up to 1.35x on the UCF-101 dataset using CogVideo. The method also improves video quality, as shown by a 22% improvement in FVD metric on UCF-101 using CogVideo. F$^3$-Pruning demonstrates generalization by effectively accelerating and improving quality on both transformer-based (CogVideo) and diffusion-based (Tune-A-Video) models.	The paper primarily focuses on temporal attention pruning, exploring other modules for pruning could be a future direction. Investigating the impact of different pruning ratios on various text-to-video models can further enhance the method's adaptability.	text-to-video synthesis, inference acceleration, pruning, temporal attention, generative models
2312.03431 Report	Gaussian-Flow: 4D Reconstruction with Dynamic 3D Gaussian Particle	Youtian Lin, Zuozhuo Dai, Siyu Zhu, Yao Yao	We introduce Gaussian-Flow, a novel point-based approach for fast dynamic scene reconstruction and real-time rendering from both multi-view and monocular videos. In contrast to the prevalent NeRF-based approaches hampered by slow training and rendering speeds, our approach harnesses recent advancements in point-based 3D Gaussian Splatting (3DGS). Specifically, a novel Dual-Domain Deformation Model (DDDM) is proposed to explicitly model attribute deformations of each Gaussian point, where the time-dependent residual of each attribute is captured by a polynomial fitting in the time domain, and a Fourier series fitting in the frequency domain. The proposed DDDM is capable of modeling complex scene deformations across long video footage, eliminating the need for training separate 3DGS for each frame or introducing an additional implicit neural field to model 3D dynamics. Moreover, the explicit deformation modeling for discretized Gaussian points ensures ultra-fast training and rendering of a 4D scene, which is comparable to the original 3DGS designed for static 3D reconstruction. Our proposed approach showcases a substantial efficiency improvement, achieving a $5\times$ faster training speed compared to the per-frame 3DGS modeling. In addition, quantitative results demonstrate that the proposed Gaussian-Flow significantly outperforms previous leading methods in novel view rendering quality. Project page: https://nju-3dv.github.io/projects/Gaussian-Flow	Introduces Gaussian-Flow, a point-based differentiable rendering approach for dynamic 3D scene reconstruction using a novel Dual-Domain Deformation Model (DDDM) applied to 3D Gaussian Splatting.	Achieves state-of-the-art training speed, rendering FPS, and novel view synthesis quality for 4D scene reconstruction by efficiently modeling deformations of each Gaussian point without relying on computationally expensive neural networks.	Models a 4D scene as deformable 3D Gaussian points and uses DDDM to capture time-dependent attribute residuals (position, rotation, radiance) with polynomial fitting in the time domain and Fourier series fitting in the frequency domain. Employs adaptive timestamp scaling and regularizations for robust optimization.	Achieves 5x faster training speed compared to per-frame 3DGS modeling. Significantly outperforms prior methods in novel view rendering quality on both multi-view and monocular datasets (HyperNeRF, Plenoptic Video). Demonstrates real-time rendering capabilities with high fidelity.	Challenges remain in preserving high-fidelity thin structures. Future work could explore more refined deformation models and regularization techniques to enhance detail preservation.	dynamic scene reconstruction, 4d scene representation, differentiable rendering, 3d gaussian splatting, real-time rendering
2312.03203 Report	Feature 3DGS: Supercharging 3D Gaussian Splatting to Enable Distilled Feature Fields	Shijie Zhou, Haoran Chang, Sicheng Jiang, Zhiwen Fan, Zehao Zhu, Dejia Xu, Pradyumna Chari, Suya You, Zhangyang Wang, Achuta Kadambi	3D scene representations have gained immense popularity in recent years. Methods that use Neural Radiance fields are versatile for traditional tasks such as novel view synthesis. In recent times, some work has emerged that aims to extend the functionality of NeRF beyond view synthesis, for semantically aware tasks such as editing and segmentation using 3D feature field distillation from 2D foundation models. However, these methods have two major limitations: (a) they are limited by the rendering speed of NeRF pipelines, and (b) implicitly represented feature fields suffer from continuity artifacts reducing feature quality. Recently, 3D Gaussian Splatting has shown state-of-the-art performance on real-time radiance field rendering. In this work, we go one step further: in addition to radiance field rendering, we enable 3D Gaussian splatting on arbitrary-dimension semantic features via 2D foundation model distillation. This translation is not straightforward: naively incorporating feature fields in the 3DGS framework encounters significant challenges, notably the disparities in spatial resolution and channel consistency between RGB images and feature maps. We propose architectural and training changes to efficiently avert this problem. Our proposed method is general, and our experiments showcase novel view semantic segmentation, language-guided editing and segment anything through learning feature fields from state-of-the-art 2D foundation models such as SAM and CLIP-LSeg. Across experiments, our distillation method is able to provide comparable or better results, while being significantly faster to both train and render. Additionally, to the best of our knowledge, we are the first method to enable point and bounding-box prompting for radiance field manipulation, by leveraging the SAM model. Project website at: https://feature-3dgs.github.io/	This paper introduces Feature 3DGS, a novel method for distilling high-dimensional semantic features from 2D foundation models (like SAM and CLIP-LSeg) into 3D Gaussian Splatting, enabling tasks like semantic segmentation, language-guided editing, and promptable instance segmentation.	Existing NeRF-based methods for 3D feature distillation are limited by slow rendering speeds and potential interference between radiance and feature fields. Feature 3DGS overcomes these limitations by leveraging the speed and explicit representation of 3D Gaussian Splatting.	The method uses a parallel N-dimensional Gaussian rasterizer to render both RGB images and semantic feature maps. A lightweight convolutional decoder (speed-up module) upsamples low-dimensional features, improving efficiency. Promptable scene manipulation is achieved by querying the distilled 3D feature field.	Feature 3DGS achieves up to 2.7x faster feature field distillation and rendering compared to NeRF-based methods. It shows up to 23% improvement in mIoU for semantic segmentation tasks on the Replica dataset. The method enables novel view semantic segmentation, language-guided editing, and promptable segmentation from any viewpoint.	The performance of Feature 3DGS is limited by the quality of the teacher network and the student feature's access to ground truth features. The adaptation of the 3DGS pipeline can introduce noise and affect the optimal performance, particularly in complex scenes with tiny objects.	3d gaussian splatting, feature distillation, semantic segmentation, language-guided editing, promptable segmentation
2312.03160 Report	HybridNeRF: Efficient Neural Rendering via Adaptive Volumetric Surfaces	Haithem Turki, Vasu Agrawal, Samuel Rota Bulò, Lorenzo Porzi, Peter Kontschieder, Deva Ramanan, Michael Zollhöfer, Christian Richardt	Neural radiance fields provide state-of-the-art view synthesis quality but tend to be slow to render. One reason is that they make use of volume rendering, thus requiring many samples (and model queries) per ray at render time. Although this representation is flexible and easy to optimize, most real-world objects can be modeled more efficiently with surfaces instead of volumes, requiring far fewer samples per ray. This observation has spurred considerable progress in surface representations such as signed distance functions, but these may struggle to model semi-opaque and thin structures. We propose a method, HybridNeRF, that leverages the strengths of both representations by rendering most objects as surfaces while modeling the (typically) small fraction of challenging regions volumetrically. We evaluate HybridNeRF against the challenging Eyeful Tower dataset along with other commonly used view synthesis datasets. When comparing to state-of-the-art baselines, including recent rasterization-based approaches, we improve error rates by 15-30% while achieving real-time framerates (at least 36 FPS) for virtual-reality resolutions (2Kx2K).	This paper proposes \method, a novel hybrid surface-volume neural rendering technique that integrates the strengths of surface and volumetric rendering to accelerate novel view synthesis for complex scenes.	Achieving real-time rendering of high-fidelity scenes is crucial for immersive applications like AR and VR, but existing methods often struggle to balance speed and quality.	\method leverages a spatially adaptive surfaceness field to represent most of the scene efficiently as a surface, while selectively employing volumetric rendering for challenging regions like thin structures or transparent objects. The method also introduces a distance-adjusted Eikonal regularization to accurately model complex backgrounds without a separate background model, and implements render-time optimizations such as hardware texture interpolation and sphere tracing to further boost performance.	\method achieves state-of-the-art quality on the challenging Eyeful Tower dataset, surpassing baselines in fidelity while maintaining real-time frame rates (at least 36 FPS) at VR resolutions (2K×2K). The approach demonstrates comparable performance to the best real-time and offline methods on the MipNeRF-360 dataset. On ScanNet++, \method outperforms other real-time techniques and achieves near-identical quality to a high-fidelity but computationally expensive baseline while rendering over 400 times faster.	The use of dense 3D grids and triplanes in \method leads to higher memory consumption compared to hash table-based approaches. Training time, although faster than the original NeRF, is comparatively slower than some recent methods like iNGP and 3D Gaussian splatting.	neural rendering, novel view synthesis, surface rendering, volume rendering, real-time rendering
2312.03079 Report	LooseControl: Lifting ControlNet for Generalized Depth Conditioning	Shariq Farooq Bhat, Niloy J. Mitra, Peter Wonka	We present LooseControl to allow generalized depth conditioning for diffusion-based image generation. ControlNet, the SOTA for depth-conditioned image generation, produces remarkable results but relies on having access to detailed depth maps for guidance. Creating such exact depth maps, in many scenarios, is challenging. This paper introduces a generalized version of depth conditioning that enables many new content-creation workflows. Specifically, we allow (C1) scene boundary control for loosely specifying scenes with only boundary conditions, and (C2) 3D box control for specifying layout locations of the target objects rather than the exact shape and appearance of the objects. Using LooseControl, along with text guidance, users can create complex environments (e.g., rooms, street views, etc.) by specifying only scene boundaries and locations of primary objects. Further, we provide two editing mechanisms to refine the results: (E1) 3D box editing enables the user to refine images by changing, adding, or removing boxes while freezing the style of the image. This yields minimal changes apart from changes induced by the edited boxes. (E2) Attribute editing proposes possible editing directions to change one particular aspect of the scene, such as the overall object density or a particular object. Extensive tests and comparisons with baselines demonstrate the generality of our method. We believe that LooseControl can become an important design tool for easily creating complex environments and be extended to other forms of guidance channels. Code and more information are available at https://shariqfarooq123.github.io/loose-control/ .	This paper introduces LooseControl, a novel framework that enables generalized depth conditioning for diffusion-based image generation, allowing for more flexible and creative control over the image generation process.	Existing methods like ControlNet, while powerful, rely on precise depth maps for guidance, which can be challenging to create. LooseControl addresses this limitation by allowing for more abstract and user-friendly depth specifications, broadening the creative possibilities for users.	The authors introduce two forms of generalized depth control: Scene boundary control, which uses scene boundaries as an upper depth limit, and 3D box control, which uses approximate 3D bounding boxes to guide object placement. They achieve this by training a modified ControlNet model on synthetically generated data that represents these generalized depth conditions.	LooseControl generates more realistic and creative images compared to baseline methods, especially when using abstract depth guidance. The framework introduces two novel editing mechanisms: 3D box editing for manipulating object placement while preserving scene style, and attribute editing for exploring variations in object attributes. A user study showed a strong preference (over 95%) for LooseControl-generated images compared to those generated using traditional depth conditioning methods.	While LooseControl effectively controls primary objects, achieving fine-grained control over secondary objects remains a challenge. Similar to ControlNet, providing too many constraints as input can limit the diversity of generated results.	image generation, diffusion models, depth conditioning, controllable image synthesis, generative ai
2312.03048 Report	DGInStyle: Domain-Generalizable Semantic Segmentation with Image Diffusion Models and Stylized Semantic Control	Yuru Jia, Lukas Hoyer, Shengyu Huang, Tianfu Wang, Luc Van Gool, Konrad Schindler, Anton Obukhov	Large, pretrained latent diffusion models (LDMs) have demonstrated an extraordinary ability to generate creative content, specialize to user data through few-shot fine-tuning, and condition their output on other modalities, such as semantic maps. However, are they usable as large-scale data generators, e.g., to improve tasks in the perception stack, like semantic segmentation? We investigate this question in the context of autonomous driving, and answer it with a resounding "yes". We propose an efficient data generation pipeline termed DGInStyle. First, we examine the problem of specializing a pretrained LDM to semantically-controlled generation within a narrow domain. Second, we propose a Style Swap technique to endow the rich generative prior with the learned semantic control. Third, we design a Multi-resolution Latent Fusion technique to overcome the bias of LDMs towards dominant objects. Using DGInStyle, we generate a diverse dataset of street scenes, train a domain-agnostic semantic segmentation model on it, and evaluate the model on multiple popular autonomous driving datasets. Our approach consistently increases the performance of several domain generalization methods compared to the previous state-of-the-art methods. Source code and dataset are available at https://dginstyle.github.io.	This paper introduces DGInStyle, a data generation pipeline for improving domain generalization in semantic segmentation using pretrained latent diffusion models (LDMs).	Domain generalization is crucial for deploying deep learning models in real-world scenarios with domain shifts. This paper addresses this by leveraging the rich priors encoded in pretrained LDMs.	DGInStyle combines three key techniques: (1) Style Swap for preserving style diversity by decoupling semantic control from the source domain style, (2) Style Prompting for enriching style variations with text prompts, and (3) Multi-resolution Latent Fusion (MRLF) for generating high-fidelity images with precise semantic layouts, especially for small objects.	DGInStyle significantly improves the performance of various domain generalization methods across different network architectures (CNNs and Transformers). It leads to substantial improvements in class-wise IoU, particularly for small and challenging classes like poles, traffic lights, and traffic signs. The effectiveness of each component (Style Swap, Style Prompting, MRLF) is validated through ablation studies.	The reliance on existing segmentation masks from the source domain limits the diversity of generated scenes. The computational cost of generating high-resolution images with MRLF can be a bottleneck for large-scale dataset generation.	domain generalization, semantic segmentation, latent diffusion models, data augmentation, generative models
2312.03047 Report	MagicStick: Controllable Video Editing via Control Handle Transformations	Yue Ma, Xiaodong Cun, Yingqing He, Chenyang Qi, Xintao Wang, Ying Shan, Xiu Li, Qifeng Chen	Text-based video editing has recently attracted considerable interest in changing the style or replacing the objects with a similar structure. Beyond this, we demonstrate that properties such as shape, size, location, motion, etc., can also be edited in videos. Our key insight is that the keyframe transformations of the specific internal feature (e.g., edge maps of objects or human pose), can easily propagate to other frames to provide generation guidance. We thus propose MagicStick, a controllable video editing method that edits the video properties by utilizing the transformation on the extracted internal control signals. In detail, to keep the appearance, we inflate both the pretrained image diffusion model and ControlNet to the temporal dimension and train low-rank adaptions (LORA) layers to fit the specific scenes. Then, in editing, we perform an inversion and editing framework. Differently, finetuned ControlNet is introduced in both inversion and generation for attention guidance with the proposed attention remix between the spatial attention maps of inversion and editing. Yet succinct, our method is the first method to show the ability of video property editing from the pre-trained text-to-image model. We present experiments on numerous examples within our unified framework. We also compare with shape-aware text-based editing and handcrafted motion video generation, demonstrating our superior temporal consistency and editing capability than previous works. The code and models will be made publicly available.	This paper proposes MagicStick, a novel framework for controllable video editing that modifies video properties (e.g., shape, size, location, motion) by leveraging keyframe transformations on extracted internal control signals (like object edges or human pose).	Many straightforward video edits, like resizing objects or changing their position over time, remain challenging for existing methods. This work addresses this gap by enabling controllable video editing of various properties while maintaining temporal consistency and appearance fidelity.	The method uses a pre-trained image diffusion model and ControlNet, adapting them to the temporal dimension. It employs a controllable video customization step to maintain appearance consistency. During editing, it uses an inversion and editing framework with a novel attention remix module guided by transformed control signals.	MagicStick successfully edits object size, position, and human motion in videos while preserving appearance and temporal consistency. The proposed method outperforms baselines like Shape-aware Video Editing and VideoComposer in terms of temporal consistency and editing quality, as shown qualitatively and quantitatively. Ablation studies confirm the importance of individual components like LoRA tuning, token embedding, temporal modules, and the Attention ReMix module.	The method struggles to edit object motion along trajectories significantly different from the source video. Future work could explore the application of this framework to more powerful pre-trained video diffusion models.	video editing, controllable generation, diffusion models, controlnet, attention mechanisms
2312.03045 Report	Customization Assistant for Text-to-image Generation	Yufan Zhou, Ruiyi Zhang, Jiuxiang Gu, Tong Sun	Customizing pre-trained text-to-image generation model has attracted massive research interest recently, due to its huge potential in real-world applications. Although existing methods are able to generate creative content for a novel concept contained in single user-input image, their capability are still far from perfection. Specifically, most existing methods require fine-tuning the generative model on testing images. Some existing methods do not require fine-tuning, while their performance are unsatisfactory. Furthermore, the interaction between users and models are still limited to directive and descriptive prompts such as instructions and captions. In this work, we build a customization assistant based on pre-trained large language model and diffusion model, which can not only perform customized generation in a tuning-free manner, but also enable more user-friendly interactions: users can chat with the assistant and input either ambiguous text or clear instruction. Specifically, we propose a new framework consists of a new model design and a novel training strategy. The resulting assistant can perform customized generation in 2-5 seconds without any test time fine-tuning. Extensive experiments are conducted, competitive results have been obtained across different domains, illustrating the effectiveness of the proposed method.	This paper introduces CAFE, a Customization Assistant For text-to-imagE generation that utilizes large language models (LLMs) to enable tuning-free, user-friendly image customization.	Existing methods for customizing pre-trained text-to-image models are either inefficient, require fine-tuning, or lack user-friendliness. CAFE addresses these limitations by offering fast, tuning-free customization and handling ambiguous user input.	CAFE leverages a multi-modal large language model (MLLM) to infer user intent from text and image input. It generates tailored image embeddings and textual explanations. A novel self-improvement via distillation (SID) strategy trains the model on automatically generated high-quality data, eliminating costly human filtering.	CAFE generates customized images in 2-5 seconds without test-time fine-tuning. It handles both declarative and interrogative sentences, enabling more natural user interactions. Quantitative evaluations demonstrate competitive performance against state-of-the-art methods in both object and human image domains.	The model's performance relies heavily on the quality and diversity of training data. Future work could explore incorporating user feedback to further improve the model's customization ability.	text-to-image generation, customization, large language models, tuning-free, image editing
2312.03029 Report	Gaussian Head Avatar: Ultra High-fidelity Head Avatar via Dynamic Gaussians	Yuelang Xu, Benwang Chen, Zhe Li, Hongwen Zhang, Lizhen Wang, Zerong Zheng, Yebin Liu	Creating high-fidelity 3D head avatars has always been a research hotspot, but there remains a great challenge under lightweight sparse view setups. In this paper, we propose Gaussian Head Avatar represented by controllable 3D Gaussians for high-fidelity head avatar modeling. We optimize the neutral 3D Gaussians and a fully learned MLP-based deformation field to capture complex expressions. The two parts benefit each other, thereby our method can model fine-grained dynamic details while ensuring expression accuracy. Furthermore, we devise a well-designed geometry-guided initialization strategy based on implicit SDF and Deep Marching Tetrahedra for the stability and convergence of the training procedure. Experiments show our approach outperforms other state-of-the-art sparse-view methods, achieving ultra high-fidelity rendering quality at 2K resolution even under exaggerated expressions.	This paper proposes Gaussian Head Avatar, a novel representation for reconstructing high-fidelity 3D head avatars from sparse views using controllable 3D Gaussians.	Existing methods struggle to synthesize high-fidelity images with pixel-level details, especially at 2K resolution and under exaggerated expressions. This method aims to overcome these limitations.	The method employs a fully learned deformation field on 3D Gaussians to model complex expressions and introduces a geometry-guided initialization strategy using SDF and DMTet for robust convergence.	Achieves superior image quality with fine-grained dynamic details at 2K resolution, outperforming state-of-the-art methods on self-reenactment tasks. Demonstrates accurate expression transfer and can effectively model exaggerated expressions not well-captured by traditional methods. Shows strong 3D consistency, enabling high-quality novel view synthesis from limited input views.	Limitations: Experiences blurring for areas lacking robust tracking, like the inside of the mouth or long hair. Future work: Address the limitations by integrating advanced tracking techniques for challenging regions.	3d head avatar, gaussian splatting, deformation field, sparse view reconstruction, high-fidelity rendering
2312.03026 Report	Uni3DL: Unified Model for 3D and Language Understanding	Xiang Li, Jian Ding, Zhaoyang Chen, Mohamed Elhoseiny	In this work, we present Uni3DL, a unified model for 3D and Language understanding. Distinct from existing unified vision-language models in 3D which are limited in task variety and predominantly dependent on projected multi-view images, Uni3DL operates directly on point clouds. This approach significantly expands the range of supported tasks in 3D, encompassing both vision and vision-language tasks in 3D. At the core of Uni3DL, a query transformer is designed to learn task-agnostic semantic and mask outputs by attending to 3D visual features, and a task router is employed to selectively generate task-specific outputs required for diverse tasks. With a unified architecture, our Uni3DL model enjoys seamless task decomposition and substantial parameter sharing across tasks. Uni3DL has been rigorously evaluated across diverse 3D vision-language understanding tasks, including semantic segmentation, object detection, instance segmentation, visual grounding, 3D captioning, and text-3D cross-modal retrieval. It demonstrates performance on par with or surpassing state-of-the-art (SOTA) task-specific models. We hope our benchmark and Uni3DL model will serve as a solid step to ease future research in unified models in the realm of 3D and language understanding. Project page: https://uni3dl.github.io.	This paper introduces Uni3DL, a unified model for 3D and language understanding that operates directly on raw point clouds, departing from traditional multi-view image projection methods.	Existing 3D vision-language models rely heavily on projected 2D images, limiting their ability to process 3D geometric information effectively. Uni3DL aims to address this by directly learning from raw point cloud data.	Uni3DL employs a transformer-based architecture with a novel cross-modal attention mechanism to learn joint representations of 3D point clouds and text. It's pre-trained on large-scale 3D-language datasets (ScanNet, ScanRefer, Cap3D Objaverse) and fine-tuned for various downstream tasks like segmentation, captioning, and retrieval.	Uni3DL achieves state-of-the-art results on ScanNet for 3D instance segmentation, even surpassing methods using additional segment labels. It demonstrates competitive performance in zero-shot 3D classification on ModelNet40 and ModelNet10, particularly excelling in top-5 accuracy. The model exhibits strong capabilities in text-guided 3D segmentation and cross-modal retrieval tasks.	Uni3DL currently doesn't leverage the strengths of pre-trained 2D foundation models like CLIP, which limits its ability to benefit from rich 2D image representations. Future work will explore a hybrid approach, combining point-based learning with insights and features from 2D foundation models to further enhance 3D language understanding.	3d vision, language understanding, point cloud processing, cross-modal learning, transformers
2312.03015 Report	PartSLIP++: Enhancing Low-Shot 3D Part Segmentation via Multi-View Instance Segmentation and Maximum Likelihood Estimation	Yuchen Zhou, Jiayuan Gu, Xuanlin Li, Minghua Liu, Yunhao Fang, Hao Su	Open-world 3D part segmentation is pivotal in diverse applications such as robotics and AR/VR. Traditional supervised methods often grapple with limited 3D data availability and struggle to generalize to unseen object categories. PartSLIP, a recent advancement, has made significant strides in zero- and few-shot 3D part segmentation. This is achieved by harnessing the capabilities of the 2D open-vocabulary detection module, GLIP, and introducing a heuristic method for converting and lifting multi-view 2D bounding box predictions into 3D segmentation masks. In this paper, we introduce PartSLIP++, an enhanced version designed to overcome the limitations of its predecessor. Our approach incorporates two major improvements. First, we utilize a pre-trained 2D segmentation model, SAM, to produce pixel-wise 2D segmentations, yielding more precise and accurate annotations than the 2D bounding boxes used in PartSLIP. Second, PartSLIP++ replaces the heuristic 3D conversion process with an innovative modified Expectation-Maximization algorithm. This algorithm conceptualizes 3D instance segmentation as unobserved latent variables, and then iteratively refines them through an alternating process of 2D-3D matching and optimization with gradient descent. Through extensive evaluations, we show that PartSLIP++ demonstrates better performance over PartSLIP in both low-shot 3D semantic and instance-based object part segmentation tasks. Code released at https://github.com/zyc00/PartSLIP2.	PartSLIP++ improves upon PartSLIP for few-shot 3D part segmentation by using SAM for pixel-wise 2D segmentation and a modified EM algorithm for lifting 2D to 3D.	Open-world 3D part segmentation is crucial for applications like robotics and AR/VR, but supervised methods suffer from limited 3D data and struggle to generalize.	PartSLIP++ uses SAM to refine 2D bounding boxes from GLIP into segmentation masks. It then employs a modified EM algorithm to iteratively match and optimize 3D instance labels with projected 2D masks.	PartSLIP++ outperforms PartSLIP in low-shot 3D semantic and instance segmentation on the PartNetE dataset. Ablation studies confirm the effectiveness of using SAM, the EM algorithm, and post-processing. PartSLIP++ enables applications like semi-automatic part annotation and 3D instance proposal generation.	The reliance on pre-trained 2D models might limit performance on highly specialized or unseen objects. Further exploration of different 2D-3D matching and optimization techniques within the EM algorithm is possible.	3d part segmentation, few-shot learning, open-world segmentation, segment anything model (sam), expectation-maximization (em)
2312.03011 Report	InstructBooth: Instruction-following Personalized Text-to-Image Generation	Daewon Chae, Nokyung Park, Jinkyu Kim, Kimin Lee	Personalizing text-to-image models using a limited set of images for a specific object has been explored in subject-specific image generation. However, existing methods often face challenges in aligning with text prompts due to overfitting to the limited training images. In this work, we introduce InstructBooth, a novel method designed to enhance image-text alignment in personalized text-to-image models without sacrificing the personalization ability. Our approach first personalizes text-to-image models with a small number of subject-specific images using a unique identifier. After personalization, we fine-tune personalized text-to-image models using reinforcement learning to maximize a reward that quantifies image-text alignment. Additionally, we propose complementary techniques to increase the synergy between these two processes. Our method demonstrates superior image-text alignment compared to existing baselines, while maintaining high personalization ability. In human evaluations, InstructBooth outperforms them when considering all comprehensive factors. Our project page is at https://sites.google.com/view/instructbooth.	This paper introduces InstructBooth, a novel method for personalized text-to-image generation that enhances image-text alignment without sacrificing personalization ability.	Existing personalized text-to-image generation methods often struggle to balance subject fidelity with the ability to accurately reflect new contexts and actions from text prompts.	InstructBooth first personalizes a text-to-image model using a unique identifier and reference images. Then, it leverages reinforcement learning to fine-tune the model, maximizing a reward based on image-text alignment.	InstructBooth generates personalized images with high text fidelity, outperforming existing methods in aligning generated images with given prompts. The method maintains high subject fidelity, ensuring generated images resemble the user-provided subject. Human evaluations demonstrate a strong preference for InstructBooth outputs over existing methods, highlighting its ability to generate personalized images that are both accurate and visually appealing.	The current subject fidelity metric used in evaluation primarily focuses on appearance and might not be ideal for evaluating personalized images with diverse poses and actions. The research highlights the need for improved metrics and techniques to evaluate subject fidelity more comprehensively in personalized image generation.	text-to-image generation, personalization, reinforcement learning, image-text alignment, subject fidelity
2312.02981 Report	ReconFusion: 3D Reconstruction with Diffusion Priors	Rundi Wu, Ben Mildenhall, Philipp Henzler, Keunhong Park, Ruiqi Gao, Daniel Watson, Pratul P. Srinivasan, Dor Verbin, Jonathan T. Barron, Ben Poole, Aleksander Holynski	3D reconstruction methods such as Neural Radiance Fields (NeRFs) excel at rendering photorealistic novel views of complex scenes. However, recovering a high-quality NeRF typically requires tens to hundreds of input images, resulting in a time-consuming capture process. We present ReconFusion to reconstruct real-world scenes using only a few photos. Our approach leverages a diffusion prior for novel view synthesis, trained on synthetic and multiview datasets, which regularizes a NeRF-based 3D reconstruction pipeline at novel camera poses beyond those captured by the set of input images. Our method synthesizes realistic geometry and texture in underconstrained regions while preserving the appearance of observed regions. We perform an extensive evaluation across various real-world datasets, including forward-facing and 360-degree scenes, demonstrating significant performance improvements over previous few-view NeRF reconstruction approaches.	This paper proposes a novel method to enhance 3D scene reconstruction from a limited number of posed images, leveraging a diffusion model trained for novel view synthesis as a prior to regularize a NeRF-based 3D reconstruction pipeline.	Reconstructing high-quality 3D scenes typically demands dense image captures (tens to hundreds), which is time-consuming and limits accessibility. This method addresses this challenge by significantly reducing the number of input images required.	The approach involves training a diffusion model on a mixture of real and synthetic multiview datasets to generate plausible novel views. This model, conditioned on input images and poses, is integrated into a NeRF reconstruction pipeline, guiding it to produce realistic renderings even from sparsely sampled viewpoints.	The method outperforms existing few-view NeRF reconstruction approaches, demonstrating significant quality improvements in both geometry and appearance, particularly in under-observed regions. It effectively reduces artifacts common in few-view reconstructions, such as "floaters" and inaccurate geometry. The diffusion prior proves to be a robust regularizer, enhancing reconstruction quality across a range of capture settings, including both forward-facing and 360-degree scenes.	The reliance on the heavyweight diffusion model introduces computational costs, slowing down the reconstruction process. The current method shows limited 3D outpainting capabilities compared to the 2D hallucinations possible with the image model.	3d reconstruction, neural radiance fields (nerf), few-shot learning, diffusion models, novel view synthesis
2312.02980 Report	GPT4Point: A Unified Framework for Point-Language Understanding and Generation	Zhangyang Qi, Ye Fang, Zeyi Sun, Xiaoyang Wu, Tong Wu, Jiaqi Wang, Dahua Lin, Hengshuang Zhao	Multimodal Large Language Models (MLLMs) have excelled in 2D image-text comprehension and image generation, but their understanding of the 3D world is notably deficient, limiting progress in 3D language understanding and generation. To solve this problem, we introduce GPT4Point, an innovative groundbreaking point-language multimodal model designed specifically for unified 3D object understanding and generation within the MLLM framework. GPT4Point as a powerful 3D MLLM seamlessly can execute a variety of point-text reference tasks such as point-cloud captioning and Q&A. Additionally, GPT4Point is equipped with advanced capabilities for controllable 3D generation, it can get high-quality results through a low-quality point-text feature maintaining the geometric shapes and colors. To support the expansive needs of 3D object-text pairs, we develop Pyramid-XL, a point-language dataset annotation engine. It constructs a large-scale database over 1M objects of varied text granularity levels from the Objaverse-XL dataset, essential for training GPT4Point. A comprehensive benchmark has been proposed to evaluate 3D point-language understanding capabilities. In extensive evaluations, GPT4Point has demonstrated superior performance in understanding and generation.	GPT4Point, a unified framework for 3D object understanding and generation using point clouds and language.	Addresses limitations of existing MLLMs in understanding and generating 3D objects, aiming for comprehensive 3D world interpretation.	Two-stage approach: (1) Point-text feature alignment using Bert-based Point-QFormer. (2) LLM branch for text inference and Diffusion branch for controlled 3D generation conditioned on point-text features.	Outperforms VLMs and PointLLM in 3D object recognition tasks like zero-shot classification and point-text retrieval. Achieves superior performance in 3D object text inference tasks, including captioning and question answering. Enables controllable text-to-3D generation by leveraging low-quality point cloud features and text descriptions, enhancing generation quality and controllability.	Limited exploration of multi-object scene understanding and interaction. Reliance on Point-E for generation, potentially limiting generation quality and diversity.	3d vision, multimodal learning, large language models, point cloud processing, text-to-3d generation
2312.02974 Report	Describing Differences in Image Sets with Natural Language	Lisa Dunlap, Yuhui Zhang, Xiaohan Wang, Ruiqi Zhong, Trevor Darrell, Jacob Steinhardt, Joseph E. Gonzalez, Serena Yeung-Levy	How do two sets of images differ? Discerning set-level differences is crucial for understanding model behaviors and analyzing datasets, yet manually sifting through thousands of images is impractical. To aid in this discovery process, we explore the task of automatically describing the differences between two $\textbf{sets}$ of images, which we term Set Difference Captioning. This task takes in image sets $D_A$ and $D_B$, and outputs a description that is more often true on $D_A$ than $D_B$. We outline a two-stage approach that first proposes candidate difference descriptions from image sets and then re-ranks the candidates by checking how well they can differentiate the two sets. We introduce VisDiff, which first captions the images and prompts a language model to propose candidate descriptions, then re-ranks these descriptions using CLIP. To evaluate VisDiff, we collect VisDiffBench, a dataset with 187 paired image sets with ground truth difference descriptions. We apply VisDiff to various domains, such as comparing datasets (e.g., ImageNet vs. ImageNetV2), comparing classification models (e.g., zero-shot CLIP vs. supervised ResNet), summarizing model failure modes (supervised ResNet), characterizing differences between generative models (e.g., StableDiffusionV1 and V2), and discovering what makes images memorable. Using VisDiff, we are able to find interesting and previously unknown differences in datasets and models, demonstrating its utility in revealing nuanced insights.	This paper explores Set Difference Captioning (SDC), a task where the goal is to generate natural language descriptions that capture the salient differences between two sets of images.	SDC is important for understanding model behaviors, analyzing datasets (especially for distribution shifts), and gaining insights into human cognition, all in a scalable and interpretable way.	The paper proposes a two-stage proposer-ranker framework. The proposer generates candidate difference descriptions based on small subsets of images. The ranker then evaluates and ranks these descriptions by checking their validity across the full image sets.	A novel SDC benchmark, VisDiffBench, is created with 187 paired image sets and ground truth difference descriptions. The best approach, VisDiff, leverages a caption-based proposer with GPT-4 and a feature-based ranker with CLIP, achieving high accuracy on VisDiffBench. VisDiff reveals interesting and sometimes previously unknown insights when applied to comparing datasets (ImageNet vs. ImageNetV2), model behaviors (CLIP vs ResNet), and analyzing human memory (LaMem dataset).	The current method relies heavily on large pre-trained models, inheriting their potential biases and limitations. VisDiffBench, while extensive, could be expanded to include more diverse and subtle differences beyond objects and styles.	set difference captioning, image understanding, dataset analysis, model interpretation, vision and language
2312.02970 Report	Alchemist: Parametric Control of Material Properties with Diffusion Models	Prafull Sharma, Varun Jampani, Yuanzhen Li, Xuhui Jia, Dmitry Lagun, Fredo Durand, William T. Freeman, Mark Matthews	We propose a method to control material attributes of objects like roughness, metallic, albedo, and transparency in real images. Our method capitalizes on the generative prior of text-to-image models known for photorealism, employing a scalar value and instructions to alter low-level material properties. Addressing the lack of datasets with controlled material attributes, we generated an object-centric synthetic dataset with physically-based materials. Fine-tuning a modified pre-trained text-to-image model on this synthetic dataset enables us to edit material properties in real-world images while preserving all other attributes. We show the potential application of our model to material edited NeRFs.	This paper introduces a method leveraging pre-trained text-to-image diffusion models for parametric control of material properties (roughness, metallic, albedo, transparency) in real images.	Achieving fine-grained control over object material properties in images has broad applications in image editing, advertising, and forensics.	The authors generate a synthetic dataset with controlled material attributes and fine-tune a pre-trained text-to-image diffusion model using relative attribute strength as an input.	The model generalizes to real images despite training on synthetic data. It allows for smooth edits of material properties controlled by a single scalar value. The method can be extended to material editing in neural radiance fields.	The model may produce minimal perceptual changes for certain attributes (roughness, metallic). Occasionally, physically unrealistic transparency edits may occur.	material editing, diffusion models, image editing, generative models, synthetic data
2312.02963 Report	MVHumanNet: A Large-scale Dataset of Multi-view Daily Dressing Human Captures	Zhangyang Xiong, Chenghong Li, Kenkun Liu, Hongjie Liao, Jianqiao Hu, Junyi Zhu, Shuliang Ning, Lingteng Qiu, Chongjie Wang, Shijie Wang, Shuguang Cui, Xiaoguang Han	In this era, the success of large language models and text-to-image models can be attributed to the driving force of large-scale datasets. However, in the realm of 3D vision, while remarkable progress has been made with models trained on large-scale synthetic and real-captured object data like Objaverse and MVImgNet, a similar level of progress has not been observed in the domain of human-centric tasks partially due to the lack of a large-scale human dataset. Existing datasets of high-fidelity 3D human capture continue to be mid-sized due to the significant challenges in acquiring large-scale high-quality 3D human data. To bridge this gap, we present MVHumanNet, a dataset that comprises multi-view human action sequences of 4,500 human identities. The primary focus of our work is on collecting human data that features a large number of diverse identities and everyday clothing using a multi-view human capture system, which facilitates easily scalable data collection. Our dataset contains 9,000 daily outfits, 60,000 motion sequences and 645 million frames with extensive annotations, including human masks, camera parameters, 2D and 3D keypoints, SMPL/SMPLX parameters, and corresponding textual descriptions. To explore the potential of MVHumanNet in various 2D and 3D visual tasks, we conducted pilot studies on view-consistent action recognition, human NeRF reconstruction, text-driven view-unconstrained human image generation, as well as 2D view-unconstrained human image and 3D avatar generation. Extensive experiments demonstrate the performance improvements and effective applications enabled by the scale provided by MVHumanNet. As the current largest-scale 3D human dataset, we hope that the release of MVHumanNet data with annotations will foster further innovations in the domain of 3D human-centric tasks at scale.	This paper introduces MVHumanNet, the largest multi-view human capture dataset to date, containing 4,500 identities, 9,000 outfits, and 645 million frames with annotations.	A large-scale, diverse human dataset is crucial for advancing 3D human-centric tasks in computer vision, similar to the impact of large datasets on language and 2D image models.	The authors built a multi-view capture system and collected data from 4,500 individuals performing various actions in everyday clothing. They annotated the data with action labels, camera parameters, masks, skeletons, and SMPL parameters.	View-consistent action recognition accuracy improves significantly with more viewpoints. NeRF reconstruction for humans shows enhanced generalization ability when trained on larger scales of MVHumanNet data. MVHumanNet enables the development of high-quality text-driven human image generation and 3D human avatar generative models.	Current experiments used only a subset (62%) of the full dataset due to hardware limitations. Existing generalizable NeRF methods, designed for limited data, could be redesigned to better leverage the full potential of MVHumanNet.	3d human capture, multi-view dataset, nerf reconstruction, text-driven generation, human generative model
2312.02949 Report	LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models	Hao Zhang, Hongyang Li, Feng Li, Tianhe Ren, Xueyan Zou, Shilong Liu, Shijia Huang, Jianfeng Gao, Lei Zhang, Chunyuan Li, Jianwei Yang	With the recent significant advancements in large multi-modal models (LMMs), the importance of their grounding capability in visual chat is increasingly recognized. Despite recent efforts to enable LMMs to support grounding, their capabilities for grounding and chat are usually separate, and their chat performance drops dramatically when asked to ground. The problem is the lack of a dataset for grounded visual chat (GVC). Existing grounding datasets only contain short captions. To address this issue, we have created GVC data that allows for the combination of grounding and chat capabilities. To better evaluate the GVC capabilities, we have introduced a benchmark called Grounding-Bench. Additionally, we have proposed a model design that can support GVC and various types of visual prompts by connecting segmentation models with language models. Experimental results demonstrate that our model outperforms other LMMs on Grounding-Bench. Furthermore, our model achieves competitive performance on classic grounding benchmarks like RefCOCO/+/g and Flickr30K Entities. Our code will be released at https://github.com/UX-Decoder/LLaVA-Grounding .	This paper introduces ullname{}, an AI assistant capable of both visual chat and grounding, by creating a new grounded visual chat dataset, proposing a new model architecture, and establishing enchname{} as a benchmark for evaluating grounded visual chat performance.	Existing large multimodal models (LMMs) struggle to effectively perform grounded visual chat due to the scarcity of grounded visual chat data and suboptimal model designs. This work aims to address these challenges and advance the development of grounded visual chat for LMMs.	The authors create a high-quality Grounded Visual Chat (GVC) dataset using human-labeled object detection data and GPT-4 for matching noun phrases to instances. They propose ullname{}, an end-to-end model that connects an LMM with a grounding model to handle grounding tasks. They also introduce enchname{}, a benchmark for evaluating grounded visual chat performance, including chat and grounding aspects.	ullname{} outperforms other open-source LMMs in both chat and grounding tasks on enchname{}. ullname{} achieves competitive results on classic grounding benchmarks like RefCOCO/+/g and Flickr30K Entities. ullname{} effectively supports various types of visual prompts, including marks, clicks, and boxes.	ullname{} has limitations in terms of semantic scope, as the training data is limited. Future work could focus on extending the dataset and data labeling methods to open-vocabulary settings.	visual grounding, visual chat, large multimodal models, benchmarking, visual prompts
2312.02936 Report	Drag-A-Video: Non-rigid Video Editing with Point-based Interaction	Yao Teng, Enze Xie, Yue Wu, Haoyu Han, Zhenguo Li, Xihui Liu	Video editing is a challenging task that requires manipulating videos on both the spatial and temporal dimensions. Existing methods for video editing mainly focus on changing the appearance or style of the objects in the video, while keeping their structures unchanged. However, there is no existing method that allows users to interactively ``drag'' any points of instances on the first frame to precisely reach the target points with other frames consistently deformed. In this paper, we propose a new diffusion-based method for interactive point-based video manipulation, called Drag-A-Video. Our method allows users to click pairs of handle points and target points as well as masks on the first frame of an input video. Then, our method transforms the inputs into point sets and propagates these sets across frames. To precisely modify the contents of the video, we employ a new video-level motion supervision to update the features of the video and introduce the latent offsets to achieve this update at multiple denoising timesteps. We propose a temporal-consistent point tracking module to coordinate the movement of the points in the handle point sets. We demonstrate the effectiveness and flexibility of our method on various videos. The website of our work is available here: https://drag-a-video.github.io/.	Introduces Drag-A-Video, the first point-based interactive non-rigid video editing system allowing users to drag points on the first frame to deform subsequent frames consistently.	Existing video editing methods struggle with precise and fine-grained control over object structure and motion, particularly for non-rigid deformations.	Employs a three-step process: 1) point set propagation of handle points, target points, and masks across frames, 2) latent optimization with video-level motion supervision to update diffusion latents across multiple timesteps, and 3) temporal-consistent point tracking to update handle point locations.	Drag-A-Video enables dragging video content by manipulating handle points toward target points, effectively deforming object structures. User study confirms Drag-A-Video surpasses the baseline in frame quality, temporal consistency, and handle point movement accuracy. Ablation studies validate the importance of point sets, multi-timestep manipulation, and temporal consistency modules for robust and coherent video editing.	2D point propagation can be impacted by occlusion and lacks depth information, limiting its effectiveness in complex scenes. The framework's sensitivity to user input, particularly mask coordination with handle points, requires further investigation to enhance usability.	video editing, diffusion models, point-based manipulation, non-rigid deformation, temporal consistency
2312.02928 Report	LivePhoto: Real Image Animation with Text-guided Motion Control	Xi Chen, Zhiheng Liu, Mengting Chen, Yutong Feng, Yu Liu, Yujun Shen, Hengshuang Zhao	Despite the recent progress in text-to-video generation, existing studies usually overlook the issue that only spatial contents but not temporal motions in synthesized videos are under the control of text. Towards such a challenge, this work presents a practical system, named LivePhoto, which allows users to animate an image of their interest with text descriptions. We first establish a strong baseline that helps a well-learned text-to-image generator (i.e., Stable Diffusion) take an image as a further input. We then equip the improved generator with a motion module for temporal modeling and propose a carefully designed training pipeline to better link texts and motions. In particular, considering the facts that (1) text can only describe motions roughly (e.g., regardless of the moving speed) and (2) text may include both content and motion descriptions, we introduce a motion intensity estimation module as well as a text re-weighting module to reduce the ambiguity of text-to-motion mapping. Empirical evidence suggests that our approach is capable of well decoding motion-related textual instructions into videos, such as actions, camera movements, or even conjuring new contents from thin air (e.g., pouring water into an empty glass). Interestingly, thanks to the proposed intensity learning mechanism, our system offers users an additional control signal (i.e., the motion intensity) besides text for video customization.	This paper introduces \method, a novel text-driven image animation framework enabling users to animate real images using text descriptions, controlling actions, camera movements, and even generating new content.	This work addresses the limitation of existing text-to-video generation methods that lack control over temporal motions, aiming to allow for flexible and user-friendly video customization through textual instructions.	The authors build upon Stable Diffusion, enhancing it with (1) image content guidance for identity preservation, (2) motion intensity estimation for controlling motion speed and range, and (3) text re-weighting for prioritizing motion descriptions over potentially conflicting content descriptions.	\method effectively animates real images from diverse domains, demonstrating strong adherence to textual instructions for motion control. The introduction of motion intensity as a parameter allows users to fine-tune the speed and range of generated motions. Text re-weighting successfully mitigates the influence of content descriptions within text prompts, preventing conflicts with the reference image and enhancing motion control.	The current implementation is limited by the resolution of SD 1.5 (256x256). Future work can explore higher resolutions and more powerful models like SD-XL to further improve performance.	image animation, text-to-video generation, motion control, stable diffusion, content guidance
2312.02919 Report	Fine-grained Controllable Video Generation via Object Appearance and Context	Hsin-Ping Huang, Yu-Chuan Su, Deqing Sun, Lu Jiang, Xuhui Jia, Yukun Zhu, Ming-Hsuan Yang	Text-to-video generation has shown promising results. However, by taking only natural languages as input, users often face difficulties in providing detailed information to precisely control the model's output. In this work, we propose fine-grained controllable video generation (FACTOR) to achieve detailed control. Specifically, FACTOR aims to control objects' appearances and context, including their location and category, in conjunction with the text prompt. To achieve detailed control, we propose a unified framework to jointly inject control signals into the existing text-to-video model. Our model consists of a joint encoder and adaptive cross-attention layers. By optimizing the encoder and the inserted layer, we adapt the model to generate videos that are aligned with both text prompts and fine-grained control. Compared to existing methods relying on dense control signals such as edge maps, we provide a more intuitive and user-friendly interface to allow object-level fine-grained control. Our method achieves controllability of object appearances without finetuning, which reduces the per-subject optimization efforts for the users. Extensive experiments on standard benchmark datasets and user-provided inputs validate that our model obtains a 70% improvement in controllability metrics over competitive baselines.	This paper presents FACTOR, a framework for fine-grained controllable video generation that allows users to control object appearance and context (location, category) using intuitive inputs like hand-drawn trajectories and reference images.	Current text-to-video generation models lack detailed controllability, often requiring dense control signals or per-subject finetuning. This work provides a more user-friendly and efficient approach for customized video generation.	FACTOR adapts a pretrained text-to-video model by incorporating a joint encoder for text prompts and control signals, and adaptive cross-attention layers to inject fine-grained control into the generation process. The model is trained by freezing the pretrained weights and updating only the newly added layers.	FACTOR achieves a 70% improvement in controllability metrics over baselines, demonstrating effective control over object trajectories and appearances. The model exhibits the ability to generate complex videos with object-object and subject-object interactions, despite not being explicitly trained for this purpose. User studies confirm the model's superior performance in visual quality, text alignment, and adherence to user-specified trajectories and appearances.	The current implementation uses a single reference image for appearance control, potentially limiting the range of motion for live subjects. Exploring data augmentation techniques could alleviate this limitation. The model may underperform when text prompts and control signals are misaligned. Future work could investigate strategies for better handling such inconsistencies.	video generation, controllable generation, text-to-video, object appearance, trajectory control
2312.02918 Report	Multimodal Prompt Perceiver: Empower Adaptiveness, Generalizability and Fidelity for All-in-One Image Restoration	Yuang Ai, Huaibo Huang, Xiaoqiang Zhou, Jiexiang Wang, Ran He	Despite substantial progress, all-in-one image restoration (IR) grapples with persistent challenges in handling intricate real-world degradations. This paper introduces MPerceiver: a novel multimodal prompt learning approach that harnesses Stable Diffusion (SD) priors to enhance adaptiveness, generalizability and fidelity for all-in-one image restoration. Specifically, we develop a dual-branch module to master two types of SD prompts: textual for holistic representation and visual for multiscale detail representation. Both prompts are dynamically adjusted by degradation predictions from the CLIP image encoder, enabling adaptive responses to diverse unknown degradations. Moreover, a plug-in detail refinement module improves restoration fidelity via direct encoder-to-decoder information transformation. To assess our method, MPerceiver is trained on 9 tasks for all-in-one IR and outperforms state-of-the-art task-specific methods across most tasks. Post multitask pre-training, MPerceiver attains a generalized representation in low-level vision, exhibiting remarkable zero-shot and few-shot capabilities in unseen tasks. Extensive experiments on 16 IR tasks underscore the superiority of MPerceiver in terms of adaptiveness, generalizability and fidelity.	This paper introduces MPerceiver, a multimodal prompt learning approach leveraging Stable Diffusion priors for enhanced adaptiveness, generalizability, and fidelity in all-in-one image restoration.	Despite substantial progress in all-in-one image restoration, handling intricate real-world degradations remains a challenge, highlighting the need for more adaptive and generalizable solutions.	MPerceiver uses a dual-branch module to learn textual and visual prompts dynamically adjusted by degradation predictions. It also utilizes a detail refinement module for enhanced fidelity.	MPerceiver outperforms state-of-the-art task-specific methods on most tasks. MPerceiver effectively handles challenging mixed degradations, common in real-world scenarios. Pre-trained MPerceiver exhibits remarkable zero-shot and few-shot capabilities in unseen tasks, demonstrating strong generalization.	MPerceiver currently focuses on single-image restoration tasks. Future work will explore the potential of the proposed approach in video restoration	image restoration, stable diffusion, prompt learning, multimodal learning, low-level vision
2312.02902 Report	HeadGaS: Real-Time Animatable Head Avatars via 3D Gaussian Splatting	Helisa Dhamo, Yinyu Nie, Arthur Moreau, Jifei Song, Richard Shaw, Yiren Zhou, Eduardo Pérez-Pellitero	3D head animation has seen major quality and runtime improvements over the last few years, particularly empowered by the advances in differentiable rendering and neural radiance fields. Real-time rendering is a highly desirable goal for real-world applications. We propose HeadGaS, the first model to use 3D Gaussian Splats (3DGS) for 3D head reconstruction and animation. In this paper we introduce a hybrid model that extends the explicit representation from 3DGS with a base of learnable latent features, which can be linearly blended with low-dimensional parameters from parametric head models to obtain expression-dependent final color and opacity values. We demonstrate that HeadGaS delivers state-of-the-art results in real-time inference frame rates, which surpasses baselines by up to ~2dB, while accelerating rendering speed by over x10.	This paper proposes \methodName, the first model to use 3D Gaussian Splats (3DGS) for real-time 3D head reconstruction and animation.	Real-time rendering of animatable 3D heads is essential for various applications like AR/VR and teleconferencing. Existing methods struggle to achieve both high realism and real-time performance.	The method enhances 3DGS with a base of learnable latent features within each Gaussian. These features are blended using expression parameters from parametric head models to obtain expression-dependent color and opacity values. The model is trained on monocular videos with tracked head poses and expression weights.	\methodNameSpace achieves state-of-the-art results on public datasets, outperforming baselines in visual quality by up to 2dB (PSNR). It significantly surpasses baselines in rendering speed, achieving real-time performance of over 100fps. The method enables realistic novel view synthesis and cross-subject expression transfer.	Performance depends on the accuracy of pre-computed head poses and expression weights. Limited generalization to unseen expressions or viewpoints that are significantly different from the training data.	3d head animation, 3d gaussian splatting, real-time rendering, differentiable rendering, neural radiance fields
2312.02896 Report	BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models	Rizhao Cai, Zirui Song, Dayan Guan, Zhenhao Chen, Xing Luo, Chenyu Yi, Alex Kot	Large Multimodal Models (LMMs) such as GPT-4V and LLaVA have shown remarkable capabilities in visual reasoning with common image styles. However, their robustness against diverse style shifts, crucial for practical applications, remains largely unexplored. In this paper, we propose a new benchmark, BenchLMM, to assess the robustness of LMMs against three different styles: artistic image style, imaging sensor style, and application style, where each style has five sub-styles. Utilizing BenchLMM, we comprehensively evaluate state-of-the-art LMMs and reveal: 1) LMMs generally suffer performance degradation when working with other styles; 2) An LMM performs better than another model in common style does not guarantee its superior performance in other styles; 3) LMMs' reasoning capability can be enhanced by prompting LMMs to predict the style first, based on which we propose a versatile and training-free method for improving LMMs; 4) An intelligent LMM is expected to interpret the causes of its errors when facing stylistic variations. We hope that our benchmark and analysis can shed new light on developing more intelligent and versatile LMMs.	This paper introduces BenchLMM, a benchmark designed to evaluate the robustness of Large Multimodal Models (LMMs) against various visual style shifts.	Existing LMM benchmarks primarily use common image styles, limiting the understanding of LMM performance across diverse artistic, sensor, and application-specific styles, crucial for real-world applications.	BenchLMM leverages existing datasets with re-labeling for VQA across three style categories: artistic (Cartoon, Sketch, etc.), sensor (Infrared, X-ray, etc.), and application-specific (remote sensing, autonomous driving, etc.). The authors evaluate several state-of-the-art LMMs, including GPT-4V, on BenchLMM and propose a Style Prompt Enhancement (SPE) method.	LMMs exhibit significant performance degradation when presented with images outside common styles. Superior performance on common-style images doesn't guarantee similar performance on other styles, highlighting the need for comprehensive evaluation. The proposed SPE method, prompting LMMs to predict image style before answering questions, shows consistent improvement across styles without fine-tuning.	The study primarily focuses on accuracy, neglecting other aspects like computational efficiency and bias detection in LMMs. Future work could explore fine-tuning LMMs on diverse styles and incorporate human feedback for error analysis and improvement.	large multimodal models, visual reasoning, benchmarking, style transfer, domain adaptation
2312.02772 Report	FG-MDM: Towards Zero-Shot Human Motion Generation via Fine-Grained Descriptions	Xu Shi, Wei Yao, Chuanchen Luo, Junran Peng, Hongwen Zhang, Yunlian Sun	Recently, significant progress has been made in text-based motion generation, enabling the generation of diverse and high-quality human motions that conform to textual descriptions. However, generating motions beyond the distribution of original datasets remains challenging, i.e., zero-shot generation. By adopting a divide-and-conquer strategy, we propose a new framework named Fine-Grained Human Motion Diffusion Model (FG-MDM) for zero-shot human motion generation. Specifically, we first parse previous vague textual annotations into fine-grained descriptions of different body parts by leveraging a large language model. We then use these fine-grained descriptions to guide a transformer-based diffusion model, which further adopts a design of part tokens. FG-MDM can generate human motions beyond the scope of original datasets owing to descriptions that are closer to motion essence. Our experimental results demonstrate the superiority of FG-MDM over previous methods in zero-shot settings. We will release our fine-grained textual annotations for HumanML3D and KIT.	The paper introduces FG-MDM, a novel framework for zero-shot human motion generation that leverages fine-grained descriptions of body parts to guide a diffusion model.	Generating human motions beyond the distribution of existing datasets is challenging due to limited dataset size and diversity. Existing methods struggle to generalize to unseen motions.	The authors use ChatGPT to paraphrase vague textual descriptions into detailed descriptions of individual body parts. These fine-grained descriptions, along with global text embeddings, guide a transformer-based diffusion model that uses part tokens for each body part.	FG-MDM outperforms state-of-the-art methods in zero-shot motion generation on HuMMan and Kungfu datasets. Qualitative results demonstrate FG-MDM's ability to generate motions consistent with fine-grained textual descriptions, including unseen and stylized motions. A user study confirms the superior quality and text-matching capabilities of motions generated by FG-MDM.	The quality of fine-grained text annotations can be further improved. Exploring better methods for incorporating fine-grained information into the diffusion model.	human motion generation, zero-shot learning, diffusion models, large language models, fine-grained descriptions
2312.02703 Report	MyPortrait: Morphable Prior-Guided Personalized Portrait Generation	Bo Ding, Zhenfeng Fan, Shuang Yang, Shihong Xia	Generating realistic talking faces is an interesting and long-standing topic in the field of computer vision. Although significant progress has been made, it is still challenging to generate high-quality dynamic faces with personalized details. This is mainly due to the inability of the general model to represent personalized details and the generalization problem to unseen controllable parameters. In this work, we propose Myportrait, a simple, general, and flexible framework for neural portrait generation. We incorporate personalized prior in a monocular video and morphable prior in 3D face morphable space for generating personalized details under novel controllable parameters. Our proposed framework supports both video-driven and audio-driven face animation given a monocular video of a single person. Distinguished by whether the test data is sent to training or not, our method provides a real-time online version and a high-quality offline version. Comprehensive experiments in various metrics demonstrate the superior performance of our method over the state-of-the-art methods. The code will be publicly available.	Presents Myportrait, a novel prior-guided framework for neural portrait generation that leverages personalized prior from a monocular video and morphable prior from 3D face morphable space to generate high-quality dynamic faces with personalized details.	Addresses the challenge of generating high-quality dynamic faces with personalized details due to limitations in representing these details and generalizing to unseen controllable parameters.	Employs a two-stage training strategy: (1) Reconstruction Training on a monocular video to learn personalized prior. (2) Scalable Training incorporating morphable prior from auxiliary data to extend the face parameter space and improve generalization to novel parameters.	Achieves superior performance in self-reenactment experiments, evidenced by lower L1 distance, LPIPS, and FID compared to state-of-the-art methods. Outperforms existing methods in cross-reenactment experiments, demonstrating improved CSIM and FID, particularly in the offline version where driven data is included in training. Shows promising results in audio-driven reenactment, indicating the validity of the morphable prior in enhancing results for this application.	Limited to monocular videos with fixed backgrounds due to the reduction of 3D to 2D scenes, potentially addressed by incorporating face segmentation methods. Performance reliant on the accuracy of face parameters extracted by face trackers, with potential for improvement as face tracking technology advances.	neural portrait generation, talking face generation, personalized prior, morphable prior, 3d face morphable model
2312.02663 Report	FaceStudio: Put Your Face Everywhere in Seconds	Yuxuan Yan, Chi Zhang, Rui Wang, Yichao Zhou, Gege Zhang, Pei Cheng, Gang Yu, Bin Fu	This study investigates identity-preserving image synthesis, an intriguing task in image generation that seeks to maintain a subject's identity while adding a personalized, stylistic touch. Traditional methods, such as Textual Inversion and DreamBooth, have made strides in custom image creation, but they come with significant drawbacks. These include the need for extensive resources and time for fine-tuning, as well as the requirement for multiple reference images. To overcome these challenges, our research introduces a novel approach to identity-preserving synthesis, with a particular focus on human images. Our model leverages a direct feed-forward mechanism, circumventing the need for intensive fine-tuning, thereby facilitating quick and efficient image generation. Central to our innovation is a hybrid guidance framework, which combines stylized images, facial images, and textual prompts to guide the image generation process. This unique combination enables our model to produce a variety of applications, such as artistic portraits and identity-blended images. Our experimental results, including both qualitative and quantitative evaluations, demonstrate the superiority of our method over existing baseline models and previous works, particularly in its remarkable efficiency and ability to preserve the subject's identity with high fidelity.	This paper introduces a novel, tuning-free framework for identity-preserving image synthesis, focusing on human images. This framework leverages a hybrid guidance module that integrates stylized images, facial images, and textual prompts to guide image generation while preserving identity.	Existing text-to-image diffusion models face challenges in capturing nuanced details, like human facial features, relying solely on textual descriptions. Existing methods for identity-preserving synthesis often require resource-intensive fine-tuning and multiple reference images.	The method uses a direct feed-forward mechanism, eliminating the need for fine-tuning. It employs a hybrid guidance module combining textual prompts, style images, and identity images to guide the image generation process of a latent diffusion model. For images with multiple identities, a multi-identity cross-attention mechanism maps guidance details to specific human segments.	The proposed model outperforms existing methods in both qualitative and quantitative evaluations, particularly in efficiency and identity preservation. The model effectively synthesizes images with large pose changes while maintaining identity, showcasing its robustness. The multi-identity cross-attention mechanism enables the generation of multi-human images with distinct identities, surpassing baselines using vanilla cross-attention.	The model is specifically tailored for human images, limiting its application to other subjects like animals or objects. The ability to generate realistic human images raises concerns regarding intellectual property and potential misuse in creating offensive content.	image synthesis, identity preservation, diffusion models, hybrid guidance, multi-identity generation
2312.02625 Report	Diffusion Noise Feature: Accurate and Fast Generated Image Detection	Yichi Zhang, Xiaogang Xu	Generative models have reached an advanced stage where they can produce remarkably realistic images. However, this remarkable generative capability also introduces the risk of disseminating false or misleading information. Notably, existing image detectors for generated images encounter challenges such as low accuracy and limited generalization. This paper seeks to address this issue by seeking a representation with strong generalization capabilities to enhance the detection of generated images. Our investigation has revealed that real and generated images display distinct latent Gaussian representations when subjected to an inverse diffusion process within a pre-trained diffusion model. Exploiting this disparity, we can amplify subtle artifacts in generated images. Building upon this insight, we introduce a novel image representation known as Diffusion Noise Feature (DNF). DNF is extracted from the estimated noise generated during the inverse diffusion process. A simple classifier, e.g., ResNet50, trained on DNF achieves high accuracy, robustness, and generalization capabilities for detecting generated images (even the corresponding generator is built with datasets/structures that are not seen during the classifier's training). We conducted experiments using four training datasets and five testsets, achieving state-of-the-art detection performance.	This paper proposes Diffusion Noise Feature (DNF), a novel image representation for detecting generated images by leveraging the distinct latent Gaussian representations of real and generated images during the inverse diffusion process in a pre-trained diffusion model.	Existing generated image detectors face limitations in accuracy and generalization, particularly with the increasing realism of images from state-of-the-art generative models. This necessitates a novel representation with enhanced generalization capabilities to distinguish real and generated images effectively.	DNF is extracted by inputting an image into a pre-trained diffusion model, executing the inverse diffusion process, and collecting the estimated noise generated at each step. A fusion strategy, determined experimentally, combines these noise estimations to obtain the final DNF representation.	The DNF classifier achieved state-of-the-art detection performance, significantly outperforming existing methods with a perfect 100% accuracy and precision on DiffusionForensics. DNF exhibited exceptional robustness against common image perturbations like Gaussian blur and JPEG compression, maintaining over 99.2% accuracy. The DNF classifier demonstrated strong cross-dataset and cross-generator generalization capabilities, accurately detecting images from unseen datasets and generators, including those based on different generative principles (e.g., Diffusion Models vs. GANs).	The effectiveness of different fusion strategies for combining the estimated noise sequence needs further investigation to optimize DNF computation for diverse scenarios and challenging detection tasks. Future research will focus on developing novel detection models specifically tailored for DNF to address the evolving capabilities of emerging generative models like Stable Diffusion v3 and Sora.	generated image detection, diffusion models, feature engineering, generalization capability, robustness
2312.02617 Report	DreaMo: Articulated 3D Reconstruction From A Single Casual Video	Tao Tu, Ming-Feng Li, Chieh Hubert Lin, Yen-Chi Cheng, Min Sun, Ming-Hsuan Yang	Articulated 3D reconstruction has valuable applications in various domains, yet it remains costly and demands intensive work from domain experts. Recent advancements in template-free learning methods show promising results with monocular videos. Nevertheless, these approaches necessitate a comprehensive coverage of all viewpoints of the subject in the input video, thus limiting their applicability to casually captured videos from online sources. In this work, we study articulated 3D shape reconstruction from a single and casually captured internet video, where the subject's view coverage is incomplete. We propose DreaMo that jointly performs shape reconstruction while solving the challenging low-coverage regions with view-conditioned diffusion prior and several tailored regularizations. In addition, we introduce a skeleton generation strategy to create human-interpretable skeletons from the learned neural bones and skinning weights. We conduct our study on a self-collected internet video collection characterized by incomplete view coverage. DreaMo shows promising quality in novel-view rendering, detailed articulated shape reconstruction, and skeleton generation. Extensive qualitative and quantitative studies validate the efficacy of each proposed component, and show existing methods are unable to solve correct geometry due to the incomplete view coverage.	This paper introduces DreaMo, a novel template-free framework designed to reconstruct articulated 3D models from single, casually captured videos with incomplete view coverage.	Reconstructing 3D models from casual videos, which often lack comprehensive viewpoint coverage, is crucial for various applications but challenging for existing methods.	DreaMo utilizes a neural implicit function to learn a rest-pose 3D model and employs a view-conditioned diffusion model to hallucinate plausible geometry in unseen or low-coverage regions. It further introduces regularization techniques to refine neural bone placement and enhance reconstruction quality.	DreaMo outperforms state-of-the-art methods in reconstructing detailed 3D shapes with plausible textures from videos with limited viewpoints. The proposed regularization schemes are shown to effectively improve the placement of neural bones, leading to more intuitive skeletons and fewer geometric artifacts. DreaMo supports user control, enabling the manipulation of reconstructed models into novel poses by adjusting the generated skeletons.	DreaMo, as a structure-from-motion method, requires a certain level of camera baseline and struggles with videos lacking sufficient viewpoint diversity. The hallucination of bones and articulations in entirely unseen regions remains a challenge, as the model relies on observing real-world motions to learn these features.	3d reconstruction, articulated shape reconstruction, diffusion models, view synthesis, skeleton generation
2312.02548 Report	GeNIe: Generative Hard Negative Images Through Diffusion	Soroush Abbasi Koohpayegani, Anuj Singh, K L Navaneet, Hadi Jamali-Rad, Hamed Pirsiavash	Data augmentation is crucial in training deep models, preventing them from overfitting to limited data. Recent advances in generative AI, e.g., diffusion models, have enabled more sophisticated augmentation techniques that produce data resembling natural images. We introduce GeNIe a novel augmentation method which leverages a latent diffusion model conditioned on a text prompt to merge contrasting data points (an image from the source category and a text prompt from the target category) to generate challenging samples. To achieve this, inspired by recent diffusion based image editing techniques, we limit the number of diffusion iterations to ensure the generated image retains low-level and background features from the source image while representing the target category, resulting in a hard negative sample for the source category. We further enhance the proposed approach by finding the appropriate noise level adaptively for each image (coined as GeNIe-Ada) leading to further performance improvement. Our extensive experiments, in both few-shot and long-tail distribution settings, demonstrate the effectiveness of our novel augmentation method and its superior performance over the prior art. Our code is available here: https://github.com/UCDvision/GeNIe	\genie{} is a novel data augmentation method that leverages a text-prompted latent diffusion model to generate challenging (hard negative) samples. It achieves this by merging contrasting data points: an image from the source category and a text prompt from the target category.	\genie{} addresses challenges in training deep models with limited data, particularly in few-shot and long-tailed learning scenarios where model generalization and robustness are crucial. It is also helpful in mitigating the effect of spurious correlations in datasets.	\genie{} employs a two-step process: (1) It partially adds noise to the latent representation of a source image. (2) It leverages a text-conditioned diffusion model, prompted with the target category, to generate a new image that semantically aligns with the target category while preserving low-level features from the source image. An adaptive noise level selection strategy (\texttt{GeNIe-Ada}) is further proposed to automatically determine the optimal noise level for each image.	\genie{} consistently improves the performance of few-shot image classification on mini-Imagenet and tiered-Imagenet, surpassing other state-of-the-art methods and data augmentation techniques. In long-tailed classification on ImageNet-LT, \genie{} leads to a significant performance boost, particularly for categories with limited samples, demonstrating its effectiveness in addressing data imbalance. For fine-grained few-shot classification, \genie{} consistently outperforms other text-based augmentation methods across various datasets like CUB200, Cars196, Food101, and FGVC-Aircraft.	The augmentation process in \genie{} is slower than traditional methods due to the time required for the diffusion process, making it less suitable for online augmentation settings. \genie{} might face challenges with datasets where images significantly deviate from the generative model's training distribution or with unfamiliar category names, requiring potential fine-tuning of the model.	data augmentation, diffusion models, few-shot learning, long-tailed classification, hard negative mining
2312.02503 Report	SAVE: Protagonist Diversification with Structure Agnostic Video Editing	Yeji Song, Wonsik Shin, Junsoo Lee, Jeesoo Kim, Nojun Kwak	Driven by the upsurge progress in text-to-image (T2I) generation models, text-to-video (T2V) generation has experienced a significant advance as well. Accordingly, tasks such as modifying the object or changing the style in a video have been possible. However, previous works usually work well on trivial and consistent shapes, and easily collapse on a difficult target that has a largely different body shape from the original one. In this paper, we spot the bias problem in the existing video editing method that restricts the range of choices for the new protagonist and attempt to address this issue using the conventional image-level personalization method. We adopt motion personalization that isolates the motion from a single source video and then modifies the protagonist accordingly. To deal with the natural discrepancy between image and video, we propose a motion word with an inflated textual embedding to properly represent the motion in a source video. We also regulate the motion word to attend to proper motion-related areas by introducing a novel pseudo optical flow, efficiently computed from the pre-calculated attention maps. Finally, we decouple the motion from the appearance of the source video with an additional pseudo word. Extensive experiments demonstrate the editing capability of our method, taking a step toward more diverse and extensive video editing.	This paper introduces SAVE, a novel single-shot video editing method that enables protagonist diversification while preserving the motion of the original subject, even with substantial changes in body structure.	Existing video editing methods struggle to maintain motion fidelity when replacing protagonists with objects of significantly different shapes, limiting their flexibility and diversity.	SAVE employs a motion personalization approach with a new motion word (S_mot) that captures the specific motion in a source video. This word utilizes expanded text embeddings with temporal information and is trained with a motion-aware cross-attention loss based on a novel pseudo optical flow. Additionally, pre-registration of the protagonist's appearance (S_pro) disentangles motion from appearance during training.	SAVE successfully edits protagonists with diverse structures while maintaining the original motion, outperforming existing methods in qualitative comparisons. Quantitative evaluation shows SAVE achieves superior performance in text alignment, frame consistency, and user preference. Ablation studies confirm the contribution of each proposed component, including expanded text embeddings, cross-attention regularization, and pre-registration of the protagonist.	The current method is limited to single protagonist motions and struggles with multiple protagonists. Future work will focus on expanding to broader motion types, including background and camera movements.	video editing, motion personalization, text-to-video generation, diffusion models, protagonist diversification
2312.02432 Report	Orthogonal Adaptation for Modular Customization of Diffusion Models	Ryan Po, Guandao Yang, Kfir Aberman, Gordon Wetzstein	Customization techniques for text-to-image models have paved the way for a wide range of previously unattainable applications, enabling the generation of specific concepts across diverse contexts and styles. While existing methods facilitate high-fidelity customization for individual concepts or a limited, pre-defined set of them, they fall short of achieving scalability, where a single model can seamlessly render countless concepts. In this paper, we address a new problem called Modular Customization, with the goal of efficiently merging customized models that were fine-tuned independently for individual concepts. This allows the merged model to jointly synthesize concepts in one image without compromising fidelity or incurring any additional computational costs. To address this problem, we introduce Orthogonal Adaptation, a method designed to encourage the customized models, which do not have access to each other during fine-tuning, to have orthogonal residual weights. This ensures that during inference time, the customized models can be summed with minimal interference. Our proposed method is both simple and versatile, applicable to nearly all optimizable weights in the model architecture. Through an extensive set of quantitative and qualitative evaluations, our method consistently outperforms relevant baselines in terms of efficiency and identity preservation, demonstrating a significant leap toward scalable customization of diffusion models.	This paper introduces Orthogonal Adaptation, a novel method for modular customization of text-to-image diffusion models, enabling efficient merging of independently fine-tuned models for multi-concept image synthesis.	Existing methods struggle with scalability in multi-concept customization, exhibiting degradation in concept quality when merged or requiring computationally expensive joint training.	Orthogonal Adaptation encourages orthogonal residual weights during independent concept fine-tuning, minimizing interference and preserving identity during merging through simple summation.	Orthogonal Adaptation maintains high fidelity in single-concept generations from merged models, outperforming baselines in identity preservation. It enables efficient merging of multiple concepts, exhibiting superior identity alignment compared to baselines, even with a large number of merged concepts. Quantitative evaluations demonstrate superior image and identity alignment scores while maintaining comparable text alignment to state-of-the-art methods.	Generating images with complex compositions and interactions between multiple custom concepts remains challenging. The method currently requires modification of the fine-tuning process and cannot be applied post-hoc to existing fine-tuned models.	diffusion models, text-to-image synthesis, model customization, orthogonal adaptation, multi-concept generation
2312.02420 Report	Towards Granularity-adjusted Pixel-level Semantic Annotation	Rohit Kundu, Sudipta Paul, Rohit Lal, Amit K. Roy-Chowdhury	Recent advancements in computer vision predominantly rely on learning-based systems, leveraging annotations as the driving force to develop specialized models. However, annotating pixel-level information, particularly in semantic segmentation, presents a challenging and labor-intensive task, prompting the need for autonomous processes. In this work, we propose GranSAM which distinguishes itself by providing semantic segmentation at the user-defined granularity level on unlabeled data without the need for any manual supervision, offering a unique contribution in the realm of semantic mask annotation method. Specifically, we propose an approach to enable the Segment Anything Model (SAM) with semantic recognition capability to generate pixel-level annotations for images without any manual supervision. For this, we accumulate semantic information from synthetic images generated by the Stable Diffusion model or web crawled images and employ this data to learn a mapping function between SAM mask embeddings and object class labels. As a result, SAM, enabled with granularity-adjusted mask recognition, can be used for pixel-level semantic annotation purposes. We conducted experiments on the PASCAL VOC 2012 and COCO-80 datasets and observed a +17.95% and +5.17% increase in mIoU, respectively, compared to existing state-of-the-art methods when evaluated under our problem setting.	This paper introduces GranSAM, a novel semantic segmentation based annotation framework that generates pixel-level annotations and semantic masks without requiring any manually labeled images or human interaction.	Annotating pixel-level information for semantic segmentation is labor-intensive and expensive. GranSAM offers an autonomous solution, enhancing efficiency and reducing costs in developing specialized models for computer vision.	The framework utilizes the Segment Anything Model (SAM) for region distinction and leverages synthetic images (generated via Stable Diffusion) or web crawled images to guide SAM's semantic understanding. A classifier head is trained on SAM's mask embeddings using a weakly-supervised multiple instance learning setup with uncertainty distillation to map masks to user-defined object classes.	GranSAM achieves competitive performance compared to existing unsupervised semantic segmentation methods on PASCAL VOC and COCO-80 datasets despite being trained on a small set of synthetic or web crawled single-object images. The framework demonstrates superior performance over state-of-the-art unsupervised methods when tested on unseen data distributions, highlighting its generalization capabilities. Uncertainty distillation during training significantly improves the model's discriminative ability, especially on the challenging COCO-80 dataset.	The performance of GranSAM on COCO-80, while exceeding baselines, highlights the challenges posed by complex datasets with diverse object classes and scenes. Further research can explore the incorporation of techniques like few-shot learning to further enhance the model's ability to generalize from limited training data and refine segmentation accuracy.	semantic segmentation, automatic annotation, segment anything model, unsupervised learning, weakly supervised learning
2312.02362 Report	PointNeRF++: A multi-scale, point-based Neural Radiance Field	Weiwei Sun, Eduard Trulls, Yang-Che Tseng, Sneha Sambandam, Gopal Sharma, Andrea Tagliasacchi, Kwang Moo Yi	Point clouds offer an attractive source of information to complement images in neural scene representations, especially when few images are available. Neural rendering methods based on point clouds do exist, but they do not perform well when the point cloud quality is low -- e.g., sparse or incomplete, which is often the case with real-world data. We overcome these problems with a simple representation that aggregates point clouds at multiple scale levels with sparse voxel grids at different resolutions. To deal with point cloud sparsity, we average across multiple scale levels -- but only among those that are valid, i.e., that have enough neighboring points in proximity to the ray of a pixel. To help model areas without points, we add a global voxel at the coarsest scale, thus unifying ``classical'' and point-based NeRF formulations. We validate our method on the NeRF Synthetic, ScanNet, and KITTI-360 datasets, outperforming the state of the art, with a significant gap compared to other NeRF-based methods, especially on more challenging scenes.	This paper introduces PointNeRF++, a novel multi-scale, point-based neural radiance field representation for improved novel view synthesis, especially in challenging scenarios with sparse or incomplete point clouds.	Existing point cloud-based neural rendering methods struggle with low-quality, sparse, or incomplete point clouds often encountered in real-world data.	The method aggregates point clouds at multiple scale levels with sparse voxel grids, averaging features only across valid scales with sufficient neighboring points. It also incorporates a global voxel at the coarsest scale to model areas without points, unifying classic and point-based NeRF formulations. A tri-plane representation is utilized for coarser scales to effectively cover larger support regions.	Significantly outperforms state-of-the-art methods on the NeRF Synthetic, ScanNet, and KITTI-360 datasets. Demonstrates superior performance, especially in handling sparse or incomplete point clouds, compared to PointNeRF and other baselines. Shows the effectiveness of multi-scale representation and the global voxel in capturing scene details and filling in gaps in point clouds.	The computational cost is limited by the classic NeRF backbone. Future work includes exploring the combination of the multi-scale strategy with computationally efficient methods like 3D Gaussian Splatting.	neural radiance fields, point clouds, multi-scale representation, novel view synthesis, 3d scene reconstruction
2312.02319 Report	Kernel Diffusion: An Alternate Approach to Blind Deconvolution	Yash Sanghvi, Yiheng Chi, Stanley H. Chan	Blind deconvolution problems are severely ill-posed because neither the underlying signal nor the forward operator are not known exactly. Conventionally, these problems are solved by alternating between estimation of the image and kernel while keeping the other fixed. In this paper, we show that this framework is flawed because of its tendency to get trapped in local minima and, instead, suggest the use of a kernel estimation strategy with a non-blind solver. This framework is employed by a diffusion method which is trained to sample the blur kernel from the conditional distribution with guidance from a pre-trained non-blind solver. The proposed diffusion method leads to state-of-the-art results on both synthetic and real blur datasets.	This paper introduces Kernel-Diff, a novel diffusion-based blind deconvolution method that prioritizes kernel estimation over the conventional alternating minimization approach.	Alternating minimization for blind deconvolution is prone to local minima. Directly estimating the kernel using a marginalization approach is more robust but computationally challenging. Kernel-Diff addresses this challenge using a diffusion model guided by a non-blind solver.	Kernel-Diff employs a diffusion model trained to sample blur kernels from the conditional distribution p(k\|y), effectively approximating the marginalization of the image space. A differentiable non-blind solver guides the diffusion process, minimizing the reblurring loss and ensuring a plausible kernel estimate.	Kernel-Diff achieves state-of-the-art performance on both synthetic (BSD100) and real (RealBlur-50) blur datasets, outperforming existing methods in PSNR, SSIM, LPIPS and FID. Ablation study demonstrates the crucial role of the non-blind solver guidance in achieving superior performance. Analysis of the reblurring loss during diffusion confirms that the proposed kernel estimation strategy converges to a better local minimum compared to alternating minimization methods.	The current implementation assumes spatially invariant blur, limiting its applicability to more general scenarios. Future work can explore better approximations of the image space marginalization or incorporate robustness to kernel inaccuracies in the non-blind solver.	blind deconvolution, diffusion models, kernel estimation, non-blind solver, image restoration
2312.02284 Report	PatchFusion: An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation	Zhenyu Li, Shariq Farooq Bhat, Peter Wonka	Single image depth estimation is a foundational task in computer vision and generative modeling. However, prevailing depth estimation models grapple with accommodating the increasing resolutions commonplace in today's consumer cameras and devices. Existing high-resolution strategies show promise, but they often face limitations, ranging from error propagation to the loss of high-frequency details. We present PatchFusion, a novel tile-based framework with three key components to improve the current state of the art: (1) A patch-wise fusion network that fuses a globally-consistent coarse prediction with finer, inconsistent tiled predictions via high-level feature guidance, (2) A Global-to-Local (G2L) module that adds vital context to the fusion network, discarding the need for patch selection heuristics, and (3) A Consistency-Aware Training (CAT) and Inference (CAI) approach, emphasizing patch overlap consistency and thereby eradicating the necessity for post-processing. Experiments on UnrealStereo4K, MVS-Synth, and Middleburry 2014 demonstrate that our framework can generate high-resolution depth maps with intricate details. PatchFusion is independent of the base model for depth estimation. Notably, our framework built on top of SOTA ZoeDepth brings improvements for a total of 17.3% and 29.4% in terms of the root mean squared error (RMSE) on UnrealStereo4K and MVS-Synth, respectively.	This paper introduces PatchFusion, a novel tile-based framework for high-resolution monocular metric depth estimation that surpasses input resolution limitations of existing depth estimation models.	Existing depth estimation models struggle with high-resolution images common in modern devices. PatchFusion addresses this by enabling the use of pre-trained models on high-resolution inputs without sacrificing accuracy or efficiency.	PatchFusion uses three steps: (1) global scale-aware coarse depth estimation, (2) local fine-depth estimation on image patches, (3) fusion of coarse and fine predictions using a guided fusion network with a global-to-local module. It also employs consistency-aware training and inference for patch coherence.	PatchFusion outperforms previous state-of-the-art methods on UnrealStereo4K and MVS-Synth datasets, showing significant improvements in RMSE, REL, and boundary delineation. The framework generalizes well to real-world images, as demonstrated on the Middlebury 2014 dataset in a zero-shot transfer setting. Ablation studies confirm the contribution of each component, particularly the effectiveness of the guided fusion network with the G2L module and consistency-aware training and inference.	The computational efficiency of the framework can be further improved, especially when using a large number of randomly selected patches. The performance in real-world settings could benefit from the availability of large, high-resolution, real-world depth datasets for training.	depth estimation, high-resolution, tile-based, monocular, deep learning
2312.02256 Report	EMDM: Efficient Motion Diffusion Model for Fast and High-Quality Motion Generation	Wenyang Zhou, Zhiyang Dou, Zeyu Cao, Zhouyingcheng Liao, Jingbo Wang, Wenjia Wang, Yuan Liu, Taku Komura, Wenping Wang, Lingjie Liu	We introduce Efficient Motion Diffusion Model (EMDM) for fast and high-quality human motion generation. Current state-of-the-art generative diffusion models have produced impressive results but struggle to achieve fast generation without sacrificing quality. On the one hand, previous works, like motion latent diffusion, conduct diffusion within a latent space for efficiency, but learning such a latent space can be a non-trivial effort. On the other hand, accelerating generation by naively increasing the sampling step size, e.g., DDIM, often leads to quality degradation as it fails to approximate the complex denoising distribution. To address these issues, we propose EMDM, which captures the complex distribution during multiple sampling steps in the diffusion model, allowing for much fewer sampling steps and significant acceleration in generation. This is achieved by a conditional denoising diffusion GAN to capture multimodal data distributions among arbitrary (and potentially larger) step sizes conditioned on control signals, enabling fewer-step motion sampling with high fidelity and diversity. To minimize undesired motion artifacts, geometric losses are imposed during network learning. As a result, EMDM achieves real-time motion generation and significantly improves the efficiency of motion diffusion models compared to existing methods while achieving high-quality motion generation. Our code will be publicly available upon publication.	This paper introduces EMDM, an Efficient Motion Diffusion Model for real-time, high-quality human motion generation, addressing the speed-quality trade-off in existing diffusion-based methods.	Current motion diffusion models struggle to achieve fast generation without compromising quality, limiting their real-world applicability.	EMDM utilizes a conditional denoising diffusion GAN to model complex motion distributions over larger sampling step sizes. This allows for fewer denoising steps during generation, significantly improving speed. Additionally, geometric losses are incorporated during training to enhance motion quality.	EMDM achieves real-time motion generation with competitive or superior quality compared to state-of-the-art methods. The model demonstrates significant speed improvements, particularly in text-to-motion tasks where it outperforms existing approaches. Ablation studies validate the contribution of key design choices like sampling step size and geometric loss weighting.	The lack of physics-based considerations in the motion generation process may lead to artifacts like floating or ground penetration. Future work could explore incorporating physical constraints and expanding input modalities beyond text, such as visual or audio inputs.	text-to-motion, motion generation, diffusion model, gan, efficient motion synthesis
2312.02253 Report	Diversify, Don't Fine-Tune: Scaling Up Visual Recognition Training with Synthetic Images	Zhuoran Yu, Chenchen Zhu, Sean Culatana, Raghuraman Krishnamoorthi, Fanyi Xiao, Yong Jae Lee	Recent advances in generative deep learning have enabled the creation of high-quality synthetic images in text-to-image generation. Prior work shows that fine-tuning a pretrained diffusion model on ImageNet and generating synthetic training images from the finetuned model can enhance an ImageNet classifier's performance. However, performance degrades as synthetic images outnumber real ones. In this paper, we explore whether generative fine-tuning is essential for this improvement and whether it is possible to further scale up training using more synthetic data. We present a new framework leveraging off-the-shelf generative models to generate synthetic training images, addressing multiple challenges: class name ambiguity, lack of diversity in naive prompts, and domain shifts. Specifically, we leverage large language models (LLMs) and CLIP to resolve class name ambiguity. To diversify images, we propose contextualized diversification (CD) and stylized diversification (SD) methods, also prompted by LLMs. Finally, to mitigate domain shifts, we leverage domain adaptation techniques with auxiliary batch normalization for synthetic images. Our framework consistently enhances recognition model performance with more synthetic data, up to 6x of original ImageNet size showcasing the potential of synthetic data for improved recognition models and strong out-of-domain generalization.	This paper proposes a new framework that leverages off-the-shelf generative models to generate synthetic training images, improving recognition model performance on large-scale datasets without the need for generative fine-tuning.	Fine-tuning generative models for each dataset is resource-intensive and performance degrades when synthetic images outnumber real ones. This work explores the potential of using readily available generative models to overcome these limitations.	The framework addresses challenges like class name ambiguity, lack of diversity in images, and domain shifts. It uses LLMs and CLIP to resolve ambiguity, introduces contextual and style diversification in prompts, and employs domain adaptation techniques with auxiliary batch normalization.	The framework consistently improves recognition accuracy, outperforming methods using fine-tuned generative models, especially as synthetic data scales up. Models trained with synthetic data show strong out-of-domain generalization, achieving significant accuracy improvements on ImageNet variations. The method is effective in low-data and long-tail settings, demonstrating its potential to reduce annotation efforts.	Training large vision transformer models with more synthetic data is computationally expensive. Further research is needed to optimize synthetic data generation for specific downstream tasks.	synthetic data, image classification, diffusion models, domain adaptation, large language models
2312.02238 Report	X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model	Lingmin Ran, Xiaodong Cun, Jia-Wei Liu, Rui Zhao, Song Zijie, Xintao Wang, Jussi Keppo, Mike Zheng Shou	We introduce X-Adapter, a universal upgrader to enable the pretrained plug-and-play modules (e.g., ControlNet, LoRA) to work directly with the upgraded text-to-image diffusion model (e.g., SDXL) without further retraining. We achieve this goal by training an additional network to control the frozen upgraded model with the new text-image data pairs. In detail, X-Adapter keeps a frozen copy of the old model to preserve the connectors of different plugins. Additionally, X-Adapter adds trainable mapping layers that bridge the decoders from models of different versions for feature remapping. The remapped features will be used as guidance for the upgraded model. To enhance the guidance ability of X-Adapter, we employ a null-text training strategy for the upgraded model. After training, we also introduce a two-stage denoising strategy to align the initial latents of X-Adapter and the upgraded model. Thanks to our strategies, X-Adapter demonstrates universal compatibility with various plugins and also enables plugins of different versions to work together, thereby expanding the functionalities of diffusion community. To verify the effectiveness of the proposed method, we conduct extensive experiments and the results show that X-Adapter may facilitate wider application in the upgraded foundational diffusion model.	X-Adapter is a universal adapter that upgrades pretrained plug-and-play modules for text-to-image diffusion models, enabling their use with newer models without retraining.	The rapid development of plugins for diffusion models is often hindered by the emergence of newer models. X-Adapter solves this incompatibility, saving time and resources while enhancing plugin capabilities.	X-Adapter freezes a copy of the old model and adds trainable mapping layers between its decoder and the upgraded model's decoder for feature remapping. This allows direct use of old plugins on the newer model, guided by the remapped features. A two-stage denoising strategy aligns the latent spaces of the models during inference.	X-Adapter demonstrates universal compatibility with various plugins, including ControlNet and LoRA. It improves the performance of old plugins by leveraging the enhanced capabilities of upgraded models, as shown in quantitative and qualitative comparisons. X-Adapter enables plugin remixing, allowing plugins from different model versions to work together.	X-Adapter may not fully preserve identity consistency for plugins like IP-Adapter that generate personalized concepts. Future work includes extending X-Adapter to improve concept customization capabilities.	diffusion models, plug-and-play modules, model upgrading, parameter-efficient transfer learning, text-to-image generation
2312.02228 Report	PixelLM: Pixel Reasoning with Large Multimodal Model	Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, Xiaojie Jin	While large multimodal models (LMMs) have achieved remarkable progress, generating pixel-level masks for image reasoning tasks involving multiple open-world targets remains a challenge. To bridge this gap, we introduce PixelLM, an effective and efficient LMM for pixel-level reasoning and understanding. Central to PixelLM is a novel, lightweight pixel decoder and a comprehensive segmentation codebook. The decoder efficiently produces masks from the hidden embeddings of the codebook tokens, which encode detailed target-relevant information. With this design, PixelLM harmonizes with the structure of popular LMMs and avoids the need for additional costly segmentation models. Furthermore, we propose a target refinement loss to enhance the model's ability to differentiate between multiple targets, leading to substantially improved mask quality. To advance research in this area, we construct MUSE, a high-quality multi-target reasoning segmentation benchmark. PixelLM excels across various pixel-level image reasoning and understanding tasks, outperforming well-established methods in multiple benchmarks, including MUSE, single- and multi-referring segmentation. Comprehensive ablations confirm the efficacy of each proposed component. All code, models, and datasets will be publicly available.	PixelLM, an efficient and effective large multimodal model (LMM) for pixel-level reasoning and understanding, capable of handling tasks with multiple open-world targets and diverse reasoning complexities.	Existing LMMs primarily generate textual descriptions and struggle with pixel-level responses like object masks, limiting their applications in tasks like image editing and robotics.	PixelLM introduces a novel pixel decoder and a segmentation codebook. The codebook encodes target-relevant information at different visual scales, and the decoder generates masks based on these embeddings and image features. A target refinement loss further enhances the differentiation between multiple targets.	PixelLM achieves state-of-the-art performance on multi-target reasoning segmentation, outperforming baselines including adapted LISA and SEEM. It demonstrates superior results on multi-referring segmentation benchmarks, surpassing LISA and its augmented variant. PixelLM also shows competitive performance on single-target referring segmentation (refCOCO series) despite not being specifically designed for this task.	The model's performance might be further improved by incorporating object relationships into the data generation process. Investigating the integration of external knowledge bases to enhance reasoning capabilities for complex scenarios.	large multimodal models, pixel-level reasoning, segmentation, image understanding, codebook
2312.02221 Report	Slice3D: Multi-Slice, Occlusion-Revealing, Single View 3D Reconstruction	Yizhi Wang, Wallace Lira, Wenqi Wang, Ali Mahdavi-Amiri, Hao Zhang	We introduce multi-slice reasoning, a new notion for single-view 3D reconstruction which challenges the current and prevailing belief that multi-view synthesis is the most natural conduit between single-view and 3D. Our key observation is that object slicing is more advantageous than altering views to reveal occluded structures. Specifically, slicing is more occlusion-revealing since it can peel through any occluders without obstruction. In the limit, i.e., with infinitely many slices, it is guaranteed to unveil all hidden object parts. We realize our idea by developing Slice3D, a novel method for single-view 3D reconstruction which first predicts multi-slice images from a single RGB image and then integrates the slices into a 3D model using a coordinate-based transformer network for signed distance prediction. The slice images can be regressed or generated, both through a U-Net based network. For the former, we inject a learnable slice indicator code to designate each decoded image into a spatial slice location, while the slice generator is a denoising diffusion model operating on the entirety of slice images stacked on the input channels. We conduct extensive evaluation against state-of-the-art alternatives to demonstrate superiority of our method, especially in recovering complex and severely occluded shape structures, amid ambiguities. All Slice3D results were produced by networks trained on a single Nvidia A40 GPU, with an inference time less than 20 seconds.	Slice3D, a novel single-view 3D reconstruction method that predicts multi-slice images to reveal occluded parts before reconstructing a 3D model.	Addresses the fundamental challenge of single-view 3D reconstruction: faithfully reconstructing occluded parts from a single view, which prevailing multi-view synthesis methods struggle with.	Predicts multi-slice images from a single RGB image using either a regressive U-Net with slice indicator codes or a generative denoising diffusion model, then integrates the slices into a 3D model using a coordinate-based transformer for signed distance prediction.	Outperforms SOTA methods, including those using diffusion + NeRF, in recovering complex and severely occluded shapes. Demonstrates superior generalization ability across diverse object categories on Objaverse. Offers faster inference times compared to NeRF-based methods while achieving better reconstruction quality.	Current implementation uses a fixed number of slices per direction, potentially limiting detail recovery. Slicing is mainly applicable to digital 3D models, limiting its applicability to real-world scenarios compared to multi-view methods.	single-view 3d reconstruction, occlusion-revealing, multi-slice representation, denoising diffusion model, transformer network
2312.02218 Report	WavePlanes: A compact Wavelet representation for Dynamic Neural Radiance Fields	Adrian Azzarelli, Nantheera Anantrasirichai, David R Bull	Dynamic Neural Radiance Fields (Dynamic NeRF) enhance NeRF technology to model moving scenes. However, they are resource intensive and challenging to compress. To address these issues, this paper presents WavePlanes, a fast and more compact explicit model. We propose a multi-scale space and space-time feature plane representation using N-level 2-D wavelet coefficients. The inverse discrete wavelet transform reconstructs feature signals at varying detail, which are linearly decoded to approximate the color and density of volumes in a 4-D grid. Exploiting the sparsity of wavelet coefficients, we compress the model using a Hash Map containing only non-zero coefficients and their locations on each plane. Compared to the state-of-the-art (SotA) plane-based models, WavePlanes is up to 15x smaller while being less resource demanding and competitive in performance and training time. Compared to other small SotA models WavePlanes preserves details better without requiring custom CUDA code or high performance computing resources. Our code is available at: https://github.com/azzarelli/waveplanes/	Presents WavePlanes, a novel dynamic Neural Radiance Field (NeRF) representation and compression method that utilizes wavelets to reduce computation and enable efficient compression.	Dynamic NeRFs, while promising for modeling moving scenes, are resource intensive and difficult to compress. Existing solutions are either slow, large, or struggle with temporal detail.	Decomposes the 4D scene into six 2D grids. Employs a multi-scale space-time feature plane representation using N-level 2D wavelet coefficients. Introduces a Zero-Agreement Masked (ZAM) fusion scheme for improved signal localization. Leverages wavelet sparsity for compression using a Hash Map.	Achieves up to 15x compression compared to state-of-the-art plane-based models. Maintains competitive performance and training time despite being more compact. Preserves details better than other small dynamic NeRF models, particularly in regions of high frequency and occlusion.	Limited in modeling objects outside the predefined bounding box. Modeling fast motion with a fixed temporal resolution can introduce noise.	neural radiance fields, nerf, wavelets, compression, dynamic scenes
2312.02216 Report	DragVideo: Interactive Drag-style Video Editing	Yufan Deng, Ruida Wang, Yuhao Zhang, Yu-Wing Tai, Chi-Keung Tang	Video generation models have shown their superior ability to generate photo-realistic video. However, how to accurately control (or edit) the video remains a formidable challenge. The main issues are: 1) how to perform direct and accurate user control in editing; 2) how to execute editings like changing shape, expression, and layout without unsightly distortion and artifacts to the edited content; and 3) how to maintain spatio-temporal consistency of video after editing. To address the above issues, we propose DragVideo, a general drag-style video editing framework. Inspired by DragGAN, DragVideo addresses issues 1) and 2) by proposing the drag-style video latent optimization method which gives desired control by updating noisy video latent according to drag instructions through video-level drag objective function. We amend issue 3) by integrating the video diffusion model with sample-specific LoRA and Mutual Self-Attention in DragVideo to ensure the edited result is spatio-temporally consistent. We also present a series of testing examples for drag-style video editing and conduct extensive experiments across a wide array of challenging editing tasks, such as motion, skeleton editing, etc, underscoring DragVideo can edit video in an intuitive, faithful to the user's intention manner, with nearly unnoticeable distortion and artifacts, while maintaining spatio-temporal consistency. While traditional prompt-based video editing fails to do the former two and directly applying image drag editing fails in the last, DragVideo's versatility and generality are emphasized. Github link: https://github.com/RickySkywalker/DragVideo-Official.	This paper introduces DragVideo, the first end-to-end framework for drag-style video editing that enables accurate and intuitive video editing while maintaining spatio-temporal consistency.	Existing video editing methods either struggle with accurate and artifact-free editing (e.g., prompt-based methods) or fail to maintain temporal consistency when directly applying image-based drag editing across frames.	DragVideo uses a video diffusion model, sample-specific LoRA, and Mutual Self-Attention. It optimizes a noisy video latent based on user-provided drag instructions (points and masks), which are propagated across frames.	DragVideo achieves high-quality, drag-based editing on real-world videos with accurate and artifact-free results. It effectively addresses the temporal inconsistency issues present in frame-by-frame drag editing approaches. Quantitative evaluations and user studies confirm DragVideo's superior performance in temporal consistency and editing effectiveness compared to baselines.	Some edited outputs still exhibit blurriness and spatial inconsistency, suggesting a need for further optimization in visual quality. The framework currently has high computational costs, necessitating improvements in computational efficiency.	video editing, drag-style editing, video diffusion models, temporal consistency, user-guided editing
2312.02214 Report	FlashAvatar: High-fidelity Head Avatar with Efficient Gaussian Embedding	Jun Xiang, Xuan Gao, Yudong Guo, Juyong Zhang	We propose FlashAvatar, a novel and lightweight 3D animatable avatar representation that could reconstruct a digital avatar from a short monocular video sequence in minutes and render high-fidelity photo-realistic images at 300FPS on a consumer-grade GPU. To achieve this, we maintain a uniform 3D Gaussian field embedded in the surface of a parametric face model and learn extra spatial offset to model non-surface regions and subtle facial details. While full use of geometric priors can capture high-frequency facial details and preserve exaggerated expressions, proper initialization can help reduce the number of Gaussians, thus enabling super-fast rendering speed. Extensive experimental results demonstrate that FlashAvatar outperforms existing works regarding visual quality and personalized details and is almost an order of magnitude faster in rendering speed. Project page: https://ustc3dv.github.io/FlashAvatar/	FlashAvatar is a novel, lightweight, and animatable 3D avatar representation that can reconstruct high-fidelity digital avatars from short monocular videos.	It addresses the limitations of existing methods, such as 3DMMs' inability to model complex features and NeRF's slow rendering speeds, paving the way for real-time interactive digital human applications.	It leverages a mesh-embedded Gaussian field initialized on a parametric face model (FLAME), learns spatial offsets for non-surface details, and employs efficient UV sampling for optimal Gaussian distribution.	Achieves photo-realistic rendering quality with fine details and subtle expressions. Outperforms previous methods in terms of visual quality and preserves personalized details. Enables super-fast rendering at 300FPS on consumer-grade GPUs due to a low Gaussian count (10K level).	Performance is contingent on accurate FLAME mesh tracking, with potential for detail loss or misalignment due to tracking errors. Currently relies on tracked expression codes for animation, limiting its ability to model complex hair dynamics.	digital avatar, 3d gaussian splatting, facial reenactment, real-time rendering, 3d reconstruction
2312.02201 Report	ImageDream: Image-Prompt Multi-view Diffusion for 3D Generation	Peng Wang, Yichun Shi	We introduce "ImageDream," an innovative image-prompt, multi-view diffusion model for 3D object generation. ImageDream stands out for its ability to produce 3D models of higher quality compared to existing state-of-the-art, image-conditioned methods. Our approach utilizes a canonical camera coordination for the objects in images, improving visual geometry accuracy. The model is designed with various levels of control at each block inside the diffusion model based on the input image, where global control shapes the overall object layout and local control fine-tunes the image details. The effectiveness of ImageDream is demonstrated through extensive evaluations using a standard prompt list. For more information, visit our project page at https://Image-Dream.github.io.	ImageDream: a novel Image-Prompt Multi-view diffusion model for high-quality 3D object generation from a single image, surpassing previous SoTA in geometry and texture quality.	Images provide richer visual information for 3D generation than text, leading to more accurate and detailed 3D models.	ImageDream uses a canonical camera coordination for consistent geometry and a multi-level image-prompt controller for granular control over object layout and appearance. It leverages a multi-view diffusion network and score distillation sampling for 3D model creation.	Significantly outperforms other SoTA baselines in user studies based on geometry quality and similarity to the image prompt. Successfully addresses geometric inaccuracies present in previous methods. Maintains high image quality in both diffusion and post-3D fusion stages according to quantitative metrics like QIS and CLIP scores.	Struggles with capturing fine details when image constraints are overly stringent, resulting in potential blurriness. Requires a better estimation of image intrinsic and extrinsic properties for optimal performance.	3d object generation, image-prompt, multi-view diffusion, canonical camera coordination, score distillation sampling
2312.02197 Report	Test-Time Degradation Adaption for Open-Set Image Restoration	Yuanbiao Gou, Haiyu Zhao, Boyun Li, Xinyan Xiao, Xi Peng	In contrast to close-set scenarios that restore images from a predefined set of degradations, open-set image restoration aims to handle the unknown degradations that were unforeseen during the pretraining phase, which is less-touched as far as we know. In this work, we explicitly study this challenging problem and reveal its essence, i.e., the unidentified distribution shifts between test and training data. In recent, test-time adaptation emerges as a fundamental method to address this inherent disparities. Inspired by this, we propose a test-time degradation adaption framework for open-set image restoration, which involves three components, i.e., i) a pre-trained and degradation-agnostic diffusion model to generate clean images, ii) a test-time degradation adapter adapts the unknown degradations based on the input image during the testing phase, and iii) the adapter-guided image restoration guides the model through the adapter to produce the corresponding clean image. Through experiments on multiple degradations absent from the training data, we show that our method achieves comparable even better performance than those task-specific methods.	This paper introduces the problem of open-set image restoration (OIR), where the task is to restore clean images from degradations not present in the training data, and proposes a Test-time degradation Adaption framework (TAO) to address it.	Most existing image restoration methods operate under a close-set scenario, limiting their applicability to real-world situations where diverse and unforeseen degradations are common. OIR aims to tackle this limitation by enabling models to handle unknown degradations.	TAO leverages a pre-trained diffusion model and incorporates two novel components: 1) a Test-time Degradation Adapter (TDA) that aligns the model to the unknown degradation during testing and 2) an Adapter-guided Image Restoration (AIR) module that dynamically adjusts supervision strategies throughout the denoising process.	TAO achieves comparable or better performance than task-specific zero-shot methods on image dehazing, low-light enhancement, and denoising. The TDA effectively aligns the generated image domain to that of the degraded input. AIR significantly improves restoration quality by dynamically adjusting guidance strategies throughout the denoising process.	The current method relies on heuristics for dividing the denoising process into stages for AIR. Exploration of alternative adapter architectures and more principled stage division strategies could further enhance performance.	image restoration, open-set learning, test-time adaptation, diffusion models, domain adaptation
2312.02190 Report	Diffusion Handles: Enabling 3D Edits for Diffusion Models by Lifting Activations to 3D	Karran Pandey, Paul Guerrero, Matheus Gadelha, Yannick Hold-Geoffroy, Karan Singh, Niloy Mitra	Diffusion Handles is a novel approach to enabling 3D object edits on diffusion images. We accomplish these edits using existing pre-trained diffusion models, and 2D image depth estimation, without any fine-tuning or 3D object retrieval. The edited results remain plausible, photo-real, and preserve object identity. Diffusion Handles address a critically missing facet of generative image based creative design, and significantly advance the state-of-the-art in generative image editing. Our key insight is to lift diffusion activations for an object to 3D using a proxy depth, 3D-transform the depth and associated activations, and project them back to image space. The diffusion process applied to the manipulated activations with identity control, produces plausible edited images showing complex 3D occlusion and lighting effects. We evaluate Diffusion Handles: quantitatively, on a large synthetic data benchmark; and qualitatively by a user study, showing our output to be more plausible, and better than prior art at both, 3D editing and identity control. Project Webpage: https://diffusionhandles.github.io/	Introduces Diffusion Handles, a method for 3D-aware object editing in diffusion-generated images using estimated depth maps and diffusion activations.	Addresses limitations of existing diffusion-based image editing techniques by enabling plausible 3D object transformations while preserving object identity.	Lifts diffusion activations to 3D using depth maps, applies 3D transformations, projects back to image space, and guides the diffusion process with the edited activations.	Outperforms baselines in terms of plausibility, identity preservation, and edit adherence as measured by user study. Demonstrates robustness to depth map inaccuracies and artifacts. Shows consistent performance on a synthetic benchmark of randomly generated scenes and edits.	Large edits revealing depth estimation errors can lead to low-quality outputs. Identity preservation, though improved, can be further enhanced.	image editing, diffusion models, 3d-aware editing, generative models, depth estimation
2312.02189 Report	StableDreamer: Taming Noisy Score Distillation Sampling for Text-to-3D	Pengsheng Guo, Hans Hao, Adam Caccavale, Zhongzheng Ren, Edward Zhang, Qi Shan, Aditya Sankar, Alexander G. Schwing, Alex Colburn, Fangchang Ma	In the realm of text-to-3D generation, utilizing 2D diffusion models through score distillation sampling (SDS) frequently leads to issues such as blurred appearances and multi-faced geometry, primarily due to the intrinsically noisy nature of the SDS loss. Our analysis identifies the core of these challenges as the interaction among noise levels in the 2D diffusion process, the architecture of the diffusion network, and the 3D model representation. To overcome these limitations, we present StableDreamer, a methodology incorporating three advances. First, inspired by InstructNeRF2NeRF, we formalize the equivalence of the SDS generative prior and a simple supervised L2 reconstruction loss. This finding provides a novel tool to debug SDS, which we use to show the impact of time-annealing noise levels on reducing multi-faced geometries. Second, our analysis shows that while image-space diffusion contributes to geometric precision, latent-space diffusion is crucial for vivid color rendition. Based on this observation, StableDreamer introduces a two-stage training strategy that effectively combines these aspects, resulting in high-fidelity 3D models. Third, we adopt an anisotropic 3D Gaussians representation, replacing Neural Radiance Fields (NeRFs), to enhance the overall quality, reduce memory usage during training, and accelerate rendering speeds, and better capture semi-transparent objects. StableDreamer reduces multi-face geometries, generates fine details, and converges stably.	\OURS is a text-to-3D generation framework that addresses blurry appearances and multi-face geometry problems common in existing score distillation sampling (SDS) methods.	Existing text-to-3D approaches, particularly those using SDS, struggle with issues like blurry appearance, oversimplified geometry, multi-face artifacts, and slow optimization and rendering.	The paper introduces three key advances: 1) reinterpretation of SDS loss as a supervised reconstruction task enabling noise annealing and training visualization, 2) a two-stage training approach utilizing both image-space and latent-space diffusion models for enhanced geometry and color quality, and 3) integration of 3D Gaussian Splatting (3DGS) representation with specialized initialization and density control for improved detail.	Time-annealing of noise levels in SDS significantly reduces multi-face artifacts. Two-stage training with image-space diffusion followed by latent-space diffusion results in both geometric accuracy and vivid, detailed appearances. Integrating 3DGS with proposed regularization techniques leads to high-fidelity models with fast rendering speeds (over 30 FPS).	The method still encounters failure cases with certain prompts where the 2D diffusion model struggles. Future work could explore techniques to address the remaining failure cases and improve the overall robustness of the system.	text-to-3d generation, score distillation sampling, 3d gaussian splatting, diffusion models, multi-view consistency
2312.02157 Report	Mesh-Guided Neural Implicit Field Editing	Can Wang, Mingming He, Menglei Chai, Dongdong Chen, Jing Liao	Neural implicit fields have emerged as a powerful 3D representation for reconstructing and rendering photo-realistic views, yet they possess limited editability. Conversely, explicit 3D representations, such as polygonal meshes, offer ease of editing but may not be as suitable for rendering high-quality novel views. To harness the strengths of both representations, we propose a new approach that employs a mesh as a guiding mechanism in editing the neural radiance field. We first introduce a differentiable method using marching tetrahedra for polygonal mesh extraction from the neural implicit field and then design a differentiable color extractor to assign colors obtained from the volume renderings to this extracted mesh. This differentiable colored mesh allows gradient back-propagation from the explicit mesh to the implicit fields, empowering users to easily manipulate the geometry and color of neural implicit fields. To enhance user control from coarse-grained to fine-grained levels, we introduce an octree-based structure into its optimization. This structure prioritizes the edited regions and the surface part, making our method achieve fine-grained edits to the neural implicit field and accommodate various user modifications, including object additions, component removals, specific area deformations, and adjustments to local and global colors. Through extensive experiments involving diverse scenes and editing operations, we have demonstrated the capabilities and effectiveness of our method. Our project page is: \url{https://cassiepython.github.io/MNeuEdit/}	This paper introduces a novel mesh-guided editing method for neural implicit fields that enables users to edit the geometry and color of neural implicit fields with the ease of manipulating explicit 3D meshes.	Editing neural implicit fields, while offering high-fidelity rendering, is challenging due to their implicit representation. This work aims to bridge the gap by using the intuitive editing capabilities of explicit 3D meshes to guide modifications in implicit fields, making the process user-friendly and compatible with existing 3D modeling workflows.	The method leverages differentiable marching tetrahedra for mesh extraction from neural implicit fields and introduces a differentiable color extractor to assign colors to the mesh vertices. An octree-based structure is incorporated for optimization, allowing for fine-grained edits. The framework supports a two-step process: first, optimizing the density field for geometry editing, and then the color function for color modifications.	The proposed method enables extensive editing capabilities, including object addition, component removal, deformation, and precise color editing, surpassing previous methods in flexibility. The octree-based optimization significantly reduces computational demands while allowing for fine-grained editing of geometry and color, addressing the limitations of using dense grids. The method outperforms existing approaches in achieving fine-grained and consistent editing results, especially in challenging scenarios like complex textures and uneven surfaces.	The method currently lacks direct support for editing scene shading and lighting, requiring users to bake these features into vertex colors. Editing highly intricate structures that do not produce high-quality meshes, such as human hair, remains challenging due to the reliance on a reliable underlying mesh structure.	neural implicit fields, 3d mesh editing, differentiable rendering, octree-based optimization, geometry and color editing
2312.02155 Report	GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis	Shunyuan Zheng, Boyao Zhou, Ruizhi Shao, Boning Liu, Shengping Zhang, Liqiang Nie, Yebin Liu	We present a new approach, termed GPS-Gaussian, for synthesizing novel views of a character in a real-time manner. The proposed method enables 2K-resolution rendering under a sparse-view camera setting. Unlike the original Gaussian Splatting or neural implicit rendering methods that necessitate per-subject optimizations, we introduce Gaussian parameter maps defined on the source views and regress directly Gaussian Splatting properties for instant novel view synthesis without any fine-tuning or optimization. To this end, we train our Gaussian parameter regression module on a large amount of human scan data, jointly with a depth estimation module to lift 2D parameter maps to 3D space. The proposed framework is fully differentiable and experiments on several datasets demonstrate that our method outperforms state-of-the-art methods while achieving an exceeding rendering speed.	Presents GPS-Gaussian, a novel method for real-time synthesis of high-fidelity novel views of human characters from sparse multi-view RGB inputs, utilizing a generalizable 3D Gaussian Splatting approach.	Addresses the limitations of existing human NVS methods, which are either computationally expensive (e.g., NeRF-based) or lack generalizability (e.g., per-subject optimization in 3D Gaussian Splatting), hindering real-time applications.	Introduces pixel-wise Gaussian parameter maps on 2D image planes, jointly learns an iterative depth estimation module and a Gaussian parameter regression module, and leverages differentiable rendering for end-to-end training.	Achieves real-time performance exceeding 25 FPS for 2K resolution rendering on a single GPU. Outperforms state-of-the-art methods in terms of rendering quality, particularly in handling occlusions and thin structures. Demonstrates strong generalization ability, enabling instant rendering of unseen characters without requiring per-subject optimization.	Requires accurate foreground matting as a preprocessing step, limiting its application to general scenes. Relies on ground truth depth for supervision during training, posing challenges for data acquisition.	novel view synthesis, human performance rendering, 3d gaussian splatting, depth estimation, real-time rendering
2312.02153 Report	Aligning and Prompting Everything All at Once for Universal Visual Perception	Yunhang Shen, Chaoyou Fu, Peixian Chen, Mengdan Zhang, Ke Li, Xing Sun, Yunsheng Wu, Shaohui Lin, Rongrong Ji	Vision foundation models have been explored recently to build general-purpose vision systems. However, predominant paradigms, driven by casting instance-level tasks as an object-word alignment, bring heavy cross-modality interaction, which is not effective in prompting object detection and visual grounding. Another line of work that focuses on pixel-level tasks often encounters a large annotation gap of things and stuff, and suffers from mutual interference between foreground-object and background-class segmentation. In stark contrast to the prevailing methods, we present APE, a universal visual perception model for aligning and prompting everything all at once in an image to perform diverse tasks, i.e., detection, segmentation, and grounding, as an instance-level sentence-object matching paradigm. Specifically, APE advances the convergence of detection and grounding by reformulating language-guided grounding as open-vocabulary detection, which efficiently scales up model prompting to thousands of category vocabularies and region descriptions while maintaining the effectiveness of cross-modality fusion. To bridge the granularity gap of different pixel-level tasks, APE equalizes semantic and panoptic segmentation to proxy instance learning by considering any isolated regions as individual instances. APE aligns vision and language representation on broad data with natural and challenging characteristics all at once without task-specific fine-tuning. The extensive experiments on over 160 datasets demonstrate that, with only one-suit of weights, APE outperforms (or is on par with) the state-of-the-art models, proving that an effective yet universal perception for anything aligning and prompting is indeed feasible. Codes and trained models are released at https://github.com/shenyunhang/APE.	This paper introduces APE, a novel vision foundation model trained on diverse datasets for multiple tasks, including open-vocabulary object detection, various image segmentation types (semantic, instance, panoptic), and visual grounding.	Existing VFMs face limitations such as heavy cross-modality interaction leading to inefficiency in prompting, large annotation gaps between things and stuff, and mutual interference between foreground and background segmentation. APE addresses these limitations.	APE leverages an instance-level region-sentence matching paradigm. It utilizes compact sentence representations for efficient vision-language interaction, equalizes semantic and panoptic segmentation to a proxy instance learning objective, and aligns vision and language representations on broad data without task-specific fine-tuning.	APE achieves state-of-the-art or competitive performance on over 160 datasets across all tasks with a single set of weights, demonstrating strong generalization ability. The model effectively handles large-scale text prompts for querying thousands of categories and sentences in a single forward pass. APE effectively unifies the learning of thing and stuff categories, addressing the granularity discrepancy in previous methods.	The current implementation of APE relies on instance-level annotations, leading to a disadvantage in panoptic segmentation evaluation due to potentially overlapping segments. Future work could explore leveraging stronger language models and pre-training methods.	vision foundation models, open-vocabulary object detection, image segmentation, visual grounding, vision-language models
2312.02150 Report	Readout Guidance: Learning Control from Diffusion Features	Grace Luo, Trevor Darrell, Oliver Wang, Dan B Goldman, Aleksander Holynski	We present Readout Guidance, a method for controlling text-to-image diffusion models with learned signals. Readout Guidance uses readout heads, lightweight networks trained to extract signals from the features of a pre-trained, frozen diffusion model at every timestep. These readouts can encode single-image properties, such as pose, depth, and edges; or higher-order properties that relate multiple images, such as correspondence and appearance similarity. Furthermore, by comparing the readout estimates to a user-defined target, and back-propagating the gradient through the readout head, these estimates can be used to guide the sampling process. Compared to prior methods for conditional generation, Readout Guidance requires significantly fewer added parameters and training samples, and offers a convenient and simple recipe for reproducing different forms of conditional control under a single framework, with a single architecture and sampling procedure. We showcase these benefits in the applications of drag-based manipulation, identity-consistent generation, and spatially aligned control. Project page: https://readout-guidance.github.io.	The paper proposes Readout Guidance, a method for controlling text-to-image diffusion models by training lightweight readout heads on diffusion features to extract and guide image generation towards desired properties.	This method provides flexible user control over diffusion models without expensive finetuning, requiring significantly fewer training samples and time compared to existing techniques.	Readout heads are trained on top of frozen diffusion models to extract single-image properties like pose and depth, or relative properties between images like appearance similarity. These learned readouts then guide the sampling process towards user-defined targets or references.	Readout Guidance achieves state-of-the-art performance on drag-based image manipulation, outperforming previous methods that require per-example finetuning. The method enables identity-consistent generation, preserving the appearance of a subject from a reference image without subject-specific training. Readout Guidance is effective for spatially aligned controls like pose, depth, and edge guidance, achieving competitive performance with significantly less training data and parameters compared to finetuned models.	Readout Guidance requires additional memory and runtime during sampling due to gradient computations. The method may sometimes produce unrealistic or cartoonish imagery while satisfying readout constraints, requiring careful tuning of guidance strength.	diffusion models, image generation, conditional image synthesis, sampling-time guidance, image manipulation
2312.02149 Report	Generative Powers of Ten	Xiaojuan Wang, Janne Kontkanen, Brian Curless, Steve Seitz, Ira Kemelmacher, Ben Mildenhall, Pratul Srinivasan, Dor Verbin, Aleksander Holynski	We present a method that uses a text-to-image model to generate consistent content across multiple image scales, enabling extreme semantic zooms into a scene, e.g., ranging from a wide-angle landscape view of a forest to a macro shot of an insect sitting on one of the tree branches. We achieve this through a joint multi-scale diffusion sampling approach that encourages consistency across different scales while preserving the integrity of each individual sampling process. Since each generated scale is guided by a different text prompt, our method enables deeper levels of zoom than traditional super-resolution methods that may struggle to create new contextual structure at vastly different scales. We compare our method qualitatively with alternative techniques in image super-resolution and outpainting, and show that our method is most effective at generating consistent multi-scale content.	This paper introduces a method for generating consistent content across multiple image scales using a text-to-image diffusion model, enabling extreme semantic zooms into a scene described by a series of text prompts at varying zoom levels.	This method addresses the limitations of traditional super-resolution methods, which struggle to create new contextual structure at vastly different scales, by leveraging semantic information from text prompts.	The method employs a joint multi-scale diffusion sampling approach, utilizing a zoom stack representation and multi-resolution blending to ensure consistency across different scales while preserving the integrity of individual sampling processes.	The method successfully generates consistent and high-quality zoom sequences for various zoom factors and scenes. It outperforms baseline methods like diffusion-based outpainting and super-resolution models in terms of consistency and image quality. The method allows for user control by incorporating real images or editing text prompts to guide the generation process.	Identifying appropriate text prompts that align with specific scales and the text-to-image model's training data remains challenging. Future work could explore optimizing geometric transformations or text embeddings for better alignment between zoom levels and prompts.	text-to-image generation, semantic zoom, multi-scale representation, diffusion models, joint sampling
2312.02147 Report	Rejuvenating image-GPT as Strong Visual Representation Learners	Sucheng Ren, Zeyu Wang, Hongru Zhu, Junfei Xiao, Alan Yuille, Cihang Xie	This paper enhances image-GPT (iGPT), one of the pioneering works that introduce autoregressive pretraining to predict next pixels for visual representation learning. Two simple yet essential changes are made. First, we shift the prediction target from raw pixels to semantic tokens, enabling a higher-level understanding of visual content. Second, we supplement the autoregressive modeling by instructing the model to predict not only the next tokens but also the visible tokens. This pipeline is particularly effective when semantic tokens are encoded by discriminatively trained models, such as CLIP. We introduce this novel approach as D-iGPT. Extensive experiments showcase that D-iGPT excels as a strong learner of visual representations: A notable achievement of D-iGPT is its compelling performance on the ImageNet-1K dataset -- by training on publicly available datasets, D-iGPT achieves 89.5\% top-1 accuracy with a vanilla ViT-Large model. This model also shows strong generalization on the downstream task and robustness on out-of-distribution samples. Code is avaiable at \href{https://github.com/OliverRensu/D-iGPT}{https://github.com/OliverRensu/D-iGPT}.	This paper introduces D-iGPT, an enhanced version of the iGPT model that predicts semantic tokens instead of raw pixels for visual representation learning, achieving strong performance on ImageNet using publicly available data.	The work addresses the limitations of existing self-supervised learning methods in computer vision, particularly the under-exploration of autoregressive pretraining for high-quality visual representation learning at scale.	D-iGPT modifies iGPT by 1) predicting semantic tokens obtained from a discriminatively trained model like CLIP, and 2) adding supervision for visible tokens to improve training.	D-iGPT achieves 89.5% top-1 accuracy on ImageNet-1K using publicly available datasets, exceeding previous state-of-the-art methods. D-iGPT demonstrates strong generalization by surpassing MAE counterparts on semantic segmentation using ADE20K. D-iGPT exhibits superior robustness compared to existing methods across various out-of-domain ImageNet datasets.	The reliance on a separate model for generating semantic tokens introduces potential limitations depending on the chosen model's performance. Further exploration is needed to fully understand the impact of scaling D-iGPT to even larger datasets and model sizes.	self-supervised learning, autoregressive pretraining, vision transformer, image classification, semantic segmentation
2312.02145 Report	Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation	Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, Konrad Schindler	Monocular depth estimation is a fundamental computer vision task. Recovering 3D depth from a single image is geometrically ill-posed and requires scene understanding, so it is not surprising that the rise of deep learning has led to a breakthrough. The impressive progress of monocular depth estimators has mirrored the growth in model capacity, from relatively modest CNNs to large Transformer architectures. Still, monocular depth estimators tend to struggle when presented with images with unfamiliar content and layout, since their knowledge of the visual world is restricted by the data seen during training, and challenged by zero-shot generalization to new domains. This motivates us to explore whether the extensive priors captured in recent generative diffusion models can enable better, more generalizable depth estimation. We introduce Marigold, a method for affine-invariant monocular depth estimation that is derived from Stable Diffusion and retains its rich prior knowledge. The estimator can be fine-tuned in a couple of days on a single GPU using only synthetic training data. It delivers state-of-the-art performance across a wide range of datasets, including over 20% performance gains in specific cases. Project page: https://marigoldmonodepth.github.io.	This paper introduces Marigold, a resource-efficient fine-tuning protocol that transforms a pre-trained Latent Diffusion Model (Stable Diffusion v2) into an image-conditioned depth estimator, enabling state-of-the-art affine-invariant monocular depth estimation.	Existing monocular depth estimators often struggle with unfamiliar content, highlighting the need for methods that leverage broader visual priors and generalize well. Diffusion models, trained on massive image datasets, offer such priors, making them promising candidates for this task.	The authors freeze the pre-trained Stable Diffusion VAE and fine-tune its U-Net to estimate depth. The input image and depth map are encoded into a latent space, concatenated, and fed into the U-Net. Fine-tuning relies solely on synthetic data (Hypersim, Virtual KITTI) and an annealed multi-resolution noise schedule for faster convergence. A test-time ensembling scheme aggregates multiple predictions to boost performance.	Marigold achieves state-of-the-art results on multiple zero-shot benchmarks (NYUv2, KITTI, ETH3D, ScanNet, DIODE), surpassing existing methods in most cases despite being trained solely on synthetic data. Training with multi-resolution noise and annealing consistently improves performance and reduces prediction variance, as demonstrated by ablation studies. The proposed test-time ensembling significantly enhances accuracy, with the most substantial gains observed when aggregating up to 10 predictions.	The inference speed of Marigold is slower than feed-forward methods due to its iterative nature. The generative nature of diffusion models can lead to inconsistent depth predictions for similar input images, even with test-time ensembling.	monocular depth estimation, diffusion models, stable diffusion, fine-tuning, zero-shot generalization
2312.02139 Report	DiffiT: Diffusion Vision Transformers for Image Generation	Ali Hatamizadeh, Jiaming Song, Guilin Liu, Jan Kautz, Arash Vahdat	Diffusion models with their powerful expressivity and high sample quality have achieved State-Of-The-Art (SOTA) performance in the generative domain. The pioneering Vision Transformer (ViT) has also demonstrated strong modeling capabilities and scalability, especially for recognition tasks. In this paper, we study the effectiveness of ViTs in diffusion-based generative learning and propose a new model denoted as Diffusion Vision Transformers (DiffiT). Specifically, we propose a methodology for finegrained control of the denoising process and introduce the Time-dependant Multihead Self Attention (TMSA) mechanism. DiffiT is surprisingly effective in generating high-fidelity images with significantly better parameter efficiency. We also propose latent and image space DiffiT models and show SOTA performance on a variety of class-conditional and unconditional synthesis tasks at different resolutions. The Latent DiffiT model achieves a new SOTA FID score of 1.73 on ImageNet-256 dataset while having 19.85%, 16.88% less parameters than other Transformer-based diffusion models such as MDT and DiT, respectively. Code: https://github.com/NVlabs/DiffiT	Introduced DiffiT, a novel ViT-based diffusion model for latent and image space generation, featuring Time-dependent Multihead Self-Attention (TMSA) for efficient spatial-temporal dependency learning.	Combines the strengths of ViTs (long-range dependency modeling, scalability) and diffusion models (high sample quality) for enhanced image generation.	Leveraged TMSA within a U-Net architecture, enabling dynamic adaptation of attention across denoising stages by integrating temporal information into queries, keys, and values.	Achieved SOTA FID score of 1.73 on ImageNet-256, outperforming competing models with fewer parameters. Demonstrated SOTA image generation performance on CIFAR-10 and FFHQ-64 datasets. Showed TMSA's effectiveness through ablation studies, highlighting its importance for superior generation quality and parameter efficiency.	Exploration of TMSA's potential beyond image generation tasks, such as image-to-image translation. Investigation of alternative time embedding techniques for potential performance enhancement.	image generation, diffusion models, vision transformers, time-dependent self-attention, generative modeling
2312.02135 Report	Fast View Synthesis of Casual Videos	Yao-Chih Lee, Zhoutong Zhang, Kevin Blackburn-Matzen, Simon Niklaus, Jianming Zhang, Jia-Bin Huang, Feng Liu	Novel view synthesis from an in-the-wild video is difficult due to challenges like scene dynamics and lack of parallax. While existing methods have shown promising results with implicit neural radiance fields, they are slow to train and render. This paper revisits explicit video representations to synthesize high-quality novel views from a monocular video efficiently. We treat static and dynamic video content separately. Specifically, we build a global static scene model using an extended plane-based scene representation to synthesize temporally coherent novel video. Our plane-based scene representation is augmented with spherical harmonics and displacement maps to capture view-dependent effects and model non-planar complex surface geometry. We opt to represent the dynamic content as per-frame point clouds for efficiency. While such representations are inconsistency-prone, minor temporal inconsistencies are perceptually masked due to motion. We develop a method to quickly estimate such a hybrid video representation and render novel views in real time. Our experiments show that our method can render high-quality novel views from an in-the-wild video with comparable quality to state-of-the-art methods while being 100x faster in training and enabling real-time rendering.	This paper presents an efficient novel view synthesis method for casual monocular videos using a hybrid explicit representation, achieving comparable quality to NeRF-based methods but 100x faster.	Existing NeRF-based methods for dynamic novel view synthesis are computationally expensive and slow to train and render, making them impractical for real-time applications.	The method separates static and dynamic content, using a global plane-based representation with spherical harmonics and displacement maps for the static background and per-frame point clouds for dynamic objects. It jointly optimizes the representation from monocular videos using a set of carefully designed loss functions.	The method achieves comparable rendering quality to state-of-the-art NeRF-based methods on the NVIDIA and DAVIS datasets. It significantly outperforms previous approaches in terms of training and rendering speed, achieving real-time rendering at 27 FPS. Ablation studies demonstrate the effectiveness of view-dependent plane textures and temporal neighbor blending for improving synthesis quality.	The method's performance depends on the accuracy of preprocessed video depth and pose estimation. It may struggle to separate objects with subtle motion from the static background, highlighting an area for future improvement.	novel view synthesis, dynamic scenes, explicit representation, real-time rendering, monocular video
2312.02134 Report	GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians	Liangxiao Hu, Hongwen Zhang, Yuxiang Zhang, Boyao Zhou, Boning Liu, Shengping Zhang, Liqiang Nie	We present GaussianAvatar, an efficient approach to creating realistic human avatars with dynamic 3D appearances from a single video. We start by introducing animatable 3D Gaussians to explicitly represent humans in various poses and clothing styles. Such an explicit and animatable representation can fuse 3D appearances more efficiently and consistently from 2D observations. Our representation is further augmented with dynamic properties to support pose-dependent appearance modeling, where a dynamic appearance network along with an optimizable feature tensor is designed to learn the motion-to-appearance mapping. Moreover, by leveraging the differentiable motion condition, our method enables a joint optimization of motions and appearances during avatar modeling, which helps to tackle the long-standing issue of inaccurate motion estimation in monocular settings. The efficacy of GaussianAvatar is validated on both the public dataset and our collected dataset, demonstrating its superior performances in terms of appearance quality and rendering efficiency.	Presents GaussianAvatar, a novel method for reconstructing realistic human avatars with dynamic 3D appearances from a single video using animatable 3D Gaussians.	Creating personalized avatars from monocular videos is challenging due to the inherent ambiguity and complexities in capturing dynamic human appearance, including wrinkles and cloth deformations.	Introduces animatable 3D Gaussians with pose-dependent properties and a dynamic appearance network to model motion-to-appearance mapping. Jointly optimizes motion and appearance during training to refine inaccurate motion estimations from monocular input.	Outperforms previous methods in reconstruction quality on People-Snapshot, NeuMan, and DynVideo datasets. Demonstrates robustness to initial motion estimations and effectively corrects misalignments in motion capture results. Enables realistic animation with challenging poses while maintaining real-time rendering speeds.	May generate artifacts due to inaccurate foreground segmentation. Faces challenges in accurately modeling loose outfits, such as dresses, due to limitations in skinning weights derived from the SMPL model.	human avatar reconstruction, animatable 3d gaussians, dynamic appearance modeling, motion optimization, single-view reconstruction
2312.02133 Report	Style Aligned Image Generation via Shared Attention	Amir Hertz, Andrey Voynov, Shlomi Fruchter, Daniel Cohen-Or	Large-scale Text-to-Image (T2I) models have rapidly gained prominence across creative fields, generating visually compelling outputs from textual prompts. However, controlling these models to ensure consistent style remains challenging, with existing methods necessitating fine-tuning and manual intervention to disentangle content and style. In this paper, we introduce StyleAligned, a novel technique designed to establish style alignment among a series of generated images. By employing minimal `attention sharing' during the diffusion process, our method maintains style consistency across images within T2I models. This approach allows for the creation of style-consistent images using a reference style through a straightforward inversion operation. Our method's evaluation across diverse styles and text prompts demonstrates high-quality synthesis and fidelity, underscoring its efficacy in achieving consistent style across various inputs.	Introduces "StyleAligned," a technique for consistent style interpretation across a set of images generated by text-to-image models, using minimal attention sharing during the diffusion process.	Addresses the challenge of maintaining consistent style in AI-generated image sets, which is crucial for applications requiring a unified aesthetic.	Employs minimal 'attention sharing' with AdaIN modulation during the diffusion process, where target images attend to a reference image's features, enabling style alignment without optimization or fine-tuning.	Achieves significantly higher style consistency scores compared to standard text-to-image generation. Exhibits less content leakage and generates more diverse sets compared to full attention sharing. Outperforms personalization-based methods like DreamBooth and StyleDrop in terms of style consistency and adherence to text prompts.	Achieving finer control over shape and appearance similarity among generated images. Overcoming limitations of diffusion inversion for more robust style transfer from input images.	text-to-image synthesis, style consistency, diffusion models, attention mechanisms, generative ai
2312.02126 Report	SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM	Nikhil Keetha, Jay Karhade, Krishna Murthy Jatavallabhula, Gengshan Yang, Sebastian Scherer, Deva Ramanan, Jonathon Luiten	Dense simultaneous localization and mapping (SLAM) is crucial for robotics and augmented reality applications. However, current methods are often hampered by the non-volumetric or implicit way they represent a scene. This work introduces SplaTAM, an approach that, for the first time, leverages explicit volumetric representations, i.e., 3D Gaussians, to enable high-fidelity reconstruction from a single unposed RGB-D camera, surpassing the capabilities of existing methods. SplaTAM employs a simple online tracking and mapping system tailored to the underlying Gaussian representation. It utilizes a silhouette mask to elegantly capture the presence of scene density. This combination enables several benefits over prior representations, including fast rendering and dense optimization, quickly determining if areas have been previously mapped, and structured map expansion by adding more Gaussians. Extensive experiments show that SplaTAM achieves up to 2x superior performance in camera pose estimation, map construction, and novel-view synthesis over existing methods, paving the way for more immersive high-fidelity SLAM applications.	\coolname{} is the first dense RGB-D SLAM solution to use 3D Gaussian Splatting for high-fidelity online camera tracking and scene reconstruction.	Existing dense SLAM methods suffer from limitations. Explicit representations struggle with novel view synthesis while implicit ones are computationally expensive and hard to edit. \coolname{} leverages the advantages of explicit volumetric representations to address these issues.	\coolname{} represents the scene as a collection of 3D Gaussians. It utilizes differentiable rendering via Gaussian splatting to estimate camera poses and update the Gaussian map. This process involves camera tracking, Gaussian densification, and map updating.	Achieves state-of-the-art camera pose estimation accuracy on multiple datasets, especially excelling in scenarios with large camera motion. Demonstrates high-fidelity novel view synthesis performance, comparable to methods using ground truth poses. Exhibits fast runtime comparable to methods using significantly fewer pixels for optimization due to the efficiency of Gaussian splatting.	Shows sensitivity to motion blur, large depth noise, and aggressive rotation. Future work includes addressing these sensitivities, scaling to large-scale scenes, and removing dependencies on known camera intrinsics and dense depth.	slam, 3d gaussian splatting, novel view synthesis, differentiable rendering, dense reconstruction
2312.02116 Report	GIVT: Generative Infinite-Vocabulary Transformers	Michael Tschannen, Cian Eastwood, Fabian Mentzer	We introduce generative infinite-vocabulary transformers (GIVT) which generate vector sequences with real-valued entries, instead of discrete tokens from a finite vocabulary. To this end, we propose two surprisingly simple modifications to decoder-only transformers: 1) at the input, we replace the finite-vocabulary lookup table with a linear projection of the input vectors; and 2) at the output, we replace the logits prediction (usually mapped to a categorical distribution) with the parameters of a multivariate Gaussian mixture model. Inspired by the image-generation paradigm of VQ-GAN and MaskGIT, where transformers are used to model the discrete latent sequences of a VQ-VAE, we use GIVT to model the unquantized real-valued latent sequences of a $\beta$-VAE. In class-conditional image generation GIVT outperforms VQ-GAN (and improved variants thereof) as well as MaskGIT, and achieves performance competitive with recent latent diffusion models. Finally, we obtain strong results outside of image generation when applying GIVT to panoptic segmentation and depth estimation with a VAE variant of the UViM framework	Introduces Generative Infinite-Vocabulary Transformers (GIVT), which generate sequences of real-valued vectors instead of discrete tokens, eliminating the need for quantization in visual data generation.	Overcomes limitations of VQ-VAE based image generation, such as low codebook usage and large embedding matrices, while offering better quality and representation learning.	Modifies decoder-only transformers by replacing embedding lookup with linear projection and outputting parameters of a Gaussian Mixture Model (GMM). Trained on latent sequences from a β-VAE.	Outperforms VQGAN, MaskGIT, and some diffusion models in class-conditional image generation. Achieves strong representation learning capabilities, on par with or exceeding VQ-based models. Demonstrates competitive performance in panoptic segmentation and depth estimation with UViM framework.	End-to-end training of VAE and GIVT poses challenges. Exploration of GIVT applications beyond image generation, such as audio and time-series modeling.	generative models, transformers, image generation, quantization-free, infinite vocabulary
2312.02109 Report	ArtAdapter: Text-to-Image Style Transfer using Multi-Level Style Encoder and Explicit Adaptation	Dar-Yen Chen, Hamish Tennent, Ching-Wen Hsu	This work introduces ArtAdapter, a transformative text-to-image (T2I) style transfer framework that transcends traditional limitations of color, brushstrokes, and object shape, capturing high-level style elements such as composition and distinctive artistic expression. The integration of a multi-level style encoder with our proposed explicit adaptation mechanism enables ArtAdapter to achieve unprecedented fidelity in style transfer, ensuring close alignment with textual descriptions. Additionally, the incorporation of an Auxiliary Content Adapter (ACA) effectively separates content from style, alleviating the borrowing of content from style references. Moreover, our novel fast finetuning approach could further enhance zero-shot style representation while mitigating the risk of overfitting. Comprehensive evaluations confirm that ArtAdapter surpasses current state-of-the-art methods.	ArtAdapter, a novel text-to-image (T2I) style transfer framework that captures both low- and high-level artistic features, from textures to composition.	Existing T2I style transfer methods struggle to capture high-level artistic style, often being limited to color and texture, or borrowing unwanted content from style references.	Uses a multi-level style encoder, an explicit adaptation mechanism within the diffusion model's cross-attention layers, and an auxiliary content adapter (ACA) during training.	Faithfully captures diverse artistic styles without compromising textual semantics. Enables flexible style mixing from different references and across different style levels. Outperforms state-of-the-art methods in both single- and multi-reference style transfer based on quantitative metrics and user study.	High-level style embeddings can inadvertently incorporate lower-level elements during style mixing, requiring improved disentanglement. Future work can explore broader applications of ArtAdapter beyond style transfer, such as incorporating structural controls.	text-to-image, style transfer, diffusion models, style mixing, deep learning
2312.02103 Report	Learning Pseudo-Labeler beyond Noun Concepts for Open-Vocabulary Object Detection	Sunghun Kang, Junbum Cha, Jonghwan Mun, Byungseok Roh, Chang D. Yoo	Open-vocabulary object detection (OVOD) has recently gained significant attention as a crucial step toward achieving human-like visual intelligence. Existing OVOD methods extend target vocabulary from pre-defined categories to open-world by transferring knowledge of arbitrary concepts from vision-language pre-training models to the detectors. While previous methods have shown remarkable successes, they suffer from indirect supervision or limited transferable concepts. In this paper, we propose a simple yet effective method to directly learn region-text alignment for arbitrary concepts. Specifically, the proposed method aims to learn arbitrary image-to-text mapping for pseudo-labeling of arbitrary concepts, named Pseudo-Labeling for Arbitrary Concepts (PLAC). The proposed method shows competitive performance on the standard OVOD benchmark for noun concepts and a large improvement on referring expression comprehension benchmark for arbitrary concepts.	This paper introduces PLAC, a pseudo-labeling method for open-vocabulary object detection (OVOD) that learns arbitrary image-to-text mapping to generate pseudo-labels, enabling the detection of concepts beyond simple nouns.	Existing OVOD methods struggle to effectively transfer knowledge from open-world classifiers to detectors, often relying on indirect supervision or limiting themselves to noun-based concepts. PLAC overcomes these limitations by directly learning region-text alignment for arbitrary concepts, broadening the scope of detectable objects.	PLAC leverages a module trained on image-text pairs to map CLIP image embeddings to corresponding text embeddings. These embeddings serve as pseudo-labels for training an OVOD model (Deformable DETR) with a two-stage matching strategy to handle the uncertainty of pseudo-labels.	PLAC achieves competitive performance on the LVIS benchmark for OVOD, demonstrating its ability to effectively learn noun concepts. PLAC significantly outperforms previous state-of-the-art methods on the RefCOCOg referring expression comprehension benchmark, highlighting its capability to detect objects based on arbitrary concepts, including colors and specific object attributes. Ablation studies confirm the effectiveness of the proposed pseudo-labeling method, loss functions, and the two-stage matching strategy.	The performance of PLAC on LVIS, while competitive, reveals that current benchmarks might not be ideal for evaluating the full potential of OVOD methods designed for arbitrary concept detection. Future work could explore incorporating additional modalities or knowledge sources to further enhance the richness and accuracy of the pseudo-labels.	open-vocabulary object detection, pseudo-labeling, region-text alignment, vision-language pre-training, referring expression comprehension
2312.02087 Report	VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence	Yuchao Gu, Yipin Zhou, Bichen Wu, Licheng Yu, Jia-Wei Liu, Rui Zhao, Jay Zhangjie Wu, David Junhao Zhang, Mike Zheng Shou, Kevin Tang	Current diffusion-based video editing primarily focuses on structure-preserved editing by utilizing various dense correspondences to ensure temporal consistency and motion alignment. However, these approaches are often ineffective when the target edit involves a shape change. To embark on video editing with shape change, we explore customized video subject swapping in this work, where we aim to replace the main subject in a source video with a target subject having a distinct identity and potentially different shape. In contrast to previous methods that rely on dense correspondences, we introduce the VideoSwap framework that exploits semantic point correspondences, inspired by our observation that only a small number of semantic points are necessary to align the subject's motion trajectory and modify its shape. We also introduce various user-point interactions (\eg, removing points and dragging points) to address various semantic point correspondence. Extensive experiments demonstrate state-of-the-art video subject swapping results across a variety of real-world videos.	This paper introduces VideoSwap, a novel framework for customized video subject swapping that leverages semantic point correspondences to align motion trajectories while enabling significant shape changes in the swapped subject.	Existing diffusion-based video editing methods, often relying on dense correspondences, struggle with shape changes in the edited subject. VideoSwap addresses this limitation by utilizing sparse semantic point correspondences, allowing for flexible shape manipulation while preserving motion fidelity.	VideoSwap extracts semantic point trajectories and embeddings from the source video. It then registers these points on the source video to guide the diffusion model during the editing process. The framework supports user interaction through point removal or dragging to handle various semantic correspondences between the source and target subjects. A layered neural atlas aids in propagating point displacements consistently across frames.	VideoSwap demonstrates superior performance in customized video subject swapping compared to state-of-the-art video editing methods, as evidenced by both qualitative and quantitative evaluations. The use of semantic point correspondence enables VideoSwap to achieve significant shape changes in the target subject while maintaining accurate alignment with the source subject's motion. Ablation studies validate the contribution of key components, such as the use of DIFT embeddings for semantic point representation and the incorporation of a point patch loss and a semantic-enhanced schedule during training.	The performance of VideoSwap relies on accurate point tracking, which can be challenged by self-occlusion and significant view changes in the video. The current implementation of VideoSwap incurs noticeable computational cost, limiting its practicality for real-time interactive editing. Future work may explore neural field acceleration and diffusion model distillation to address this limitation.	video editing, diffusion models, semantic point correspondence, shape change, motion trajectory alignment
2312.02069 Report	GaussianAvatars: Photorealistic Head Avatars with Rigged 3D Gaussians	Shenhan Qian, Tobias Kirschstein, Liam Schoneveld, Davide Davoli, Simon Giebenhain, Matthias Nießner	We introduce GaussianAvatars, a new method to create photorealistic head avatars that are fully controllable in terms of expression, pose, and viewpoint. The core idea is a dynamic 3D representation based on 3D Gaussian splats that are rigged to a parametric morphable face model. This combination facilitates photorealistic rendering while allowing for precise animation control via the underlying parametric model, e.g., through expression transfer from a driving sequence or by manually changing the morphable model parameters. We parameterize each splat by a local coordinate frame of a triangle and optimize for explicit displacement offset to obtain a more accurate geometric representation. During avatar reconstruction, we jointly optimize for the morphable model parameters and Gaussian splat parameters in an end-to-end fashion. We demonstrate the animation capabilities of our photorealistic avatar in several challenging scenarios. For instance, we show reenactments from a driving video, where our method outperforms existing works by a significant margin.	Introduces \OURS{}, a method for creating animatable and photorealistic head avatars by rigging 3D Gaussian splats to a parametric mesh (FLAME)	Creating animatable avatars with photorealistic quality and controllability is crucial for various applications in gaming, VR/AR, etc.	Binds 3D Gaussian splats to a FLAME mesh, enabling the Gaussians to move dynamically with the mesh. A binding inheritance strategy supports adding/removing Gaussians while maintaining controllability. Regularization ensures accurate animation without artifacts.	Outperforms state-of-the-art methods in novel-view synthesis and self-reenactment achieving higher PSNR and SSIM values. Exhibits superior visual quality, capturing fine details like wrinkles and hair strands, especially during animation. Demonstrates better generalization to novel expressions and poses compared to other methods.	Lacks explicit control over regions not modeled by FLAME, such as hair or accessories. Relighting the avatar is currently not feasible.	avatar creation, 3d gaussian splatting, parametric face model, novel view synthesis, facial reenactment
2312.01987 Report	Bootstrapping SparseFormers from Vision Foundation Models	Ziteng Gao, Zhan Tong, Kevin Qinghong Lin, Joya Chen, Mike Zheng Shou	The recently proposed SparseFormer architecture provides an alternative approach to visual understanding by utilizing a significantly lower number of visual tokens via adjusting RoIs, greatly reducing computational costs while still achieving promising performance. However, training SparseFormers from scratch is still expensive, and scaling up the number of parameters can be challenging. In this paper, we propose to bootstrap SparseFormers from ViT-based vision foundation models in a simple and efficient way. Since the majority of SparseFormer blocks are the standard transformer ones, we can inherit weights from large-scale pre-trained vision transformers and freeze them as much as possible. Therefore, we only need to train the SparseFormer-specific lightweight focusing transformer to adjust token RoIs and fine-tune a few early pre-trained blocks to align the final token representation. In such a way, we can bootstrap SparseFormer architectures from various large-scale pre-trained models (e.g., IN-21K pre-trained AugRegs or CLIPs) using a rather smaller amount of training samples (e.g., IN-1K) and without labels or captions within just a few hours. As a result, the bootstrapped unimodal SparseFormer (from AugReg-ViT-L/16-384) can reach 84.9% accuracy on IN-1K with only 49 tokens, and the multimodal SparseFormer from CLIPs also demonstrates notable zero-shot performance with highly reduced computational cost without seeing any caption during the bootstrapping procedure. In addition, CLIP-bootstrapped SparseFormers, which align the output space with language without seeing a word, can serve as efficient vision encoders in multimodal large language models. Code and models are available at https://github.com/showlab/sparseformer	This paper presents a method to bootstrap SparseFormers from pre-trained vision foundation models (like AugReg and CLIP) by inheriting weights and aligning final representations using a smaller dataset.	Training SparseFormers from scratch is expensive and scaling them is challenging. This work addresses these limitations by leveraging pre-trained models for faster and more efficient training.	The method inherits weights from the pre-trained model into the SparseFormer's cortex transformer blocks. Then, it aligns the final representation of the SparseFormer with that of the pre-trained model using a cosine loss on a smaller dataset like ImageNet-1K.	Bootstrapped SparseFormers achieve comparable accuracy to pre-trained models with significantly fewer tokens (e.g., 84.9% top-1 accuracy on ImageNet-1K with only 49 tokens). The method is applicable to both unimodal (classification) and multimodal (CLIP) models. Bootstrapped SparseFormers can serve as efficient backbones for downstream tasks like semantic segmentation and as vision encoders in multimodal large language models.	The bootstrapping method assumes a transformer architecture for the vision foundation model, limiting its applicability to non-transformer models. The method requires access to pre-trained weights, which may not be available for some large-scale proprietary models.	sparseformers, vision foundation models, model bootstrapping, efficient vision transformers, multimodal learning
2312.01985 Report	UniGS: Unified Representation for Image Generation and Segmentation	Lu Qi, Lehan Yang, Weidong Guo, Yu Xu, Bo Du, Varun Jampani, Ming-Hsuan Yang	This paper introduces a novel unified representation of diffusion models for image generation and segmentation. Specifically, we use a colormap to represent entity-level masks, addressing the challenge of varying entity numbers while aligning the representation closely with the image RGB domain. Two novel modules, including the location-aware color palette and progressive dichotomy module, are proposed to support our mask representation. On the one hand, a location-aware palette guarantees the colors' consistency to entities' locations. On the other hand, the progressive dichotomy module can efficiently decode the synthesized colormap to high-quality entity-level masks in a depth-first binary search without knowing the cluster numbers. To tackle the issue of lacking large-scale segmentation training data, we employ an inpainting pipeline and then improve the flexibility of diffusion models across various tasks, including inpainting, image synthesis, referring segmentation, and entity segmentation. Comprehensive experiments validate the efficiency of our approach, demonstrating comparable segmentation mask quality to state-of-the-art and adaptability to multiple tasks. The code will be released at \href{https://github.com/qqlu/Entity}{https://github.com/qqlu/Entity}.	This paper presents UniGS, a novel unified diffusion model for simultaneous image generation and entity-level segmentation using a colormap representation for masks.	A unified representation for both generation and segmentation can refine image generation, enhance coherence between synthesized entities and their masks, and enable a single model to perform various dense prediction tasks.	UniGS employs a UNet architecture with dual branches for image and mask generation. It introduces a location-aware color palette for consistent entity representation and a progressive dichotomy module for efficient colormap decoding to masks. The model is trained using an inpainting pipeline to address limited segmentation data.	UniGS achieves comparable segmentation quality to state-of-the-art methods without using standard segmentation losses. It demonstrates strong performance in multi-class multi-region inpainting, image synthesis, referring segmentation, and entity segmentation. The model exhibits the ability to generate realistic shadows, even without explicit shadow supervision.	There is still a performance gap between UniGS and state-of-the-art entity segmentation models. Future work includes exploring multi-task training for all tasks within a single model.	diffusion models, image generation, semantic segmentation, unified representation, inpainting
2312.01841 Report	VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior	Xusen Sun, Longhao Zhang, Hao Zhu, Peng Zhang, Bang Zhang, Xinya Ji, Kangneng Zhou, Daiheng Gao, Liefeng Bo, Xun Cao	Audio-driven talking head generation has drawn much attention in recent years, and many efforts have been made in lip-sync, expressive facial expressions, natural head pose generation, and high video quality. However, no model has yet led or tied on all these metrics due to the one-to-many mapping between audio and motion. In this paper, we propose VividTalk, a two-stage generic framework that supports generating high-visual quality talking head videos with all the above properties. Specifically, in the first stage, we map the audio to mesh by learning two motions, including non-rigid expression motion and rigid head motion. For expression motion, both blendshape and vertex are adopted as the intermediate representation to maximize the representation ability of the model. For natural head motion, a novel learnable head pose codebook with a two-phase training mechanism is proposed. In the second stage, we proposed a dual branch motion-vae and a generator to transform the meshes into dense motion and synthesize high-quality video frame-by-frame. Extensive experiments show that the proposed VividTalk can generate high-visual quality talking head videos with lip-sync and realistic enhanced by a large margin, and outperforms previous state-of-the-art works in objective and subjective comparisons.	Proposed VividTalk, a two-stage framework for generating high-quality talking head videos with expressive facial expressions and natural head poses from audio and a single reference image.	Existing methods struggle to simultaneously achieve lip-sync, expressive facial expressions, natural head poses, and high video quality due to the one-to-many mapping between audio and motion.	The framework consists of: 1) Audio-To-Mesh Generation: maps audio to non-rigid facial expressions (using both blendshapes and vertex offsets) and rigid head poses (using a learnable head pose codebook) to generate driven meshes. 2) Mesh-To-Video Generation: transforms driven meshes into 2D dense motion using a dual-branch motion-VAE and synthesizes the final video frame-by-frame.	Outperforms state-of-the-art methods in both objective and subjective evaluations for video quality, identity preservation, and head pose diversity. Generates accurate lip-synchronized and expressive facial motions. Demonstrates the effectiveness of using both blendshapes and vertex offsets, a learnable head pose codebook, and a dual-branch motion-VAE.	Reliance on 3DMM for facial modeling, which may limit the representation of certain facial features. Further research on improving the temporal consistency and smoothness of generated head poses.	talking head generation, audio-driven animation, deep learning, computer vision, 3d morphable model
2312.01790 Report	Exploring Multi-Modal Fusion for Image Manipulation Detection and Localization	Konstantinos Triaridis, Vasileios Mezaris	Recent image manipulation localization and detection techniques usually leverage forensic artifacts and traces that are produced by a noise-sensitive filter, such as SRM and Bayar convolution. In this paper, we showcase that different filters commonly used in such approaches excel at unveiling different types of manipulations and provide complementary forensic traces. Thus, we explore ways of merging the outputs of such filters and aim to leverage the complementary nature of the artifacts produced to perform image manipulation localization and detection (IMLD). We propose two distinct methods: one that produces independent features from each forensic filter and then fuses them (this is referred to as late fusion) and one that performs early mixing of different modal outputs and produces early combined features (this is referred to as early fusion). We demonstrate that both approaches achieve competitive performance for both image manipulation localization and detection, outperforming state-of-the-art models across several datasets.	This paper proposes two novel multi-modal fusion approaches for enhancing image manipulation detection and localization by leveraging complementary forensic artifacts from different filters (NoisePrint++, SRM, Bayar convolution).	Image manipulation detection and localization are crucial for combating disinformation and fostering trust in digital media, especially with the advancement of sophisticated image editing tools.	The study uses a dual-branch encoder-decoder architecture as a baseline and extends it with two fusion paradigms: 1) Late Fusion: Features from each forensic filter are extracted independently and then concatenated. Shared weights in the RGB branch mitigate overfitting. 2) Early Fusion: Features from different modalities are mixed early on using convolutional blocks before being fed into the encoder, promoting smoother feature integration.	Both early and late fusion methods achieve state-of-the-art performance on five benchmark datasets for image manipulation localization (using pixel-level F1) and detection (using AUC and balanced accuracy). The study reveals that different forensic filters exhibit complementary strengths in detecting specific manipulation types, such as NoisePrint++ excelling in post-processing manipulations and SRM/Bayar in copy-move or diffusion-based manipulations. Both proposed fusion approaches demonstrate robustness against image degradations like Gaussian blurring and JPEG compression.	The late fusion model's performance on detection tasks (bAcc) suggests potential for improvement through further regularization. Future work will explore the models' limitations against adversarial attacks.	image forensics, image manipulation detection, image manipulation localization, multimodal fusion, deep learning
2312.01771 Report	IMProv: Inpainting-based Multimodal Prompting for Computer Vision Tasks	Jiarui Xu, Yossi Gandelsman, Amir Bar, Jianwei Yang, Jianfeng Gao, Trevor Darrell, Xiaolong Wang	In-context learning allows adapting a model to new tasks given a task description at test time. In this paper, we present IMProv - a generative model that is able to in-context learn visual tasks from multimodal prompts. Given a textual description of a visual task (e.g. "Left: input image, Right: foreground segmentation"), a few input-output visual examples, or both, the model in-context learns to solve it for a new test input. We train a masked generative transformer on a new dataset of figures from computer vision papers and their associated captions, together with a captioned large-scale image-text dataset. During inference time, we prompt the model with text and/or image task example(s) and have the model inpaint the corresponding output. We show that training our model with text conditioning and scaling the dataset size improves in-context learning for computer vision tasks by over +10\% AP for Foreground Segmentation, over +5\% gains in AP for Single Object Detection, and almost 20\% lower LPIPS in Colorization. Our empirical results suggest that vision and language prompts are complementary and it is advantageous to use both to achieve better in-context learning performance. Project page is available at https://jerryxu.net/IMProv .	Presents IMProv, a generative model for in-context learning of visual tasks from multimodal prompts (text and image), enabling it to solve new tasks given a textual description, visual examples, or both.	Addresses the limitations of vision-only in-context learning in computer vision by incorporating the complementary strengths of language and vision for clearer task instruction and ambiguity reduction.	Trains a masked generative transformer on a new dataset of captioned computer vision paper figures (S2CV) combined with LAION-400M. At inference, the model receives visual and/or textual prompts and inpaints the output for the given task.	Training with text conditioning and a larger dataset significantly improves in-context learning performance on various vision tasks. IMProv achieves superior results compared to vision-only approaches, e.g., +10% AP on foreground segmentation. Demonstrates a trade-off between visual and textual prompt quality - high-quality prompts of one type can compensate for lower-quality prompts of the other.	Limited to generating pixel-based outputs, restricting its applicability to tasks representable in the pixel space. Future work involves investigating the impact of incorporating more diverse unstructured data sources on in-context learning capabilities.	in-context learning, multimodal learning, computer vision, image inpainting, generative models
2312.01711 Report	Regressor-Segmenter Mutual Prompt Learning for Crowd Counting	Mingyue Guo, Li Yuan, Zhaoyi Yan, Binghui Chen, Yaowei Wang, Qixiang Ye	Crowd counting has achieved significant progress by training regressors to predict instance positions. In heavily crowded scenarios, however, regressors are challenged by uncontrollable annotation variance, which causes density map bias and context information inaccuracy. In this study, we propose mutual prompt learning (mPrompt), which leverages a regressor and a segmenter as guidance for each other, solving bias and inaccuracy caused by annotation variance while distinguishing foreground from background. In specific, mPrompt leverages point annotations to tune the segmenter and predict pseudo head masks in a way of point prompt learning. It then uses the predicted segmentation masks, which serve as spatial constraint, to rectify biased point annotations as context prompt learning. mPrompt defines a way of mutual information maximization from prompt learning, mitigating the impact of annotation variance while improving model accuracy. Experiments show that mPrompt significantly reduces the Mean Average Error (MAE), demonstrating the potential to be general framework for down-stream vision tasks.	This paper proposes a mutual prompt learning (mPrompt) framework for crowd counting that leverages a regressor and a segmenter to guide each other, mitigating the impact of annotation variance.	Point annotations in crowded scenes suffer from variance, leading to density map bias and inaccurate context information in crowd counting models.	mPrompt utilizes point annotations to train a segmenter, predicting pseudo head masks as point prompts. These masks then act as context prompts, refining the regressor's predictions. This mutual learning process optimizes both branches.	mPrompt significantly reduces the Mean Average Error (MAE) on four benchmark datasets. Visualization analysis demonstrates mPrompt's capability to generate more accurate density maps. Ablation studies validate the effectiveness of each component in mPrompt.	The current method relies on pre-trained segmenters; exploring end-to-end training without relying on box annotations is a potential future direction. Extending mPrompt to other downstream vision tasks with scarce or noisy labels, such as object detection and visual tracking, is promising.	crowd counting, prompt learning, annotation variance, segmentation, density map regression
2312.01663 Report	Customize your NeRF: Adaptive Source Driven 3D Scene Editing via Local-Global Iterative Training	Runze He, Shaofei Huang, Xuecheng Nie, Tianrui Hui, Luoqi Liu, Jiao Dai, Jizhong Han, Guanbin Li, Si Liu	In this paper, we target the adaptive source driven 3D scene editing task by proposing a CustomNeRF model that unifies a text description or a reference image as the editing prompt. However, obtaining desired editing results conformed with the editing prompt is nontrivial since there exist two significant challenges, including accurate editing of only foreground regions and multi-view consistency given a single-view reference image. To tackle the first challenge, we propose a Local-Global Iterative Editing (LGIE) training scheme that alternates between foreground region editing and full-image editing, aimed at foreground-only manipulation while preserving the background. For the second challenge, we also design a class-guided regularization that exploits class priors within the generation model to alleviate the inconsistency problem among different views in image-driven editing. Extensive experiments show that our CustomNeRF produces precise editing results under various real scenes for both text- and image-driven settings.	CustomNeRF, a unified framework for adaptive source-driven 3D scene editing using text descriptions or reference images as prompts.	Existing 3D scene editing methods lack the flexibility to perform specific edits based on user-provided reference images while preserving the background accurately.	The authors propose a novel framework with (1) a foreground-aware NeRF for identifying editable regions, (2) a subject-aware T2I model for embedding reference image subjects into hybrid prompts, and (3) a Local-Global Iterative Editing (LGIE) training scheme for editing foregrounds while preserving backgrounds and a class-guided regularization for view consistency in image-driven editing.	CustomNeRF produces precise and view-consistent editing results in both text- and image-driven settings, outperforming baseline methods. The LGIE training scheme effectively edits foreground regions while preserving background content. Class-guided regularization mitigates the Janus problem in image-driven editing, improving cross-view consistency.	The method's reliance on Custom Diffusion for transferring subject appearance may result in inconsistencies if Custom Diffusion fails to replicate reference images perfectly. Currently limited to text and image prompts, future work could explore incorporating other editing sources like audio or sketches.	3d scene editing, neural radiance fields (nerf), text-to-image generation, image-driven editing, view consistency
2312.01629 Report	CLAMP: Contrastive LAnguage Model Prompt-tuning	Piotr Teterwak, Ximeng Sun, Bryan A. Plummer, Kate Saenko, Ser-Nam Lim	Large language models (LLMs) have emerged as powerful general-purpose interfaces for many machine learning problems. Recent work has adapted LLMs to generative visual tasks like image captioning, visual question answering, and visual chat, using a relatively small amount of instruction-tuning data. In this paper, we explore whether modern LLMs can also be adapted to classifying an image into a set of categories. First, we evaluate multimodal LLMs that are tuned for generative tasks on zero-shot image classification and find that their performance is far below that of specialized models like CLIP. We then propose an approach for light fine-tuning of LLMs using the same contrastive image-caption matching objective as CLIP. Our results show that LLMs can, indeed, achieve good image classification performance when adapted this way. Our approach beats state-of-the-art mLLMs by 13% and slightly outperforms contrastive learning with a custom text model, while also retaining the LLM's generative abilities. LLM initialization appears to particularly help classification in domains under-represented in the visual pre-training data.	The paper introduces CLAMP (Contrastive LAnguage Model Prompt-tuning), a technique to enhance the zero-shot image classification abilities of multimodal Large Language Models (mLLMs).	Existing mLLMs excel at generative visual tasks like captioning but struggle with discriminative tasks such as zero-shot image classification. This is a significant limitation as classification is fundamental for a foundation model.	CLAMP adapts an LLM by replacing the text encoder of a contrastive vision-language model with the LLM and fine-tunes it using a contrastive image-caption objective. It leverages techniques like Read-Only Prompts, Output Attention Pooling, and LoRA (Low Rank Adaptation) for efficient fine-tuning.	CLAMP significantly outperforms state-of-the-art mLLMs on zero-shot image classification by 13%, approaching the performance of CLIP. The LLM initialization in CLAMP proves particularly beneficial for classification in domains under-represented in the visual pre-training data. CLAMP retains the LLM's generative capabilities, showing promise for universal models.	One limitation is that CLAMP's current implementation requires separate adapters for discriminative and generative tasks. Future work could explore combining these adapters into a single set for a more unified model.	multimodal llms, zero-shot classification, contrastive learning, prompt-tuning, parameter-efficient fine-tuning
2312.01623 Report	Universal Segmentation at Arbitrary Granularity with Language Instruction	Yong Liu, Cairong Zhang, Yitong Wang, Jiahao Wang, Yujiu Yang, Yansong Tang	This paper aims to achieve universal segmentation of arbitrary semantic level. Despite significant progress in recent years, specialist segmentation approaches are limited to specific tasks and data distribution. Retraining a new model for adaptation to new scenarios or settings takes expensive computation and time cost, which raises the demand for versatile and universal segmentation model that can cater to various granularity. Although some attempts have been made for unifying different segmentation tasks or generalization to various scenarios, limitations in the definition of paradigms and input-output spaces make it difficult for them to achieve accurate understanding of content at arbitrary granularity. To this end, we present UniLSeg, a universal segmentation model that can perform segmentation at any semantic level with the guidance of language instructions. For training UniLSeg, we reorganize a group of tasks from original diverse distributions into a unified data format, where images with texts describing segmentation targets as input and corresponding masks are output. Combined with a automatic annotation engine for utilizing numerous unlabeled data, UniLSeg achieves excellent performance on various tasks and settings, surpassing both specialist and unified segmentation models.	Proposes UniLSeg, a universal segmentation model that uses language instructions to segment images at any semantic level.	Existing segmentation models are often task-specific and struggle to adapt to diverse scenarios and granularities. UniLSeg addresses this by using flexible language prompts for universal segmentation.	UniLSeg employs a two-stream decoding structure for visual-linguistic interaction, enabling segmentation at various levels. It's trained on a unified dataset of images, masks, and captions, incorporating data from various segmentation tasks. An automatic annotation engine generates pseudo-labels for unlabeled data, enhancing training.	Outperforms state-of-the-art methods in referring image segmentation, achieving 79.27% vs 73.41% IoU on G-Ref. Achieves state-of-the-art performance in salient object detection, surpassing previous methods on ECSSD, SOD, and PASCAL-S. Demonstrates strong performance in semantic segmentation, surpassing previous unified models in the in-vocabulary setting and achieving competitive results in the open-vocabulary setting.	Current implementation processes videos frame-by-frame, lacking temporal understanding for video segmentation. Performance on semantic segmentation, while exceeding other unified models, remains slightly lower than specialized models.	universal segmentation, language-guided vision, visual-linguistic interaction, multi-task learning, automatic annotation
2312.01597 Report	SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference	Feng Wang, Jieru Mei, Alan Yuille	Recent advances in contrastive language-image pretraining (CLIP) have demonstrated strong capabilities in zero-shot classification by aligning visual representations with target text embeddings in an image level. However, in dense prediction tasks, CLIP often struggles to localize visual features within an image and fails to give accurate pixel-level predictions, which prevents it from functioning as a generalized visual foundation model. In this work, we aim to enhance CLIP's potential for semantic segmentation with minimal modifications to its pretrained models. By rethinking self-attention, we surprisingly find that CLIP can adapt to dense prediction tasks by simply introducing a novel Correlative Self-Attention (CSA) mechanism. Specifically, we replace the traditional self-attention block of CLIP vision encoder's last layer by our CSA module and reuse its pretrained projection matrices of query, key, and value, leading to a training-free adaptation approach for CLIP's zero-shot semantic segmentation. Extensive experiments show the advantage of CSA: we obtain a 38.2% average zero-shot mIoU across eight semantic segmentation benchmarks highlighted in this paper, significantly outperforming the existing SoTA's 33.9% and the vanilla CLIP's 14.1%.	The paper proposes SCLIP, a segmentation-adapted CLIP model for zero-shot semantic segmentation. It leverages a novel Correlative Self-Attention (CSA) mechanism to improve CLIP's dense prediction capabilities without requiring fine-tuning.	Vanilla CLIP struggles with localizing visual features in images, making it unsuitable for semantic segmentation. SCLIP addresses this issue and enhances CLIP's potential as a general-purpose visual foundation model.	SCLIP introduces the CSA module, which replaces the original self-attention block in CLIP's vision encoder. CSA computes attention scores based on pairwise correlations between local visual tokens, promoting spatial covariance and enabling accurate localization.	SCLIP achieves state-of-the-art zero-shot semantic segmentation results, significantly outperforming baselines like MaskCLIP and TCL on eight benchmarks. The CSA module is robust and insensitive to projection matrix parameters, allowing for training-free adaptation of pretrained CLIP models. SCLIP demonstrates the effectiveness of incorporating semantic correlations between local features for improved visual reasoning in dense prediction tasks.	The paper primarily focuses on adapting CLIP's vision transformer encoder and does not explore modifications to the language encoder. Future work could explore alternative architectural choices within the CSA module or investigate its effectiveness in other dense prediction tasks beyond semantic segmentation.	semantic segmentation, zero-shot learning, vision-language models, clip, self-attention
2312.01531 Report	SANeRF-HQ: Segment Anything for NeRF in High Quality	Yichen Liu, Benran Hu, Chi-Keung Tang, Yu-Wing Tai	Recently, the Segment Anything Model (SAM) has showcased remarkable capabilities of zero-shot segmentation, while NeRF (Neural Radiance Fields) has gained popularity as a method for various 3D problems beyond novel view synthesis. Though there exist initial attempts to incorporate these two methods into 3D segmentation, they face the challenge of accurately and consistently segmenting objects in complex scenarios. In this paper, we introduce the Segment Anything for NeRF in High Quality (SANeRF-HQ) to achieve high-quality 3D segmentation of any target object in a given scene. SANeRF-HQ utilizes SAM for open-world object segmentation guided by user-supplied prompts, while leveraging NeRF to aggregate information from different viewpoints. To overcome the aforementioned challenges, we employ density field and RGB similarity to enhance the accuracy of segmentation boundary during the aggregation. Emphasizing on segmentation accuracy, we evaluate our method on multiple NeRF datasets where high-quality ground-truths are available or manually annotated. SANeRF-HQ shows a significant quality improvement over state-of-the-art methods in NeRF object segmentation, provides higher flexibility for object localization, and enables more consistent object segmentation across multiple views. Results and code are available at the project site: https://lyclyc52.github.io/SANeRF-HQ/.	The paper introduces SANeRF-HQ, a novel framework that combines Segment Anything Model (SAM) and Neural Radiance Fields (NeRF) to achieve high-quality 3D segmentation in complex scenes.	Existing methods for 3D segmentation in NeRF struggle with accuracy, consistency across views, and generalization to open-world scenarios. SANeRF-HQ addresses these limitations by leveraging the power of SAM and the multi-view aggregation capabilities of NeRF.	SANeRF-HQ consists of a feature container (cache or distilled feature field), a mask decoder, and a mask aggregator. It encodes images into SAM features, propagates user prompts to generate 2D masks, and aggregates these masks in 3D using an object field. A Ray-Pair RGB loss further improves boundary accuracy.	SANeRF-HQ quantitatively outperforms state-of-the-art methods like SA3D and ISRF on multiple NeRF datasets. The method generates consistent 3D segmentations across different viewpoints. The use of a Ray-Pair RGB loss leads to more accurate segmentation boundaries, especially in challenging cases.	The performance of SANeRF-HQ relies on the quality of the pre-trained NeRF model and may be affected by scene complexity. The Ray-Pair RGB loss might not be universally applicable, particularly when dealing with objects that share similar colors and textures. Future work could focus on enhancing its robustness.	3d segmentation, neural radiance fields, segment anything model, multi-view consistency, zero-shot segmentation
2312.01409 Report	Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models	Shengqu Cai, Duygu Ceylan, Matheus Gadelha, Chun-Hao Paul Huang, Tuanfeng Yang Wang, Gordon Wetzstein	Traditional 3D content creation tools empower users to bring their imagination to life by giving them direct control over a scene's geometry, appearance, motion, and camera path. Creating computer-generated videos, however, is a tedious manual process, which can be automated by emerging text-to-video diffusion models. Despite great promise, video diffusion models are difficult to control, hindering a user to apply their own creativity rather than amplifying it. To address this challenge, we present a novel approach that combines the controllability of dynamic 3D meshes with the expressivity and editability of emerging diffusion models. For this purpose, our approach takes an animated, low-fidelity rendered mesh as input and injects the ground truth correspondence information obtained from the dynamic mesh into various stages of a pre-trained text-to-image generation model to output high-quality and temporally consistent frames. We demonstrate our approach on various examples where motion can be obtained by animating rigged assets or changing the camera path.	This paper presents Generative Rendering, a novel framework that combines the controllability of 3D modeling with the expressiveness of text-to-image (T2I) diffusion models for generating stylized animations.	Existing text-to-video models lack fine-grained control over scene layout and motion, while traditional 3D workflows are time-consuming and require expertise. This work bridges this gap by enabling controllable video generation using pre-trained T2I models.	The method leverages 4D spatio-temporal correspondences from animated 3D meshes to guide the image generation process. Key innovations include UV-space noise initialization for temporal consistency and correspondence-aware blending of self-attention features for consistent appearance synthesis.	Generative Rendering demonstrates superior frame consistency and prompt fidelity compared to adapted baselines. The method supports camera and object rotations, physical simulations, and character animations, showcasing its versatility. The proposed UV-space feature injection and noise initialization significantly improve temporal consistency in generated animations.	The method's reliance on multi-step diffusion inference limits real-time animation capabilities. Handling large environmental changes and dramatic perspective shifts remains challenging due to limitations in feature correspondence.	video generation, diffusion models, 3d animation, text-to-image synthesis, generative ai
2312.01381 Report	Language-driven All-in-one Adverse Weather Removal	Hao Yang, Liyuan Pan, Yan Yang, Wei Liang	All-in-one (AiO) frameworks restore various adverse weather degradations with a single set of networks jointly. To handle various weather conditions, an AiO framework is expected to adaptively learn weather-specific knowledge for different degradations and shared knowledge for common patterns. However, existing methods: 1) rely on extra supervision signals, which are usually unknown in real-world applications; 2) employ fixed network structures, which restrict the diversity of weather-specific knowledge. In this paper, we propose a Language-driven Restoration framework (LDR) to alleviate the aforementioned issues. First, we leverage the power of pre-trained vision-language (PVL) models to enrich the diversity of weather-specific knowledge by reasoning about the occurrence, type, and severity of degradation, generating description-based degradation priors. Then, with the guidance of degradation prior, we sparsely select restoration experts from a candidate list dynamically based on a Mixture-of-Experts (MoE) structure. This enables us to adaptively learn the weather-specific and shared knowledge to handle various weather conditions (e.g., unknown or mixed weather). Experiments on extensive restoration scenarios show our superior performance (see Fig. 1). The source code will be made available.	This paper proposes LDR, a Language-driven Restoration framework for removing various adverse weather conditions in an all-in-one solution.	Existing methods struggle to handle diverse weather conditions, often relying on extra supervision signals or fixed network structures. LDR overcomes these limitations by leveraging pre-trained vision-language models.	LDR uses a pre-trained vision-language model to generate degradation priors, which are then used to dynamically select restoration experts from a candidate list based on a Mixture-of-Experts structure. The selected experts are applied pixel-wisely to restore weather-specific features.	LDR significantly outperforms existing general and all-in-one methods on benchmark datasets. The method effectively handles images with varying degradation severity, outperforming baselines in heavily degraded cases. LDR generalizes well to unseen weather conditions, successfully restoring images degraded by haze even though it was trained only on rain, snow, and raindrop degradations.	The reliance on pre-trained vision-language models introduces a dependence on the quality and reasoning capabilities of those models. Future work could explore extending LDR to handle a wider range of image degradations beyond adverse weather conditions.	image restoration, adverse weather removal, vision-language models, mixture-of-experts, degradation prior
2312.01305 Report	ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models	Jeong-gi Kwak, Erqun Dong, Yuhe Jin, Hanseok Ko, Shweta Mahajan, Kwang Moo Yi	Generating novel views of an object from a single image is a challenging task. It requires an understanding of the underlying 3D structure of the object from an image and rendering high-quality, spatially consistent new views. While recent methods for view synthesis based on diffusion have shown great progress, achieving consistency among various view estimates and at the same time abiding by the desired camera pose remains a critical problem yet to be solved. In this work, we demonstrate a strikingly simple method, where we utilize a pre-trained video diffusion model to solve this problem. Our key idea is that synthesizing a novel view could be reformulated as synthesizing a video of a camera going around the object of interest -- a scanning video -- which then allows us to leverage the powerful priors that a video diffusion model would have learned. Thus, to perform novel-view synthesis, we create a smooth camera trajectory to the target view that we wish to render, and denoise using both a view-conditioned diffusion model and a video diffusion model. By doing so, we obtain a highly consistent novel view synthesis, outperforming the state of the art.	This paper proposes a novel method for single-image novel view synthesis that leverages pre-trained video diffusion models to improve the consistency of generated views without requiring any additional training or fine-tuning.	Existing diffusion-based novel view synthesis methods often struggle with maintaining consistency in object pose and content across different generated views. This method addresses this limitation by using video diffusion models as a prior to ensure smoother transitions and greater fidelity to the input image.	The method synthesizes a sequence of views along a smooth camera trajectory from the input image to the target view. It then leverages a pre-trained view-conditioned diffusion model and a video diffusion model jointly during the denoising process to generate the final novel view.	The method significantly improves consistency in object pose and content across generated views compared to existing 2D novel view synthesis techniques. It outperforms state-of-the-art methods on standard image quality metrics such as PSNR, SSIM, and LPIPS. A novel optical flow-based metric demonstrates the superior performance of the method in generating spatially consistent novel views.	The method currently lacks an explicit 3D model and can exhibit inconsistencies when generating views from very different angles. Future work will explore incorporating explicit 3D pipelines and leveraging the method for high-resolution and editable novel view rendering.	novel view synthesis, diffusion models, video diffusion models, single image view synthesis, 3d consistency
2312.01280 Report	Brain Decodes Deep Nets	Huzheng Yang, James Gee, Jianbo Shi	We developed a tool for visualizing and analyzing large pre-trained vision models by mapping them onto the brain, thus exposing their hidden inside. Our innovation arises from a surprising usage of brain encoding: predicting brain fMRI measurements in response to images. We report two findings. First, explicit mapping between the brain and deep-network features across dimensions of space, layers, scales, and channels is crucial. This mapping method, FactorTopy, is plug-and-play for any deep-network; with it, one can paint a picture of the network onto the brain (literally!). Second, our visualization shows how different training methods matter: they lead to remarkable differences in hierarchical organization and scaling behavior, growing with more data or network capacity. It also provides insight into fine-tuning: how pre-trained models change when adapting to small datasets. We found brain-like hierarchically organized network suffer less from catastrophic forgetting after fine-tuned.	A novel visualization tool, FactorTopy, is introduced, using brain encoding models to map deep network features onto the brain, exposing their internal workings.	This tool allows for analyzing how different training objectives and model scales affect the hierarchical organization of deep networks, ultimately impacting their performance and generalization capabilities.	FactorTopy employs a factorized feature selection approach across space, layer, scale, and channel dimensions, constrained by brain topology for robust network-to-brain mapping. This mapping is then visualized by coloring the brain based on the dominant layer selected for each voxel.	Training objectives matter: CLIP aligns hierarchically with the brain, while supervised methods like ImageNet and SAM show bottom-up and top-down structures. Scaling networks: CLIP's brain alignment improves with increasing size and data, while other models exhibit a decrease. Fine-tuning: CLIP maintains its hierarchical structure and suffers less catastrophic forgetting compared to models like DiNOv2 and SAM.	Reliance on high-quality brain-encoding data, which is currently limited. Potential for limited applicability to network designs drastically different from the brain's structure.	brain encoding, deep network visualization, hierarchical organization, fine-tuning, catastrophic forgetting
2312.01255 Report	Meta ControlNet: Enhancing Task Adaptation via Meta Learning	Junjie Yang, Jinze Zhao, Peihao Wang, Zhangyang Wang, Yingbin Liang	Diffusion-based image synthesis has attracted extensive attention recently. In particular, ControlNet that uses image-based prompts exhibits powerful capability in image tasks such as canny edge detection and generates images well aligned with these prompts. However, vanilla ControlNet generally requires extensive training of around 5000 steps to achieve a desirable control for a single task. Recent context-learning approaches have improved its adaptability, but mainly for edge-based tasks, and rely on paired examples. Thus, two important open issues are yet to be addressed to reach the full potential of ControlNet: (i) zero-shot control for certain tasks and (ii) faster adaptation for non-edge-based tasks. In this paper, we introduce a novel Meta ControlNet method, which adopts the task-agnostic meta learning technique and features a new layer freezing design. Meta ControlNet significantly reduces learning steps to attain control ability from 5000 to 1000. Further, Meta ControlNet exhibits direct zero-shot adaptability in edge-based tasks without any finetuning, and achieves control within only 100 finetuning steps in more complex non-edge tasks such as Human Pose, outperforming all existing methods. The codes is available in https://github.com/JunjieYang97/Meta-ControlNet.	This paper introduces Meta ControlNet, a novel approach for fast and adaptable image synthesis by learning a generalizable ControlNet initialization using meta-learning and a novel layer freezing design.	Vanilla ControlNet requires extensive training for task-specific control, and existing adaptations struggle with zero-shot learning and fast adaptation for non-edge-based tasks. This work addresses these limitations.	The paper proposes Meta ControlNet, which leverages a FO-MAML framework with various image conditions as meta-tasks and freezes specific encoder and middle blocks during training to enable rapid adaptation.	Meta ControlNet achieves control ability in 1000 steps, significantly faster than vanilla ControlNet's 5000 steps. The method demonstrates zero-shot adaptation for edge-based tasks like Canny edge detection. Meta ControlNet adapts quickly to challenging non-edge tasks, controlling human pose in 100 steps and human pose mapping in 200 steps.	The model exhibits minor errors in distinguishing between humans and animals in tasks like human pose mapping. Future work can explore improving the adaptation speed disparity between tasks aligned with Stable Diffusion's strengths and those requiring learning new representations.	image synthesis, controlnet, meta-learning, zero-shot learning, few-shot learning
2312.01196 Report	Neural Parametric Gaussians for Monocular Non-Rigid Object Reconstruction	Devikalyan Das, Christopher Wewer, Raza Yunus, Eddy Ilg, Jan Eric Lenssen	Reconstructing dynamic objects from monocular videos is a severely underconstrained and challenging problem, and recent work has approached it in various directions. However, owing to the ill-posed nature of this problem, there has been no solution that can provide consistent, high-quality novel views from camera positions that are significantly different from the training views. In this work, we introduce Neural Parametric Gaussians (NPGs) to take on this challenge by imposing a two-stage approach: first, we fit a low-rank neural deformation model, which then is used as regularization for non-rigid reconstruction in the second stage. The first stage learns the object's deformations such that it preserves consistency in novel views. The second stage obtains high reconstruction quality by optimizing 3D Gaussians that are driven by the coarse model. To this end, we introduce a local 3D Gaussian representation, where temporally shared Gaussians are anchored in and deformed by local oriented volumes. The resulting combined model can be rendered as radiance fields, resulting in high-quality photo-realistic reconstructions of the non-rigidly deforming objects. We demonstrate that NPGs achieve superior results compared to previous works, especially in challenging scenarios with few multi-view cues.	Presents Neural Parametric Gaussians (NPGs), a two-stage approach for high-quality, non-rigid object reconstruction from monocular videos.	Monocular non-rigid reconstruction is highly underconstrained, and existing methods struggle to produce temporally consistent results, especially for novel views.	Stage 1 learns a coarse point model with low-rank deformation for temporal regularization. Stage 2 optimizes 3D Gaussians within local volumes defined by the point model, capturing fine details.	Achieves state-of-the-art novel view synthesis on the D-NeRF dataset. Significantly outperforms previous methods on the challenging Unbiased4D dataset with limited multi-view cues. Provides temporally consistent reconstructions with high-frequency details even for complex object motions.	Performance depends on the complexity of sequences (e.g., camera motion, speed, and extent of deformations). May struggle with scenes where the template initialization fails (e.g., collapses to a flat surface).	non-rigid reconstruction, novel view synthesis, monocular video, 3d gaussians, neural parametric models
2312.01129 Report	ControlDreamer: Stylized 3D Generation with Multi-View ControlNet	Yeongtak Oh, Jooyoung Choi, Yongsung Kim, Minjun Park, Chaehun Shin, Sungroh Yoon	Recent advancements in text-to-3D generation have significantly contributed to the automation and democratization of 3D content creation. Building upon these developments, we aim to address the limitations of current methods in generating 3D models with creative geometry and styles. We introduce multi-view ControlNet, a novel depth-aware multi-view diffusion model trained on generated datasets from a carefully curated text corpus. Our multi-view ControlNet is then integrated into our two-stage pipeline, ControlDreamer, enabling text-guided generation of stylized 3D models. Additionally, we present a comprehensive benchmark for 3D style editing, encompassing a broad range of subjects, including objects, animals, and characters, to further facilitate research on diverse 3D generation. Our comparative analysis reveals that this new pipeline outperforms existing text-to-3D methods as evidenced by human evaluations and CLIP score metrics.	Introduces ControlDreamer, a two-stage text-to-3D generation pipeline that uses a novel depth-aware multi-view diffusion model (MV-ControlNet) to create stylized 3D models.	Addresses limitations of current text-to-3D methods in generating creative geometry and styles by separating geometry and style generation stages and leveraging depth information.	Combines NeRF generation with DMTet mesh refinement, trained with a novel MV-ControlNet that leverages depth maps for aligning style with generated geometry. Trained on a dataset generated from a curated 100K text corpus.	Outperforms existing two-stage pipelines in generating styles on 3D models, as evidenced by CLIP score metrics and human assessments. MV-ControlNet effectively aligns diverse geometries and styles, leading to high-quality 3D model generation. Depth-aware MV-ControlNet surpasses normals and edges-aware variants in rendering detailed textures and geometries.	Errors in pre-trained depth estimators can cause artifacts. Limited to 256x256 resolution due to MVDream's training resolution.	text-to-3d, two-stage pipeline, multi-view diffusion model, controlnet, 3d style editing
2312.01068 Report	DPHMs: Diffusion Parametric Head Models for Depth-based Tracking	Jiapeng Tang, Angela Dai, Yinyu Nie, Lev Markhasin, Justus Thies, Matthias Niessner	We introduce Diffusion Parametric Head Models (DPHMs), a generative model that enables robust volumetric head reconstruction and tracking from monocular depth sequences. While recent volumetric head models, such as NPHMs, can now excel in representing high-fidelity head geometries, tracking and reconstructing heads from real-world single-view depth sequences remains very challenging, as the fitting to partial and noisy observations is underconstrained. To tackle these challenges, we propose a latent diffusion-based prior to regularize volumetric head reconstruction and tracking. This prior-based regularizer effectively constrains the identity and expression codes to lie on the underlying latent manifold which represents plausible head shapes. To evaluate the effectiveness of the diffusion-based prior, we collect a dataset of monocular Kinect sequences consisting of various complex facial expression motions and rapid transitions. We compare our method to state-of-the-art tracking methods and demonstrate improved head identity reconstruction as well as robust expression tracking.	Introduces Diffusion Parametric Head Models (DPHMs), a novel generative model for robust volumetric head reconstruction and tracking from monocular depth sequences, by incorporating diffusion-based priors into neural parametric head models (NPHMs).	Addresses the challenge of reconstructing and tracking heads from noisy and partial depth data, which often leads to overfitting and unrealistic results with traditional NPHMs.	Leverages a latent diffusion model to learn the distribution of identity and expression parameters in NPHMs, enabling effective regularization of these parameters during fitting to real-world depth sequences.	Achieves more accurate head identity reconstruction compared to state-of-the-art methods, especially in capturing fine-grained hair geometries. Demonstrates robust and coherent facial expression tracking, even for complex and rapid transitions, by constraining latent optimization within plausible head shape manifolds. Outperforms existing methods in quantitative evaluations on a new challenging benchmark dataset (DPHM-Kinect) and a multi-view video dataset (NerSemble).	Current implementation has slower inference due to test-time optimization of neural parametric models. Future work will focus on real-time head tracking solutions and incorporating RGB images for enhanced hair reconstruction.	head reconstruction, facial tracking, diffusion models, depth sensors, 3d avatars
2312.01027 Report	LDM-ISP: Enhancing Neural ISP for Low Light with Latent Diffusion Models	Qiang Wen, Yazhou Xing, Zhefan Rao, Qifeng Chen	Enhancing a low-light noisy RAW image into a well-exposed and clean sRGB image is a significant challenge for modern digital cameras. Prior approaches have difficulties in recovering fine-grained details and true colors of the scene under extremely low-light environments due to near-to-zero SNR. Meanwhile, diffusion models have shown significant progress towards general domain image generation. In this paper, we propose to leverage the pre-trained latent diffusion model to perform the neural ISP for enhancing extremely low-light images. Specifically, to tailor the pre-trained latent diffusion model to operate on the RAW domain, we train a set of lightweight taming modules to inject the RAW information into the diffusion denoising process via modulating the intermediate features of UNet. We further observe different roles of UNet denoising and decoder reconstruction in the latent diffusion model, which inspires us to decompose the low-light image enhancement task into latent-space low-frequency content generation and decoding-phase high-frequency detail maintenance. Through extensive experiments on representative datasets, we demonstrate our simple design not only achieves state-of-the-art performance in quantitative evaluations but also shows significant superiority in visual comparisons over strong baselines, which highlight the effectiveness of powerful generative priors for neural ISP under extremely low-light environments. The project page is available at https://csqiangwen.github.io/projects/ldm-isp/	This paper presents LDM-ISP, a novel method leveraging a pre-trained latent diffusion model and taming modules to enhance neural ISP for low-light image enhancement.	Existing low-light enhancement methods struggle to recover fine-grained details and true colors, especially in extremely low-light conditions due to limited training data. This work explores the potential of powerful generative priors from pre-trained diffusion models to address these limitations.	LDM-ISP inserts trainable taming modules into a frozen pre-trained latent diffusion model (Stable Diffusion). It employs 2D discrete wavelet transforms to decompose the input RAW image into low- and high-frequency sub-bands. The low-frequency sub-band guides the UNet for content generation, while the high-frequency sub-bands are used to maintain details during the decoding phase.	LDM-ISP achieves state-of-the-art performance on three benchmark datasets (SID-Sony, ELD-Sony, LRD) quantitatively and qualitatively. The method effectively recovers structural information and enhances details, even in extremely dark and noisy regions. Taming the decoder with high-frequency information is crucial for accurate color correction and detail preservation.	The inference speed of LDM-ISP is limited by the DDIM sampling process in the diffusion model. Exploring the combination of text prompts with the proposed method for flexible low-light image editing is a promising research direction.	low-light image enhancement, neural image signal processing, latent diffusion model, generative priors, taming modules
2312.01026 Report	Token Fusion: Bridging the Gap between Token Pruning and Token Merging	Minchul Kim, Shangqian Gao, Yen-Chang Hsu, Yilin Shen, Hongxia Jin	Vision Transformers (ViTs) have emerged as powerful backbones in computer vision, outperforming many traditional CNNs. However, their computational overhead, largely attributed to the self-attention mechanism, makes deployment on resource-constrained edge devices challenging. Multiple solutions rely on token pruning or token merging. In this paper, we introduce "Token Fusion" (ToFu), a method that amalgamates the benefits of both token pruning and token merging. Token pruning proves advantageous when the model exhibits sensitivity to input interpolations, while token merging is effective when the model manifests close to linear responses to inputs. We combine this to propose a new scheme called Token Fusion. Moreover, we tackle the limitations of average merging, which doesn't preserve the intrinsic feature norm, resulting in distributional shifts. To mitigate this, we introduce MLERP merging, a variant of the SLERP technique, tailored to merge multiple tokens while maintaining the norm distribution. ToFu is versatile, applicable to ViTs with or without additional training. Our empirical evaluations indicate that ToFu establishes new benchmarks in both classification and image generation tasks concerning computational efficiency and model accuracy.	Introduced "Token Fusion" (ToFu), a method that combines the advantages of token pruning and token merging to accelerate Vision Transformers (ViTs) while preserving accuracy.	ViTs are powerful but computationally expensive, hindering deployment on resource-constrained devices. ToFu addresses this by reducing computational overhead without significant accuracy loss.	ToFu dynamically switches between pruning and merging based on the model's sensitivity to input interpolations at different depths. It also introduces MLERP merging, a variant of SLERP, to preserve feature norm distribution during merging.	ToFu achieves state-of-the-art speed and accuracy trade-offs on ImageNet classification compared to existing token reduction methods. MLERP merging outperforms average merging in both accuracy and speed. ToFu demonstrates effectiveness in image generation tasks, improving efficiency while maintaining image quality in Stable Diffusion.	The selection of the merging strategy switching point (d) currently relies on a hyperparameter search. Further investigation into the theoretical properties of MLERP merging and its impact on model optimization.	vision transformer, token pruning, token merging, model compression, efficient inference
2312.00971 Report	Consistent Mesh Diffusion	Julian Knodt, Xifeng Gao	Given a 3D mesh with a UV parameterization, we introduce a novel approach to generating textures from text prompts. While prior work uses optimization from Text-to-Image Diffusion models to generate textures and geometry, this is slow and requires significant compute resources. Alternatively, there are projection based approaches that use the same Text-to-Image models that paint images onto a mesh, but lack consistency at different viewing angles, we propose a method that uses a single Depth-to-Image diffusion network, and generates a single consistent texture when rendered on the 3D surface by first unifying multiple 2D image's diffusion paths, and hoisting that to 3D with MultiDiffusion~\cite{multidiffusion}. We demonstrate our approach on a dataset containing 30 meshes, taking approximately 5 minutes per mesh. To evaluate the quality of our approach, we use CLIP-score~\cite{clipscore} and Frechet Inception Distance (FID)~\cite{frechet} to evaluate the quality of the rendering, and show our improvement over prior work.	This paper presents a novel method for generating consistent textures on 3D meshes from text prompts using a single Depth-to-Image diffusion network.	Generating high-quality 3D models with textures is crucial for various applications like games and shopping apps. Existing methods are either computationally expensive or produce inconsistent textures across different views.	The proposed method leverages the concept of MultiDiffusion and extends it to 3D mesh texturing. It utilizes a spherical harmonic latent texture map to render the mesh in latent space, enabling joint denoising of multiple views from a single diffusion pass. To further enhance consistency, it incorporates GAN inversion in latent space and weights pixel importance based on surface normals.	The method generates consistent textures with fewer seams and artifacts compared to previous approaches like TEXTure and Text2Tex. Quantitative evaluation using CLIP-Score and FID shows comparable or superior performance to baseline methods. The approach is computationally efficient, taking approximately 5 minutes per mesh on a single NVIDIA GeForce RTX 3090.	The method may still suffer from the multi-Janus problem where multiple faces are generated from different views. The reliance on text prompts for texture generation can be imprecise and ambiguous, leading to inconsistent results. Exploring image-based guidance could address this limitation.	mesh texturing, text-to-3d, diffusion models, multi-view consistency, gan inversion
2312.00944 Report	Enhancing Diffusion Models with 3D Perspective Geometry Constraints	Rishi Upadhyay, Howard Zhang, Yunhao Ba, Ethan Yang, Blake Gella, Sicheng Jiang, Alex Wong, Achuta Kadambi	While perspective is a well-studied topic in art, it is generally taken for granted in images. However, for the recent wave of high-quality image synthesis methods such as latent diffusion models, perspective accuracy is not an explicit requirement. Since these methods are capable of outputting a wide gamut of possible images, it is difficult for these synthesized images to adhere to the principles of linear perspective. We introduce a novel geometric constraint in the training process of generative models to enforce perspective accuracy. We show that outputs of models trained with this constraint both appear more realistic and improve performance of downstream models trained on generated images. Subjective human trials show that images generated with latent diffusion models trained with our constraint are preferred over images from the Stable Diffusion V2 model 70% of the time. SOTA monocular depth estimation models such as DPT and PixelFormer, fine-tuned on our images, outperform the original models trained on real images by up to 7.03% in RMSE and 19.3% in SqRel on the KITTI test set for zero-shot transfer.	The paper introduces a novel geometric constraint during the training process of latent diffusion models to improve the perspective accuracy of generated images.	Current diffusion models often generate images that violate the principles of linear perspective, limiting their realism and usefulness for downstream tasks like depth estimation.	The authors add a new loss term to the diffusion model training process. This term encourages the gradient field of the generated image to align with its expected vanishing points, calculated from ground truth data.	Images generated with the proposed constraint appear more realistic and better preserve straight lines compared to the baseline Stable Diffusion V2 model. Human subjective tests show a strong preference (around 70%) for images generated by the enhanced model over the baseline model. Fine-tuning SOTA monocular depth estimation models on images generated by the enhanced model improves their performance on real-world datasets (KITTI, DIODE) compared to models trained on baseline images or even real images.	The method requires a dataset with ground truth vanishing points for training, limiting its applicability to scenes with strong vanishing lines. Generating large synthetic datasets using diffusion models remains computationally expensive.	diffusion models, perspective constraints, depth estimation, image generation, synthetic data
2312.00878 Report	Grounding Everything: Emerging Localization Properties in Vision-Language Transformers	Walid Bousselham, Felix Petersen, Vittorio Ferrari, Hilde Kuehne	Vision-language foundation models have shown remarkable performance in various zero-shot settings such as image retrieval, classification, or captioning. But so far, those models seem to fall behind when it comes to zero-shot localization of referential expressions and objects in images. As a result, they need to be fine-tuned for this task. In this paper, we show that pretrained vision-language (VL) models allow for zero-shot open-vocabulary object localization without any fine-tuning. To leverage those capabilities, we propose a Grounding Everything Module (GEM) that generalizes the idea of value-value attention introduced by CLIPSurgery to a self-self attention path. We show that the concept of self-self attention corresponds to clustering, thus enforcing groups of tokens arising from the same object to be similar while preserving the alignment with the language space. To further guide the group formation, we propose a set of regularizations that allows the model to finally generalize across datasets and backbones. We evaluate the proposed GEM framework on various benchmark tasks and datasets for semantic segmentation. It shows that GEM not only outperforms other training-free open-vocabulary localization methods, but also achieves state-of-the-art results on the recently proposed OpenImagesV7 large-scale segmentation benchmark.	This paper proposes GEM (Grounding Everything Module), a training-free method for open-vocabulary object localization using pre-trained vision-language models.	Existing vision-language models often struggle with zero-shot localization tasks, requiring fine-tuning. This method leverages the inherent localization capabilities of these models without additional training.	GEM employs a self-self attention mechanism inspired by CLIPSurgery, combining it with L2 normalization, adaptive temperature, iterative refinement, and qkv-ensemble to improve visual feature grouping and alignment with text embeddings.	GEM outperforms other training-free methods and achieves competitive results against fine-tuned models on zero-shot semantic segmentation tasks. It achieves state-of-the-art results on the OpenImagesV7 dataset for zero-shot point prediction, demonstrating its effectiveness in large-scale open vocabulary settings. Analysis reveals that GEM enhances both visual distinctiveness (grouping of similar features) and vision-language alignment.	The number of iterations in the self-self attention mechanism can impact performance depending on the number of classes in the dataset. Failure cases highlight potential limitations in the text encoder, suggesting future research directions.	vision-language models, zero-shot learning, object localization, semantic segmentation, self-attention
2312.00869 Report	Segment and Caption Anything	Xiaoke Huang, Jianfeng Wang, Yansong Tang, Zheng Zhang, Han Hu, Jiwen Lu, Lijuan Wang, Zicheng Liu	We propose a method to efficiently equip the Segment Anything Model (SAM) with the ability to generate regional captions. SAM presents strong generalizability to segment anything while is short for semantic understanding. By introducing a lightweight query-based feature mixer, we align the region-specific features with the embedding space of language models for later caption generation. As the number of trainable parameters is small (typically in the order of tens of millions), it costs less computation, less memory usage, and less communication bandwidth, resulting in both fast and scalable training. To address the scarcity problem of regional caption data, we propose to first pre-train our model on objection detection and segmentation tasks. We call this step weak supervision pretraining since the pre-training data only contains category names instead of full-sentence descriptions. The weak supervision pretraining allows us to leverage many publicly available object detection and segmentation datasets. We conduct extensive experiments to demonstrate the superiority of our method and validate each design choice. This work serves as a stepping stone towards scaling up regional captioning data and sheds light on exploring efficient ways to augment SAM with regional semantics. The project page, along with the associated code, can be accessed via https://xk-huang.github.io/segment-caption-anything/.	This paper proposes an efficient method to augment the Segment Anything Model (SAM) with regional captioning capabilities by introducing a lightweight query-based feature mixer.	SAM exhibits strong generalizability for segmenting anything but lacks semantic understanding. This work aims to bridge this gap by enabling SAM to generate regional captions.	The method employs a lightweight hybrid feature mixer that aligns region-specific features with the embedding space of a frozen language model. Weak supervision pre-training is used to leverage existing object detection and segmentation datasets.	The method achieves state-of-the-art performance on the Visual Genome benchmark. Weak supervision pre-training using large-scale datasets significantly improves performance. A larger pre-trained language model generally leads to better captioning results.	The model might face challenges in predicting correct attributes (e.g., color) and distinguishing visually similar concepts. Future work includes exploring larger-scale weak supervision datasets and self-training for improved generalizability.	regional captioning, segment anything model, weak supervision, vision-language models, interactive segmentation
2312.00863 Report	EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything	Yunyang Xiong, Bala Varadarajan, Lemeng Wu, Xiaoyu Xiang, Fanyi Xiao, Chenchen Zhu, Xiaoliang Dai, Dilin Wang, Fei Sun, Forrest Iandola, Raghuraman Krishnamoorthi, Vikas Chandra	Segment Anything Model (SAM) has emerged as a powerful tool for numerous vision applications. A key component that drives the impressive performance for zero-shot transfer and high versatility is a super large Transformer model trained on the extensive high-quality SA-1B dataset. While beneficial, the huge computation cost of SAM model has limited its applications to wider real-world applications. To address this limitation, we propose EfficientSAMs, light-weight SAM models that exhibits decent performance with largely reduced complexity. Our idea is based on leveraging masked image pretraining, SAMI, which learns to reconstruct features from SAM image encoder for effective visual representation learning. Further, we take SAMI-pretrained light-weight image encoders and mask decoder to build EfficientSAMs, and finetune the models on SA-1B for segment anything task. We perform evaluations on multiple vision tasks including image classification, object detection, instance segmentation, and semantic object detection, and find that our proposed pretraining method, SAMI, consistently outperforms other masked image pretraining methods. On segment anything task such as zero-shot instance segmentation, our EfficientSAMs with SAMI-pretrained lightweight image encoders perform favorably with a significant gain (e.g., ~4 AP on COCO/LVIS) over other fast SAM models.	This paper proposes EfficientSAMs, lightweight Segment Anything Model (SAM) variants that achieve comparable performance to SAM with significantly reduced complexity, enhancing real-world applicability.	The high computational cost of SAM, particularly the image encoder, limits its practical deployment in real-time applications. EfficientSAMs address this limitation.	The authors introduce SAM-leveraged masked image pretraining (SAMI), training lightweight ViT image encoders to reconstruct features from the SAM encoder. EfficientSAMs integrate these pretrained encoders with the SAM decoder and fine-tune them on the SA-1B dataset.	SAMI consistently outperforms other masked image pretraining methods in transfer learning settings on image classification, object detection, instance segmentation, and semantic segmentation tasks. EfficientSAMs achieve state-of-the-art quality-efficiency trade-offs, demonstrating superior performance (e.g., ~4 AP improvement on COCO/LVIS) compared to other fast SAM models like MobileSAM and FastSAM. EfficientSAMs significantly reduce inference time (~20x) and parameter size (~20x) compared to SAM while maintaining competitive performance.	The paper primarily focuses on efficiency improvements for the image encoder, leaving room for future exploration in optimizing the decoder for further computational gains. While demonstrating promising results in salient instance segmentation, further research is needed to refine and evaluate its performance thoroughly.	segment anything model, efficient deep learning, masked image pretraining, vision transformers, instance segmentation
2312.00860 Report	Segment Any 3D Gaussians	Jiazhong Cen, Jiemin Fang, Chen Yang, Lingxi Xie, Xiaopeng Zhang, Wei Shen, Qi Tian	Interactive 3D segmentation in radiance fields is an appealing task since its importance in 3D scene understanding and manipulation. However, existing methods face challenges in either achieving fine-grained, multi-granularity segmentation or contending with substantial computational overhead, inhibiting real-time interaction. In this paper, we introduce Segment Any 3D GAussians (SAGA), a novel 3D interactive segmentation approach that seamlessly blends a 2D segmentation foundation model with 3D Gaussian Splatting (3DGS), a recent breakthrough of radiance fields. SAGA efficiently embeds multi-granularity 2D segmentation results generated by the segmentation foundation model into 3D Gaussian point features through well-designed contrastive training. Evaluation on existing benchmarks demonstrates that SAGA can achieve competitive performance with state-of-the-art methods. Moreover, SAGA achieves multi-granularity segmentation and accommodates various prompts, including points, scribbles, and 2D masks. Notably, SAGA can finish the 3D segmentation within milliseconds, achieving nearly 1000x acceleration compared to previous SOTA. The project page is at https://jumpat.github.io/SAGA.	SAGA is a novel interactive 3D segmentation method that achieves millisecond-level segmentation by distilling knowledge from the Segment Anything Model (SAM) into 3D Gaussians.	Existing methods for interactive 3D segmentation in radiance fields are either computationally expensive or lack fine-grained segmentation capabilities.	SAGA trains low-dimensional features for 3D Gaussians using a combination of SAM-guidance loss and correspondence loss to enable efficient and accurate segmentation from various prompts like points, scribbles, and masks.	SAGA achieves competitive segmentation performance with previous state-of-the-art methods while being significantly faster. SAGA supports various prompt types, including points, scribbles, masks, bounding boxes, and text. SAGA is particularly well-suited for scenes with multiple objects requiring segmentation.	SAGA's performance depends on the quality of the 3D Gaussian reconstruction. The semantic-agnostic nature of the post-processing step can lead to false positives.	3d segmentation, radiance fields, 3d gaussian splatting, interactive segmentation, segment anything model
2312.00853 Report	Motion-Guided Latent Diffusion for Temporally Consistent Real-world Video Super-resolution	Xi Yang, Chenhang He, Jianqi Ma, Lei Zhang	Real-world low-resolution (LR) videos have diverse and complex degradations, imposing great challenges on video super-resolution (VSR) algorithms to reproduce their high-resolution (HR) counterparts with high quality. Recently, the diffusion models have shown compelling performance in generating realistic details for image restoration tasks. However, the diffusion process has randomness, making it hard to control the contents of restored images. This issue becomes more serious when applying diffusion models to VSR tasks because temporal consistency is crucial to the perceptual quality of videos. In this paper, we propose an effective real-world VSR algorithm by leveraging the strength of pre-trained latent diffusion models. To ensure the content consistency among adjacent frames, we exploit the temporal dynamics in LR videos to guide the diffusion process by optimizing the latent sampling path with a motion-guided loss, ensuring that the generated HR video maintains a coherent and continuous visual flow. To further mitigate the discontinuity of generated details, we insert temporal module to the decoder and fine-tune it with an innovative sequence-oriented loss. The proposed motion-guided latent diffusion (MGLD) based VSR algorithm achieves significantly better perceptual quality than state-of-the-arts on real-world VSR benchmark datasets, validating the effectiveness of the proposed model design and training strategies.	This paper proposes Motion-Guided Latent Diffusion (MGLD), a novel video super-resolution (VSR) algorithm that leverages the generative power of pre-trained latent diffusion models to enhance the quality of real-world low-resolution videos.	Real-world low-resolution videos present diverse and complex degradations, posing significant challenges for VSR algorithms. Existing methods often struggle to balance detail reproduction with artifact suppression. This work explores the use of latent diffusion models, which have shown impressive results in image restoration, to address these challenges in VSR.	The proposed MGLD incorporates temporal dynamics into the VSR process through two key innovations: 1) a motion-guided diffusion sampling process that uses optical flow information from LR videos to ensure temporal consistency in the generated HR frames, and 2) a temporal-aware sequence decoder fine-tuned with a novel sequence-oriented loss to enhance the continuity and smoothness of generated details.	MGLD outperforms state-of-the-art real-world VSR methods on benchmark datasets, exhibiting superior perceptual quality in terms of detail realism, texture richness, and artifact reduction. Quantitative evaluation using full-reference metrics (LPIPS, DISTS) on synthetic datasets and no-reference metrics (NIQE, BRISQUE, MUSIQ) on real-world datasets demonstrates the superior performance of MGLD. Ablation studies confirm the effectiveness of the proposed motion-guided sampling and temporal-aware decoding strategies, highlighting their synergistic contributions to the overall VSR performance.	The computational complexity of MGLD is higher compared to non-diffusion based VSR methods, primarily due to the iterative nature of the diffusion process. Future work will investigate model distillation and efficient sampling techniques to enhance the inference speed. While the average warping error (WE) is commonly used to evaluate temporal consistency in VSR, it might not fully capture human perception. Future research will explore more sophisticated metrics to assess the temporal smoothness of generated videos.	video super-resolution, real-world vsr, latent diffusion models, motion-guided sampling, temporal consistency
2312.00852 Report	Beyond First-Order Tweedie: Solving Inverse Problems using Latent Diffusion	Litu Rout, Yujia Chen, Abhishek Kumar, Constantine Caramanis, Sanjay Shakkottai, Wen-Sheng Chu	Sampling from the posterior distribution poses a major computational challenge in solving inverse problems using latent diffusion models. Common methods rely on Tweedie's first-order moments, which are known to induce a quality-limiting bias. Existing second-order approximations are impractical due to prohibitive computational costs, making standard reverse diffusion processes intractable for posterior sampling. This paper introduces Second-order Tweedie sampler from Surrogate Loss (STSL), a novel sampler that offers efficiency comparable to first-order Tweedie with a tractable reverse process using second-order approximation. Our theoretical results reveal that the second-order approximation is lower bounded by our surrogate loss that only requires $O(1)$ compute using the trace of the Hessian, and by the lower bound we derive a new drift term to make the reverse process tractable. Our method surpasses SoTA solvers PSLD and P2L, achieving 4X and 8X reduction in neural function evaluations, respectively, while notably enhancing sampling quality on FFHQ, ImageNet, and COCO benchmarks. In addition, we show STSL extends to text-guided image editing and addresses residual distortions present from corrupted images in leading text-guided image editing methods. To our best knowledge, this is the first work to offer an efficient second-order approximation in solving inverse problems using latent diffusion and editing real-world images with corruptions.	This paper introduces STSL, an efficient second-order Tweedie sampler for posterior sampling in latent diffusion models, improving image inversion and text-guided editing, especially for corrupted images.	Existing first-order Tweedie samplers in diffusion-based inverse problem solvers suffer from bias, while second-order approximations are computationally expensive. This hinders high-fidelity reconstruction and editing, especially for real-world corrupted images.	STSL leverages a novel surrogate loss function based on a tractable second-order Tweedie approximation. It uses Hutchinson's estimator to efficiently compute the trace of the Hessian, requiring only the readily available first-order score from diffusion models. This enables an efficient alternative reverse diffusion process for superior posterior sampling.	STSL achieves 4x and 8x reduction in neural function evaluations compared to state-of-the-art solvers PSLD and P2L, respectively, while enhancing sampling quality. It outperforms existing methods in image inversion tasks, including denoising, inpainting, super-resolution, and deblurring, on FFHQ, ImageNet, and COCO benchmarks. STSL effectively extends to text-guided image editing, outperforming NTI in handling real-world corrupted images by enabling faithful edits and content preservation.	The current implementation could be further optimized by utilizing a more sophisticated measurement operator as in P2L. Incorporating prompt-tuning into the pipeline could potentially improve text-guided editing.	latent diffusion models, image inversion, image editing, second-order tweedie sampler, posterior sampling
2312.00845 Report	VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models	Hyeonho Jeong, Geon Yeong Park, Jong Chul Ye	Text-to-video diffusion models have advanced video generation significantly. However, customizing these models to generate videos with tailored motions presents a substantial challenge. In specific, they encounter hurdles in (a) accurately reproducing motion from a target video, and (b) creating diverse visual variations. For example, straightforward extensions of static image customization methods to video often lead to intricate entanglements of appearance and motion data. To tackle this, here we present the Video Motion Customization (VMC) framework, a novel one-shot tuning approach crafted to adapt temporal attention layers within video diffusion models. Our approach introduces a novel motion distillation objective using residual vectors between consecutive frames as a motion reference. The diffusion process then preserves low-frequency motion trajectories while mitigating high-frequency motion-unrelated noise in image space. We validate our method against state-of-the-art video generative models across diverse real-world motions and contexts. Our codes, data and the project demo can be found at https://video-motion-customization.github.io	This document outlines a style guide for submitting papers to the IEEE Computer Society Press, emphasizing blind review practices, formatting guidelines, and referencing conventions.	Standardized formatting ensures clear communication of research, fair peer review, and professional publication quality for IEEE Computer Society Press.	The paper provides detailed instructions and examples for various aspects of manuscript preparation, including language, length, margins, type style, headings, figures, tables, references, and color use.	The guide clarifies blind review procedures, emphasizing the importance of removing identifying information while maintaining scientific rigor. Specific instructions are given for formatting equations, cross-references, and citations, ensuring consistency and ease of navigation for readers. Authors are strongly encouraged to prioritize clear and concise writing, avoiding unnecessary jargon and ensuring the paper's accessibility to a broad audience.	The guide primarily focuses on technical aspects of formatting, potentially leaving room for addressing ethical considerations in research and publication. While the guide emphasizes blind review, it could benefit from further discussion on handling conflicts of interest and promoting diversity in citations.	style guide, ieee, manuscript preparation, blind review, academic publishing
2312.00833 Report	Lasagna: Layered Score Distillation for Disentangled Object Relighting	Dina Bashkirova, Arijit Ray, Rupayan Mallick, Sarah Adel Bargal, Jianming Zhang, Ranjay Krishna, Kate Saenko	Professional artists, photographers, and other visual content creators use object relighting to establish their photo's desired effect. Unfortunately, manual tools that allow relighting have a steep learning curve and are difficult to master. Although generative editing methods now enable some forms of image editing, relighting is still beyond today's capabilities; existing methods struggle to keep other aspects of the image -- colors, shapes, and textures -- consistent after the edit. We propose Lasagna, a method that enables intuitive text-guided relighting control. Lasagna learns a lighting prior by using score distillation sampling to distill the prior of a diffusion model, which has been finetuned on synthetic relighting data. To train Lasagna, we curate a new synthetic dataset ReLiT, which contains 3D object assets re-lit from multiple light source locations. Despite training on synthetic images, quantitative results show that Lasagna relights real-world images while preserving other aspects of the input image, outperforming state-of-the-art text-guided image editing methods. Lasagna enables realistic and controlled results on natural images and digital art pieces and is preferred by humans over other methods in over 91% of cases. Finally, we demonstrate the versatility of our learning objective by extending it to allow colorization, another form of image editing.	This paper introduces Lasagna, a novel method for text-guided object relighting in images that leverages a diffusion model prior.	Relighting objects in images is a crucial aspect of visual content creation, but existing tools are often difficult to use or lack generalizability. Lasagna aims to provide an intuitive and realistic solution for text-guided relighting.	Lasagna employs a layered score distillation sampling approach to learn a lighting prior from a diffusion model fine-tuned on a synthetic relighting dataset called ReLiT. It predicts separate editing layers for shading and lighting, which are then composed with the input image to achieve the desired relighting effect while preserving other image aspects.	Lasagna outperforms state-of-the-art text-guided image editing methods in terms of realistic and controlled relighting, as evidenced by human evaluation. The method generalizes well to various image domains, including natural photos and digital art, despite being trained on synthetic data. Lasagna's layered editing framework can be extended to other image editing tasks, as demonstrated with a proof-of-concept for sketch colorization.	Lasagna may struggle with highly abstract input images. The method can sometimes introduce over-exposure artifacts in the background, which could be addressed in future work with techniques like foreground masking.	image editing, relighting, diffusion models, score distillation sampling, text-guided synthesis
2312.00785 Report	Sequential Modeling Enables Scalable Learning for Large Vision Models	Yutong Bai, Xinyang Geng, Karttikeya Mangalam, Amir Bar, Alan Yuille, Trevor Darrell, Jitendra Malik, Alexei A Efros	We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data. To do this, we define a common format, "visual sentences", in which we can represent raw images and videos as well as annotated data sources such as semantic segmentations and depth reconstructions without needing any meta-knowledge beyond the pixels. Once this wide variety of visual data (comprising 420 billion tokens) is represented as sequences, the model can be trained to minimize a cross-entropy loss for next token prediction. By training across various scales of model architecture and data diversity, we provide empirical evidence that our models scale effectively. Many different vision tasks can be solved by designing suitable visual prompts at test time.	This paper introduces a novel sequential modeling approach to learning a Large Vision Model (LVM) without linguistic data.	The goal is to create a foundation for LVMs that can scale with large datasets and address various vision tasks through prompting, similar to LLMs in NLP.	The methodology involves: (1) Representing diverse visual data, including raw images/videos and annotations, as unified "visual sentences" - sequences of images. (2) Training a large transformer architecture to predict the next token in these visual sentences, using a learned tokenizer to convert images into discrete tokens.	The model demonstrates scaling behavior with increasing model size and data size. Various vision tasks can be solved by designing suitable prompts, showcasing the potential for in-context learning. The model benefits significantly from the diversity and volume of unsupervised data used during training.	Limitations in computational resources restricted the exploration of various aspects, like the impact of different datasets. The model's size, despite being large, is still relatively small compared to LLMs, leaving room for further exploration in generalization capabilities.	large vision model, visual prompting, sequential modeling, vision-only training, unsupervised learning
2312.00784 Report	ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts	Mu Cai, Haotian Liu, Dennis Park, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Yong Jae Lee	While existing large vision-language multimodal models focus on whole image understanding, there is a prominent gap in achieving region-specific comprehension. Current approaches that use textual coordinates or spatial encodings often fail to provide a user-friendly interface for visual prompting. To address this challenge, we introduce a novel multimodal model capable of decoding arbitrary visual prompts. This allows users to intuitively mark images and interact with the model using natural cues like a "red bounding box" or "pointed arrow". Our simple design directly overlays visual markers onto the RGB image, eliminating the need for complex region encodings, yet achieves state-of-the-art performance on region-understanding tasks like Visual7W, PointQA, and Visual Commonsense Reasoning benchmark. Furthermore, we present ViP-Bench, a comprehensive benchmark to assess the capability of models in understanding visual prompts across multiple dimensions, enabling future research in this domain. Code, data, and model are publicly available.	The paper introduces ViP-LLaVA, a novel multimodal model designed for intuitive interaction with images using natural language and arbitrary visual prompts like arrows, boxes, or scribbles.	Existing large vision-language models predominantly focus on whole-image understanding and lack the capacity to process region-specific information effectively, limiting their ability to understand user intent in complex scenes.	ViP-LLaVA leverages CLIP's ability to encode both images and superimposed visual markers. By overlaying these prompts directly onto the image, the model learns to associate visual cues with specific regions, enhancing region-specific comprehension.	ViP-LLaVA achieves state-of-the-art results on region understanding tasks, surpassing models specifically designed for region-based reasoning on benchmarks like Visual7W and PointQA. The model demonstrates strong generalization abilities, accurately interpreting user-drawn visual prompts at test time, even with variations in thickness or marker type. A new benchmark, ViP-Bench, is introduced to comprehensively evaluate multimodal models' region understanding capabilities under various visual prompts, covering aspects like recognition, OCR, knowledge, math, relationship reasoning, and language generation.	Current LMMs, including ViP-LLaVA, still lag behind GPT-4V in tasks demanding strong language reasoning, particularly OCR, math, and language generation, indicating an area for future research. While ViP-LLaVA effectively leverages visual prompts for region understanding, exploring other region representation methods, such as combining visual prompts with textual coordinates, could further enhance performance.	multimodal learning, visual prompting, region understanding, vision-language models, benchmarking
2312.00778 Report	MorpheuS: Neural Dynamic 360° Surface Reconstruction from Monocular RGB-D Video	Hengyi Wang, Jingwen Wang, Lourdes Agapito	Neural rendering has demonstrated remarkable success in dynamic scene reconstruction. Thanks to the expressiveness of neural representations, prior works can accurately capture the motion and achieve high-fidelity reconstruction of the target object. Despite this, real-world video scenarios often feature large unobserved regions where neural representations struggle to achieve realistic completion. To tackle this challenge, we introduce MorpheuS, a framework for dynamic 360{\deg} surface reconstruction from a casually captured RGB-D video. Our approach models the target scene as a canonical field that encodes its geometry and appearance, in conjunction with a deformation field that warps points from the current frame to the canonical space. We leverage a view-dependent diffusion prior and distill knowledge from it to achieve realistic completion of unobserved regions. Experimental results on various real-world and synthetic datasets show that our method can achieve high-fidelity 360{\deg} surface reconstruction of a deformable object from a monocular RGB-D video.	MorpheuS: a novel framework for dynamic 360° surface reconstruction from casual monocular RGB-D videos, achieving photo-realistic completion of unobserved regions via diffusion priors.	Existing dynamic scene reconstruction methods struggle to achieve realistic completion of unobserved regions, limiting their applications.	MorpheuS represents the scene with a hyper-dimensional canonical field and a deformation field. It leverages a view-dependent diffusion prior and distills knowledge from it through Score Distillation Sampling (SDS) to complete unobserved geometry and appearance.	MorpheuS achieves high-fidelity 360° surface reconstruction with accurate motion and geometry. The use of diffusion priors leads to photo-realistic completion of unobserved regions, outperforming previous methods. Canonical space regularization and temporal view-dependent SDS contribute to robust and accurate reconstruction.	MorpheuS may fail in challenging scenarios like incomplete views, motion blur, or complex articulation due to limitations of the diffusion prior. The method currently lacks motion priors, hindering reconstruction in self-occluded regions.	dynamic scene reconstruction, diffusion priors, neural implicit representations, 360° reconstruction, rgb-d video
2312.00777 Report	VideoBooth: Diffusion-based Video Generation with Image Prompts	Yuming Jiang, Tianxing Wu, Shuai Yang, Chenyang Si, Dahua Lin, Yu Qiao, Chen Change Loy, Ziwei Liu	Text-driven video generation witnesses rapid progress. However, merely using text prompts is not enough to depict the desired subject appearance that accurately aligns with users' intents, especially for customized content creation. In this paper, we study the task of video generation with image prompts, which provide more accurate and direct content control beyond the text prompts. Specifically, we propose a feed-forward framework VideoBooth, with two dedicated designs: 1) We propose to embed image prompts in a coarse-to-fine manner. Coarse visual embeddings from image encoder provide high-level encodings of image prompts, while fine visual embeddings from the proposed attention injection module provide multi-scale and detailed encoding of image prompts. These two complementary embeddings can faithfully capture the desired appearance. 2) In the attention injection module at fine level, multi-scale image prompts are fed into different cross-frame attention layers as additional keys and values. This extra spatial information refines the details in the first frame and then it is propagated to the remaining frames, which maintains temporal consistency. Extensive experiments demonstrate that VideoBooth achieves state-of-the-art performance in generating customized high-quality videos with subjects specified in image prompts. Notably, VideoBooth is a generalizable framework where a single model works for a wide range of image prompts with feed-forward pass.	Introduces VideoBooth, a novel feed-forward framework for generating videos using both image and text prompts, enabling customized content creation with accurate subject appearance control.	Addresses the limitations of text-only prompts in video generation, which struggle to accurately depict specific subject appearances, particularly for customized content.	Employs a coarse-to-fine visual embedding strategy: 1) a pretrained CLIP image encoder extracts coarse visual embeddings from image prompts, inserted into text embeddings; 2) an attention injection module refines details by incorporating multi-scale image prompt representations into the cross-frame attention of a pretrained text-to-video diffusion model.	VideoBooth achieves state-of-the-art performance in generating customized videos with subject appearance faithful to the image prompts. Quantitative evaluation using CLIP-Image and DINO metrics demonstrate superior image alignment compared to baseline methods. User study confirms VideoBooth's superiority in image alignment, text alignment, and overall quality.	The model was trained on a dataset with watermarks, requiring an additional module to remove them. Future work includes expanding the dataset and enhancing the model's ability to handle complex motions and diverse object appearances.	video generation, image prompts, text-to-video, diffusion models, customized content creation
2312.00739 Report	Adversarial Score Distillation: When score distillation meets GAN	Min Wei, Jingkai Zhou, Junyao Sun, Xuesong Zhang	Existing score distillation methods are sensitive to classifier-free guidance (CFG) scale: manifested as over-smoothness or instability at small CFG scales, while over-saturation at large ones. To explain and analyze these issues, we revisit the derivation of Score Distillation Sampling (SDS) and decipher existing score distillation with the Wasserstein Generative Adversarial Network (WGAN) paradigm. With the WGAN paradigm, we find that existing score distillation either employs a fixed sub-optimal discriminator or conducts incomplete discriminator optimization, resulting in the scale-sensitive issue. We propose the Adversarial Score Distillation (ASD), which maintains an optimizable discriminator and updates it using the complete optimization objective. Experiments show that the proposed ASD performs favorably in 2D distillation and text-to-3D tasks against existing methods. Furthermore, to explore the generalization ability of our WGAN paradigm, we extend ASD to the image editing task, which achieves competitive results. The project page and code are at https://github.com/2y7c3/ASD.	This paper unveils the connection between score distillation and GANs, proposing Adversarial Score Distillation (ASD) to address the limitations of existing score distillation methods.	Existing score distillation methods, like SDS, are sensitive to classifier-free guidance (CFG) scale, resulting in over-smoothing or instability at small scales and over-saturation at large scales. This paper aims to analyze and rectify these issues.	The authors revisit the derivation of SDS and establish its connection to Wasserstein GAN (WGAN). They identify that existing methods either employ a fixed sub-optimal discriminator or conduct incomplete optimization. ASD, leveraging the WGAN paradigm, maintains an optimizable discriminator and updates it using the complete WGAN discriminator loss.	ASD demonstrates superior performance in quality, stability, and diversity compared to SDS and VSD in both 2D distillation and text-to-3D tasks. The paper provides a theoretical analysis of VSD and CSD under the WGAN paradigm. ASD's application is extended to image editing, showcasing competitive results and highlighting the generalization ability of the proposed paradigm.	ASD, while exhibiting strong performance, still suffers from speed limitations similar to VSD. Further exploration of dynamic gamma values in the discriminator loss function is suggested for potential improvement.	score distillation, generative adversarial networks, text-to-3d synthesis, image editing, wasserstein gan
2312.00732 Report	Gaussian Grouping: Segment and Edit Anything in 3D Scenes	Mingqiao Ye, Martin Danelljan, Fisher Yu, Lei Ke	The recent Gaussian Splatting achieves high-quality and real-time novel-view synthesis of the 3D scenes. However, it is solely concentrated on the appearance and geometry modeling, while lacking in fine-grained object-level scene understanding. To address this issue, we propose Gaussian Grouping, which extends Gaussian Splatting to jointly reconstruct and segment anything in open-world 3D scenes. We augment each Gaussian with a compact Identity Encoding, allowing the Gaussians to be grouped according to their object instance or stuff membership in the 3D scene. Instead of resorting to expensive 3D labels, we supervise the Identity Encodings during the differentiable rendering by leveraging the 2D mask predictions by SAM, along with introduced 3D spatial consistency regularization. Comparing to the implicit NeRF representation, we show that the discrete and grouped 3D Gaussians can reconstruct, segment and edit anything in 3D with high visual quality, fine granularity and efficiency. Based on Gaussian Grouping, we further propose a local Gaussian Editing scheme, which shows efficacy in versatile scene editing applications, including 3D object removal, inpainting, colorization and scene recomposition. Our code and models will be at https://github.com/lkeab/gaussian-grouping.	Presents Gaussian Grouping, an extension of 3D Gaussian Splatting for joint reconstruction and segmentation of anything in open-world 3D scenes.	Addresses the limitations of existing 3D scene understanding methods that rely on expensive 3D labels or struggle with fine-grained segmentation in open-world settings.	Augments each Gaussian with a learnable Identity Encoding, supervised by 2D mask predictions from SAM and a 3D spatial consistency regularization, enabling grouping of Gaussians into object instances or stuff.	Achieves high-quality reconstruction comparable to original Gaussian Splatting. Significantly outperforms existing open-vocabulary 3D segmentation methods on LERF-Mask dataset. Enables efficient and versatile scene editing applications, including object removal, inpainting, colorization, and scene recomposition.	Currently limited to static 3D scenes due to the lack of dynamic modeling. Exploiting fully unsupervised 3D Gaussian grouping without 2D mask supervision.	3d scene understanding, gaussian splatting, open-world segmentation, scene editing, segment anything model (sam)
2312.00674 Report	LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models	Ying Nie, Wei He, Kai Han, Yehui Tang, Tianyu Guo, Fanyi Du, Yunhe Wang	Vision-language pre-training like CLIP has shown promising performance on various downstream tasks such as zero-shot image classification and image-text retrieval. Most of the existing CLIP-alike works usually adopt relatively large image encoders like ResNet50 and ViT, while the lightweight counterparts are rarely discussed. In this paper, we propose a multi-level interaction paradigm for training lightweight CLIP models. Firstly, to mitigate the problem that some image-text pairs are not strictly one-to-one correspondence, we improve the conventional global instance-level alignment objective by softening the label of negative samples progressively. Secondly, a relaxed bipartite matching based token-level alignment objective is introduced for finer-grained alignment between image patches and textual words. Moreover, based on the observation that the accuracy of CLIP model does not increase correspondingly as the parameters of text encoder increase, an extra objective of masked language modeling (MLM) is leveraged for maximizing the potential of the shortened text encoder. In practice, an auxiliary fusion module injecting unmasked image embedding into masked text embedding at different network stages is proposed for enhancing the MLM. Extensive experiments show that without introducing additional computational cost during inference, the proposed method achieves a higher performance on multiple downstream tasks.	This paper proposes LightCLIP, a multi-level interaction paradigm for training lightweight CLIP models that achieves higher performance on downstream tasks without introducing additional computational cost during inference.	Existing CLIP models are difficult to deploy on edge devices due to their large parameter size, and directly adapting existing training methods to lightweight models leads to sub-optimal results.	The authors propose: (1) a progressive softening of labels for the global instance-level alignment objective to account for noisy image-text pairs; (2) a relaxed bipartite matching based token-level alignment objective for finer-grained alignment between image patches and textual words; and (3) a masked language modeling (MLM) objective enhanced by fusing unmasked image embedding into masked text embedding at different network stages to maximize the potential of a shortened text encoder.	LightCLIP outperforms CLIP, SLIP, and DeCLIP on zero-shot ImageNet classification with various lightweight image encoders. LightCLIP consistently achieves higher average zero-shot accuracy on 10 small datasets compared to other methods. LightCLIP shows significant improvements in zero-shot image-text retrieval on Flickr30K and MS-COCO, especially in image-to-text top-1 hit accuracy.	The paper primarily focuses on YFCC15M-V2 dataset for pre-training, limiting the exploration of performance with larger datasets. Future work could explore alternative lightweight architectures and fusion strategies for both image and text encoders.	vision-language pre-training, lightweight model, clip, zero-shot learning, image-text retrieval
2312.00596 Report	BCN: Batch Channel Normalization for Image Classification	Afifa Khaled, Chao Li, Jia Ning, Kun He	Normalization techniques have been widely used in the field of deep learning due to their capability of enabling higher learning rates and are less careful in initialization. However, the effectiveness of popular normalization technologies is typically limited to specific areas. Unlike the standard Batch Normalization (BN) and Layer Normalization (LN), where BN computes the mean and variance along the (N,H,W) dimensions and LN computes the mean and variance along the (C,H,W) dimensions (N, C, H and W are the batch, channel, spatial height and width dimension, respectively), this paper presents a novel normalization technique called Batch Channel Normalization (BCN). To exploit both the channel and batch dependence and adaptively and combine the advantages of BN and LN based on specific datasets or tasks, BCN separately normalizes inputs along the (N, H, W) and (C, H, W) axes, then combines the normalized outputs based on adaptive parameters. As a basic block, BCN can be easily integrated into existing models for various applications in the field of computer vision. Empirical results show that the proposed technique can be seamlessly applied to various versions of CNN or Vision Transformer architecture. The code is publicly available at https://github.com/AfifaKhaled/BatchChannel-Normalization	This paper introduces Batch Channel Normalization (BCN), a novel normalization technique for deep learning that combines the strengths of Batch Normalization (BN) and Layer Normalization (LN).	Existing normalization techniques like BN and LN have limitations, with BN requiring large batch sizes and LN not performing well on convolutional layers. BCN aims to overcome these limitations by exploiting both channel and batch dependencies.	BCN normalizes inputs separately along the (N, H, W) and (C, H, W) axes, then combines these normalized outputs using adaptive parameters. This allows BCN to leverage the advantages of both BN and LN.	BCN consistently outperforms BN, LN, and other normalization techniques in image classification tasks on CIFAR-10/100, SVHN, and ImageNet datasets. BCN improves the performance of self-supervised learning methods like BYOL. BCN shows consistent improvements when applied to Vision Transformer (ViT) models.	Future work includes an ablation study on directly computing average and variance along (N, C, H, W) axes. Further investigation of BCN's effectiveness across a wider range of CNN architectures and applications is planned.	batch normalization, layer normalization, deep learning, normalization techniques, computer vision
2312.00588 Report	LucidDreaming: Controllable Object-Centric 3D Generation	Zhaoning Wang, Ming Li, Chen Chen	With the recent development of generative models, Text-to-3D generations have also seen significant growth. Nonetheless, achieving precise control over 3D generation continues to be an arduous task, as using text to control often leads to missing objects and imprecise locations. Contemporary strategies for enhancing controllability in 3D generation often entail the introduction of additional parameters, such as customized diffusion models. This often induces hardness in adapting to different diffusion models or creating distinct objects. In this paper, we present LucidDreaming as an effective pipeline capable of fine-grained control over 3D generation. It requires only minimal input of 3D bounding boxes, which can be deduced from a simple text prompt using a Large Language Model. Specifically, we propose clipped ray sampling to separately render and optimize objects with user specifications. We also introduce object-centric density blob bias, fostering the separation of generated objects. With individual rendering and optimizing of objects, our method excels not only in controlled content generation from scratch but also within the pre-trained NeRF scenes. In such scenarios, existing generative approaches often disrupt the integrity of the original scene, and current editing methods struggle to synthesize new content in empty spaces. We show that our method exhibits remarkable adaptability across a spectrum of mainstream Score Distillation Sampling-based 3D generation frameworks, and achieves superior alignment of 3D content when compared to baseline approaches. We also provide a dataset of prompts with 3D bounding boxes, benchmarking 3D spatial controllability.	This paper introduces LucidDreaming, a plug-and-play pipeline enhancing controllability in 3D generation using bounding boxes or text prompts.	Existing text-to-3D generation methods struggle with fine-grained control, often resulting in missing objects or inaccurate placements. While controllable methods exist, they rely on customized diffusion models and lack adaptability.	The paper proposes: (1) Clipped ray sampling for individual object rendering and optimization within bounding boxes. (2) Object-centric density bias initialization to accurately position initial density within bounding boxes. (3) Integration of a Large Language Model to generate bounding boxes and object descriptions from text prompts.	LucidDreaming demonstrates superior control over object placement and number compared to baseline methods. The method adapts to various SDS-based 3D generation frameworks (DreamFusion, Magic3D, ProlificDreamer). It allows controlled object generation within pre-trained NeRF scenes, unlike methods focused on modifying existing objects.	Current implementation struggles with object interactions, relying on separate rendering. Training time increases linearly with the number of objects, posing challenges for complex scenes.	3d generation, controllability, text-to-3d, nerf, score distillation sampling
2312.00583 Report	MD-Splatting: Learning Metric Deformation from 4D Gaussians in Highly Deformable Scenes	Bardienus P. Duisterhof, Zhao Mandi, Yunchao Yao, Jia-Wei Liu, Mike Zheng Shou, Shuran Song, Jeffrey Ichnowski	Accurate 3D tracking in highly deformable scenes with occlusions and shadows can facilitate new applications in robotics, augmented reality, and generative AI. However, tracking under these conditions is extremely challenging due to the ambiguity that arises with large deformations, shadows, and occlusions. We introduce MD-Splatting, an approach for simultaneous 3D tracking and novel view synthesis, using video captures of a dynamic scene from various camera poses. MD-Splatting builds on recent advances in Gaussian splatting, a method that learns the properties of a large number of Gaussians for state-of-the-art and fast novel view synthesis. MD-Splatting learns a deformation function to project a set of Gaussians with non-metric, thus canonical, properties into metric space. The deformation function uses a neural-voxel encoding and a multilayer perceptron (MLP) to infer Gaussian position, rotation, and a shadow scalar. We enforce physics-inspired regularization terms based on local rigidity, conservation of momentum, and isometry, which leads to trajectories with smaller trajectory errors. MD-Splatting achieves high-quality 3D tracking on highly deformable scenes with shadows and occlusions. Compared to state-of-the-art, we improve 3D tracking by an average of 23.9 %, while simultaneously achieving high-quality novel view synthesis. With sufficient texture such as in scene 6, MD-Splatting achieves a median tracking error of 3.39 mm on a cloth of 1 x 1 meters in size. Project website: https://md-splatting.github.io/.	\modelname{} is a novel method for simultaneous 3D tracking and novel view synthesis in highly deformable scenes, using video captures from various camera poses. It leverages Gaussian splatting and learns a deformation function to map canonical Gaussians into metric space for tracking and rendering.	Accurate 3D tracking in deformable scenes is crucial for applications in robotics, AR, and AI, but it is challenging due to ambiguities caused by deformations, shadows, and occlusions.	\modelname{} learns a deformation function using a neural-voxel encoding and an MLP to infer Gaussian position, rotation, and a shadow scalar. It also enforces physics-inspired regularization terms for plausible deformations.	\modelname{} achieves state-of-the-art 3D tracking on deformable scenes, improving accuracy by 16.7% compared to previous methods. It achieves high-quality novel view reconstruction with an average PSNR of 39.1. The method exhibits robustness in textured environments and shows promising results even with lower time resolution.	The method currently relies on a multi-camera setup, limiting its applicability in some real-world scenarios. The work primarily focuses on scenes with a single cloth object; expanding to more complex environments with diverse soft objects is an area for future exploration.	3d tracking, novel view synthesis, deformable objects, gaussian splatting, neural rendering
2312.00451 Report	FSGS: Real-Time Few-shot View Synthesis using Gaussian Splatting	Zehao Zhu, Zhiwen Fan, Yifan Jiang, Zhangyang Wang	Novel view synthesis from limited observations remains an important and persistent task. However, high efficiency in existing NeRF-based few-shot view synthesis is often compromised to obtain an accurate 3D representation. To address this challenge, we propose a few-shot view synthesis framework based on 3D Gaussian Splatting that enables real-time and photo-realistic view synthesis with as few as three training views. The proposed method, dubbed FSGS, handles the extremely sparse initialized SfM points with a thoughtfully designed Gaussian Unpooling process. Our method iteratively distributes new Gaussians around the most representative locations, subsequently infilling local details in vacant areas. We also integrate a large-scale pre-trained monocular depth estimator within the Gaussians optimization process, leveraging online augmented views to guide the geometric optimization towards an optimal solution. Starting from sparse points observed from limited input viewpoints, our FSGS can accurately grow into unseen regions, comprehensively covering the scene and boosting the rendering quality of novel views. Overall, FSGS achieves state-of-the-art performance in both accuracy and rendering efficiency across diverse datasets, including LLFF, Mip-NeRF360, and Blender. Project website: https://zehaozhu.github.io/FSGS/.	FSGS, a novel point-based framework for few-shot view synthesis, leveraging Proximity-guided Gaussian Unpooling and monocular depth priors.	Addresses the challenge of high inefficiency and inaccurate 3D representation in existing NeRF-based few-shot view synthesis methods.	Employs Proximity-guided Gaussian Unpooling to densify 3D Gaussians for scene coverage and integrates monocular depth priors, enhanced by pseudo view generation, for optimal Gaussian optimization.	Achieves state-of-the-art rendering quality on LLFF, Mip-NeRF360, and Blender datasets. Enables real-time rendering speed (200+ FPS) suitable for practical applications. Significantly outperforms NeRF-based methods in rendering accuracy and speed, particularly in few-shot scenarios with limited training views.	Reliance on accurate SfM for initialization, potentially limiting performance in challenging scenarios. Exploration of alternative depth priors beyond monocular depth estimators to further improve accuracy.	novel view synthesis, few-shot learning, 3d gaussian splatting, monocular depth estimation, real-time rendering
2312.00330 Report	StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter	Gongye Liu, Menghan Xia, Yong Zhang, Haoxin Chen, Jinbo Xing, Xintao Wang, Yujiu Yang, Ying Shan	Text-to-video (T2V) models have shown remarkable capabilities in generating diverse videos. However, they struggle to produce user-desired stylized videos due to (i) text's inherent clumsiness in expressing specific styles and (ii) the generally degraded style fidelity. To address these challenges, we introduce StyleCrafter, a generic method that enhances pre-trained T2V models with a style control adapter, enabling video generation in any style by providing a reference image. Considering the scarcity of stylized video datasets, we propose to first train a style control adapter using style-rich image datasets, then transfer the learned stylization ability to video generation through a tailor-made finetuning paradigm. To promote content-style disentanglement, we remove style descriptions from the text prompt and extract style information solely from the reference image using a decoupling learning strategy. Additionally, we design a scale-adaptive fusion module to balance the influences of text-based content features and image-based style features, which helps generalization across various text and style combinations. StyleCrafter efficiently generates high-quality stylized videos that align with the content of the texts and resemble the style of the reference images. Experiments demonstrate that our approach is more flexible and efficient than existing competitors.	The paper introduces StyleCrafter, a novel method that enables pre-trained text-to-video (T2V) models to generate stylized videos using a single reference image.	Existing T2V models struggle to produce stylized videos due to the difficulty of expressing specific styles through text prompts and the lack of large-scale stylized video datasets.	The authors propose a two-stage training pipeline: 1) train a style adapter on a stylized image dataset to extract style features, 2) adapt a pre-trained T2V model by fine-tuning its temporal blocks with the style adapter incorporated.	StyleCrafter generates high-quality stylized videos that are both text-aligned and style-conformant. The method outperforms existing single-reference and even some multi-reference based stylized video generation methods. Ablation studies validate the effectiveness of the proposed style adapter architecture, training scheme, and adaptive style-content fusion module.	The model may not generate satisfactory results when the reference image inadequately represents the target style or the style is extremely uncommon. The reliance on pre-trained T2V models limits the quality of generated results in certain aspects, e.g., generating high-fidelity faces.	text-to-video generation, stylized video generation, style adapter, content-style disentanglement, diffusion models
2312.00210 Report	DREAM: Diffusion Rectification and Estimation-Adaptive Models	Jinxin Zhou, Tianyu Ding, Tianyi Chen, Jiachen Jiang, Ilya Zharkov, Zhihui Zhu, Luming Liang	We present DREAM, a novel training framework representing Diffusion Rectification and Estimation Adaptive Models, requiring minimal code changes (just three lines) yet significantly enhancing the alignment of training with sampling in diffusion models. DREAM features two components: diffusion rectification, which adjusts training to reflect the sampling process, and estimation adaptation, which balances perception against distortion. When applied to image super-resolution (SR), DREAM adeptly navigates the tradeoff between minimizing distortion and preserving high image quality. Experiments demonstrate DREAM's superiority over standard diffusion-based SR methods, showing a $2$ to $3\times $ faster training convergence and a $10$ to $20\times$ reduction in sampling steps to achieve comparable results. We hope DREAM will inspire a rethinking of diffusion model training paradigms.	This paper presents DREAM, a novel training framework for diffusion models that effectively reduces the training-sampling discrepancy in conditional image generation tasks, such as super-resolution.	Training diffusion models, especially for conditional generation tasks, suffers from a discrepancy between training and sampling processes, hindering their performance. DREAM addresses this issue with minimal code changes, leading to enhanced image quality, faster training, and improved sampling efficiency.	DREAM consists of two main components: 1) Diffusion Rectification: adjusts training to reflect the sampling process by utilizing the model's own predictions for error estimation and rectification. 2) Estimation Adaptation: balances the benefits of standard diffusion and diffusion rectification by adaptively incorporating ground-truth information during training.	DREAM significantly enhances both distortion and perception metrics across various diffusion-based super-resolution models and datasets. It achieves a 2-3 times faster training convergence and a 10-20 times reduction in sampling steps compared to standard diffusion training, yielding superior or comparable results. DREAM demonstrates superior robustness and generalization ability, achieving state-of-the-art out-of-distribution (OOD) super-resolution performance across diverse datasets and scales.	While DREAM shows promising results, it primarily focuses on super-resolution tasks in this work. Further exploration of advanced network architectures and loss functions, such as incorporating GAN loss, could potentially lead to further enhancements in image quality.	diffusion models, super-resolution, training-sampling discrepancy, generative models, image generation
2312.00206 Report	SparseGS: Real-Time 360° Sparse View Synthesis using Gaussian Splatting	Haolin Xiong, Sairisheek Muttukuru, Rishi Upadhyay, Pradyumna Chari, Achuta Kadambi	The problem of novel view synthesis has grown significantly in popularity recently with the introduction of Neural Radiance Fields (NeRFs) and other implicit scene representation methods. A recent advance, 3D Gaussian Splatting (3DGS), leverages an explicit representation to achieve real-time rendering with high-quality results. However, 3DGS still requires an abundance of training views to generate a coherent scene representation. In few shot settings, similar to NeRF, 3DGS tends to overfit to training views, causing background collapse and excessive floaters, especially as the number of training views are reduced. We propose a method to enable training coherent 3DGS-based radiance fields of 360-degree scenes from sparse training views. We integrate depth priors with generative and explicit constraints to reduce background collapse, remove floaters, and enhance consistency from unseen viewpoints. Experiments show that our method outperforms base 3DGS by 6.4% in LPIPS and by 12.2% in PSNR, and NeRF-based methods by at least 17.6% in LPIPS on the MipNeRF-360 dataset with substantially less training and inference cost.	This paper presents SparseGS, a novel method for real-time 360° sparse view synthesis that leverages 3D Gaussian Splatting (3DGS) and incorporates depth priors, diffusion constraints, and a novel floater pruning technique.	Existing view synthesis techniques like NeRFs and 3DGS often struggle in few-shot scenarios, leading to artifacts like floaters and background collapse, particularly in challenging 360° unbounded scenes.	SparseGS integrates depth priors using a patch-based depth correlation loss based on a novel softmax depth rendering technique. It utilizes a score distillation sampling loss from a pre-trained diffusion model for refinement and employs image re-projection for data augmentation. A key innovation is an explicit, adaptive operator that directly prunes unwanted "floater" artifacts from the 3D Gaussian representation.	SparseGS outperforms base 3DGS by 6.4% in LPIPS and 12.2% in PSNR on the MipNeRF-360 dataset. It surpasses NeRF-based methods by at least 17.6% in LPIPS. SparseGS enables real-time inference (100+ FPS) while maintaining high visual quality.	SparseGS is highly reliant on the accuracy and detail of the initial point cloud provided by COLMAP, which can be problematic in sparse view settings where initial point clouds are small. Future work could explore point cloud densification techniques as data augmentation to address this limitation.	novel view synthesis, 3d gaussian splatting, few-shot learning, depth priors, floater pruning
2312.00195 Report	Raising the Bar of AI-generated Image Detection with CLIP	Davide Cozzolino, Giovanni Poggi, Riccardo Corvi, Matthias Nießner, Luisa Verdoliva	The aim of this work is to explore the potential of pre-trained vision-language models (VLMs) for universal detection of AI-generated images. We develop a lightweight detection strategy based on CLIP features and study its performance in a wide variety of challenging scenarios. We find that, contrary to previous beliefs, it is neither necessary nor convenient to use a large domain-specific dataset for training. On the contrary, by using only a handful of example images from a single generative model, a CLIP-based detector exhibits surprising generalization ability and high robustness across different architectures, including recent commercial tools such as Dalle-3, Midjourney v5, and Firefly. We match the state-of-the-art (SoTA) on in-distribution data and significantly improve upon it in terms of generalization to out-of-distribution data (+6% AUC) and robustness to impaired/laundered data (+13%). Our project is available at https://grip-unina.github.io/ClipBased-SyntheticImageDetection/	This paper presents a lightweight AI-generated image detection method using CLIP features, demonstrating superior generalization ability and robustness across diverse generators, outperforming state-of-the-art methods.	Detecting AI-generated images is crucial for combating disinformation and ensuring media authenticity, especially with the proliferation of advanced image synthesis tools.	The method extracts CLIP features from real/fake image pairs with shared textual descriptions, training a linear SVM classifier. It analyzes the impact of reference set size, content, and CLIP pre-training.	CLIP features achieve excellent generalization, requiring only a handful of examples for effective detection. Performance is influenced by the diversity of the reference set and benefits from large-scale pre-training. The method demonstrates strong robustness to image perturbations, surpassing existing methods, particularly on challenging commercial AI-generated images.	The method's reliance on semantic features might be vulnerable to adversarial attacks targeting these aspects. Future work includes exploring few-shot adaptation for real-world scenarios and improving interpretability.	ai-generated image detection, clip, generalization, robustness, image forensics
2312.00116 Report	S2ST: Image-to-Image Translation in the Seed Space of Latent Diffusion	Or Greenberg, Eran Kishon, Dani Lischinski	Image-to-image translation (I2IT) refers to the process of transforming images from a source domain to a target domain while maintaining a fundamental connection in terms of image content. In the past few years, remarkable advancements in I2IT were achieved by Generative Adversarial Networks (GANs), which nevertheless struggle with translations requiring high precision. Recently, Diffusion Models have established themselves as the engine of choice for image generation. In this paper we introduce S2ST, a novel framework designed to accomplish global I2IT in complex photorealistic images, such as day-to-night or clear-to-rain translations of automotive scenes. S2ST operates within the seed space of a Latent Diffusion Model, thereby leveraging the powerful image priors learned by the latter. We show that S2ST surpasses state-of-the-art GAN-based I2IT methods, as well as diffusion-based approaches, for complex automotive scenes, improving fidelity while respecting the target domain's appearance across a variety of domains. Notably, S2ST obviates the necessity for training domain-specific translation networks.	Introduces S2ST, a novel diffusion-based unpaired image-to-image translation (I2IT) method for complex photorealistic images (e.g., automotive scenes), operating within the seed space of a Latent Diffusion Model (LDM).	Addresses limitations of GAN-based I2IT methods in handling complex scene translations with high content fidelity, leveraging the power of pre-trained diffusion models for realistic and detailed image generation.	Employs a two-step process: 1) Seed Translation optimizes the initial seed obtained by inverting the source image to match the target domain while preserving structure. 2) Trajectory Optimization refines the DDIM sampling trajectory to further enhance structural similarity between source and generated images.	Outperforms state-of-the-art GAN-based methods in terms of target domain appearance and realism (measured by KID and SSIM) for day-to-night translations on BDD100k. Demonstrates superior performance in human evaluation for achieving target domain appearance while preserving source image content. Enables multi-domain translation using the same model, unlike GAN-based methods requiring separate training for each domain pair.	High computational cost due to backpropagation through the entire sampling process. Lack of explicit cycle-consistency mechanism found in GANs, potentially limiting content preservation despite efforts through seed optimization and trajectory refinement.	image-to-image translation, diffusion models, seed space, trajectory optimization, automotive scenes
2312.00109 Report	Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering	Tao Lu, Mulin Yu, Linning Xu, Yuanbo Xiangli, Limin Wang, Dahua Lin, Bo Dai	Neural rendering methods have significantly advanced photo-realistic 3D scene rendering in various academic and industrial applications. The recent 3D Gaussian Splatting method has achieved the state-of-the-art rendering quality and speed combining the benefits of both primitive-based representations and volumetric representations. However, it often leads to heavily redundant Gaussians that try to fit every training view, neglecting the underlying scene geometry. Consequently, the resulting model becomes less robust to significant view changes, texture-less area and lighting effects. We introduce Scaffold-GS, which uses anchor points to distribute local 3D Gaussians, and predicts their attributes on-the-fly based on viewing direction and distance within the view frustum. Anchor growing and pruning strategies are developed based on the importance of neural Gaussians to reliably improve the scene coverage. We show that our method effectively reduces redundant Gaussians while delivering high-quality rendering. We also demonstrates an enhanced capability to accommodate scenes with varying levels-of-detail and view-dependent observations, without sacrificing the rendering speed.	This paper introduces Scaffold-GS, a novel 3D scene representation method for view-adaptive rendering using anchor points to guide neural 3D Gaussian distribution.	Existing 3D Gaussian Splatting methods suffer from redundant Gaussians and lack of robustness to view changes. Scaffold-GS improves rendering quality and efficiency by leveraging scene structure and view-dependent neural Gaussians.	The method initializes anchor points from SfM point clouds and dynamically predicts neural Gaussian attributes from anchor features and viewing information. It refines anchor points via growing and pruning based on neural Gaussian feedback.	Scaffold-GS achieves comparable or better rendering quality than state-of-the-art methods like 3D-GS. It requires significantly less storage space while maintaining real-time rendering speed. The learned anchor features exhibit semantic clustering, indicating potential for scene understanding tasks.	Performance heavily relies on the quality of initial SfM point clouds. The current filtering strategy by opacity may mask important neural Gaussians.	neural rendering, 3d gaussian splatting, view-adaptive rendering, scene representation, anchor points
2312.00094 Report	Fast ODE-based Sampling for Diffusion Models in Around 5 Steps	Zhenyu Zhou, Defang Chen, Can Wang, Chun Chen	Sampling from diffusion models can be treated as solving the corresponding ordinary differential equations (ODEs), with the aim of obtaining an accurate solution with as few number of function evaluations (NFE) as possible. Recently, various fast samplers utilizing higher-order ODE solvers have emerged and achieved better performance than the initial first-order one. However, these numerical methods inherently result in certain approximation errors, which significantly degrades sample quality with extremely small NFE (e.g., around 5). In contrast, based on the geometric observation that each sampling trajectory almost lies in a two-dimensional subspace embedded in the ambient space, we propose Approximate MEan-Direction Solver (AMED-Solver) that eliminates truncation errors by directly learning the mean direction for fast diffusion sampling. Besides, our method can be easily used as a plugin to further improve existing ODE-based samplers. Extensive experiments on image synthesis with the resolution ranging from 32 to 512 demonstrate the effectiveness of our method. With only 5 NFE, we achieve 6.61 FID on CIFAR-10, 10.74 FID on ImageNet 64$\times$64, and 13.20 FID on LSUN Bedroom. Our code is available at https://github.com/zju-pi/diff-sampler.	This paper introduces AMED-Solver, a novel single-step ODE solver for diffusion models that minimizes discretization errors by predicting mean directions in each sampling step.	Existing fast diffusion samplers suffer from significant sample quality degradation when using very few function evaluations (NFE), especially single-step solvers. AMED-Solver addresses this limitation, enabling high-quality generation in around 5 NFE.	The method leverages the observation that sampling trajectories lie approximately in a 2D subspace. It then trains a shallow neural network (AMED predictor) to predict intermediate time steps and scaling factors that minimize the distance between student and teacher sampling trajectories.	AMED-Solver outperforms other single-step ODE solvers and achieves comparable or superior results to multi-step solvers in many cases. The AMED-Plugin, a generalization of AMED-Solver, consistently improves the performance of existing fast ODE solvers across various datasets. The method achieves state-of-the-art results among solver-based methods in around 5 NFE, demonstrating significant FID improvements on CIFAR-10, ImageNet 64x64, and LSUN Bedroom.	Fast ODE solvers, including AMED, show high sensitivity to time schedules, especially with limited NFE. Future work could explore adaptive time schedules based on sampling trajectory geometry. The paper primarily focuses on image generation. Exploring AMED's applicability to other diffusion model applications like image editing and restoration could be interesting.	diffusion models, ode solvers, fast sampling, image generation, knowledge distillation
2312.00093 Report	GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs	Gege Gao, Weiyang Liu, Anpei Chen, Andreas Geiger, Bernhard Schölkopf	As pretrained text-to-image diffusion models become increasingly powerful, recent efforts have been made to distill knowledge from these text-to-image pretrained models for optimizing a text-guided 3D model. Most of the existing methods generate a holistic 3D model from a plain text input. This can be problematic when the text describes a complex scene with multiple objects, because the vectorized text embeddings are inherently unable to capture a complex description with multiple entities and relationships. Holistic 3D modeling of the entire scene further prevents accurate grounding of text entities and concepts. To address this limitation, we propose GraphDreamer, a novel framework to generate compositional 3D scenes from scene graphs, where objects are represented as nodes and their interactions as edges. By exploiting node and edge information in scene graphs, our method makes better use of the pretrained text-to-image diffusion model and is able to fully disentangle different objects without image-level supervision. To facilitate modeling of object-wise relationships, we use signed distance fields as representation and impose a constraint to avoid inter-penetration of objects. To avoid manual scene graph creation, we design a text prompt for ChatGPT to generate scene graphs based on text inputs. We conduct both qualitative and quantitative experiments to validate the effectiveness of GraphDreamer in generating high-fidelity compositional 3D scenes with disentangled object entities.	GraphDreamer is a novel framework that leverages scene graphs to generate compositional 3D scenes, effectively disentangling objects and their relationships from text descriptions.	Existing text-to-3D methods struggle with complex scenes involving multiple objects and their interactions, suffering from attribute confusion and guidance collapse. GraphDreamer overcomes these limitations by utilizing the structured representation of scene graphs.	GraphDreamer decomposes scene graphs into object and relationship descriptions. It employs identity-aware positional encoders to represent individual object fields and a shared SDF network for geometry. By rendering objects and their combinations individually and globally, GraphDreamer utilizes SDS loss for optimization.	GraphDreamer effectively disentangles objects in 3D scenes, as evidenced by the CLIP score analysis of individual object renderings. It outperforms state-of-the-art text-to-3D methods like Magic3D and MVDream in generating multi-object scenes, achieving higher CLIP scores and better visual fidelity. Ablation studies confirm that the use of scene graphs significantly improves performance, highlighting their importance for accurate guidance.	Individual object generation quality remains constrained by the limitations of SDS optimization. Object decomposition may fail in cases of significant semantic dominance of one object over another.	3d scene generation, text-to-3d, scene graphs, score distillation sampling, compositional 3d modeling
2312.00085 Report	X-Dreamer: Creating High-quality 3D Content by Bridging the Domain Gap Between Text-to-2D and Text-to-3D Generation	Yiwei Ma, Yijun Fan, Jiayi Ji, Haowei Wang, Xiaoshuai Sun, Guannan Jiang, Annan Shu, Rongrong Ji	In recent times, automatic text-to-3D content creation has made significant progress, driven by the development of pretrained 2D diffusion models. Existing text-to-3D methods typically optimize the 3D representation to ensure that the rendered image aligns well with the given text, as evaluated by the pretrained 2D diffusion model. Nevertheless, a substantial domain gap exists between 2D images and 3D assets, primarily attributed to variations in camera-related attributes and the exclusive presence of foreground objects. Consequently, employing 2D diffusion models directly for optimizing 3D representations may lead to suboptimal outcomes. To address this issue, we present X-Dreamer, a novel approach for high-quality text-to-3D content creation that effectively bridges the gap between text-to-2D and text-to-3D synthesis. The key components of X-Dreamer are two innovative designs: Camera-Guided Low-Rank Adaptation (CG-LoRA) and Attention-Mask Alignment (AMA) Loss. CG-LoRA dynamically incorporates camera information into the pretrained diffusion models by employing camera-dependent generation for trainable parameters. This integration enhances the alignment between the generated 3D assets and the camera's perspective. AMA loss guides the attention map of the pretrained diffusion model using the binary mask of the 3D object, prioritizing the creation of the foreground object. This module ensures that the model focuses on generating accurate and detailed foreground objects. Extensive evaluations demonstrate the effectiveness of our proposed method compared to existing text-to-3D approaches. Our project webpage: https://xmu-xiaoma666.github.io/Projects/X-Dreamer/ .	This paper introduces X-Dreamer, a novel framework for high-quality text-to-3D content creation that bridges the gap between text-to-2D and text-to-3D generation by incorporating camera information and prioritizing foreground object generation.	Existing text-to-3D methods face challenges due to the domain gap between 2D images and 3D assets, especially in handling camera parameters and focusing on foreground objects.	X-Dreamer utilizes two innovative designs: 1) CG-LoRA dynamically integrates camera information into pretrained diffusion models for better alignment. 2) AMA loss guides the attention map to prioritize foreground object generation by aligning it with the rendered 3D object mask.	X-Dreamer generates high-quality, photorealistic 3D assets from text prompts, starting from either an ellipsoid or a coarse-grained mesh. X-Dreamer outperforms SOTA methods like DreamFusion, Magic3D, and Fantasia3D in realism and achieves comparable results to ProlificDreamer with significantly less optimization time. Ablation studies demonstrate the significant contributions of CG-LoRA and AMA loss in enhancing geometry, appearance, and overall quality of generated 3D objects.	X-Dreamer currently cannot generate multiple separate objects from a single text prompt, sometimes merging their properties. Future work could explore multi-object generation and address other limitations.	text-to-3d synthesis, diffusion models, camera-aware generation, foreground object prioritization, 3d content creation
2312.00081 Report	Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding	Wujian Peng, Sicheng Xie, Zuyao You, Shiyi Lan, Zuxuan Wu	Vision language models (VLM) have demonstrated remarkable performance across various downstream tasks. However, understanding fine-grained visual-linguistic concepts, such as attributes and inter-object relationships, remains a significant challenge. While several benchmarks aim to evaluate VLMs in finer granularity, their primary focus remains on the linguistic aspect, neglecting the visual dimension. Here, we highlight the importance of evaluating VLMs from both a textual and visual perspective. We introduce a progressive pipeline to synthesize images that vary in a specific attribute while ensuring consistency in all other aspects. Utilizing this data engine, we carefully design a benchmark, SPEC, to diagnose the comprehension of object size, position, existence, and count. Subsequently, we conduct a thorough evaluation of four leading VLMs on SPEC. Surprisingly, their performance is close to random guess, revealing significant limitations. With this in mind, we propose a simple yet effective approach to optimize VLMs in fine-grained understanding, achieving significant improvements on SPEC without compromising the zero-shot performance. Results on two additional fine-grained benchmarks also show consistent improvements, further validating the transferability of our approach. Code and data are available at https://github.com/wjpoom/SPEC.	This paper introduces SPEC, a new benchmark for evaluating the fine-grained visual-linguistic comprehension of Vision Language Models (VLMs) concerning object size, position, existence, and count, and proposes a simple yet effective training method to enhance VLMs' understanding in these aspects.	Existing VLMs show limitations in understanding fine-grained visual-linguistic concepts, highlighting a need for benchmarks like SPEC that go beyond evaluating object recognition and focus on compositional reasoning.	The authors develop a progressive pipeline to synthesize images with controlled variations in specific attributes while maintaining consistency in other aspects. They use this pipeline to construct SPEC and evaluate four leading VLMs, revealing their shortcomings. A novel training method incorporating hard negative examples is then proposed and applied to CLIP to boost its fine-grained understanding.	Even state-of-the-art VLMs perform close to random chance on SPEC, indicating significant limitations in fine-grained comprehension. The proposed training method significantly improves CLIP's performance on SPEC, boosting both image-to-text and text-to-image matching accuracy. The improvements obtained through the proposed method generalize to other fine-grained benchmarks like ARO and Eqben, showcasing its ability to enhance transferable fine-grained understanding.	The current study focuses on evaluating four specific attributes, future work could explore more diverse visual-linguistic concepts. The benchmark is constructed using synthetic images, which may not fully encompass the complexity of real-world images. Future work should consider incorporating real-world images for evaluation.	vision language models, fine-grained understanding, benchmarking, image synthesis, compositional reasoning
2312.00079 Report	HiFi Tuner: High-Fidelity Subject-Driven Fine-Tuning for Diffusion Models	Zhonghao Wang, Wei Wei, Yang Zhao, Zhisheng Xiao, Mark Hasegawa-Johnson, Humphrey Shi, Tingbo Hou	This paper explores advancements in high-fidelity personalized image generation through the utilization of pre-trained text-to-image diffusion models. While previous approaches have made significant strides in generating versatile scenes based on text descriptions and a few input images, challenges persist in maintaining the subject fidelity within the generated images. In this work, we introduce an innovative algorithm named HiFi Tuner to enhance the appearance preservation of objects during personalized image generation. Our proposed method employs a parameter-efficient fine-tuning framework, comprising a denoising process and a pivotal inversion process. Key enhancements include the utilization of mask guidance, a novel parameter regularization technique, and the incorporation of step-wise subject representations to elevate the sample fidelity. Additionally, we propose a reference-guided generation approach that leverages the pivotal inversion of a reference image to mitigate unwanted subject variations and artifacts. We further extend our method to a novel image editing task: substituting the subject in an image through textual manipulations. Experimental evaluations conducted on the DreamBooth dataset using the Stable Diffusion model showcase promising results. Fine-tuning solely on textual embeddings improves CLIP-T score by 3.6 points and improves DINO score by 9.6 points over Textual Inversion. When fine-tuning all parameters, HiFi Tuner improves CLIP-T score by 1.2 points and improves DINO score by 1.2 points over DreamBooth, establishing a new state of the art.	This paper introduces HiFi Tuner, a novel parameter-efficient fine-tuning framework for personalized image generation using pre-trained text-to-image diffusion models, enhancing subject fidelity while preserving scene coverage.	Existing methods struggle to balance sample quality with parameter efficiency, scene flexibility, and accurate preservation of subject appearance in personalized image generation. This work addresses these limitations to enhance the fidelity of personalized images.	The proposed HiFi Tuner utilizes a denoising process with mask guidance, parameter regularization, and step-wise subject representations. It also employs a reference-guided generation approach leveraging pivotal inversion of a reference image to maintain subject details.	Fine-tuning solely textual embeddings with HiFi Tuner improves CLIP-T score by 3.6 points and DINO score by 9.6 points compared to Textual Inversion. Fine-tuning all parameters with HiFi Tuner surpasses DreamBooth by 1.2 points in both CLIP-T and DINO scores. The method is extended to a novel image editing task, successfully substituting subjects in images through textual manipulations.	The reference-guided generation is only applied to rigid objects due to the limited appearance variations in the dataset. Future work could explore applying HiFi Tuner to more complex scenes with multiple interacting objects.	image generation, diffusion models, personalized image synthesis, text-to-image generation, fine-tuning
2312.00065 Report	Unsupervised Keypoints from Pretrained Diffusion Models	Eric Hedlin, Gopal Sharma, Shweta Mahajan, Xingzhe He, Hossam Isack, Abhishek Kar Helge Rhodin, Andrea Tagliasacchi, Kwang Moo Yi	Unsupervised learning of keypoints and landmarks has seen significant progress with the help of modern neural network architectures, but performance is yet to match the supervised counterpart, making their practicability questionable. We leverage the emergent knowledge within text-to-image diffusion models, towards more robust unsupervised keypoints. Our core idea is to find text embeddings that would cause the generative model to consistently attend to compact regions in images (i.e. keypoints). To do so, we simply optimize the text embedding such that the cross-attention maps within the denoising network are localized as Gaussians with small standard deviations. We validate our performance on multiple datasets: the CelebA, CUB-200-2011, Tai-Chi-HD, DeepFashion, and Human3.6m datasets. We achieve significantly improved accuracy, sometimes even outperforming supervised ones, particularly for data that is non-aligned and less curated. Our code is publicly available and can be found through our project page: https://ubc-vision.github.io/StableKeypoints/	This paper introduces a novel unsupervised keypoint learning method that leverages the knowledge embedded within pre-trained text-to-image diffusion models, specifically targeting cross-attention maps to identify semantically meaningful locations in images.	Unsupervised keypoint detection methods currently lag behind their supervised counterparts in performance, especially on non-aligned, in-the-wild datasets. This work aims to bridge this gap by utilizing the power of large pre-trained generative models.	The proposed method optimizes text embeddings (tokens) to enforce localized responses in the cross-attention maps of a diffusion model. This is achieved by minimizing the difference between attention maps and Gaussian distributions centered at their maxima, while also enforcing equivariance to geometric transformations. The final keypoints are then extracted as the maxima of these localized attention maps.	The method achieves state-of-the-art results on several benchmark datasets, including CelebA, CUB-200-2011, Tai-Chi-HD, DeepFashion, and Human3.6m, particularly excelling in unaligned and less curated settings. The approach demonstrates strong generalization capability, effectively transferring learned keypoints to unseen datasets and even across different object categories. The method highlights the potential of leveraging large pre-trained diffusion models for downstream vision tasks without requiring fine-tuning.	While demonstrating strong performance in unaligned cases, the method's performance in heavily pre-processed and aligned settings could be further investigated and potentially improved. Future work could explore the impact of different diffusion models and architectures on the quality of learned keypoints.	unsupervised learning, keypoint detection, diffusion models, cross-attention, generalization
2312.00063 Report	MoMask: Generative Masked Modeling of 3D Human Motions	Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Sen Wang, Li Cheng	We introduce MoMask, a novel masked modeling framework for text-driven 3D human motion generation. In MoMask, a hierarchical quantization scheme is employed to represent human motion as multi-layer discrete motion tokens with high-fidelity details. Starting at the base layer, with a sequence of motion tokens obtained by vector quantization, the residual tokens of increasing orders are derived and stored at the subsequent layers of the hierarchy. This is consequently followed by two distinct bidirectional transformers. For the base-layer motion tokens, a Masked Transformer is designated to predict randomly masked motion tokens conditioned on text input at training stage. During generation (i.e. inference) stage, starting from an empty sequence, our Masked Transformer iteratively fills up the missing tokens; Subsequently, a Residual Transformer learns to progressively predict the next-layer tokens based on the results from current layer. Extensive experiments demonstrate that MoMask outperforms the state-of-art methods on the text-to-motion generation task, with an FID of 0.045 (vs e.g. 0.141 of T2M-GPT) on the HumanML3D dataset, and 0.228 (vs 0.514) on KIT-ML, respectively. MoMask can also be seamlessly applied in related tasks without further model fine-tuning, such as text-guided temporal inpainting.	Introduces MoMask, a generative masked modeling framework for text-driven 3D human motion generation using a hierarchical quantization scheme and bidirectional transformers.	Addresses limitations of existing text-to-motion methods by improving motion quality, capturing subtle language nuances, and enabling efficient bidirectional decoding.	Employs residual vector quantization (RVQ) to represent motion as multi-layer discrete tokens. Utilizes a Masked Transformer to predict base-layer tokens conditioned on text, and a Residual Transformer to progressively predict residual tokens.	Achieves state-of-the-art performance on HumanML3D and KIT-ML datasets with FID scores of 0.045 and 0.228, respectively. Generates motions with higher quality and better understanding of subtle language concepts compared to baselines. Demonstrates effectiveness in text-guided temporal inpainting.	Limited motion diversity compared to fidelity and faithfulness. Requires target motion length as input.	text-to-motion generation, generative masked modeling, residual vector quantization, 3d human motion synthesis, motion inpainting
2311.18837 Report	VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion Models	Zhen Xing, Qi Dai, Zihao Zhang, Hui Zhang, Han Hu, Zuxuan Wu, Yu-Gang Jiang	Diffusion models have achieved significant success in image and video generation. This motivates a growing interest in video editing tasks, where videos are edited according to provided text descriptions. However, most existing approaches only focus on video editing for short clips and rely on time-consuming tuning or inference. We are the first to propose Video Instruction Diffusion (VIDiff), a unified foundation model designed for a wide range of video tasks. These tasks encompass both understanding tasks (such as language-guided video object segmentation) and generative tasks (video editing and enhancement). Our model can edit and translate the desired results within seconds based on user instructions. Moreover, we design an iterative auto-regressive method to ensure consistency in editing and enhancing long videos. We provide convincing generative results for diverse input videos and written instructions, both qualitatively and quantitatively. More examples can be found at our website https://ChenHsing.github.io/VIDiff.	VIDiff is introduced, a unified diffusion-based framework for various video translation tasks guided by multimodal instructions.	Existing video editing and understanding models often lack a unified approach, require time-consuming tuning, and struggle with long video consistency.	A pre-trained T2I diffusion model is adapted using a multi-stage training process, incorporating temporal attention and a multimodal condition injection mechanism for image and text instructions. An iterative auto-regressive method ensures long video consistency.	VIDiff achieves state-of-the-art performance on video editing benchmarks, outperforming methods requiring per-video tuning. The model excels in video enhancement tasks like deblurring, dehazing, and in-painting, surpassing existing instruction-guided techniques. The iterative generation method effectively maintains temporal consistency in long video translations.	The performance on certain tasks is limited by the VAE encoder used in the LDM. Future work includes exploring integration with large language models for more complex video understanding tasks.	video editing, video enhancement, diffusion models, multimodal learning, instruction-guided video translation
2311.18836 Report	ChatPose: Chatting about 3D Human Pose	Yao Feng, Jing Lin, Sai Kumar Dwivedi, Yu Sun, Priyanka Patel, Michael J. Black	We introduce ChatPose, a framework employing Large Language Models (LLMs) to understand and reason about 3D human poses from images or textual descriptions. Our work is motivated by the human ability to intuitively understand postures from a single image or a brief description, a process that intertwines image interpretation, world knowledge, and an understanding of body language. Traditional human pose estimation and generation methods often operate in isolation, lacking semantic understanding and reasoning abilities. ChatPose addresses these limitations by embedding SMPL poses as distinct signal tokens within a multimodal LLM, enabling the direct generation of 3D body poses from both textual and visual inputs. Leveraging the powerful capabilities of multimodal LLMs, ChatPose unifies classical 3D human pose and generation tasks while offering user interactions. Additionally, ChatPose empowers LLMs to apply their extensive world knowledge in reasoning about human poses, leading to two advanced tasks: speculative pose generation and reasoning about pose estimation. These tasks involve reasoning about humans to generate 3D poses from subtle text queries, possibly accompanied by images. We establish benchmarks for these tasks, moving beyond traditional 3D pose generation and estimation methods. Our results show that ChatPose outperforms existing multimodal LLMs and task-specific methods on these newly proposed tasks. Furthermore, ChatPose's ability to understand and generate 3D human poses based on complex reasoning opens new directions in human pose analysis.	ChatPose is a framework that enables Large Language Models (LLMs) to understand and reason about 3D human poses from images or textual descriptions, bridging the gap between traditional pose estimation/generation methods and LLMs' general reasoning abilities.	Existing human pose estimation and generation methods lack semantic understanding and reasoning, operating in isolation. ChatPose leverages LLMs' world knowledge and reasoning capabilities to overcome these limitations, unifying pose analysis tasks and enabling novel applications.	ChatPose embeds SMPL poses as tokens within a multimodal LLM. It's trained on image-to-SMPL and text-to-SMPL data, allowing it to generate 3D poses from textual and visual inputs. The LLM's reasoning abilities are further utilized for two new tasks: Speculative Pose Generation (SPG) and Reasoning-based Pose Estimation (RPE).	ChatPose outperforms other multimodal LLMs on pose generation and estimation tasks. It demonstrates zero-shot capability in reasoning about human poses within multi-turn dialogues. The framework excels in handling complex scenarios requiring reasoning, such as SPG and RPE, surpassing traditional methods.	The accuracy of 3D pose estimation from images is not yet on par with specialized regressors, highlighting the need for larger, higher quality datasets relating language to pose. Freezing the vision encoder during training poses a limitation, potentially addressed by more powerful backbones or whole-model fine-tuning.	human pose estimation, pose generation, large language models, multimodal learning, reasoning
2311.18835 Report	InstructSeq: Unifying Vision Tasks with Instruction-conditioned Multi-modal Sequence Generation	Rongyao Fang, Shilin Yan, Zhaoyang Huang, Jingqiu Zhou, Hao Tian, Jifeng Dai, Hongsheng Li	Empowering models to dynamically accomplish tasks specified through natural language instructions represents a promising path toward more capable and general artificial intelligence. In this work, we introduce InstructSeq, an instruction-conditioned multi-modal modeling framework that unifies diverse vision tasks through flexible natural language control and handling of both visual and textual data. InstructSeq employs a multimodal transformer architecture encompassing visual, language, and sequential modeling. We utilize a visual encoder to extract image features and a text encoder to encode instructions. An autoregressive transformer fuses the representations and generates sequential task outputs. By training with LLM-generated natural language instructions, InstructSeq acquires a strong comprehension of free-form instructions for specifying visual tasks. This provides an intuitive interface for directing capabilities using flexible natural instructions. Without any task-specific tuning, InstructSeq achieves compelling performance on semantic segmentation, referring expression segmentation/comprehension, and image captioning. The flexible control and multi-task unification empower the model with more human-like versatility and generalizability for computer vision. The code will be released soon at https://github.com/rongyaofang/InstructSeq.	Introduced InstructSeq, an instruction-conditioned multi-modal model that unifies diverse vision tasks through flexible natural language instructions, handling both visual and textual data.	Addresses limitations of existing multi-modal models that rely on fixed instruction templates and lack flexibility in handling various vision tasks requiring different output types.	Employs a multi-modal transformer architecture with a visual encoder, a frozen instruction encoder, and an autoregressive transformer to generate visual or textual outputs based on the input instruction. Utilizes an LLM to generate natural language instructions for training, enabling the model to comprehend and respond to diverse phrasings.	Achieves competitive performance on semantic segmentation, referring expression segmentation/comprehension, and image captioning without task-specific tuning. Demonstrates superior generalization ability to novel instructions compared to models trained on fixed templates. Provides confidence estimates for predictions through sampling-based token generation, enabling the identification of uncertain areas in outputs.	Mixing textual and dense visual outputs during training might slightly degrade performance on specific tasks like referring segmentation. Computational constraints limit exploring larger model sizes and more diverse datasets.	multi-modal learning, natural language instructions, vision generalist model, sequence generation, computer vision
2311.18834 Report	ART$\boldsymbol{\cdot}$V: Auto-Regressive Text-to-Video Generation with Diffusion Models	Wenming Weng, Ruoyu Feng, Yanhui Wang, Qi Dai, Chunyu Wang, Dacheng Yin, Zhiyuan Zhao, Kai Qiu, Jianmin Bao, Yuhui Yuan, Chong Luo, Yueyi Zhang, Zhiwei Xiong	We present ART$\boldsymbol{\cdot}$V, an efficient framework for auto-regressive video generation with diffusion models. Unlike existing methods that generate entire videos in one-shot, ART$\boldsymbol{\cdot}$V generates a single frame at a time, conditioned on the previous ones. The framework offers three distinct advantages. First, it only learns simple continual motions between adjacent frames, therefore avoiding modeling complex long-range motions that require huge training data. Second, it preserves the high-fidelity generation ability of the pre-trained image diffusion models by making only minimal network modifications. Third, it can generate arbitrarily long videos conditioned on a variety of prompts such as text, image or their combinations, making it highly versatile and flexible. To combat the common drifting issue in AR models, we propose masked diffusion model which implicitly learns which information can be drawn from reference images rather than network predictions, in order to reduce the risk of generating inconsistent appearances that cause drifting. Moreover, we further enhance generation coherence by conditioning it on the initial frame, which typically contains minimal noise. This is particularly useful for long video generation. When trained for only two weeks on four GPUs, ART$\boldsymbol{\cdot}$V already can generate videos with natural motions, rich details and a high level of aesthetic quality. Besides, it enables various appealing applications, e.g., composing a long video from multiple text prompts.	This paper introduces ART⋅V, a novel auto-regressive framework using diffusion models for generating videos from text and/or image prompts.	Existing text-to-video generation methods struggle to create realistic, long-range motions due to the limitations of one-shot generation and training data size. ART⋅V addresses these challenges by generating frames sequentially and focusing on short, continuous motions.	ART⋅V employs a pre-trained image diffusion model with minimal modifications. Key techniques include: 1) Masked Diffusion Model (MDM) to mitigate drifting by leveraging information from reference frames. 2) Noise Augmentation to bridge the gap between training and testing. 3) Anchored Conditioning on the initial frame to enhance long-term coherence.	ART⋅V generates videos with natural motion, rich detail, and high aesthetic quality despite limited training resources. It outperforms existing methods in zero-shot video generation benchmarks (UCF-101, MSR-VTT), achieving state-of-the-art results when conditioned on ground truth images. The auto-regressive approach allows for generating arbitrarily long videos from multiple text prompts with seamless transitions.	Training on higher resolution and quality datasets is expected to further improve visual fidelity. Exploring advanced temporal modeling techniques within the auto-regressive framework could enhance long-range motion quality.	text-to-video generation, diffusion models, auto-regressive models, motion generation, video synthesis
2311.18832 Report	Exploiting Diffusion Prior for Generalizable Dense Prediction	Hsin-Ying Lee, Hung-Yu Tseng, Hsin-Ying Lee, Ming-Hsuan Yang	Contents generated by recent advanced Text-to-Image (T2I) diffusion models are sometimes too imaginative for existing off-the-shelf dense predictors to estimate due to the immitigable domain gap. We introduce DMP, a pipeline utilizing pre-trained T2I models as a prior for dense prediction tasks. To address the misalignment between deterministic prediction tasks and stochastic T2I models, we reformulate the diffusion process through a sequence of interpolations, establishing a deterministic mapping between input RGB images and output prediction distributions. To preserve generalizability, we use low-rank adaptation to fine-tune pre-trained models. Extensive experiments across five tasks, including 3D property estimation, semantic segmentation, and intrinsic image decomposition, showcase the efficacy of the proposed method. Despite limited-domain training data, the approach yields faithful estimations for arbitrary images, surpassing existing state-of-the-art algorithms.	The paper proposes DMP, a novel approach leveraging pre-trained text-to-image (T2I) diffusion models as priors for generalizable dense prediction tasks.	Existing dense prediction models struggle with the domain gap between real-world and T2I-generated images, limiting their application on imaginative content. This work aims to bridge this gap and enable faithful estimations on arbitrary images.	DMP introduces a deterministic image-to-prediction diffusion process, reformulating the stochastic T2I generation into a series of interpolations. This ensures deterministic mapping between input RGB and output predictions. Additionally, it employs low-rank adaptation to fine-tune pre-trained models on limited-domain data while preserving generalizability.	DMP achieves superior accuracy compared to previous image-to-image translation and diffusion-based methods on tasks like depth, normal, and segmentation estimation. Despite training on a small dataset of labeled bedroom images, DMP exhibits remarkable generalization, providing plausible predictions even on out-of-domain and arbitrary images. The proposed deterministic diffusion process is shown to be crucial for achieving accurate and faithful estimations, outperforming alternative parameterizations and single-step prediction approaches.	The performance on real-world multi-class semantic segmentation tasks remains limited due to challenges in encoding many classes in the image space. Future work includes exploring the application of real-world datasets with text descriptions generated by image captioning models for potentially better performance.	dense prediction, diffusion models, text-to-image generation, generalizability, low-rank adaptation
2311.18830 Report	MotionEditor: Editing Video Motion via Content-Aware Diffusion	Shuyuan Tu, Qi Dai, Zhi-Qi Cheng, Han Hu, Xintong Han, Zuxuan Wu, Yu-Gang Jiang	Existing diffusion-based video editing models have made gorgeous advances for editing attributes of a source video over time but struggle to manipulate the motion information while preserving the original protagonist's appearance and background. To address this, we propose MotionEditor, a diffusion model for video motion editing. MotionEditor incorporates a novel content-aware motion adapter into ControlNet to capture temporal motion correspondence. While ControlNet enables direct generation based on skeleton poses, it encounters challenges when modifying the source motion in the inverted noise due to contradictory signals between the noise (source) and the condition (reference). Our adapter complements ControlNet by involving source content to transfer adapted control signals seamlessly. Further, we build up a two-branch architecture (a reconstruction branch and an editing branch) with a high-fidelity attention injection mechanism facilitating branch interaction. This mechanism enables the editing branch to query the key and value from the reconstruction branch in a decoupled manner, making the editing branch retain the original background and protagonist appearance. We also propose a skeleton alignment algorithm to address the discrepancies in pose size and position. Experiments demonstrate the promising motion editing ability of MotionEditor, both qualitatively and quantitatively.	The paper proposes MotionEditor, a novel diffusion model designed for video motion editing, which transfers motion from a reference video to a source video while preserving the original protagonist's appearance and background.	Existing diffusion-based video editing models primarily focus on texture editing and struggle to manipulate motion information effectively while preserving the original protagonist and background.	MotionEditor incorporates a content-aware motion adapter into ControlNet for temporal motion correspondence and a two-branch architecture (reconstruction and editing branches) with a high-fidelity attention injection mechanism to facilitate branch interaction and preserve source appearance. It also employs a skeleton alignment algorithm to address pose discrepancies between source and reference.	MotionEditor demonstrates superior performance in motion editing compared to previous video editing and human motion transfer methods, both qualitatively and quantitatively. The proposed content-aware motion adapter enhances motion control and temporal consistency, while the high-fidelity attention injection mechanism preserves background details and protagonist appearance. Ablation studies confirm the importance of core components, such as the motion adapter, attention injection, and skeleton alignment, for achieving high-quality motion editing.	MotionEditor might fail in cases where foreground and background latents are confused, resulting in artifacts. Future work can explore explicit decoupling of foreground and background before denoising and develop a learnable mixture adapter for more natural blending.	video motion editing, diffusion models, content-aware motion adapter, high-fidelity attention injection, skeleton alignment
2311.18829 Report	MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation	Yanhui Wang, Jianmin Bao, Wenming Weng, Ruoyu Feng, Dacheng Yin, Tao Yang, Jingxu Zhang, Qi Dai Zhiyuan Zhao, Chunyu Wang, Kai Qiu, Yuhui Yuan, Chuanxin Tang, Xiaoyan Sun, Chong Luo, Baining Guo	We present MicroCinema, a straightforward yet effective framework for high-quality and coherent text-to-video generation. Unlike existing approaches that align text prompts with video directly, MicroCinema introduces a Divide-and-Conquer strategy which divides the text-to-video into a two-stage process: text-to-image generation and image\&text-to-video generation. This strategy offers two significant advantages. a) It allows us to take full advantage of the recent advances in text-to-image models, such as Stable Diffusion, Midjourney, and DALLE, to generate photorealistic and highly detailed images. b) Leveraging the generated image, the model can allocate less focus to fine-grained appearance details, prioritizing the efficient learning of motion dynamics. To implement this strategy effectively, we introduce two core designs. First, we propose the Appearance Injection Network, enhancing the preservation of the appearance of the given image. Second, we introduce the Appearance Noise Prior, a novel mechanism aimed at maintaining the capabilities of pre-trained 2D diffusion models. These design elements empower MicroCinema to generate high-quality videos with precise motion, guided by the provided text prompts. Extensive experiments demonstrate the superiority of the proposed framework. Concretely, MicroCinema achieves SOTA zero-shot FVD of 342.86 on UCF-101 and 377.40 on MSR-VTT. See https://wangyanhui666.github.io/MicroCinema.github.io/ for video samples.	This paper proposes MicroCinema, a two-stage text-to-video generation framework that leverages the strengths of existing text-to-image models for enhanced quality and coherence.	Current text-to-video generation models struggle with appearance and temporal coherence, especially when trained directly from text-video pairs. This framework addresses these limitations by separating appearance and motion modeling.	MicroCinema generates a key frame from text using an off-the-shelf text-to-image model. This key frame, along with the text prompt, guides a novel image&text-to-video model, featuring an Appearance Injection Network and an Appearance Noise Prior, to generate coherent videos.	MicroCinema achieves state-of-the-art zero-shot FVD of 342.86 on UCF101 and 377.40 on MSR-VTT using only the WebVid-10M dataset for training. The proposed Appearance Injection Network and Appearance Noise Prior significantly improve appearance preservation and motion modeling. The framework allows flexible integration of different text-to-image models and exhibits strong controllability through text prompts.	The model's performance on small objects, particularly faces, is limited by the reconstruction capabilities of the VAE used. Future work includes exploring joint spatial-temporal super-resolution for further quality enhancements.	text-to-video generation, diffusion models, appearance modeling, motion modeling, divide-and-conquer
2311.18827 Report	Motion-Conditioned Image Animation for Video Editing	Wilson Yan, Andrew Brown, Pieter Abbeel, Rohit Girdhar, Samaneh Azadi	We introduce MoCA, a Motion-Conditioned Image Animation approach for video editing. It leverages a simple decomposition of the video editing problem into image editing followed by motion-conditioned image animation. Furthermore, given the lack of robust evaluation datasets for video editing, we introduce a new benchmark that measures edit capability across a wide variety of tasks, such as object replacement, background changes, style changes, and motion edits. We present a comprehensive human evaluation of the latest video editing methods along with MoCA, on our proposed benchmark. MoCA establishes a new state-of-the-art, demonstrating greater human preference win-rate, and outperforming notable recent approaches including Dreamix (63%), MasaCtrl (75%), and Tune-A-Video (72%), with especially significant improvements for motion edits.	Introduces MoCA, a motion-conditioned image animation approach for text-driven video editing that outperforms existing methods, and a new dataset focused on motion editing for comprehensive benchmarking.	Addresses the limitations of current video editing methods that struggle with motion editing or specialize in a narrow range of edits.	Decomposes video editing into image editing and motion-conditioned image animation. Leverages pre-trained image editing models and a video generation model trained with text, first frame, and optical flow conditioning.	MoCA outperforms state-of-the-art video editing models across various edit types based on human evaluation. Motion conditioning is crucial for preserving original video motion during spatial edits. Existing automatic metrics for video editing show low correlation with human judgment, especially for motion-based edits.	Reliance on video extrapolation limits fidelity in preserving aspects introduced after the first frame. Need for better automatic evaluation metrics aligned with human perception for video editing.	video editing, video generation, motion conditioning, text-driven editing, diffusion models
2311.18823 Report	Initializing Models with Larger Ones	Zhiqiu Xu, Yanjie Chen, Kirill Vishniakov, Yida Yin, Zhiqiang Shen, Trevor Darrell, Lingjie Liu, Zhuang Liu	Weight initialization plays an important role in neural network training. Widely used initialization methods are proposed and evaluated for networks that are trained from scratch. However, the growing number of pretrained models now offers new opportunities for tackling this classical problem of weight initialization. In this work, we introduce weight selection, a method for initializing smaller models by selecting a subset of weights from a pretrained larger model. This enables the transfer of knowledge from pretrained weights to smaller models. Our experiments demonstrate that weight selection can significantly enhance the performance of small models and reduce their training time. Notably, it can also be used together with knowledge distillation. Weight selection offers a new approach to leverage the power of pretrained models in resource-constrained settings, and we hope it can be a useful tool for training small models in the large-model era. Code is available at https://github.com/OscarXZQ/weight-selection.	This paper introduces "weight selection," a method for initializing smaller neural networks by selecting a subset of weights from pretrained larger models within the same family.	This approach enables the transfer of knowledge from pretrained models to smaller ones, which is particularly beneficial in resource-constrained settings where large models are impractical.	Weight selection involves three steps: 1) selecting corresponding layers from the teacher model, 2) mapping components between student and teacher layers, and 3) selecting elements from the teacher's weight tensors to initialize the student model.	Weight selection significantly improves test accuracy across various image classification datasets, especially for smaller datasets. Weight selection substantially reduces training time compared to random initialization, achieving the same performance with fewer epochs. Weight selection is compatible with knowledge distillation and can be combined to further enhance performance.	The effectiveness of weight selection may be limited by the availability of pretrained models within the same family and of a suitable size. Future work can explore different strategies for selecting weights, potentially incorporating importance or relevance metrics.	weight initialization, transfer learning, knowledge distillation, model compression, pretrained models
2311.18822 Report	ElasticDiffusion: Training-free Arbitrary Size Image Generation through Global-Local Content Separation	Moayed Haji-Ali, Guha Balakrishnan, Vicente Ordonez	Diffusion models have revolutionized image generation in recent years, yet they are still limited to a few sizes and aspect ratios. We propose ElasticDiffusion, a novel training-free decoding method that enables pretrained text-to-image diffusion models to generate images with various sizes. ElasticDiffusion attempts to decouple the generation trajectory of a pretrained model into local and global signals. The local signal controls low-level pixel information and can be estimated on local patches, while the global signal is used to maintain overall structural consistency and is estimated with a reference image. We test our method on CelebA-HQ (faces) and LAION-COCO (objects/indoor/outdoor scenes). Our experiments and qualitative results show superior image coherence quality across aspect ratios compared to MultiDiffusion and the standard decoding strategy of Stable Diffusion. Project page: https://elasticdiffusion.github.io/	Introduces ElasticDiffusion, a training-free decoding method enabling pretrained text-to-image diffusion models to generate images at arbitrary sizes.	Addresses the limitation of existing diffusion models that are typically trained on a few image sizes and struggle to maintain quality at different resolutions or aspect ratios.	Decouples the generation trajectory into local (pixel-level details) and global signals (structural consistency) by leveraging insights from classifier-free guidance. Local signals are estimated on patches, while global signals are derived from a reference image.	Generates coherent images at various resolutions, outperforming baselines like Stable Diffusion and MultiDiffusion. Achieves comparable FID and CLIP scores to SDXL at 1024x1024 resolution using a smaller base model (Stable Diffusion 1.4). Effectively handles diverse aspect ratios, surpassing baselines in maintaining image coherence and alignment with input prompts.	Potential for artifact generation due to inaccuracies in estimating global/local signals. Limited effectiveness in generating images at significantly extended sizes (beyond 4x the training resolution).	diffusion models, image generation, arbitrary size, classifier-free guidance, resolution independence
2311.18815 Report	IMMA: Immunizing text-to-image Models against Malicious Adaptation	Amber Yijia Zheng, Raymond A. Yeh	Advancements in text-to-image models and fine-tuning methods have led to the increasing risk of malicious adaptation, i.e., fine-tuning to generate harmful unauthorized content. Recent works, e.g., Glaze or MIST, have developed data-poisoning techniques which protect the data against adaptation methods. In this work, we consider an alternative paradigm for protection. We propose to ``immunize'' the model by learning model parameters that are difficult for the adaptation methods when fine-tuning malicious content; in short IMMA. Empirical results show IMMA's effectiveness against malicious adaptations, including mimicking the artistic style and learning of inappropriate/unauthorized content, over three adaptation methods: LoRA, Textual-Inversion, and DreamBooth.	The paper introduces IMMA, a novel model immunization technique designed to safeguard text-to-image models from malicious adaptation, preventing the generation of harmful or unauthorized content.	The rise of open-source text-to-image models and fine-tuning methods necessitates protection against misuse, such as generating harmful content or infringing on artists' rights. Existing data poisoning methods place the burden on content creators. IMMA addresses this by immunizing the model itself.	IMMA utilizes a bi-level optimization program. It learns a set of model parameters that lead to poor performance when adapted for malicious purposes, effectively acting as a poor model initialization for malicious adaptation.	IMMA successfully inhibits re-learning of erased concepts, demonstrated by quantitative metrics and user studies. IMMA effectively prevents adaptation towards personalized/unique concepts while preserving the model's adaptability for benign concepts. IMMA exhibits resilience against JPEG compression, surpassing data poisoning methods like MIST in this aspect.	Immunizing against certain target concepts may negatively impact the model's performance on other concepts. Future research could explore methods to mitigate the potential negative impact on other concepts during immunization.	model immunization, text-to-image generation, malicious adaptation, diffusion models, ai safety
2311.18775 Report	CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation	Zineng Tang, Ziyi Yang, Mahmoud Khademi, Yang Liu, Chenguang Zhu, Mohit Bansal	We present CoDi-2, a versatile and interactive Multimodal Large Language Model (MLLM) that can follow complex multimodal interleaved instructions, conduct in-context learning (ICL), reason, chat, edit, etc., in an any-to-any input-output modality paradigm. By aligning modalities with language for both encoding and generation, CoDi-2 empowers Large Language Models (LLMs) to not only understand complex modality-interleaved instructions and in-context examples, but also autoregressively generate grounded and coherent multimodal outputs in the continuous feature space. To train CoDi-2, we build a large-scale generation dataset encompassing in-context multimodal instructions across text, vision, and audio. CoDi-2 demonstrates a wide range of zero-shot capabilities for multimodal generation, such as in-context learning, reasoning, and compositionality of any-to-any modality generation through multi-round interactive conversation. CoDi-2 surpasses previous domain-specific models on tasks such as subject-driven image generation, vision transformation, and audio editing. CoDi-2 signifies a substantial breakthrough in developing a comprehensive multimodal foundation model adept at interpreting in-context language-vision-audio interleaved instructions and producing multimodal outputs.	This paper presents METHODNAME, a versatile Multimodal Large Language Model (MLLM) that excels in following complex interleaved multimodal instructions, performing in-context learning (ICL), reasoning, engaging in conversations, and editing content in an any-to-any input-output modality paradigm.	Current multimodal generative models struggle with zero-shot fine-grained control, are limited to single-round user interactions, and often handle only one or two input modalities. METHODNAME addresses these limitations by enabling sophisticated multimodal generation, multi-round interactions, and understanding modality-interleaved inputs.	METHODNAME leverages a Large Language Model (LLM) as its core, enhanced with multimodal encoders and decoders. This architecture enables it to process text, image, and audio inputs aligned in the language space, facilitating in-context learning and reasoning. The model is trained on a diverse dataset incorporating existing multimodal resources and novel text-only datasets adapted for multimodal in-context learning.	METHODNAME achieves competitive zero-shot performance in subject-driven image generation, demonstrating its ability to generalize to unseen tasks. The model excels in audio manipulation tasks, surpassing previous methods in adding, dropping, and replacing audio elements. METHODNAME demonstrates strong capabilities in various in-context multimodal generation tasks, including style adaptation, image composition, and exemplar-based editing.	The training datasets, while diverse, might not cover all potential real-world applications, such as visual concept learning despite the model showing promising results in this area. The model's performance might be further improved by exploring techniques to enhance its ability to learn and apply visual concepts.	multimodal generation, large language models, in-context learning, multimodal reasoning, multimodal interaction
2311.18765 Report	MLLMs-Augmented Visual-Language Representation Learning	Yanqing Liu, Kai Wang, Wenqi Shao, Ping Luo, Yu Qiao, Mike Zheng Shou, Kaipeng Zhang, Yang You	Visual-language pre-training has achieved remarkable success in many multi-modal tasks, largely attributed to the availability of large-scale image-text datasets. In this work, we demonstrate that Multi-modal Large Language Models (MLLMs) can enhance visual-language representation learning by establishing richer image-text associations for image-text datasets. Our approach is simple, utilizing MLLMs to extend multiple diverse captions for each image. To prevent the bias introduced by MLLMs' hallucinations and monotonous language styles, we propose "text shearing" to maintain the quality and availability of extended captions. In image-text retrieval, without introducing additional training cost, our method consistently obtains 5.6 ~ 35.0 and 16.8 ~ 46.1 improvement on Recall@1 under the fine-tuning and zero-shot settings, respectively. Notably, we obtain zero-shot results that are comparable to fine-tuning on target datasets, which encourages more exploration of the versatile use of MLLMs.	This paper proposes to leverage Multi-modal Large Language Models (MLLMs) to enhance visual-language representation learning.	Large-scale image-text datasets are crucial for visual-language pre-training, but simply removing mismatched pairs reduces data size and negatively impacts performance. This method improves the quality and diversity of image-text pairs without reducing the dataset size.	Multiple MLLMs are used to generate diverse captions for each image, then "text shearing" is applied to truncate captions to the average length of original captions. This process maintains caption quality and reduces MLLMs' hallucinations.	The method consistently improves zero-shot and fine-tuned image-text retrieval performance on MSCOCO and Flickr30K datasets by significant margins. Zero-shot CLIP with the proposed method outperforms vanilla CLIP fine-tuned on target datasets for image-text retrieval. The method consistently improves performance on various downstream tasks, including image classification, visual question answering, visual reasoning, image captioning, and video-language tasks.	Noise from unreliable MLLMs' outputs limits performance. Future work could explore using more powerful MLLMs and larger datasets.	visual-language pre-training, multi-modal large language models, image-text retrieval, image captioning, data augmentation
2311.18763 Report	Continual Diffusion with STAMINA: STack-And-Mask INcremental Adapters	James Seale Smith, Yen-Chang Hsu, Zsolt Kira, Yilin Shen, Hongxia Jin	Recent work has demonstrated a remarkable ability to customize text-to-image diffusion models to multiple, fine-grained concepts in a sequential (i.e., continual) manner while only providing a few example images for each concept. This setting is known as continual diffusion. Here, we ask the question: Can we scale these methods to longer concept sequences without forgetting? Although prior work mitigates the forgetting of previously learned concepts, we show that its capacity to learn new tasks reaches saturation over longer sequences. We address this challenge by introducing a novel method, STack-And-Mask INcremental Adapters (STAMINA), which is composed of low-ranked attention-masked adapters and customized MLP tokens. STAMINA is designed to enhance the robust fine-tuning properties of LoRA for sequential concept learning via learnable hard-attention masks parameterized with low rank MLPs, enabling precise, scalable learning via sparse adaptation. Notably, all introduced trainable parameters can be folded back into the model after training, inducing no additional inference parameter costs. We show that STAMINA outperforms the prior SOTA for the setting of text-to-image continual customization on a 50-concept benchmark composed of landmarks and human faces, with no stored replay data. Additionally, we extended our method to the setting of continual learning for image classification, demonstrating that our gains also translate to state-of-the-art performance in this standard benchmark.	This paper presents STAMINA (STack-And-Mask INcremental Adapters), a novel method to improve continual learning in text-to-image diffusion models by enhancing low-rank adaptations with attention masking and learnable MLP tokens.	Continual diffusion, the ability to sequentially customize models with new concepts without forgetting previous ones, is crucial for personalized applications but existing methods struggle to scale to longer concept sequences.	STAMINA combines low-rank adapters with learnable hard-attention masks (parameterized with low-rank MLPs and Gumbel softmax) and replaces custom token embeddings with learnable MLPs. All introduced parameters can be folded back into the model after training, inducing no additional inference cost.	STAMINA outperforms the previous state-of-the-art (C-LoRA) in continual text-to-image customization on a 50-concept benchmark composed of landmarks and human faces. The method requires significantly fewer training steps compared to C-LoRA. STAMINA also achieves state-of-the-art performance when applied to continual learning for image classification on a standard 20-task benchmark.	The generation of multiple concepts in a single image still has a high failure rate and requires improvement. Ethical considerations regarding the generation of personal images (e.g., faces) and potential bias in generated images need careful attention.	continual learning, text-to-image generation, diffusion models, sparse adaptation, attention masking
2311.18729 Report	Learning One-Shot 4D Head Avatar Synthesis using Synthetic Data	Yu Deng, Duomin Wang, Xiaohang Ren, Xingyu Chen, Baoyuan Wang	Existing one-shot 4D head synthesis methods usually learn from monocular videos with the aid of 3DMM reconstruction, yet the latter is evenly challenging which restricts them from reasonable 4D head synthesis. We present a method to learn one-shot 4D head synthesis via large-scale synthetic data. The key is to first learn a part-wise 4D generative model from monocular images via adversarial learning, to synthesize multi-view images of diverse identities and full motions as training data; then leverage a transformer-based animatable triplane reconstructor to learn 4D head reconstruction using the synthetic data. A novel learning strategy is enforced to enhance the generalizability to real images by disentangling the learning process of 3D reconstruction and reenactment. Experiments demonstrate our superiority over the prior art.	This paper introduces a novel method for one-shot 4D head avatar synthesis from a single image, leveraging a large-scale synthetic dataset generated by a novel 4D generative head model.	Existing methods depend on 3DMM reconstruction from monocular videos, which limits their performance due to the inherent challenges of 3DMM estimation. This new method aims to overcome these limitations and achieve higher-fidelity 4D head synthesis.	The method consists of two main components: 1) GenHead: a 4D generative model trained on monocular images to synthesize multi-view head images with diverse identities, full motion control, and background separation. 2) A one-shot 4D head synthesis model trained on the synthetic data, employing a transformer-based encoder-decoder architecture and a disentangled learning strategy to enhance generalizability to real images.	Achieves high-fidelity 4D head reconstruction with reasonable geometry and complete motion control from single images. Outperforms previous state-of-the-art methods in terms of visual quality, identity preservation, and pose accuracy, particularly under large pose variations. Demonstrates the efficacy of using synthetic data for learning complex tasks like one-shot 4D head synthesis and opens up new possibilities for scalable head avatar creation.	Limitations include difficulty handling complex accessories and makeups, potential for texture flickering, and challenges with extreme profile views. Future work involves improving the handling of high-frequency details, addressing artifacts under specific expressions, and exploring ways to incorporate real data and 3D priors for enhanced realism.	4d head avatar synthesis, one-shot learning, synthetic data, generative adversarial networks, neural rendering
2311.18654 Report	Detailed Human-Centric Text Description-Driven Large Scene Synthesis	Gwanghyun Kim, Dong Un Kang, Hoigi Seo, Hayeon Kim, Se Young Chun	Text-driven large scene image synthesis has made significant progress with diffusion models, but controlling it is challenging. While using additional spatial controls with corresponding texts has improved the controllability of large scene synthesis, it is still challenging to faithfully reflect detailed text descriptions without user-provided controls. Here, we propose DetText2Scene, a novel text-driven large-scale image synthesis with high faithfulness, controllability, and naturalness in a global context for the detailed human-centric text description. Our DetText2Scene consists of 1) hierarchical keypoint-box layout generation from the detailed description by leveraging large language model (LLM), 2) view-wise conditioned joint diffusion process to synthesize a large scene from the given detailed text with LLM-generated grounded keypoint-box layout and 3) pixel perturbation-based pyramidal interpolation to progressively refine the large scene for global coherence. Our DetText2Scene significantly outperforms prior arts in text-to-large scene synthesis qualitatively and quantitatively, demonstrating strong faithfulness with detailed descriptions, superior controllability, and excellent naturalness in a global context.	Proposes DetText2Scene, a novel method for text-driven large-scale image synthesis that generates highly controllable and natural images faithfully reflecting detailed human-centric text descriptions.	Existing text-to-large-scene generation methods struggle to faithfully and controllably generate images from detailed descriptions, often lacking global coherence.	Leverages a hierarchical approach with three stages: 1) Generates a keypoint-box layout from text using a fine-tuned large language model (LLM). 2) Synthesizes a large scene using a view-wise conditioned joint diffusion process guided by the layout and text. 3) Improves global coherence through pixel perturbation-based pyramidal interpolation.	DetText2Scene outperforms prior arts in text-to-large scene synthesis both qualitatively and quantitatively. It demonstrates strong faithfulness to detailed descriptions, superior controllability over the number and attributes of generated instances, and excellent naturalness in global context. User studies confirm significant preference for DetText2Scene over existing methods regarding faithfulness, controllability, and naturalness.	The LLM's understanding of visual context, especially 3D information, can be limited. The quality of generated images is contingent on the capabilities of the underlying text-to-image diffusion model (Stable Diffusion 1.5 in this case).	text-to-image synthesis, large scene generation, diffusion models, large language models, layout generation
2311.18651 Report	LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning	Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, Tao Chen	Recent advances in Large Multimodal Models (LMM) have made it possible for various applications in human-machine interactions. However, developing LMMs that can comprehend, reason, and plan in complex and diverse 3D environments remains a challenging topic, especially considering the demand for understanding permutation-invariant point cloud 3D representations of the 3D scene. Existing works seek help from multi-view images, and project 2D features to 3D space as 3D scene representations. This, however, leads to huge computational overhead and performance degradation. In this paper, we present LL3DA, a Large Language 3D Assistant that takes point cloud as direct input and respond to both textual-instructions and visual-prompts. This help LMMs better comprehend human interactions and further help to remove the ambiguities in cluttered 3D scenes. Experiments show that LL3DA achieves remarkable results, and surpasses various 3D vision-language models on both 3D Dense Captioning and 3D Question Answering.	This paper introduces LL3DA, a Large Language 3D Assistant capable of understanding, reasoning, and planning in complex 3D environments by responding to textual instructions and visual prompts.	Developing models that can comprehend and interact with 3D environments using natural language is crucial for advancements in fields like autonomous driving and embodied AI.	LL3DA leverages a multi-modal transformer (Interactor3D) to process 3D scene data, textual instructions, and visual prompts, generating a fixed-length representation. This representation is then used as a prefix to a frozen pre-trained Large Language Model (LLM) for response generation.	LL3DA achieves state-of-the-art results on 3D Dense Captioning benchmarks ScanRefer and Nr3D. It also outperforms previous methods in 3D Question Answering on the ScanQA dataset. The addition of visual prompts, like user clicks, significantly improves LL3DA's performance by removing ambiguities in complex scenes.	The generalist performance of LL3DA on Nr3D is limited due to not differentiating between Nr3D and ScanRefer datasets during training. Future work could focus on collecting higher-quality and more diverse 3D vision and language annotations to enhance the model's reasoning and planning abilities.	3d vision and language, large language models, instruction following, visual prompts, point cloud understanding
2311.18610 Report	DiffCAD: Weakly-Supervised Probabilistic CAD Model Retrieval and Alignment from an RGB Image	Daoyi Gao, Dávid Rozenberszki, Stefan Leutenegger, Angela Dai	Perceiving 3D structures from RGB images based on CAD model primitives can enable an effective, efficient 3D object-based representation of scenes. However, current approaches rely on supervision from expensive annotations of CAD models associated with real images, and encounter challenges due to the inherent ambiguities in the task -- both in depth-scale ambiguity in monocular perception, as well as inexact matches of CAD database models to real observations. We thus propose DiffCAD, the first weakly-supervised probabilistic approach to CAD retrieval and alignment from an RGB image. We formulate this as a conditional generative task, leveraging diffusion to learn implicit probabilistic models capturing the shape, pose, and scale of CAD objects in an image. This enables multi-hypothesis generation of different plausible CAD reconstructions, requiring only a few hypotheses to characterize ambiguities in depth/scale and inexact shape matches. Our approach is trained only on synthetic data, leveraging monocular depth and mask estimates to enable robust zero-shot adaptation to various real target domains. Despite being trained solely on synthetic data, our multi-hypothesis approach can even surpass the supervised state-of-the-art on the Scan2CAD dataset by 5.9% with 8 hypotheses.	This paper introduces DiffCAD, the first weakly-supervised probabilistic approach for retrieving and aligning CAD models to a single RGB image, addressing inherent ambiguities in depth, scale, and shape matching.	Current methods for CAD-based 3D scene understanding rely on expensive real-image annotations and struggle with depth-scale ambiguities and inexact CAD matches. DiffCAD overcomes these limitations by learning probabilistic distributions for plausible reconstructions.	The method uses diffusion models to capture distributions of scene scale, object pose (via Normalized Object Coordinates), and object shape (latent codes). Trained solely on synthetic data with depth and mask estimates, it generalizes to real images through a multi-hypothesis sampling scheme.	DiffCAD outperforms state-of-the-art supervised methods on ScanNet, achieving 5.9% higher accuracy with only 8 hypotheses. The learned probabilistic models effectively capture ambiguities, with performance increasing as more hypotheses are considered. The method generalizes to unseen real-world datasets like ARKit, demonstrating robustness despite synthetic training data.	The reliance on CAD models limits the reconstruction of objects without close database matches, suggesting future work in CAD model deformation. The current approach doesn't explicitly model object relations, potentially hindering performance in highly structured scenes. Integrating scene context is a promising direction.	3d vision, cad model retrieval, diffusion models, weakly-supervised learning, single-view reconstruction
2311.18608 Report	Contrastive Denoising Score for Text-guided Latent Diffusion Image Editing	Hyelin Nam, Gihyun Kwon, Geon Yeong Park, Jong Chul Ye	With the remarkable advent of text-to-image diffusion models, image editing methods have become more diverse and continue to evolve. A promising recent approach in this realm is Delta Denoising Score (DDS) - an image editing technique based on Score Distillation Sampling (SDS) framework that leverages the rich generative prior of text-to-image diffusion models. However, relying solely on the difference between scoring functions is insufficient for preserving specific structural elements from the original image, a crucial aspect of image editing. To address this, here we present an embarrassingly simple yet very powerful modification of DDS, called Contrastive Denoising Score (CDS), for latent diffusion models (LDM). Inspired by the similarities and differences between DDS and the contrastive learning for unpaired image-to-image translation(CUT), we introduce a straightforward approach using CUT loss within the DDS framework. Rather than employing auxiliary networks as in the original CUT approach, we leverage the intermediate features of LDM, specifically those from the self-attention layers, which possesses rich spatial information. Our approach enables zero-shot image-to-image translation and neural radiance field (NeRF) editing, achieving structural correspondence between the input and output while maintaining content controllability. Qualitative results and comparisons demonstrates the effectiveness of our proposed method. Project page: https://hyelinnam.github.io/CDS/	This paper introduces Contrastive Denoising Score (CDS), a novel method for text-driven image editing that integrates Contrastive Unpaired Translation (CUT) loss into the Delta Denoising Score (DDS) framework.	Existing image editing techniques based on diffusion models often struggle to balance semantic changes guided by text prompts with preserving the structural integrity of the source image. CDS addresses this limitation by enhancing DDS with structural consistency.	CDS leverages intermediate latent representations from the self-attention layers of a pre-trained Latent Diffusion Model (LDM) to compute the CUT loss. This eliminates the need for training a separate encoder network, enabling zero-shot image editing.	CDS successfully translates source images, achieving a better balance between content transformation aligned with the target text prompt and maintaining the structural details of the source image compared to existing methods. Quantitative evaluations including CLIP accuracy, DINO-ViT structure distance, and LPIPS distance demonstrate that CDS outperforms previous state-of-the-art methods. The method is applicable to various domains beyond image editing, including Neural Radiance Fields (NeRF), showcasing its versatility and potential for broader applications.	Failure cases can arise from unfavorable random patch selections or when the source object has unconventional poses. Future work includes exploring techniques to mitigate these limitations and further enhance the robustness of CDS.	image editing, text-guided synthesis, diffusion models, contrastive learning, score distillation
2311.18561 Report	Periodic Vibration Gaussian: Dynamic Urban Scene Reconstruction and Real-time Rendering	Yurui Chen, Chun Gu, Junzhe Jiang, Xiatian Zhu, Li Zhang	Modeling dynamic, large-scale urban scenes is challenging due to their highly intricate geometric structures and unconstrained dynamics in both space and time. Prior methods often employ high-level architectural priors, separating static and dynamic elements, resulting in suboptimal capture of their synergistic interactions. To address this challenge, we present a unified representation model, called Periodic Vibration Gaussian (PVG). PVG builds upon the efficient 3D Gaussian splatting technique, originally designed for static scene representation, by introducing periodic vibration-based temporal dynamics. This innovation enables PVG to elegantly and uniformly represent the characteristics of various objects and elements in dynamic urban scenes. To enhance temporally coherent and large scene representation learning with sparse training data, we introduce a novel temporal smoothing mechanism and a position-aware adaptive control strategy respectively. Extensive experiments on Waymo Open Dataset and KITTI benchmarks demonstrate that PVG surpasses state-of-the-art alternatives in both reconstruction and novel view synthesis for both dynamic and static scenes. Notably, PVG achieves this without relying on manually labeled object bounding boxes or expensive optical flow estimation. Moreover, PVG exhibits 900-fold acceleration in rendering over the best alternative.	This paper proposes Periodic Vibration Gaussian (PVG), a novel unified representation model for dynamic urban scene reconstruction and real-time rendering.	Modeling dynamic urban scenes is challenging due to their complex geometry and unconstrained dynamics. Existing methods struggle to capture synergistic interactions between static and dynamic elements or suffer from low efficiency.	PVG extends 3D Gaussian Splatting by introducing periodic vibration for temporal dynamics. It also incorporates a temporal smoothing mechanism for coherence and a position-aware adaptive control strategy for large scenes.	PVG outperforms state-of-the-art methods in novel view synthesis on Waymo Open Dataset and KITTI benchmark. It achieves superior efficiency, with up to 900-fold rendering speedup compared to competitors. PVG effectively captures both static and dynamic elements without manual annotations or pre-trained models.	PVG's geometric representation accuracy is limited by its highly adaptable design. Future work includes improving geometric accuracy and further enhancing its ability to depict complex urban scenes.	dynamic scene reconstruction, novel view synthesis, 3d gaussian splatting, periodic vibration gaussian, autonomous driving
2311.18512 Report	Revisiting Proposal-based Object Detection	Aritra Bhowmik, Martin R. Oswald, Pascal Mettes, Cees G. M. Snoek	This paper revisits the pipeline for detecting objects in images with proposals. For any object detector, the obtained box proposals or queries need to be classified and regressed towards ground truth boxes. The common solution for the final predictions is to directly maximize the overlap between each proposal and the ground truth box, followed by a winner-takes-all ranking or non-maximum suppression. In this work, we propose a simple yet effective alternative. For proposal regression, we solve a simpler problem where we regress to the area of intersection between proposal and ground truth. In this way, each proposal only specifies which part contains the object, avoiding a blind inpainting problem where proposals need to be regressed beyond their visual scope. In turn, we replace the winner-takes-all strategy and obtain the final prediction by taking the union over the regressed intersections of a proposal group surrounding an object. Our revisited approach comes with minimal changes to the detection pipeline and can be plugged into any existing method. We show that our approach directly improves canonical object detection and instance segmentation architectures, highlighting the utility of intersection-based regression and grouping.	This paper proposes a revisited object detection pipeline that decomposes the proposal-to-ground truth regression and proposal-candidate selection into intersection and union problems, leading to a more accurate and robust detection.	The traditional object detection pipeline suffers from ill-posed regression targets and discards valuable information from multiple proposals. This paper addresses these issues to improve object localization accuracy.	The authors introduce Intersection-based Regression, where proposals regress only to the intersection with the ground truth, and Intersection-based Grouping, where the union of regressed intersections from multiple proposals forms the final detection.	The proposed method consistently outperforms baseline detectors like Faster R-CNN, Mask R-CNN, and YOLOv3 on COCO and PASCAL VOC datasets. The approach demonstrates significant improvements in handling high IoU thresholds, indicating more accurate localization. An oracle experiment shows that the method's performance scales with improved classification accuracy, highlighting its potential for future detectors.	The method encounters limitations in handling crowded scenes, where merging multiple instances into a single proposal is possible. Future work involves developing advanced grouping strategies to address the challenges posed by crowded scenes.	object detection, intersection-based regression, intersection-based grouping, proposal combination, deep learning
2311.18482 Report	Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding	Jin-Chuan Shi, Miao Wang, Hao-Bin Duan, Shao-Hua Guan	Open-vocabulary querying in 3D space is challenging but essential for scene understanding tasks such as object localization and segmentation. Language-embedded scene representations have made progress by incorporating language features into 3D spaces. However, their efficacy heavily depends on neural networks that are resource-intensive in training and rendering. Although recent 3D Gaussians offer efficient and high-quality novel view synthesis, directly embedding language features in them leads to prohibitive memory usage and decreased performance. In this work, we introduce Language Embedded 3D Gaussians, a novel scene representation for open-vocabulary query tasks. Instead of embedding high-dimensional raw semantic features on 3D Gaussians, we propose a dedicated quantization scheme that drastically alleviates the memory requirement, and a novel embedding procedure that achieves smoother yet high accuracy query, countering the multi-view feature inconsistencies and the high-frequency inductive bias in point-based representations. Our comprehensive experiments show that our representation achieves the best visual quality and language querying accuracy across current language-embedded representations, while maintaining real-time rendering frame rates on a single desktop GPU.	This paper introduces Language Embedded 3D Gaussians, a novel scene representation framework for open-vocabulary query tasks in 3D scenes, achieving high precision and efficiency.	Open-vocabulary querying in 3D space is crucial for scene understanding, enabling tasks like object localization and segmentation. Existing methods struggle to balance efficiency and accuracy in embedding language features into 3D representations.	The method quantizes dense language features from CLIP and DINO into a compact feature space, significantly reducing memory requirements. These features are embedded into 3D Gaussians, and a novel mechanism utilizing learned uncertainty values smooths semantic features spatially to address visual inconsistencies across viewpoints.	Achieves state-of-the-art visual quality in novel view synthesis surpassing NeRF-based and 3D Gaussian baselines. Demonstrates superior accuracy in open-vocabulary querying tasks compared to existing language-embedded 3D representations. Maintains real-time rendering frame rates on a single desktop GPU due to efficient quantization and compact representation.	Detecting highly reflective or translucent objects remains challenging due to limitations in current visual-language models. Fine-grained object geometry at high resolutions using CLIP-derived semantics needs improvement.	3d scene understanding, open-vocabulary query, language embedding, 3d gaussians, novel view synthesis
2311.18448 Report	HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and Objects from Video	Zicong Fan, Maria Parelli, Maria Eleni Kadoglou, Muhammed Kocabas, Xu Chen, Michael J. Black, Otmar Hilliges	Since humans interact with diverse objects every day, the holistic 3D capture of these interactions is important to understand and model human behaviour. However, most existing methods for hand-object reconstruction from RGB either assume pre-scanned object templates or heavily rely on limited 3D hand-object data, restricting their ability to scale and generalize to more unconstrained interaction settings. To this end, we introduce HOLD -- the first category-agnostic method that reconstructs an articulated hand and object jointly from a monocular interaction video. We develop a compositional articulated implicit model that can reconstruct disentangled 3D hand and object from 2D images. We also further incorporate hand-object constraints to improve hand-object poses and consequently the reconstruction quality. Our method does not rely on 3D hand-object annotations while outperforming fully-supervised baselines in both in-the-lab and challenging in-the-wild settings. Moreover, we qualitatively show its robustness in reconstructing from in-the-wild videos. Code: https://github.com/zc-alexfan/hold	This paper introduces HOLD, a novel category-agnostic method for reconstructing articulated hand and object 3D surfaces jointly from a single interaction video.	Understanding human behavior requires capturing 3D hand-object interactions, but existing methods are limited by pre-scanned templates or limited training data, hindering scalability and generalization.	HOLD initializes hand and object poses using off-the-shelf estimators and structure-from-motion, respectively. It then uses a compositional implicit neural network trained with a multi-class segmentation loss, eikonal loss, sparsity loss, and an SDF loss for shape regularization. Hand-object interaction constraints further refine pose estimates, leading to more accurate reconstructions.	HOLD significantly outperforms state-of-the-art methods in hand pose and object reconstruction accuracy, generalizing to unseen object categories. Jointly modeling hand and object improves object reconstruction compared to modeling objects in isolation. Pose refinement with interaction constraints significantly enhances the accuracy of object and hand poses, resulting in better object reconstructions.	Reconstruction of thin or textureless objects is limited by detector-based SfM. Reliance on raw RGB data for supervision may hinder the reconstruction of less visible object regions.	hand-object reconstruction, 3d reconstruction, monocular video, neural implicit representation, interaction constraints
2311.18435 Report	Layered Rendering Diffusion Model for Zero-Shot Guided Image Synthesis	Zipeng Qi, Guoxi Huang, Zebin Huang, Qin Guo, Jinwen Chen, Junyu Han, Jian Wang, Gang Zhang, Lufei Liu, Errui Ding, Jingdong Wang	This paper introduces innovative solutions to enhance spatial controllability in diffusion models reliant on text queries. We present two key innovations: Vision Guidance and the Layered Rendering Diffusion (LRDiff) framework. Vision Guidance, a spatial layout condition, acts as a clue in the perturbed distribution, greatly narrowing down the search space, to focus on the image sampling process adhering to the spatial layout condition. The LRDiff framework constructs an image-rendering process with multiple layers, each of which applies the vision guidance to instructively estimate the denoising direction for a single object. Such a layered rendering strategy effectively prevents issues like unintended conceptual blending or mismatches, while allowing for more coherent and contextually accurate image synthesis. The proposed method provides a more efficient and accurate means of synthesising images that align with specific spatial and contextual requirements. We demonstrate through our experiments that our method provides better results than existing techniques both quantitatively and qualitatively. We apply our method to three practical applications: bounding box-to-image, semantic mask-to-image and image editing.	This paper introduces LRDiff, a zero-shot diffusion-based framework for layout-guided image synthesis using vision guidance.	Existing T2I models struggle with spatial controllability. Fine-tuning methods are costly, while attention-based methods suffer from blending and mismatch issues.	LRDiff uses vision guidance, a spatial condition added to the input, to guide denoising. It employs a layered rendering approach, estimating object denoising separately before combining them with global context.	LRDiff achieves superior spatial controllability compared to previous zero-shot methods, as demonstrated by higher AP and IoU scores. It effectively mitigates object blending, a common issue in attention-manipulation based approaches. The method is robust to various scene descriptions and allows for image editing by inserting or replacing objects.	There's a trade-off between image fidelity and spatial alignment depending on the denoising period. Generating small objects with high fidelity remains a challenge.	text-to-image synthesis, diffusion models, layout guidance, vision guidance, layered rendering
2311.18387 Report	On Exact Inversion of DPM-Solvers	Seongmin Hong, Kyeonghyun Lee, Suh Yoon Jeon, Hyewon Bae, Se Young Chun	Diffusion probabilistic models (DPMs) are a key component in modern generative models. DPM-solvers have achieved reduced latency and enhanced quality significantly, but have posed challenges to find the exact inverse (i.e., finding the initial noise from the given image). Here we investigate the exact inversions for DPM-solvers and propose algorithms to perform them when samples are generated by the first-order as well as higher-order DPM-solvers. For each explicit denoising step in DPM-solvers, we formulated the inversions using implicit methods such as gradient descent or forward step method to ensure the robustness to large classifier-free guidance unlike the prior approach using fixed-point iteration. Experimental results demonstrated that our proposed exact inversion methods significantly reduced the error of both image and noise reconstructions, greatly enhanced the ability to distinguish invisible watermarks and well prevented unintended background changes consistently during image editing. Project page: \url{https://smhongok.github.io/inv-dpm.html}.	This paper proposes exact inversion methods for finding the initial noise of images generated by various Diffusion Probabilistic Models (DPMs), including high-order DPM solvers.	Exact inversion is crucial for applications like image editing, style transfer, model attacks, watermark detection, and image restoration, enabling broader applications with DPMs.	The authors propose using the backward Euler method for exact inversion of DDIM (a first-order DPM-solver). For high-order DPM-solvers, they introduce backward Euler with approximate high-order terms. To ensure robustness with large classifier-free guidance, they employ gradient descent or the forward step method.	The proposed methods significantly reduce reconstruction errors compared to the na"ive DDIM inversion for both images and noise, in both pixel-space DPM and LDM. They enable accurate reconstruction of noise-space watermarks, even improving the detection and enabling watermark classification. The methods substantially improve background-preserving image editing without requiring the original latent vectors.	The proposed method has a significantly larger computational time compared to na"ive DDIM inversion. It assumes prior knowledge of the prompt used in LDMs, leaving joint estimation of prompt and initial noise for future work.	diffusion probabilistic models, generative models, image inversion, image editing, watermark detection
2311.18297 Report	TrustMark: Universal Watermarking for Arbitrary Resolution Images	Tu Bui, Shruti Agarwal, John Collomosse	Imperceptible digital watermarking is important in copyright protection, misinformation prevention, and responsible generative AI. We propose TrustMark - a GAN-based watermarking method with novel design in architecture and spatio-spectra losses to balance the trade-off between watermarked image quality with the watermark recovery accuracy. Our model is trained with robustness in mind, withstanding various in- and out-place perturbations on the encoded image. Additionally, we introduce TrustMark-RM - a watermark remover method useful for re-watermarking. Our methods achieve state-of-art performance on 3 benchmarks comprising arbitrary resolution images.	TrustMark, a novel GAN-based watermarking method for arbitrary resolution images, balances imperceptibility and watermark recovery while achieving robustness against perturbations.	Addresses challenges in misinformation prevention, copyright protection, and responsible generative AI by enabling robust and imperceptible embedding of identifiers (e.g., provenance data) within images.	Combines a novel architecture with 1x1 convolutional post-processing layers, focal frequency loss, extensive noise simulation during training, a resolution scaling method, and a watermark removal network (TrustMark-RM) for re-watermarking.	Achieves state-of-the-art imperceptibility and watermark recovery performance on three benchmarks (CLIC, DIV2K, MetFace). Demonstrates robustness against various noise sources, severity levels, and adversarial attacks. Enables effective watermark removal and re-watermarking while preserving image quality.	Performance slightly degrades on highly cluttered images. Watermark removal effectiveness weakens with repeated re-watermarking due to accumulated noise.	watermarking, deep learning, image processing, content provenance, robustness
2311.18288 Report	CosAvatar: Consistent and Animatable Portrait Video Tuning with Text Prompt	Haiyao Xiao, Chenglai Zhong, Xuan Gao, Yudong Guo, Juyong Zhang	Recently, text-guided digital portrait editing has attracted more and more attentions. However, existing methods still struggle to maintain consistency across time, expression, and view or require specific data prerequisites. To solve these challenging problems, we propose CosAvatar, a high-quality and user-friendly framework for portrait tuning. With only monocular video and text instructions as input, we can produce animatable portraits with both temporal and 3D consistency. Different from methods that directly edit in the 2D domain, we employ a dynamic NeRF-based 3D portrait representation to model both the head and torso. We alternate between editing the video frames' dataset and updating the underlying 3D portrait until the edited frames reach 3D consistency. Additionally, we integrate the semantic portrait priors to enhance the edited results, allowing precise modifications in specified semantic areas. Extensive results demonstrate that our proposed method can not only accurately edit portrait styles or local attributes based on text instructions but also support expressive animation driven by a source video.	CosAvatar, a novel text-driven portrait editing framework using monocular dynamic NeRF to enable global style and local attribute editing with strong temporal and 3D consistency, while supporting animation.	Existing methods struggle to maintain consistency across time, expression, and view during portrait editing, especially for dynamic sequences, limiting flexibility and generalizability.	The method reconstructs a dynamic NeRF-based 3D portrait, separately modeling head and torso motion. It then leverages a structure-preserving image-conditioned diffusion model (InstructPix2Pix) to iteratively edit rendered frames, alternating between dataset updates and NeRF refinement. Semantic segmentation priors guide local attribute edits.	Generates high-fidelity stylized portrait videos with strong temporal and 3D consistency from text instructions. Allows precise control over local semantic region edits, like hair or clothing style. Enables animation of the edited portrait driven by expressions and poses from reference videos.	Fine-detail editing is limited due to InstructPix2Pix's limitations in object isolation and spatial reasoning. Lacks explicit geometric proxies for body parts, potentially impacting accuracy in scenarios with significant body movement.	portrait editing, text-driven editing, dynamic nerf, diffusion models, semantic segmentation
2311.18266 Report	Prompt-Based Exemplar Super-Compression and Regeneration for Class-Incremental Learning	Ruxiao Duan, Yaoyao Liu, Jieneng Chen, Adam Kortylewski, Alan Yuille	Replay-based methods in class-incremental learning (CIL) have attained remarkable success, as replaying the exemplars of old classes can significantly mitigate catastrophic forgetting. Despite their effectiveness, the inherent memory restrictions of CIL result in saving a limited number of exemplars with poor diversity, leading to data imbalance and overfitting issues. In this paper, we introduce a novel exemplar super-compression and regeneration method, ESCORT, which substantially increases the quantity and enhances the diversity of exemplars. Rather than storing past images, we compress images into visual and textual prompts, e.g., edge maps and class tags, and save the prompts instead, reducing the memory usage of each exemplar to 1/24 of the original size. In subsequent learning phases, diverse high-resolution exemplars are generated from the prompts by a pre-trained diffusion model, e.g., ControlNet. To minimize the domain gap between generated exemplars and real images, we propose partial compression and diffusion-based data augmentation, allowing us to utilize an off-the-shelf diffusion model without fine-tuning it on the target dataset. Therefore, the same diffusion model can be downloaded whenever it is needed, incurring no memory consumption. Comprehensive experiments demonstrate that our method significantly improves model performance across multiple CIL benchmarks, e.g., 5.0 percentage points higher than the previous state-of-the-art on 10-phase Caltech-256 dataset.	This paper introduces ESCORT, an exemplar super-compression and regeneration method using prompts, to enhance the quantity and diversity of exemplars in replay-based class-incremental learning (CIL).	Replay-based CIL methods are limited by the small quantity and poor diversity of stored exemplars, leading to data imbalance and overfitting.	ESCORT compresses past images into visual (edge maps) and textual (class tags) prompts for storage. Diverse, high-resolution exemplars are then regenerated using a pre-trained diffusion model (ControlNet) from these prompts during subsequent CIL phases. Partial compression and diffusion-based data augmentation mitigate the domain gap between generated and real images.	ESCORT significantly improves model performance, achieving state-of-the-art results on three image classification benchmarks (Caltech-256, Food-101, Places-100). The method consistently improves accuracy across various memory budgets, demonstrating its effectiveness in memory-constrained scenarios. Ablation studies confirm the benefits of partial compression and diffusion-based data augmentation in improving exemplar quality and diversity.	The current implementation relies on a pre-selected diffusion model (ControlNet). Adapting to other generative models might require further exploration. The study primarily focuses on image classification tasks. Exploring ESCORT's applicability to other CIL domains, like object detection or semantic segmentation, is a promising future direction.	class-incremental learning, exemplar compression, diffusion models, data augmentation, catastrophic forgetting
2311.18257 Report	Diffusion Models Without Attention	Jing Nathan Yan, Jiatao Gu, Alexander M. Rush	In recent advancements in high-fidelity image generation, Denoising Diffusion Probabilistic Models (DDPMs) have emerged as a key player. However, their application at high resolutions presents significant computational challenges. Current methods, such as patchifying, expedite processes in UNet and Transformer architectures but at the expense of representational capacity. Addressing this, we introduce the Diffusion State Space Model (DiffuSSM), an architecture that supplants attention mechanisms with a more scalable state space model backbone. This approach effectively handles higher resolutions without resorting to global compression, thus preserving detailed image representation throughout the diffusion process. Our focus on FLOP-efficient architectures in diffusion training marks a significant step forward. Comprehensive evaluations on both ImageNet and LSUN datasets at two resolutions demonstrate that DiffuSSMs are on par or even outperform existing diffusion models with attention modules in FID and Inception Score metrics while significantly reducing total FLOP usage.	Introduces Diffusion State Space Model (DiffuSSM), an attention-free diffusion architecture that replaces attention with a more scalable state space model backbone for high-resolution image generation.	Addresses computational challenges of existing diffusion models at high resolutions, particularly the quadratic complexity of attention mechanisms, without compromising representational capacity by avoiding patchification or multi-scale resolution compression.	Employs a gated state space model (SSM) backbone to process finer-grained image representations without global compression and incorporates an hourglass architecture in MLP layers to enhance efficiency.	Achieves comparable or superior FID and Inception Score results to existing diffusion models on ImageNet and LSUN datasets at various resolutions. Demonstrates significantly reduced total FLOP usage compared to attention-based models like DiT. Shows improved robustness in spatial reconstruction and visual quality by avoiding patchification in qualitative analysis.	Focuses primarily on (un)conditional image generation, leaving text-to-image approaches for future exploration. Does not incorporate recent advancements like masked image training, which could potentially further improve performance.	diffusion models, image generation, state space models, attention mechanism, high-resolution
2311.18248 Report	mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model	Anwen Hu, Yaya Shi, Haiyang Xu, Jiabo Ye, Qinghao Ye, Ming Yan, Chenliang Li, Qi Qian, Ji Zhang, Fei Huang	Recently, the strong text creation ability of Large Language Models(LLMs) has given rise to many tools for assisting paper reading or even writing. However, the weak diagram analysis abilities of LLMs or Multimodal LLMs greatly limit their application scenarios, especially for scientific academic paper writing. In this work, towards a more versatile copilot for academic paper writing, we mainly focus on strengthening the multi-modal diagram analysis ability of Multimodal LLMs. By parsing Latex source files of high-quality papers, we carefully build a multi-modal diagram understanding dataset M-Paper. By aligning diagrams in the paper with related paragraphs, we construct professional diagram analysis samples for training and evaluation. M-Paper is the first dataset to support joint comprehension of multiple scientific diagrams, including figures and tables in the format of images or Latex codes. Besides, to better align the copilot with the user's intention, we introduce the `outline' as the control signal, which could be directly given by the user or revised based on auto-generated ones. Comprehensive experiments with a state-of-the-art Mumtimodal LLM demonstrate that training on our dataset shows stronger scientific diagram understanding performance, including diagram captioning, diagram analysis, and outline recommendation. The dataset, code, and model are available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/PaperOwl.	This paper introduces PaperOwl, a new model fine-tuned for scientific diagram analysis in academic papers. It leverages outlines as control signals to align generated analyses with user intent and preceding text.	Existing LLMs and MLLMs struggle with the complex diagram analysis required for academic writing, limiting their use as writing copilots.	The authors built mPLUG-DocOwl, a dataset with aligned diagrams, captions, paragraph analyses, and outlines extracted from high-quality computer science papers. They then fine-tuned a pre-trained MLLM on this dataset, incorporating techniques like image cropping and outline-guided generation.	PaperOwl significantly outperforms state-of-the-art MLLMs on diagram captioning, analysis, and outline recommendation tasks. Using outlines as control signals improves analysis quality and aligns it with user intent. Incorporating preceding text as context further enhances analysis coherence and accuracy.	The cropping module for high-resolution images poses challenges for balancing multimodal information in diagram analysis. The model may sometimes prioritize following the outline over providing detailed insights from diagrams.	multimodal learning, diagram understanding, academic paper writing, large language models, computer vision
2311.18243 Report	DKiS: Decay weight invertible image steganography with private key	Hang Yang, Yitian Xu, Xuhua Liu	Image steganography, defined as the practice of concealing information within another image, traditionally encounters security challenges when its methods become publicly known or are under attack. To address this, a novel private key-based image steganography technique has been introduced. This approach ensures the security of the hidden information, as access requires a corresponding private key, regardless of the public knowledge of the steganography method. Experimental evidence has been presented, demonstrating the effectiveness of our method and showcasing its real-world applicability. Furthermore, a critical challenge in the invertible image steganography process has been identified by us: the transfer of non-essential, or `garbage', information from the secret to the host pipeline. To tackle this issue, the decay weight has been introduced to control the information transfer, effectively filtering out irrelevant data and enhancing the performance of image steganography. The code for this technique is publicly accessible at https://github.com/yanghangAI/DKiS, and a practical demonstration can be found at http://yanghang.site/hidekey.	This paper introduces DKiS, a novel private key-based image steganography technique, which ensures the security of hidden information even if the method is publicly known.	Traditional image steganography methods face security risks when the method is known or attacked. DKiS addresses this by incorporating preset private keys, enhancing security.	DKiS uses invertible neural networks with a private key integrated into the encoding process. A decay weight is introduced to control information transfer, optimizing performance.	DKiS significantly outperforms previous private key-based methods in image hiding quality. The private key effectively prevents unauthorized embedding and extraction attacks, as shown by attack simulation. DKiS demonstrates consistent performance across diverse datasets including COCO, ImageNet, and PubLayNet.	The current implementation of DKiS is limited to hiding images within images. Exploring other data types for hiding is a potential future direction. While DKiS has shown robustness, investigating its resilience against a wider range of steganalysis attacks is crucial for future work.	image steganography, deep learning, private key, security, invertible neural network
2311.18208 Report	SMaRt: Improving GANs with Score Matching Regularity	Mengfei Xia, Yujun Shen, Ceyuan Yang, Ran Yi, Wenping Wang, Yong-jin Liu	Generative adversarial networks (GANs) usually struggle in learning from highly diverse data, whose underlying manifold is complex. In this work, we revisit the mathematical foundations of GANs, and theoretically reveal that the native adversarial loss for GAN training is insufficient to fix the problem of subsets with positive Lebesgue measure of the generated data manifold lying out of the real data manifold. Instead, we find that score matching serves as a promising solution to this issue thanks to its capability of persistently pushing the generated data points towards the real data manifold. We thereby propose to improve the optimization of GANs with score matching regularity (SMaRt). Regarding the empirical evidences, we first design a toy example to show that training GANs by the aid of a ground-truth score function can help reproduce the real data distribution more accurately, and then confirm that our approach can consistently boost the synthesis performance of various state-of-the-art GANs on real-world datasets with pre-trained diffusion models acting as the approximate score function. For instance, when training Aurora on the ImageNet 64x64 dataset, we manage to improve FID from 8.87 to 7.11, on par with the performance of one-step consistency model. The source code will be made public.	This paper proposes SMaRt, a novel method using score matching as a regularity term during GAN training to address the gradient vanishing issue and enhance image generation quality.	Gradient vanishing in GANs, often caused by generated data manifolds not fully aligning with real data manifolds, limits GAN performance and downstream applications. SMaRt aims to solve this by guiding generated samples towards the real data manifold.	SMaRt leverages pre-trained diffusion models and incorporates a score matching loss term into the GAN objective function. This loss encourages the generator to produce samples that are more consistent with the real data distribution.	SMaRt consistently improves the performance of various state-of-the-art GANs on diverse datasets, including CIFAR10, LSUN Bedroom, and ImageNet. SMaRt effectively addresses the gradient vanishing issue by persistently providing gradients to push generated samples towards the real data manifold. The paper demonstrates the effectiveness of SMaRt on a toy example with discrete data distribution, showcasing its ability to handle discreteness better than previous methods.	The optimal choice of hyperparameters, such as the timestep interval and loss weight for score matching, is currently determined empirically and requires further investigation. Despite employing a lazy strategy, the additional score matching computation in SMaRt slightly increases training time compared to baseline GANs.	generative adversarial networks (gans), score matching, diffusion models, gradient vanishing, image generation
2311.18159 Report	Compact3D: Compressing Gaussian Splat Radiance Field Models with Vector Quantization	KL Navaneet, Kossar Pourahmadi Meibodi, Soroush Abbasi Koohpayegani, Hamed Pirsiavash	3D Gaussian Splatting is a new method for modeling and rendering 3D radiance fields that achieves much faster learning and rendering time compared to SOTA NeRF methods. However, it comes with a drawback in the much larger storage demand compared to NeRF methods since it needs to store the parameters for several 3D Gaussians. We notice that many Gaussians may share similar parameters, so we introduce a simple vector quantization method based on \kmeans algorithm to quantize the Gaussian parameters. Then, we store the small codebook along with the index of the code for each Gaussian. Moreover, we compress the indices further by sorting them and using a method similar to run-length encoding. We do extensive experiments on standard benchmarks as well as a new benchmark which is an order of magnitude larger than the standard benchmarks. We show that our simple yet effective method can reduce the storage cost for the original 3D Gaussian Splatting method by a factor of almost $20\times$ with a very small drop in the quality of rendered images.	This paper introduces a vector quantization method based on the k-means algorithm to compress 3D Gaussian Splatting (3DGS) models for efficient storage and rendering of 3D radiance fields.	3DGS offers fast learning and rendering compared to NeRF methods, but requires significantly more storage. This work addresses this limitation, making 3DGS practical for applications with storage constraints, such as edge devices.	The method quantizes the parameters of 3D Gaussians by grouping similar parameters and clustering them independently using k-means. It stores a codebook and indices for each Gaussian, enabling significant storage reduction. Further compression is achieved by sorting Gaussians and employing run-length encoding.	The compressed model (CompGS) reduces storage size by a factor of nearly 20x compared to the original 3DGS while maintaining comparable quality to state-of-the-art NeRF approaches. CompGS preserves the real-time rendering capabilities of 3DGS. Experiments on a newly introduced large-scale benchmark (ARKit) demonstrate the effectiveness of CompGS in compressing large indoor scenes for potential VR applications.	The compression method introduces computational overhead during training due to the k-means clustering. Future work could explore faster k-means implementations or alternative quantization techniques to further reduce training time.	3d gaussian splatting, radiance fields, vector quantization, model compression, novel view synthesis
2311.18068 Report	ALSTER: A Local Spatio-Temporal Expert for Online 3D Semantic Reconstruction	Silvan Weder, Francis Engelmann, Johannes L. Schönberger, Akihito Seki, Marc Pollefeys, Martin R. Oswald	We propose an online 3D semantic segmentation method that incrementally reconstructs a 3D semantic map from a stream of RGB-D frames. Unlike offline methods, ours is directly applicable to scenarios with real-time constraints, such as robotics or mixed reality. To overcome the inherent challenges of online methods, we make two main contributions. First, to effectively extract information from the input RGB-D video stream, we jointly estimate geometry and semantic labels per frame in 3D. A key focus of our approach is to reason about semantic entities both in the 2D input and the local 3D domain to leverage differences in spatial context and network architectures. Our method predicts 2D features using an off-the-shelf segmentation network. The extracted 2D features are refined by a lightweight 3D network to enable reasoning about the local 3D structure. Second, to efficiently deal with an infinite stream of input RGB-D frames, a subsequent network serves as a temporal expert predicting the incremental scene updates by leveraging 2D, 3D, and past information in a learned manner. These updates are then integrated into a global scene representation. Using these main contributions, our method can enable scenarios with real-time constraints and can scale to arbitrary scene sizes by processing and updating the scene only in a local region defined by the new measurement. Our experiments demonstrate improved results compared to existing online methods that purely operate in local regions and show that complementary sources of information can boost the performance. We provide a thorough ablation study on the benefits of different architectural as well as algorithmic design decisions. Our method yields competitive results on the popular ScanNet benchmark and SceneNN dataset.	This paper proposes an online 3D semantic segmentation method that incrementally reconstructs a semantically enriched 3D map from a stream of RGB-D frames.	Online 3D semantic reconstruction is crucial for real-time applications like robotics and mixed reality, where agents need to interact with their environment without prior knowledge.	The method uses a three-stage pipeline: a 2D encoder extracts features from RGB-D images, a 3D encoder incorporates geometric information, and a novel spatio-temporal expert network fuses 2D, 3D, and past scene information to update a learned 3D scene representation.	The method achieves state-of-the-art results among online local reconstruction methods on the ScanNet and SceneNN benchmarks. The proposed temporal expert network effectively combines 2D and 3D information, leading to improved semantic segmentation compared to using either source alone. The approach is memory and computationally efficient, making it suitable for real-time applications on devices with limited resources.	The 2D encoder is identified as the main bottleneck for real-time performance. Future work could explore alternative 2D encoders or optimization strategies to improve runtime speed further.	3d semantic segmentation, online reconstruction, rgb-d vision, spatio-temporal expert network, learned scene representation
2311.17977 Report	GaussianShader: 3D Gaussian Splatting with Shading Functions for Reflective Surfaces	Yingwenqi Jiang, Jiadong Tu, Yuan Liu, Xifeng Gao, Xiaoxiao Long, Wenping Wang, Yuexin Ma	The advent of neural 3D Gaussians has recently brought about a revolution in the field of neural rendering, facilitating the generation of high-quality renderings at real-time speeds. However, the explicit and discrete representation encounters challenges when applied to scenes featuring reflective surfaces. In this paper, we present GaussianShader, a novel method that applies a simplified shading function on 3D Gaussians to enhance the neural rendering in scenes with reflective surfaces while preserving the training and rendering efficiency. The main challenge in applying the shading function lies in the accurate normal estimation on discrete 3D Gaussians. Specifically, we proposed a novel normal estimation framework based on the shortest axis directions of 3D Gaussians with a delicately designed loss to make the consistency between the normals and the geometries of Gaussian spheres. Experiments show that GaussianShader strikes a commendable balance between efficiency and visual quality. Our method surpasses Gaussian Splatting in PSNR on specular object datasets, exhibiting an improvement of 1.57dB. When compared to prior works handling reflective surfaces, such as Ref-NeRF, our optimization time is significantly accelerated (23h vs. 0.58h). Please click on our project website to see more results.	GaussianShader enhances neural rendering in scenes with reflective surfaces using a simplified shading function on 3D Gaussians, improving visual quality while preserving training and rendering efficiency.	Existing neural rendering methods struggle with reflective surfaces: NeRF methods are computationally expensive, and while 3D Gaussian Splatting is efficient, it lacks explicit appearance modeling, hindering realism in scenes with reflective surfaces.	The method incorporates a simplified shading function considering diffuse colors and direct reflections, with a residual term for complex reflections. It introduces a novel normal estimation framework based on the shortest axis direction of 3D Gaussians and a normal-geometry consistency loss to ensure accurate normal estimation on discrete Gaussian spheres.	GaussianShader surpasses Gaussian Splatting in PSNR on specular object datasets by 1.57dB. It achieves comparable visual quality to methods like Ref-NeRF and ENVIDR while significantly reducing optimization time (0.58h vs. 23h for Ref-NeRF). The method maintains real-time rendering capabilities, making it suitable for interactive applications.	The method's performance on highly complex lighting scenarios with intricate indirect illumination might be limited. Future work could explore incorporating more sophisticated BRDF models for increased realism.	neural rendering, 3d gaussian splatting, reflective surfaces, normal estimation, shading function
2311.17971 Report	GeoDream: Disentangling 2D and Geometric Priors for High-Fidelity and Consistent 3D Generation	Baorui Ma, Haoge Deng, Junsheng Zhou, Yu-Shen Liu, Tiejun Huang, Xinlong Wang	Text-to-3D generation by distilling pretrained large-scale text-to-image diffusion models has shown great promise but still suffers from inconsistent 3D geometric structures (Janus problems) and severe artifacts. The aforementioned problems mainly stem from 2D diffusion models lacking 3D awareness during the lifting. In this work, we present GeoDream, a novel method that incorporates explicit generalized 3D priors with 2D diffusion priors to enhance the capability of obtaining unambiguous 3D consistent geometric structures without sacrificing diversity or fidelity. Specifically, we first utilize a multi-view diffusion model to generate posed images and then construct cost volume from the predicted image, which serves as native 3D geometric priors, ensuring spatial consistency in 3D space. Subsequently, we further propose to harness 3D geometric priors to unlock the great potential of 3D awareness in 2D diffusion priors via a disentangled design. Notably, disentangling 2D and 3D priors allows us to refine 3D geometric priors further. We justify that the refined 3D geometric priors aid in the 3D-aware capability of 2D diffusion priors, which in turn provides superior guidance for the refinement of 3D geometric priors. Our numerical and visual comparisons demonstrate that GeoDream generates more 3D consistent textured meshes with high-resolution realistic renderings (i.e., 1024 $\times$ 1024) and adheres more closely to semantic coherence.	This paper presents GeoDream, a novel text-to-3D generation method that incorporates explicit 3D priors with 2D diffusion priors to improve 3D consistency and reduce artifacts, especially for asymmetric structures (Janus problems).	Current text-to-3D generation methods based on large-scale text-to-image diffusion models struggle to generate 3D consistent structures, particularly for asymmetric shapes, due to the lack of 3D awareness in 2D diffusion models.	GeoDream utilizes a multi-view diffusion model to generate posed images and constructs a cost volume representing 3D geometric priors. This volume, combined with a disentangled design for incorporating 2D diffusion priors, refines the 3D geometry and texture, resulting in high-fidelity textured meshes.	GeoDream generates 3D consistent textured meshes with higher resolution (1024x1024) and realism than previous methods. The method demonstrates superior semantic coherence, as measured by a newly proposed Uni3D_score metric. GeoDream adapts to various multi-view diffusion models and benefits from a critical viewpoint sampling strategy for robust cost volume construction.	The training process of GeoDream is relatively time-consuming, despite optimizations. Future work includes exploring larger batch sizes and multi-GPU training for faster generation.	text-to-3d generation, diffusion models, 3d priors, janus problem, cost volume
2311.17963 Report	M$^{2}$Chat: Empowering VLM for Multimodal LLM Interleaved Text-Image Generation	Xiaowei Chi, Rongyu Zhang, Zhengkai Jiang, Yijiang Liu, Yatian Wang, Xingqun Qi, Wenhan Luo, Peng Gao, Shanghang Zhang, Qifeng Liu, Yike Guo	While current LLM chatbots like GPT-4V bridge the gap between human instructions and visual representations to enable text-image generations, they still lack efficient alignment methods for high-fidelity performance on multiple downstream tasks. In this paper, we propose \textbf{$M^{2}Chat$}, a novel unified multimodal LLM framework for generating interleaved text-image conversation across various scenarios. Specifically, we propose an $M^{3}Adapter$ that efficiently integrates granular low-level visual information and high-level semantic features from multi-modality prompts. Upon the well-aligned fused feature, $M^{3}Adapter$ tailors a learnable gating strategy to balance the model creativity and consistency across various tasks adaptively. Moreover, to further enhance the effectiveness of $M^{3}Adapter$ while preserving the coherence of semantic context comprehension, we introduce a two-stage $M^{3}FT$ fine-tuning strategy. This strategy optimizes disjoint groups of parameters for image-text alignment and visual-instruction respectively. Extensive experiments demonstrate our $M^{2}Chat$ surpasses state-of-the-art counterparts across diverse benchmarks, showcasing its prowess in interleaving generation, storytelling, and multimodal dialogue systems. The demo and code are available at \red{https://mattie-e.github.io/M2Chat.github.io}.	This paper proposes M²Chat, a novel unified multimodal large language model (LLM) framework that generates interleaved text-image conversations across various scenarios using an M³Adapter.	Current LLM chatbots lack efficient alignment methods for high-fidelity performance on multiple downstream tasks requiring both creative and consistent text-image generation.	M²Chat integrates Stable Diffusion XL with LLaMA-AdapterV2. It employs an M³Adapter to integrate visual and semantic features from multimodal prompts and a two-stage M³FT fine-tuning strategy to optimize image-text alignment and visual instruction.	M²Chat outperforms state-of-the-art models in interleaved generation tasks, showcasing superior quality and semantic consistency. The M³Adapter effectively aligns visual and semantic features, enhancing text-image congruence and image fidelity. The two-stage M³FT strategy significantly improves generative quality by optimizing for alignment and instruction following.	The model's performance relies heavily on the quality and diversity of the training data. Future work can explore incorporating more modalities, such as audio, to create richer and more engaging conversational experiences.	multimodal generation, large language models, text-to-image synthesis, interleaved generation, multimodal dialogue
2311.17957 Report	HandRefiner: Refining Malformed Hands in Generated Images by Diffusion-based Conditional Inpainting	Wenquan Lu, Yufei Xu, Jing Zhang, Chaoyue Wang, Dacheng Tao	Diffusion models have achieved remarkable success in generating realistic images but suffer from generating accurate human hands, such as incorrect finger counts or irregular shapes. This difficulty arises from the complex task of learning the physical structure and pose of hands from training images, which involves extensive deformations and occlusions. For correct hand generation, our paper introduces a lightweight post-processing solution called $\textbf{HandRefiner}$. HandRefiner employs a conditional inpainting approach to rectify malformed hands while leaving other parts of the image untouched. We leverage the hand mesh reconstruction model that consistently adheres to the correct number of fingers and hand shape, while also being capable of fitting the desired hand pose in the generated image. Given a generated failed image due to malformed hands, we utilize ControlNet modules to re-inject such correct hand information. Additionally, we uncover a phase transition phenomenon within ControlNet as we vary the control strength. It enables us to take advantage of more readily available synthetic data without suffering from the domain gap between realistic and synthetic hands. Experiments demonstrate that HandRefiner can significantly improve the generation quality quantitatively and qualitatively. The code is available at https://github.com/wenquanlu/HandRefiner .	This paper introduces HandRefiner, a lightweight post-processing solution to rectify malformed hands in images generated by diffusion models, without needing to retrain the models.	Diffusion models struggle to generate realistic human hands due to their complex structure and occlusion variations, leading to incorrect finger counts and irregular shapes.	HandRefiner uses a hand mesh reconstruction model to estimate hand depth maps from generated images. These maps are then used as guidance within a ControlNet-based inpainting pipeline to reconstruct hand regions.	HandRefiner significantly improves hand generation quality, as evidenced by improved FID/KID scores and user study results. The paper identifies a phase transition phenomenon within ControlNet when varying control strength, allowing for effective use of synthetic data during training. Fine-tuning with synthetic data, incorporating negative prompts, and using an inpainting loss all contribute to improved performance.	HandRefiner currently faces challenges in generating interacting hands due to mesh reconstruction difficulties and training data limitations. Future work could explore expanding HandRefiner's compatibility with larger diffusion models and addressing limitations related to generating small hands and interacting hands.	diffusion models, hand generation, image inpainting, controlnet, synthetic data
2311.17953 Report	Rethinking Image Editing Detection in the Era of Generative AI Revolution	Zhihao Sun, Haipeng Fang, Xinying Zhao, Danding Wang, Juan Cao	The accelerated advancement of generative AI significantly enhance the viability and effectiveness of generative regional editing methods. This evolution render the image manipulation more accessible, thereby intensifying the risk of altering the conveyed information within original images and even propagating misinformation. Consequently, there exists a critical demand for robust capable of detecting the edited images. However, the lack of comprehensive dataset containing images edited with abundant and advanced generative regional editing methods poses a substantial obstacle to the advancement of corresponding detection methods. We endeavor to fill the vacancy by constructing the GRE dataset, a large-scale generative regional editing dataset with the following advantages: 1) Collection of real-world original images, focusing on two frequently edited scenarios. 2) Integration of a logical and simulated editing pipeline, leveraging multiple large models in various modalities. 3) Inclusion of various editing approaches with distinct architectures. 4) Provision of comprehensive analysis tasks. We perform comprehensive experiments with proposed three tasks: edited image classification, edited method attribution and edited region localization, providing analysis of distinct editing methods and evaluation of detection methods in related fields. We expect that the GRE dataset can promote further research and exploration in the field of generative region editing detection.	The paper introduces GRE, a large-scale dataset for detecting and analyzing generative regional editing in images.	Generative AI advancements increase the risk of malicious image manipulation. Existing datasets lack comprehensiveness for detecting edits from advanced generative methods, hindering detection development.	The authors collect real-world images and build a simulated editing pipeline using large models (e.g., ChatGPT, Stable Diffusion) to generate logically coherent edited images with various methods (GAN-based, diffusion-based, black-box). They provide annotations for classification, attribution, and localization tasks.	Models trained on seen editing methods show good generalization on unseen ones for classification but struggle with generalization on localization. Attribution of GAN-based methods is easier than diffusion-based methods. Generative editing creates less perceptible edits, making detection challenging and highlighting GRE dataset's value.	The paper focuses on known generative methods, leaving room for exploring detection against unknown methods. Future work includes incorporating new editing methods and large models into the pipeline.	generative ai, image manipulation detection, dataset, regional editing, deep learning
2311.17946 Report	DreamSync: Aligning Text-to-Image Generation with Image Understanding Feedback	Jiao Sun, Deqing Fu, Yushi Hu, Su Wang, Royi Rassin, Da-Cheng Juan, Dana Alon, Charles Herrmann, Sjoerd van Steenkiste, Ranjay Krishna, Cyrus Rashtchian	Despite their wide-spread success, Text-to-Image models (T2I) still struggle to produce images that are both aesthetically pleasing and faithful to the user's input text. We introduce DreamSync, a model-agnostic training algorithm by design that improves T2I models to be faithful to the text input. DreamSync builds off a recent insight from TIFA's evaluation framework -- that large vision-language models (VLMs) can effectively identify the fine-grained discrepancies between generated images and the text inputs. DreamSync uses this insight to train T2I models without any labeled data; it improves T2I models using its own generations. First, it prompts the model to generate several candidate images for a given input text. Then, it uses two VLMs to select the best generation: a Visual Question Answering model that measures the alignment of generated images to the text, and another that measures the generation's aesthetic quality. After selection, we use LoRA to iteratively finetune the T2I model to guide its generation towards the selected best generations. DreamSync does not need any additional human annotation. model architecture changes, or reinforcement learning. Despite its simplicity, DreamSync improves both the semantic alignment and aesthetic appeal of two diffusion-based T2I models, evidenced by multiple benchmarks (+1.7% on TIFA, +2.9% on DSG1K, +3.4% on VILA aesthetic) and human evaluation.	Introduces \textbf{\ours}, a model-agnostic training algorithm that enhances the faithfulness and aesthetic quality of text-to-image generation models.	Existing text-to-image models struggle to produce images that are both aesthetically pleasing and faithful to the user's input text. This framework addresses these challenges in a model-agnostic way without human feedback.	\ours uses vision-language models (VLMs) to evaluate and select the best generated images for fine-tuning the text-to-image generation model. It iteratively refines the model by generating multiple candidate images, having VLMs select the best based on faithfulness and aesthetics, and then fine-tuning on the selected images using LoRA.	\textbf{\ours} significantly improves both the semantic alignment and aesthetic quality of two diffusion-based T2I models, SDXL and SD v1.4, as evidenced by multiple benchmarks (+1.7\% on TIFA, +2.9\% on DSG1K, +3.4\% on VILA aesthetic). \textbf{\ours} outperforms existing state-of-the-art alignment methods on both TIFA and DSG benchmarks while maintaining high visual appeal, as measured by VILA and human evaluation. Human evaluation on SDXL shows that \ours consistently improves image alignment across all categories in the DSG benchmark.	The performance of \textbf{\ours} is limited by the pre-trained model it starts with, as certain complex compositions or attributes might not be adequately addressed. Occasional decline in texture details and shadows is observed in some generated images after applying \textbf{\ours}, indicating room for further quality improvement.	text-to-image generation, image faithfulness, vision-language models, model-agnostic training, iterative bootstrapping
2311.17944 Report	LALM: Long-Term Action Anticipation with Language Models	Sanghwan Kim, Daoji Huang, Yongqin Xian, Otmar Hilliges, Luc Van Gool, Xi Wang	Understanding human activity is a crucial yet intricate task in egocentric vision, a field that focuses on capturing visual perspectives from the camera wearer's viewpoint. While traditional methods heavily rely on representation learning trained on extensive video data, there exists a significant limitation: obtaining effective video representations proves challenging due to the inherent complexity and variability in human activities.Furthermore, exclusive dependence on video-based learning may constrain a model's capability to generalize across long-tail classes and out-of-distribution scenarios. In this study, we introduce a novel approach for long-term action anticipation using language models (LALM), adept at addressing the complex challenges of long-term activity understanding without the need for extensive training. Our method incorporates an action recognition model to track previous action sequences and a vision-language model to articulate relevant environmental details. By leveraging the context provided by these past events, we devise a prompting strategy for action anticipation using large language models (LLMs). Moreover, we implement Maximal Marginal Relevance for example selection to facilitate in-context learning of the LLMs. Our experimental results demonstrate that LALM surpasses the state-of-the-art methods in the task of long-term action anticipation on the Ego4D benchmark. We further validate LALM on two additional benchmarks, affirming its capacity for generalization across intricate activities with different sets of taxonomies. These are achieved without specific fine-tuning.	This paper introduces a novel framework, PA-LLM, for long-term action anticipation in egocentric videos leveraging pre-trained vision-language and large language models (LLMs).	Understanding human activity from an egocentric viewpoint is crucial for applications like user-assistance systems and patient monitoring. Existing methods struggle with the complexity and variability of human actions and often lack generalizability. PA-LLM addresses these challenges by leveraging the power of LLMs.	PA-LLM utilizes an action recognition model to track past actions and a vision-language model to describe the visual context. This information is then used to construct prompts for an LLM, which predicts future actions. The LLM leverages in-context learning with exemplars selected using Maximal Marginal Relevance (MMR) for improved generalization.	PA-LLM surpasses state-of-the-art methods on the Ego4D benchmark for long-term action anticipation. The method demonstrates strong generalization capabilities, achieving competitive results on EK-55 and EGTEA datasets without fine-tuning. Ablation studies highlight the importance of accurate action recognition, effective image captioning, and appropriate prompt design for optimal performance.	The reliance on accurate past action descriptions poses a limitation, as errors in action recognition propagate to the prediction stage. Future work can explore incorporating temporal information and advanced LLM prompting techniques, such as chain-of-thought prompting, for enhanced performance.	egocentric vision, action anticipation, large language models, vision-language models, in-context learning
2311.17937 Report	Unlocking Spatial Comprehension in Text-to-Image Diffusion Models	Mohammad Mahdi Derakhshani, Menglin Xia, Harkirat Behl, Cees G. M. Snoek, Victor Rühle	We propose CompFuser, an image generation pipeline that enhances spatial comprehension and attribute assignment in text-to-image generative models. Our pipeline enables the interpretation of instructions defining spatial relationships between objects in a scene, such as `An image of a gray cat on the left of an orange dog', and generate corresponding images. This is especially important in order to provide more control to the user. CompFuser overcomes the limitation of existing text-to-image diffusion models by decoding the generation of multiple objects into iterative steps: first generating a single object and then editing the image by placing additional objects in their designated positions. To create training data for spatial comprehension and attribute assignment we introduce a synthetic data generation process, that leverages a frozen large language model and a frozen layout-based diffusion model for object placement. We compare our approach to strong baselines and show that our model outperforms state-of-the-art image generation models in spatial comprehension and attribute assignment, despite being 3x to 5x smaller in parameters.	This paper introduces InstructObject2Scene, an image generation pipeline enhancing spatial comprehension and attribute assignment in text-to-image models by iteratively adding objects to a scene based on their relative positions.	Existing text-to-image models struggle with accurately representing spatial relationships between multiple objects, limiting user control over image generation.	The pipeline uses a large language model (LLM) to decode instructions into multiple generation steps, starting with a single object and iteratively adding others. A synthetic dataset, created using an LLM and layout-based diffusion model, trains the model to understand spatial relationships.	InstructObject2Scene outperforms state-of-the-art models in spatial comprehension and attribute assignment. The model achieves significantly higher accuracy in placing objects according to textual instructions. Despite being smaller in size, InstructObject2Scene demonstrates superior performance compared to larger counterparts.	The model is currently limited to two objects and left/right relationships. The LLM used for layout generation lacks a deep understanding of geometry, potentially limiting the complexity of generated scenes.	text-to-image generation, spatial reasoning, attribute assignment, image editing, diffusion models
2311.17921 Report	Do text-free diffusion models learn discriminative visual representations?	Soumik Mukhopadhyay, Matthew Gwilliam, Yosuke Yamaguchi, Vatsal Agarwal, Namitha Padmanabhan, Archana Swaminathan, Tianyi Zhou, Abhinav Shrivastava	While many unsupervised learning models focus on one family of tasks, either generative or discriminative, we explore the possibility of a unified representation learner: a model which addresses both families of tasks simultaneously. We identify diffusion models, a state-of-the-art method for generative tasks, as a prime candidate. Such models involve training a U-Net to iteratively predict and remove noise, and the resulting model can synthesize high-fidelity, diverse, novel images. We find that the intermediate feature maps of the U-Net are diverse, discriminative feature representations. We propose a novel attention mechanism for pooling feature maps and further leverage this mechanism as DifFormer, a transformer feature fusion of features from different diffusion U-Net blocks and noise steps. We also develop DifFeed, a novel feedback mechanism tailored to diffusion. We find that diffusion models are better than GANs, and, with our fusion and feedback mechanisms, can compete with state-of-the-art unsupervised image representation learning methods for discriminative tasks - image classification with full and semi-supervision, transfer for fine-grained classification, object detection and segmentation, and semantic segmentation. Our project website (https://mgwillia.github.io/diffssl/) and code (https://github.com/soumik-kanad/diffssl) are available publicly.	This paper demonstrates that diffusion models, known for generative tasks, can also learn discriminative visual representations suitable for recognition tasks, making them promising candidates for unified self-supervised representation learning.	Unified representation learning is important because it allows a single model to be used for various downstream tasks like image recognition, reconstruction, and synthesis, eliminating the need for training separate models for each task.	The authors analyze diffusion model embeddings, propose an Attention head for feature pooling, introduce DifFormer (a transformer-based feature fusion method combining features from different diffusion U-Net blocks and noise steps), and develop DifFeed (a feedback mechanism for diffusion features).	Diffusion models outperform GANs in image classification and generation. The discriminative power of diffusion features is distributed across network blocks, noise time steps, and feature resolutions, requiring intelligent fusion strategies. Proposed methods (Attention head, DifFormer, DifFeed) significantly improve diffusion model performance on ImageNet classification, semi-supervised learning, fine-grained visual classification, and semantic segmentation.	While competitive, the proposed methods' performance on semantic segmentation doesn't surpass state-of-the-art methods like MAE (ViT-L). Exploration of diffusion features for object detection and instance segmentation is limited due to training costs.	diffusion models, self-supervised learning, unified representation learning, image classification, semantic segmentation
2311.17917 Report	AvatarStudio: High-fidelity and Animatable 3D Avatar Creation from Text	Jianfeng Zhang, Xuanmeng Zhang, Huichao Zhang, Jun Hao Liew, Chenxu Zhang, Yi Yang, Jiashi Feng	We study the problem of creating high-fidelity and animatable 3D avatars from only textual descriptions. Existing text-to-avatar methods are either limited to static avatars which cannot be animated or struggle to generate animatable avatars with promising quality and precise pose control. To address these limitations, we propose AvatarStudio, a coarse-to-fine generative model that generates explicit textured 3D meshes for animatable human avatars. Specifically, AvatarStudio begins with a low-resolution NeRF-based representation for coarse generation, followed by incorporating SMPL-guided articulation into the explicit mesh representation to support avatar animation and high resolution rendering. To ensure view consistency and pose controllability of the resulting avatars, we introduce a 2D diffusion model conditioned on DensePose for Score Distillation Sampling supervision. By effectively leveraging the synergy between the articulated mesh representation and the DensePose-conditional diffusion model, AvatarStudio can create high-quality avatars from text that are ready for animation, significantly outperforming previous methods. Moreover, it is competent for many applications, e.g., multimodal avatar animations and style-guided avatar creation. For more results, please refer to our project page: http://jeff95.me/projects/avatarstudio.html	This paper proposes \ours{}, a coarse-to-fine generative model that creates high-fidelity, animatable 3D avatars from text descriptions.	Existing text-to-avatar methods are limited to static avatars or struggle to produce high-quality animatable avatars with accurate pose control. This limits practical applications that require realistic and controllable digital humans.	The method uses a two-stage approach: 1) a low-resolution NeRF representation for coarse generation and 2) optimization of a SMPL-guided articulated textured mesh for high-resolution rendering. It leverages a DensePose-conditioned ControlNet for Score Distillation Sampling (SDS) to ensure view consistency and pose accuracy.	\ours{} generates significantly higher-quality avatars with finer details compared to previous state-of-the-art methods. It supports multimodal animation, allowing users to control avatar motion through videos or text descriptions. By incorporating an adapter, it enables the creation of avatars with unique artistic styles guided by reference images.	Currently, \ours{} does not support fine-grained facial expressions. The avatar generation process could be made more efficient as it currently takes around 2.5 hours.	3d avatar generation, text-to-3d, animatable avatars, densepose guidance, score distillation sampling
2311.17907 Report	CG3D: Compositional Generation for Text-to-3D via Gaussian Splatting	Alexander Vilesov, Pradyumna Chari, Achuta Kadambi	With the onset of diffusion-based generative models and their ability to generate text-conditioned images, content generation has received a massive invigoration. Recently, these models have been shown to provide useful guidance for the generation of 3D graphics assets. However, existing work in text-conditioned 3D generation faces fundamental constraints: (i) inability to generate detailed, multi-object scenes, (ii) inability to textually control multi-object configurations, and (iii) physically realistic scene composition. In this work, we propose CG3D, a method for compositionally generating scalable 3D assets that resolves these constraints. We find that explicit Gaussian radiance fields, parameterized to allow for compositions of objects, possess the capability to enable semantically and physically consistent scenes. By utilizing a guidance framework built around this explicit representation, we show state of the art results, capable of even exceeding the guiding diffusion model in terms of object combinations and physics accuracy.	Proposes CG3D, a method for generating scalable and composable 3D scenes from text prompts using explicit Gaussian radiance fields.	Addresses limitations of existing text-to-3D methods in generating detailed multi-object scenes with controllable configurations and physically realistic compositions.	Decomposes scene generation into object generation and interaction parameter estimation, leveraging Gaussian splatting and score distillation sampling from a pre-trained image diffusion model.	Achieves zero-shot compositional generation of diverse scenes with plausible object poses and scales. Enables physically realistic compositions through gravity and contact constraints. Allows for efficient scene editing and object manipulation.	Assumes rigid-body interactions, limiting its ability to model object deformations. Performance relies heavily on the quality of guidance from the pre-trained diffusion model, potentially leading to failures in cases of weak guidance or ambiguities.	text-to-3d, compositional generation, gaussian splatting, score distillation sampling, 3d scene synthesis
2311.17902 Report	Language-conditioned Detection Transformer	Jang Hyun Cho, Philipp Krähenbühl	We present a new open-vocabulary detection framework. Our framework uses both image-level labels and detailed detection annotations when available. Our framework proceeds in three steps. We first train a language-conditioned object detector on fully-supervised detection data. This detector gets to see the presence or absence of ground truth classes during training, and conditions prediction on the set of present classes. We use this detector to pseudo-label images with image-level labels. Our detector provides much more accurate pseudo-labels than prior approaches with its conditioning mechanism. Finally, we train an unconditioned open-vocabulary detector on the pseudo-annotated images. The resulting detector, named DECOLA, shows strong zero-shot performance in open-vocabulary LVIS benchmark as well as direct zero-shot transfer benchmarks on LVIS, COCO, Object365, and OpenImages. DECOLA outperforms the prior arts by 17.1 AP-rare and 9.4 mAP on zero-shot LVIS benchmark. DECOLA achieves state-of-the-art results in various model sizes, architectures, and datasets by only training on open-sourced data and academic-scale computing. Code is available at https://github.com/janghyuncho/DECOLA.	This paper introduces DECOLA, a transformer-based object detector that conditions its predictions on language embeddings of object categories, enhancing open-vocabulary detection performance.	Current open-vocabulary detectors struggle to generalize to unseen object categories due to their reliance on fixed vocabularies and limited language integration. DECOLA addresses this by adapting its inner workings to any arbitrary set of concepts represented in language.	DECOLA utilizes a two-phase approach: 1) Training a language-conditioned object detector on fully-supervised data to generate accurate pseudo-labels for weakly-labeled images. 2) Training an unconditioned open-vocabulary detector on the combined dataset of human-annotated and pseudo-annotated images.	DECOLA achieves state-of-the-art results on open-vocabulary LVIS benchmark, outperforming previous methods by significant margins. The language-conditioning mechanism leads to high-quality pseudo-labels, effectively expanding the training data and improving generalization to unseen categories. DECOLA demonstrates strong direct zero-shot transfer performance on various benchmarks, including LVIS, COCO, Object365, and OpenImages.	The performance improvement from ImageNet-21K pretraining is less significant for Deformable DETR compared to CenterNet2. Future work includes exploring co-training language-conditioning and multi-class prediction in a single phase.	open-vocabulary detection, language-conditioned detection, self-training, pseudo-labeling, zero-shot transfer
2311.17891 Report	Pose Anything: A Graph-Based Approach for Category-Agnostic Pose Estimation	Or Hirschorn, Shai Avidan	Traditional 2D pose estimation models are limited by their category-specific design, making them suitable only for predefined object categories. This restriction becomes particularly challenging when dealing with novel objects due to the lack of relevant training data. To address this limitation, category-agnostic pose estimation (CAPE) was introduced. CAPE aims to enable keypoint localization for arbitrary object categories using a single model, requiring minimal support images with annotated keypoints. This approach not only enables object pose generation based on arbitrary keypoint definitions but also significantly reduces the associated costs, paving the way for versatile and adaptable pose estimation applications. We present a novel approach to CAPE that leverages the inherent geometrical relations between keypoints through a newly designed Graph Transformer Decoder. By capturing and incorporating this crucial structural information, our method enhances the accuracy of keypoint localization, marking a significant departure from conventional CAPE techniques that treat keypoints as isolated entities. We validate our approach on the MP-100 benchmark, a comprehensive dataset comprising over 20,000 images spanning more than 100 categories. Our method outperforms the prior state-of-the-art by substantial margins, achieving remarkable improvements of 2.16% and 1.82% under 1-shot and 5-shot settings, respectively. Furthermore, our method's end-to-end training demonstrates both scalability and efficiency compared to previous CAPE approaches.	This paper introduces a novel category-agnostic pose estimation method that leverages geometrical relationships between keypoints using a Graph Transformer Decoder, improving accuracy in keypoint localization for arbitrary object categories.	Category-agnostic pose estimation allows a single model to predict keypoints for various object categories, even those unseen during training, which is crucial for real-world applications with novel objects.	The method employs a Graph Transformer Decoder within a DETR-like architecture. It leverages a pre-trained SwinV2 backbone and removes keypoint order dependency. The decoder uses a graph convolutional network to capture relationships between keypoints, enhancing localization accuracy.	The method outperforms previous state-of-the-art methods, achieving a significant improvement in PCK accuracy on the MP-100 benchmark. The model demonstrates robustness and generalization by effectively handling out-of-distribution images, including cartoons and AI-generated images. Ablation studies confirm the importance of the graph structure, with performance dropping significantly when random graph connections are used.	The method's reliance on accurate skeleton definitions might limit its applicability to objects with highly complex or variable structures. Future work can explore extending this approach to 3D pose estimation.	pose estimation, category-agnostic, graph neural networks, transformer networks, computer vision
2311.17874 Report	FisherRF: Active View Selection and Uncertainty Quantification for Radiance Fields using Fisher Information	Wen Jiang, Boshu Lei, Kostas Daniilidis	This study addresses the challenging problem of active view selection and uncertainty quantification within the domain of Radiance Fields. Neural Radiance Fields (NeRF) have greatly advanced image rendering and reconstruction, but the limited availability of 2D images poses uncertainties stemming from occlusions, depth ambiguities, and imaging errors. Efficiently selecting informative views becomes crucial, and quantifying NeRF model uncertainty presents intricate challenges. Existing approaches either depend on model architecture or are based on assumptions regarding density distributions that are not generally applicable. By leveraging Fisher Information, we efficiently quantify observed information within Radiance Fields without ground truth data. This can be used for the next best view selection and pixel-wise uncertainty quantification. Our method overcomes existing limitations on model architecture and effectiveness, achieving state-of-the-art results in both view selection and uncertainty quantification, demonstrating its potential to advance the field of Radiance Fields. Our method with the 3D Gaussian Splatting backend could perform view selections at 70 fps.	Presents FisherRF, a novel method for active view selection and uncertainty quantification in Radiance Fields, leveraging Fisher information.	Efficiently selecting informative views is crucial for NeRF models when only limited 2D images are available due to occlusions, depth ambiguities, and imaging errors.	Leverages Fisher information to quantify observed information in Radiance Fields and uses it to select the next best view with the highest information gain. Employs approximations and exploits sparsity for efficient computation.	FisherRF achieves state-of-the-art results in active view selection, outperforming previous methods and random baselines on Blender and Mip-NeRF360 datasets. Method enables effective batch active view selection, crucial for real-world applications like view planning. Demonstrates strong performance in pixel-wise uncertainty quantification, achieving superior results compared to state-of-the-art methods on the Light Field dataset.	Limited to static scenes in confined scenarios. Future work includes extending the method to large-scale and dynamically changing Radiance Fields.	radiance fields, active view selection, uncertainty quantification, fisher information, volumetric rendering
2311.17857 Report	Gaussian Shell Maps for Efficient 3D Human Generation	Rameen Abdal, Wang Yifan, Zifan Shi, Yinghao Xu, Ryan Po, Zhengfei Kuang, Qifeng Chen, Dit-Yan Yeung, Gordon Wetzstein	Efficient generation of 3D digital humans is important in several industries, including virtual reality, social media, and cinematic production. 3D generative adversarial networks (GANs) have demonstrated state-of-the-art (SOTA) quality and diversity for generated assets. Current 3D GAN architectures, however, typically rely on volume representations, which are slow to render, thereby hampering the GAN training and requiring multi-view-inconsistent 2D upsamplers. Here, we introduce Gaussian Shell Maps (GSMs) as a framework that connects SOTA generator network architectures with emerging 3D Gaussian rendering primitives using an articulable multi shell--based scaffold. In this setting, a CNN generates a 3D texture stack with features that are mapped to the shells. The latter represent inflated and deflated versions of a template surface of a digital human in a canonical body pose. Instead of rasterizing the shells directly, we sample 3D Gaussians on the shells whose attributes are encoded in the texture features. These Gaussians are efficiently and differentiably rendered. The ability to articulate the shells is important during GAN training and, at inference time, to deform a body into arbitrary user-defined poses. Our efficient rendering scheme bypasses the need for view-inconsistent upsamplers and achieves high-quality multi-view consistent renderings at a native resolution of $512 \times 512$ pixels. We demonstrate that GSMs successfully generate 3D humans when trained on single-view datasets, including SHHQ and DeepFashion.	This paper introduces Gaussian Shell Maps (GSMs), an efficient 3D GAN framework that combines CNN-based generators with 3D Gaussian rendering primitives for high-quality, real-time 3D human generation.	Efficient 3D human generation is crucial for various industries. Existing methods either suffer from slow volume rendering or limited expressiveness. GSMs offer a solution by combining the efficiency of CNNs with the expressiveness of 3D Gaussians.	GSMs anchor 3D Gaussians to "shells" derived from the SMPL human body template. A CNN generates texture maps encoding Gaussian parameters, enabling efficient rendering via Gaussian splatting. Articulation is achieved by deforming the shells.	GSMs generate diverse and high-resolution (512x512) 3D humans with realistic clothing and accessories. The method achieves state-of-the-art rendering speed (125 FPS) without requiring upsampling, eliminating aliasing artifacts. GSMs outperform competing methods in pose control accuracy while achieving comparable visual quality and diversity.	The method relies on a parametric deformation model, limiting its ability to handle complex dynamics of hair and loose clothing. Extracting accurate geometry and normals from the irregular and sparse Gaussians is not straightforward. Future work includes exploring surface splatting for better geometry extraction and incorporating multi-view data for enhanced realism.	3d human generation, generative adversarial networks, 3d gaussians, shell maps, real-time rendering
2311.17754 Report	Cinematic Behavior Transfer via NeRF-based Differentiable Filming	Xuekun Jiang, Anyi Rao, Jingbo Wang, Dahua Lin, Bo Dai	In the evolving landscape of digital media and video production, the precise manipulation and reproduction of visual elements like camera movements and character actions are highly desired. Existing SLAM methods face limitations in dynamic scenes and human pose estimation often focuses on 2D projections, neglecting 3D statuses. To address these issues, we first introduce a reverse filming behavior estimation technique. It optimizes camera trajectories by leveraging NeRF as a differentiable renderer and refining SMPL tracks. We then introduce a cinematic transfer pipeline that is able to transfer various shot types to a new 2D video or a 3D virtual environment. The incorporation of 3D engine workflow enables superior rendering and control abilities, which also achieves a higher rating in the user study.	This paper presents a reverse filming behavior estimation technique for transferring cinematic behavior from movie shots to new 2D/3D content using NeRF and SMPL models, enabling artists to reuse camera trajectories and character movements through a 3D engine-based workflow.	Precise manipulation and reproduction of camera movements and character actions in video production is crucial for maintaining continuity, style, and mood. Existing methods often struggle with complex dynamic scenes and decoupling human and camera motions.	The method predicts SMPL tracks and optimizes camera trajectories using a differentiable renderer NeRF with image-level matching supervision. It refines character movements and applies them to new 2D videos or 3D virtual environments via a 3D engine workflow.	The approach accurately extracts character movements and camera trajectories from various movie shots, enabling 2D and 3D cinematic transfers. It outperforms state-of-the-art methods in frame composition restoration and camera pose estimation across different shot types. User studies confirm the effectiveness and user satisfaction with the generated results, especially in terms of camera movement accuracy and character pose refinement.	The method depends on SLAM for initial camera trajectory estimation, limiting its performance in highly dynamic scenes. It focuses primarily on shots with prominent human subjects, requiring adaptation for scenes centered around environments or objects.	cinematic behavior transfer, camera trajectory optimization, character motion estimation, nerf, smpl
2311.17737 Report	GenZI: Zero-Shot 3D Human-Scene Interaction Generation	Lei Li, Angela Dai	Can we synthesize 3D humans interacting with scenes without learning from any 3D human-scene interaction data? We propose GenZI, the first zero-shot approach to generating 3D human-scene interactions. Key to GenZI is our distillation of interaction priors from large vision-language models (VLMs), which have learned a rich semantic space of 2D human-scene compositions. Given a natural language description and a coarse point location of the desired interaction in a 3D scene, we first leverage VLMs to imagine plausible 2D human interactions inpainted into multiple rendered views of the scene. We then formulate a robust iterative optimization to synthesize the pose and shape of a 3D human model in the scene, guided by consistency with the 2D interaction hypotheses. In contrast to existing learning-based approaches, GenZI circumvents the conventional need for captured 3D interaction data, and allows for flexible control of the 3D interaction synthesis with easy-to-use text prompts. Extensive experiments show that our zero-shot approach has high flexibility and generality, making it applicable to diverse scene types, including both indoor and outdoor environments.	The paper introduces \OurName{}, a zero-shot approach for generating 3D human-scene interactions (HSI) from text prompts, eliminating the need for 3D interaction training data.	Existing HSI synthesis methods rely on large-scale 3D interaction datasets, which are costly and difficult to acquire, limiting their generalizability. This work explores zero-shot 3D HSI generation by leveraging powerful 2D vision-language models (VLMs).	\OurName{} employs VLMs to imagine plausible 2D human interactions in multiple rendered scene views using a dynamic masking scheme for automated human inpainting. It then optimizes a 3D human model's pose and shape to ensure consistency with the inferred 2D interactions through a robust, iterative process.	Perceptual studies show a strong preference for \OurName{}'s generations over baselines. \OurName{} achieves the highest semantic consistency scores, indicating better alignment between generated interactions and text prompts. The method demonstrates strong generalization to diverse indoor and outdoor scenes, unlike data-driven baselines limited to specific interaction types.	The quality of \OurName{} depends on the inpainting ability of the VLM, which can be limited by the model's capacity and biases. Inference speed is constrained by the iterative nature of diffusion models used for inpainting.	3d human-scene interaction, zero-shot learning, vision-language models, latent diffusion models, 3d human pose estimation
2311.17717 Report	Receler: Reliable Concept Erasing of Text-to-Image Diffusion Models via Lightweight Erasers	Chi-Pin Huang, Kai-Po Chang, Chung-Ting Tsai, Yung-Hsuan Lai, Fu-En Yang, Yu-Chiang Frank Wang	Concept erasure in text-to-image diffusion models aims to disable pre-trained diffusion models from generating images related to a target concept. To perform reliable concept erasure, the properties of robustness and locality are desirable. The former refrains the model from producing images associated with the target concept for any paraphrased or learned prompts, while the latter preserves its ability in generating images with non-target concepts. In this paper, we propose Reliable Concept Erasing via Lightweight Erasers (Receler). It learns a lightweight Eraser to perform concept erasing while satisfying the above desirable properties by proposed concept-localized regularization and adversarial prompt learning schemes. Comprehensive experiments with various concepts verify the superiority of Receler over previous methods. Our code will be available upon acceptance.	This paper proposes Receler, a method for reliably erasing concepts from text-to-image diffusion models using lightweight erasers while maintaining locality and robustness to paraphrased prompts.	Concept erasure is crucial for mitigating the risks of generating NSFW or copyright-infringing content from pre-trained text-to-image diffusion models.	Receler introduces a lightweight eraser that learns to remove target concepts from cross-attention layer outputs. It leverages concept-localized regularization for locality and adversarial prompt learning for robustness.	Receler outperforms state-of-the-art methods in erasing objects and inappropriate content, demonstrating superior robustness and locality. It effectively defends against learned attack prompts, showcasing enhanced reliability. Receler allows for compositional concept erasure by combining outputs from separately trained erasers.	The paper primarily focuses on erasing single concepts, leaving multi-concept erasure for future exploration. The impact of different pre-trained diffusion models on erasure effectiveness requires further investigation.	concept erasing, diffusion models, parameter-efficient fine-tuning, adversarial prompt learning, text-to-image generation
2311.17707 Report	SAMPro3D: Locating SAM Prompts in 3D for Zero-Shot Scene Segmentation	Mutian Xu, Xingyilang Yin, Lingteng Qiu, Yang Liu, Xin Tong, Xiaoguang Han	We introduce SAMPro3D for zero-shot 3D indoor scene segmentation. Given the 3D point cloud and multiple posed 2D frames of 3D scenes, our approach segments 3D scenes by applying the pretrained Segment Anything Model (SAM) to 2D frames. Our key idea involves locating 3D points in scenes as natural 3D prompts to align their projected pixel prompts across frames, ensuring frame-consistency in both pixel prompts and their SAM-predicted masks. Moreover, we suggest filtering out low-quality 3D prompts based on feedback from all 2D frames, for enhancing segmentation quality. We also propose to consolidate different 3D prompts if they are segmenting the same object, bringing a more comprehensive segmentation. Notably, our method does not require any additional training on domain-specific data, enabling us to preserve the zero-shot power of SAM. Extensive qualitative and quantitative results show that our method consistently achieves higher quality and more diverse segmentation than previous zero-shot or fully supervised approaches, and in many cases even surpasses human-level annotations. The project page can be accessed at https://mutianxu.github.io/sampro3d/.	This paper introduces SAMPro3D, a novel framework for zero-shot 3D indoor scene segmentation using the pretrained Segment Anything Model (SAM) applied to posed 2D frames of the scene.	Existing methods for 3D scene segmentation lack zero-shot capability or require domain-specific training, limiting their generalizability to new scenes. SAMPro3D leverages the zero-shot power of SAM for direct application to 3D scenes.	SAMPro3D locates 3D points as prompts to align corresponding pixel prompts and predicted masks across different frames, ensuring frame-consistency. It then filters low-quality prompts and consolidates those segmenting the same object.	SAMPro3D consistently achieves higher quality and more diverse segmentation than previous zero-shot or fully supervised approaches on ScanNet200. Qualitative results showcase superior segmentation across various scenes and objects, often surpassing human annotations in diversity. User studies confirm the effectiveness of SAMPro3D in terms of both segmentation accuracy and diversity, exceeding even human-level performance.	The segmentation performance of SAMPro3D is inherently limited by the capabilities of SAM. Future work could explore real-time harmonization of 3D scene segmentation and reconstruction using Mobile-SAM and parallel processing.	3d scene segmentation, zero-shot learning, segment anything model (sam), prompt engineering, indoor scene understanding
2311.17618 Report	ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language Model	Fukun Yin, Xin Chen, Chi Zhang, Biao Jiang, Zibo Zhao, Jiayuan Fan, Gang Yu, Taihao Li, Tao Chen	The advent of large language models, enabling flexibility through instruction-driven approaches, has revolutionized many traditional generative tasks, but large models for 3D data, particularly in comprehensively handling 3D shapes with other modalities, are still under-explored. By achieving instruction-based shape generations, versatile multimodal generative shape models can significantly benefit various fields like 3D virtual construction and network-aided design. In this work, we present ShapeGPT, a shape-included multi-modal framework to leverage strong pre-trained language models to address multiple shape-relevant tasks. Specifically, ShapeGPT employs a word-sentence-paragraph framework to discretize continuous shapes into shape words, further assembles these words for shape sentences, as well as integrates shape with instructional text for multi-modal paragraphs. To learn this shape-language model, we use a three-stage training scheme, including shape representation, multimodal alignment, and instruction-based generation, to align shape-language codebooks and learn the intricate correlations among these modalities. Extensive experiments demonstrate that ShapeGPT achieves comparable performance across shape-relevant tasks, including text-to-shape, shape-to-text, shape completion, and shape editing.	ShapeGPT, a unified shape-included multi-modal framework leveraging pre-trained LLMs to address multiple shape-related tasks.	Existing methods lack a holistic understanding of the interplay between 3D shapes and other modalities, limiting their versatility across tasks.	ShapeGPT discretizes shapes into shape words and sentences, integrates them with text for multi-modal paragraphs, and utilizes a three-stage training scheme (shape representation, multimodal alignment, instruction-based generation).	ShapeGPT achieves comparable performance to state-of-the-art methods in image-to-shape, text-to-shape, and multi-modal-to-shape generation. It effectively handles additional shape-centric tasks like shape captioning, completion, reasoning, and editing within a single architecture. Ablation studies highlight the importance of shape token length, language model size, and pre-training for optimal performance.	ShapeGPT's current capabilities are limited to single-object generation and lack texture generation. Future work aims to expand ShapeGPT's capabilities to include more shape-centric tasks, textured shapes, and support for additional modalities like voice and video.	3d shape generation, multimodal learning, large language models, shape-language pre-training, instruction-based generation
2311.17609 Report	AnyLens: A Generative Diffusion Model with Any Rendering Lens	Andrey Voynov, Amir Hertz, Moab Arar, Shlomi Fruchter, Daniel Cohen-Or	State-of-the-art diffusion models can generate highly realistic images based on various conditioning like text, segmentation, and depth. However, an essential aspect often overlooked is the specific camera geometry used during image capture. The influence of different optical systems on the final scene appearance is frequently overlooked. This study introduces a framework that intimately integrates a text-to-image diffusion model with the particular lens geometry used in image rendering. Our method is based on a per-pixel coordinate conditioning method, enabling the control over the rendering geometry. Notably, we demonstrate the manipulation of curvature properties, achieving diverse visual effects, such as fish-eye, panoramic views, and spherical texturing using a single diffusion model.	Introduces AnyLens, a framework integrating text-to-image diffusion models with specific lens geometries for enhanced realism and control over optical effects in generated images.	Addresses the limitation of existing text-to-image diffusion models in replicating diverse optical effects produced by various camera lenses, enhancing realism and control over image synthesis.	Utilizes per-pixel coordinate conditioning, providing the diffusion model with spatial locations of pixels in an undistorted view, and introduces self-attention re-weighting to account for content density variations caused by warping.	Successfully generates images simulating diverse lens effects, including fish-eye and panoramic views. Enables spherical texturing and panorama generation with accurate surface curvature alignment. Maintains base text-to-image generation quality while enabling lens-based control.	Relies on metric representations compatible with the simulated lens geometry. Limited extrapolation capabilities due to reliance on repetitions.	diffusion models, lens simulation, image generation, spherical texturing, per-pixel conditioning
2311.17536 Report	SmoothVideo: Smooth Video Synthesis with Noise Constraints on Diffusion Models for One-shot Video Tuning	Liang Peng, Haoran Cheng, Zheng Yang, Ruisi Zhao, Linxuan Xia, Chaotian Song, Qinglin Lu, Boxi Wu, Wei Liu	Recent one-shot video tuning methods, which fine-tune the network on a specific video based on pre-trained text-to-image models (e.g., Stable Diffusion), are popular in the community because of the flexibility. However, these methods often produce videos marred by incoherence and inconsistency. To address these limitations, this paper introduces a simple yet effective noise constraint across video frames. This constraint aims to regulate noise predictions across their temporal neighbors, resulting in smooth latents. It can be simply included as a loss term during the training phase. By applying the loss to existing one-shot video tuning methods, we significantly improve the overall consistency and smoothness of the generated videos. Furthermore, we argue that current video evaluation metrics inadequately capture smoothness. To address this, we introduce a novel metric that considers detailed features and their temporal dynamics. Experimental results validate the effectiveness of our approach in producing smoother videos on various one-shot video tuning baselines. The source codes and video demos are available at \href{https://github.com/SPengLiang/SmoothVideo}{https://github.com/SPengLiang/SmoothVideo}.	This paper introduces a noise constraint loss to improve the smoothness and coherence of videos generated by one-shot video tuning methods.	Existing one-shot video tuning methods often produce videos with incoherence and flicker, leading to noticeable artifacts and reduced visual quality.	The authors analyze the relationship between noise predictions and latent representations in the DDIM reverse process. They propose a noise constraint loss that regularizes the noise predictions across adjacent video frames, encouraging smoother transitions and reducing flicker.	Applying the noise constraint loss to existing one-shot video tuning methods (Tune-A-Video, ControlVideo, Make-A-Protagonist) significantly improves the smoothness of generated videos. The authors introduce a novel video smoothness metric, VL score, that outperforms traditional CLIP-based metrics in capturing temporal consistency. The proposed method can be readily integrated into training-free video editing techniques, leading to enhanced coherence and reduced flickering.	The noise constraint loss can sometimes negatively impact text alignment in the generated videos. The sliding window design in the VL score metric cannot effectively handle complex scene motions like zoom in and zoom out.	video generation, one-shot video tuning, diffusion models, video smoothness, noise constraint loss
2311.17528 Report	HiDiffusion: Unlocking Higher-Resolution Creativity and Efficiency in Pretrained Diffusion Models	Shen Zhang, Zhaowei Chen, Zhenyu Zhao, Yuhao Chen, Yao Tang, Jiajun Liang	Diffusion models have become a mainstream approach for high-resolution image synthesis. However, directly generating higher-resolution images from pretrained diffusion models will encounter unreasonable object duplication and exponentially increase the generation time. In this paper, we discover that object duplication arises from feature duplication in the deep blocks of the U-Net. Concurrently, We pinpoint the extended generation times to self-attention redundancy in U-Net's top blocks. To address these issues, we propose a tuning-free higher-resolution framework named HiDiffusion. Specifically, HiDiffusion contains Resolution-Aware U-Net (RAU-Net) that dynamically adjusts the feature map size to resolve object duplication and engages Modified Shifted Window Multi-head Self-Attention (MSW-MSA) that utilizes optimized window attention to reduce computations. we can integrate HiDiffusion into various pretrained diffusion models to scale image generation resolutions even to 4096x4096 at 1.5-6x the inference speed of previous methods. Extensive experiments demonstrate that our approach can address object duplication and heavy computation issues, achieving state-of-the-art performance on higher-resolution image synthesis tasks.	This paper presents HiDiffusion, a tuning-free framework to enable pretrained diffusion models to generate higher-resolution images (e.g., 4096x4096) with high efficiency.	Existing methods for higher-resolution image synthesis using diffusion models suffer from limitations such as object duplication, lack of fine details, and slow inference speed. HiDiffusion aims to address these issues and improve the scalability of pretrained models.	HiDiffusion incorporates two novel components: 1) Resolution-Aware U-Net (RAU-Net) dynamically adjusts feature map sizes to mitigate object duplication and retain fine details. 2) Modified Shifted Window Multi-head Self-Attention (MSW-MSA) reduces computational cost by replacing global attention with optimized window attention.	HiDiffusion successfully scales image generation resolutions up to 4096x4096 while preserving fine details and avoiding object duplication. The method demonstrates significant speed improvements, achieving 1.5-6x faster inference compared to previous methods. HiDiffusion is a tuning-free approach, meaning it can be seamlessly integrated with existing pretrained diffusion models like SD 1.5, SD 2.1, SDXL, and SDXL Turbo without requiring additional training.	While effective, HiDiffusion still relies on the inherent capabilities of the base diffusion model, requiring prompt engineering for optimal results. Future work could explore better integration with super-resolution techniques to further enhance resolution and image quality.	diffusion models, high-resolution image synthesis, image generation, model efficiency, tuning-free
2311.17461 Report	When StyleGAN Meets Stable Diffusion: a $\mathscr{W}_+$ Adapter for Personalized Image Generation	Xiaoming Li, Xinyu Hou, Chen Change Loy	Text-to-image diffusion models have remarkably excelled in producing diverse, high-quality, and photo-realistic images. This advancement has spurred a growing interest in incorporating specific identities into generated content. Most current methods employ an inversion approach to embed a target visual concept into the text embedding space using a single reference image. However, the newly synthesized faces either closely resemble the reference image in terms of facial attributes, such as expression, or exhibit a reduced capacity for identity preservation. Text descriptions intended to guide the facial attributes of the synthesized face may fall short, owing to the intricate entanglement of identity information with identity-irrelevant facial attributes derived from the reference image. To address these issues, we present the novel use of the extended StyleGAN embedding space $\mathcal{W}_+$, to achieve enhanced identity preservation and disentanglement for diffusion models. By aligning this semantically meaningful human face latent space with text-to-image diffusion models, we succeed in maintaining high fidelity in identity preservation, coupled with the capacity for semantic editing. Additionally, we propose new training objectives to balance the influences of both prompt and identity conditions, ensuring that the identity-irrelevant background remains unaffected during facial attribute modifications. Extensive experiments reveal that our method adeptly generates personalized text-to-image outputs that are not only compatible with prompt descriptions but also amenable to common StyleGAN editing directions in diverse settings. Our source code will be available at \url{https://github.com/csxmli2016/w-plus-adapter}.	This paper proposes a novel approach to personalized text-to-image generation that leverages the extended StyleGAN embedding space (W+) to enhance identity preservation and disentanglement for diffusion models.	Existing methods struggle to simultaneously preserve identity, generate varied facial attributes, and create identity-irrelevant content aligned with text descriptions due to the entangled nature of textual embedding space.	The approach involves two stages: 1) aligning W+ with Stable Diffusion by training a mapping network to project a w+ embedding to SD latent space and injecting this as an additional identity condition, 2) fine-tuning for in-the-wild generation using a residual cross-attention module and novel regularized training to disentangle identity-relevant and -irrelevant features.	The method successfully generates personalized text-to-image outputs compatible with prompt descriptions and amenable to StyleGAN editing directions. Quantitative evaluation shows comparable or superior performance to state-of-the-art methods in terms of CLIP Score, identity distance, and face detection score. Ablation studies demonstrate the importance of each component, including the two-stage training, residual cross-attention, and regularization loss, in achieving a balance between identity preservation, attribute editability, and text prompt compatibility.	The process of converting real images to w+ vectors can lead to a loss of detail, impacting identity fidelity. The current framework is limited to single-face generation and editing.	text-to-image generation, personalized image synthesis, stylegan, stable diffusion, facial attribute editing
2311.17338 Report	VideoAssembler: Identity-Consistent Video Generation with Reference Entities using Diffusion Model	Haoyu Zhao, Tianyi Lu, Jiaxi Gu, Xing Zhang, Zuxuan Wu, Hang Xu, Yu-Gang Jiang	Identity-consistent video generation seeks to synthesize videos that are guided by both textual prompts and reference images of entities. Current approaches typically utilize cross-attention layers to integrate the appearance of the entity, which predominantly captures semantic attributes, resulting in compromised fidelity of entities. Moreover, these methods necessitate iterative fine-tuning for each new entity encountered, thereby limiting their applicability. To address these challenges, we introduce VideoAssembler, a novel end-to-end framework for identity-consistent video generation that can conduct inference directly when encountering new entities. VideoAssembler is adept at producing videos that are not only flexible with respect to the input reference entities but also responsive to textual conditions. Additionally, by modulating the quantity of input images for the entity, VideoAssembler enables the execution of tasks ranging from image-to-video generation to sophisticated video editing. VideoAssembler comprises two principal components: the Reference Entity Pyramid (REP) encoder and the Entity-Prompt Attention Fusion (EPAF) module. The REP encoder is designed to infuse comprehensive appearance details into the denoising stages of the stable diffusion model. Concurrently, the EPAF module is utilized to integrate text-aligned features effectively. Furthermore, to mitigate the challenge of scarce data, we present a methodology for the preprocessing of training data. Our evaluation of the VideoAssembler framework on the UCF-101, MSR-VTT, and DAVIS datasets indicates that it achieves good performances in both quantitative and qualitative analyses (346.84 in FVD and 48.01 in IS on UCF-101). Our project page is at https://gulucaptain.github.io/videoassembler/.	This paper introduces VideoAssembler, a novel end-to-end framework for identity-consistent video generation that can directly infer new entities without retraining.	Identity-consistent video generation is challenging because it requires generating content-reasonable videos while accurately injecting given entity information. Existing methods struggle with appearance fidelity, weak action guidance, and reliance on few-shot fine-tuning.	VideoAssembler utilizes a Reference Entity Pyramid (REP) encoder to infuse detailed appearance into the denoising stages of the stable diffusion model and an Entity-Prompt Attention Fusion (EPAF) module to integrate text-aligned features. It also introduces a data preprocessing methodology to address training data scarcity.	VideoAssembler achieves state-of-the-art performance on UCF-101 and MSR-VTT datasets in terms of FVD and IS metrics. The method exhibits strong entity fidelity and action control, as evidenced by qualitative comparisons with other methods like VideoDreamer. It demonstrates flexibility in handling different numbers of input entities, enabling image-to-video generation and video editing tasks.	The optimal number of input entities for the best performance needs further investigation. The model's generative creativity might be slightly limited when solely relying on the REP encoder.	video generation, identity-consistent, reference entities, diffusion models, video editing
2311.17261 Report	SceneTex: High-Quality Texture Synthesis for Indoor Scenes via Diffusion Priors	Dave Zhenyu Chen, Haoxuan Li, Hsin-Ying Lee, Sergey Tulyakov, Matthias Nießner	We propose SceneTex, a novel method for effectively generating high-quality and style-consistent textures for indoor scenes using depth-to-image diffusion priors. Unlike previous methods that either iteratively warp 2D views onto a mesh surface or distillate diffusion latent features without accurate geometric and style cues, SceneTex formulates the texture synthesis task as an optimization problem in the RGB space where style and geometry consistency are properly reflected. At its core, SceneTex proposes a multiresolution texture field to implicitly encode the mesh appearance. We optimize the target texture via a score-distillation-based objective function in respective RGB renderings. To further secure the style consistency across views, we introduce a cross-attention decoder to predict the RGB values by cross-attending to the pre-sampled reference locations in each instance. SceneTex enables various and accurate texture synthesis for 3D-FRONT scenes, demonstrating significant improvements in visual quality and prompt fidelity over the prior texture generation methods.	SceneTex, a novel method for generating high-quality and style-consistent textures for indoor scenes using depth-to-image diffusion priors.	Generating realistic and style-consistent textures for 3D scenes is crucial for various applications but remains challenging due to the need for accurate geometry and style cues.	SceneTex introduces a multiresolution texture field to represent scene appearance and leverages a cross-attention texture decoder for global style consistency. It optimizes the texture by distilling knowledge from a pre-trained depth-conditioned diffusion prior.	SceneTex generates high-quality textures superior to baseline methods based on CLIP score and Inception Score. The method effectively reflects input prompts in the generated textures, as demonstrated by user studies. Ablation studies confirm the importance of the multiresolution texture field and cross-attention decoder for achieving high-quality and style-consistent results.	The generated textures sometimes exhibit shading effects, potentially due to the diffusion prior. Future work could explore addressing the shading issue and extending the method to handle more complex lighting conditions.	texture synthesis, 3d scenes, diffusion models, cross-attention, style consistency
2311.17245 Report	LightGaussian: Unbounded 3D Gaussian Compression with 15x Reduction and 200+ FPS	Zhiwen Fan, Kevin Wang, Kairun Wen, Zehao Zhu, Dejia Xu, Zhangyang Wang	Recent advancements in real-time neural rendering using point-based techniques have paved the way for the widespread adoption of 3D representations. However, foundational approaches like 3D Gaussian Splatting come with a substantial storage overhead caused by growing the SfM points to millions, often demanding gigabyte-level disk space for a single unbounded scene, posing significant scalability challenges and hindering the splatting efficiency. To address this challenge, we introduce LightGaussian, a novel method designed to transform 3D Gaussians into a more efficient and compact format. Drawing inspiration from the concept of Network Pruning, LightGaussian identifies Gaussians that are insignificant in contributing to the scene reconstruction and adopts a pruning and recovery process, effectively reducing redundancy in Gaussian counts while preserving visual effects. Additionally, LightGaussian employs distillation and pseudo-view augmentation to distill spherical harmonics to a lower degree, allowing knowledge transfer to more compact representations while maintaining reflectance. Furthermore, we propose a hybrid scheme, VecTree Quantization, to quantize all attributes, resulting in lower bitwidth representations with minimal accuracy losses. In summary, LightGaussian achieves an averaged compression rate over 15x while boosting the FPS from 139 to 215, enabling an efficient representation of complex scenes on Mip-NeRF 360, Tank and Temple datasets. Project website: https://lightgaussian.github.io/	LightGaussian, a novel method that compresses 3D Gaussian representations for efficient novel view synthesis, achieving a 15x reduction in size and boosting rendering speed to 200+ FPS.	3D Gaussian Splatting offers high-quality novel view synthesis but suffers from a large storage overhead due to millions of Gaussians.	LightGaussian employs 1) Gaussian Pruning & Recovery to eliminate redundant Gaussians based on global significance, 2) SH Distillation with pseudo-view augmentation to transfer high-degree SH information to compact lower-degree representations, and 3) Gaussian Attribute Vector Quantization based on global significance to reduce the representation bit-width.	LightGaussian achieves over 15x compression on the Mip-NeRF 360 dataset, reducing storage from 724MB to 42MB. Rendering speed is improved from 119 FPS to 209 FPS. Visual fidelity is maintained with minimal quality loss (SSIM decrease of 0.005).	VQ applied to all Gaussian attributes leads to significant accuracy loss. Future work includes exploring zero-shot compression for different 3D Gaussian Splatting frameworks.	novel view synthesis, 3d gaussian splatting, model compression, knowledge distillation, vector quantization
2311.17216 Report	Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation	Hang Li, Chengzhi Shen, Philip Torr, Volker Tresp, Jindong Gu	Diffusion-based models have gained significant popularity for text-to-image generation due to their exceptional image-generation capabilities. A risk with these models is the potential generation of inappropriate content, such as biased or harmful images. However, the underlying reasons for generating such undesired content from the perspective of the diffusion model's internal representation remain unclear. Previous work interprets vectors in an interpretable latent space of diffusion models as semantic concepts. However, existing approaches cannot discover directions for arbitrary concepts, such as those related to inappropriate concepts. In this work, we propose a novel self-supervised approach to find interpretable latent directions for a given concept. With the discovered vectors, we further propose a simple approach to mitigate inappropriate generation. Extensive experiments have been conducted to verify the effectiveness of our mitigation approach, namely, for fair generation, safe generation, and responsible text-enhancing generation. Project page: \url{https://interpretdiffusion.github.io}.	This paper proposes a self-supervised approach to discover and utilize interpretable latent directions within diffusion models' internal representations for responsible text-to-image generation.	Existing methods for interpreting and manipulating diffusion models struggle to discover directions for arbitrary concepts, especially those related to inappropriate content, hindering responsible generation.	The authors optimize a concept vector by minimizing the reconstruction loss of images generated with concept-related prompts, forcing the vector to represent the missing concept information. This vector is then added to the model's internal activations during inference to guide responsible generation.	The method successfully generates images with balanced representations of societal groups, mitigating bias in professions like doctors. It effectively eliminates harmful content from inappropriate prompts, outperforming existing safety methods. The approach enhances text guidance for responsible prompts, accurately representing concepts like 'no violence' in generated images.	The linear manipulation of concepts might not fully capture complex relationships between different attributes. The approach's reliance on synthesized data for concept discovery might not fully represent real-world diversity.	diffusion models, responsible ai, text-to-image generation, fairness, safety
2311.17138 Report	Shadows Don't Lie and Lines Can't Bend! Generative Models don't know Projective Geometry...for now	Ayush Sarkar, Hanlin Mai, Amitabh Mahapatra, Svetlana Lazebnik, D. A. Forsyth, Anand Bhattad	Generative models can produce impressively realistic images. This paper demonstrates that generated images have geometric features different from those of real images. We build a set of collections of generated images, prequalified to fool simple, signal-based classifiers into believing they are real. We then show that prequalified generated images can be identified reliably by classifiers that only look at geometric properties. We use three such classifiers. All three classifiers are denied access to image pixels, and look only at derived geometric features. The first classifier looks at the perspective field of the image, the second looks at lines detected in the image, and the third looks at relations between detected objects and shadows. Our procedure detects generated images more reliably than SOTA local signal based detectors, for images from a number of distinct generators. Saliency maps suggest that the classifiers can identify geometric problems reliably. We conclude that current generators cannot reliably reproduce geometric properties of real images.	This paper reveals that while AI-generated images are increasingly realistic, they often contain subtle geometric inconsistencies, particularly in perspective and shadow accuracy, which can be used to distinguish them from real images.	This research is crucial as it highlights a key limitation in current generative models: their struggle to accurately replicate the rules of projective geometry, a fundamental aspect of realistic image creation.	The authors curate a dataset of real and generated images, filtered to remove easily detectable artifacts. They then train three classifiers on geometric features: one analyzing line segments, another examining perspective fields, and the last focusing on object-shadow relationships. None of the classifiers see the image pixels.	The classifiers, despite not analyzing image pixels directly, are highly effective at identifying generated images, achieving AUCs ranging from 0.72 to 0.97. The analysis suggests that generative models struggle to maintain consistent vanishing points, leading to inaccuracies in line convergence and perspective distortion. Shadow inconsistencies, such as mismatched directions and lengths relative to objects and light sources, are also reliably detected.	The study primarily focuses on indoor and outdoor scenes, potentially limiting the generalizability of findings to other image types. Future work could explore the use of these geometric inconsistencies as feedback mechanisms to improve the realism of future generative models.	generative models, projective geometry, image forensics, shadow analysis, perspective analysis
2311.17137 Report	Generative Models: What do they know? Do they know things? Let's find out!	Xiaodan Du, Nicholas Kolkin, Greg Shakhnarovich, Anand Bhattad	Generative models have been shown to be capable of synthesizing highly detailed and realistic images. It is natural to suspect that they implicitly learn to model some image intrinsics such as surface normals, depth, or shadows. In this paper, we present compelling evidence that generative models indeed internally produce high-quality scene intrinsic maps. We introduce Intrinsic LoRA (I LoRA), a universal, plug-and-play approach that transforms any generative model into a scene intrinsic predictor, capable of extracting intrinsic scene maps directly from the original generator network without needing additional decoders or fully fine-tuning the original network. Our method employs a Low-Rank Adaptation (LoRA) of key feature maps, with newly learned parameters that make up less than 0.6% of the total parameters in the generative model. Optimized with a small set of labeled images, our model-agnostic approach adapts to various generative architectures, including Diffusion models, GANs, and Autoregressive models. We show that the scene intrinsic maps produced by our method compare well with, and in some cases surpass those generated by leading supervised techniques.	The paper introduces a universal, plug-and-play approach called \methodfull (\methodnospace) that transforms any generative model into a scene intrinsic predictor, enabling the extraction of intrinsic scene maps like normals, depth, albedo, and shading directly from the original generator network.	This is important for understanding the knowledge generative models possess, leveraging them for real image understanding, and potentially improving their quality.	The methodology uses Low-Rank Adaptation (LoRA) of key feature maps (attention layers, affine layers, or convolutional attention layers depending on the model) to modulate the features and enable the extraction of scene intrinsics. The LoRA modules are trained using a small set of labeled images and pseudo-ground truth generated by state-of-the-art models.	All tested generative models (diffusion, GANs, autoregressive) can be adapted to extract scene intrinsic maps using \method. High-quality intrinsic extraction is achieved, outperforming SOTA supervised methods in some tasks, using minimal additional parameters (<0.6%) and limited labeled data (as few as 250 images). A correlation exists between the quality of the generated images and the accuracy of extracted intrinsics, suggesting stronger generative models lead to better intrinsic predictions.	Multi-step diffusion, while promising, introduces misalignment and color shift issues which need to be addressed. Further work is needed to explore explicitly incorporating scene intrinsics into the learning process of generative models and developing evaluation metrics based on physical properties.	generative models, scene intrinsic extraction, low-rank adaptation (lora), diffusion models, gans
2311.17132 Report	TransNeXt: Robust Foveal Visual Perception for Vision Transformers	Dai Shi	Due to the depth degradation effect in residual connections, many efficient Vision Transformers models that rely on stacking layers for information exchange often fail to form sufficient information mixing, leading to unnatural visual perception. To address this issue, in this paper, we propose Aggregated Attention, a biomimetic design-based token mixer that simulates biological foveal vision and continuous eye movement while enabling each token on the feature map to have a global perception. Furthermore, we incorporate learnable tokens that interact with conventional queries and keys, which further diversifies the generation of affinity matrices beyond merely relying on the similarity between queries and keys. Our approach does not rely on stacking for information exchange, thus effectively avoiding depth degradation and achieving natural visual perception. Additionally, we propose Convolutional GLU, a channel mixer that bridges the gap between GLU and SE mechanism, which empowers each token to have channel attention based on its nearest neighbor image features, enhancing local modeling capability and model robustness. We combine aggregated attention and convolutional GLU to create a new visual backbone called TransNeXt. Extensive experiments demonstrate that our TransNeXt achieves state-of-the-art performance across multiple model sizes. At a resolution of $224^2$, TransNeXt-Tiny attains an ImageNet accuracy of 84.0%, surpassing ConvNeXt-B with 69% fewer parameters. Our TransNeXt-Base achieves an ImageNet accuracy of 86.2% and an ImageNet-A accuracy of 61.6% at a resolution of $384^2$, a COCO object detection mAP of 57.1, and an ADE20K semantic segmentation mIoU of 54.7.	This paper introduces TransNeXt, a novel visual backbone network for computer vision tasks, incorporating two key components: Aggregated Attention (a biomimetic token mixer inspired by foveal vision) and Convolutional GLU (a channel mixer with gated channel attention).	The authors aim to address limitations in existing Vision Transformer (ViT) models, such as depth degradation effects and unnatural visual perception arising from stacking layers for information exchange.	The paper presents Aggregated Attention, which combines dual-path design (fine-grained local and coarse-grained global perception), query embedding, and positional attention to mimic human visual information processing. They also propose Convolutional GLU, which integrates local feature-based channel attention into GLU, improving model robustness.	TransNeXt achieves state-of-the-art performance across multiple model sizes on various tasks including image classification, object detection, and semantic segmentation. TransNeXt exhibits superior robustness compared to previous models, particularly on challenging datasets like ImageNet-A. The CUDA implementation significantly accelerates training and inference, showcasing the practical efficiency of the proposed architecture.	The model's throughput, while competitive, has room for improvement compared to models utilizing highly optimized dense GPU operators. Further investigation into the potential trade-off associated with query embedding in out-of-distribution test sets is warranted.	vision transformer, computer vision, biomimetic design, aggregated attention, convolutional glu
2311.17126 Report	Reason out Your Layout: Evoking the Layout Master from Large Language Models for Text-to-Image Synthesis	Xiaohui Chen, Yongfei Liu, Yingxiang Yang, Jianbo Yuan, Quanzeng You, Li-Ping Liu, Hongxia Yang	Recent advancements in text-to-image (T2I) generative models have shown remarkable capabilities in producing diverse and imaginative visuals based on text prompts. Despite the advancement, these diffusion models sometimes struggle to translate the semantic content from the text into images entirely. While conditioning on the layout has shown to be effective in improving the compositional ability of T2I diffusion models, they typically require manual layout input. In this work, we introduce a novel approach to improving T2I diffusion models using Large Language Models (LLMs) as layout generators. Our method leverages the Chain-of-Thought prompting of LLMs to interpret text and generate spatially reasonable object layouts. The generated layout is then used to enhance the generated images' composition and spatial accuracy. Moreover, we propose an efficient adapter based on a cross-attention mechanism, which explicitly integrates the layout information into the stable diffusion models. Our experiments demonstrate significant improvements in image quality and layout accuracy, showcasing the potential of LLMs in augmenting generative image models.	This paper introduces a novel approach to enhancing text-to-image diffusion models by employing Large Language Models (LLMs) as layout generators, leveraging Chain-of-Thought (CoT) prompting for improved compositionality.	Existing text-to-image models often struggle with complex compositions involving multiple objects and spatial relations. This work leverages the reasoning and language understanding capabilities of LLMs to guide image generation with spatially-aware layouts.	1. LLMs are used to generate object layouts (bounding boxes) from text prompts using CoT prompting for improved spatial reasoning. 2. A novel adapter, LACA, is proposed to integrate these layouts into Stable Diffusion models via cross-attention masks, explicitly guiding object placement during image generation.	The method significantly improves image quality and layout accuracy compared to baseline Stable Diffusion models. CoT prompting with in-context examples enhances the quality of LLM-generated layouts, resulting in more accurate object placements. The approach exhibits improved generative counting accuracy, accurately depicting the number of objects specified in the text prompt.	The reliance on LLM-generated layouts introduces a dependency on the accuracy and robustness of the LLM. Future work can explore fine-tuning LLMs specifically for layout generation and investigate alternative layout representations beyond bounding boxes.	text-to-image generation, diffusion models, large language models, chain-of-thought prompting, layout generation
2311.17123 Report	ConTex-Human: Free-View Rendering of Human from a Single Image with Texture-Consistent Synthesis	Xiangjun Gao, Xiaoyu Li, Chaopeng Zhang, Qi Zhang, Yanpei Cao, Ying Shan, Long Quan	In this work, we propose a method to address the challenge of rendering a 3D human from a single image in a free-view manner. Some existing approaches could achieve this by using generalizable pixel-aligned implicit fields to reconstruct a textured mesh of a human or by employing a 2D diffusion model as guidance with the Score Distillation Sampling (SDS) method, to lift the 2D image into 3D space. However, a generalizable implicit field often results in an over-smooth texture field, while the SDS method tends to lead to a texture-inconsistent novel view with the input image. In this paper, we introduce a texture-consistent back view synthesis module that could transfer the reference image content to the back view through depth and text-guided attention injection. Moreover, to alleviate the color distortion that occurs in the side region, we propose a visibility-aware patch consistency regularization for texture mapping and refinement combined with the synthesized back view texture. With the above techniques, we could achieve high-fidelity and texture-consistent human rendering from a single image. Experiments conducted on both real and synthetic data demonstrate the effectiveness of our method and show that our approach outperforms previous baseline methods.	This paper introduces "ConTex-Human", a novel framework for generating high-fidelity, texture-consistent 3D human models from single images.	Free-view human synthesis from single images is crucial for various applications, but existing methods struggle to generate high-fidelity results with consistent textures, especially in unseen areas.	The framework uses a three-stage approach: (1) Reconstructing a coarse radiance field using a Zero-1-to-3 diffusion prior. (2) Synthesizing a texture-consistent back view image using a depth and text-guided attention injection module. (3) Refining the geometry and texture using a DMTet mesh and a visibility-aware patch consistency loss, guided by the synthesized back view and reference image.	Outperforms baseline methods on THuman2.0 dataset in terms of PSNR, LPIPS, and CLIP metrics, demonstrating better alignment with ground truth. Generates higher-quality and more texture-consistent results than baseline methods on both synthetic and real datasets (THuman2.0 and SSHQ). Ablation studies confirm the importance of the texture-consistent back view synthesis and visibility-aware patch consistency loss for achieving high-quality and consistent texture.	Generated geometry can be coarse, especially in hand and foot regions, and struggles to recover from significant errors in the coarse stage. Side and invisible regions, though color-consistent, exhibit lower quality and occasional noise compared to front and back views.	3d human rendering, single image reconstruction, texture consistency, diffusion models, score distillation sampling
2311.17117 Report	Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation	Li Hu, Xin Gao, Peng Zhang, Ke Sun, Bang Zhang, Liefeng Bo	Character Animation aims to generating character videos from still images through driving signals. Currently, diffusion models have become the mainstream in visual generation research, owing to their robust generative capabilities. However, challenges persist in the realm of image-to-video, especially in character animation, where temporally maintaining consistency with detailed information from character remains a formidable problem. In this paper, we leverage the power of diffusion models and propose a novel framework tailored for character animation. To preserve consistency of intricate appearance features from reference image, we design ReferenceNet to merge detail features via spatial attention. To ensure controllability and continuity, we introduce an efficient pose guider to direct character's movements and employ an effective temporal modeling approach to ensure smooth inter-frame transitions between video frames. By expanding the training data, our approach can animate arbitrary characters, yielding superior results in character animation compared to other image-to-video methods. Furthermore, we evaluate our method on benchmarks for fashion video and human dance synthesis, achieving state-of-the-art results.	Presents Animate Anyone, a novel diffusion model-based framework for character animation that generates animated videos from character images and desired pose sequences while maintaining appearance consistency and temporal stability.	Current image-to-video methods struggle to maintain temporal consistency with detailed information from character images, especially in character animation.	Leverages Stable Diffusion architecture and incorporates: 1) ReferenceNet with spatial attention to preserve detailed appearance features, 2) a lightweight pose guider for controllable movements, and 3) a temporal layer for smooth inter-frame transitions.	Maintains spatial and temporal consistency of character appearance in videos. Produces high-definition videos without temporal jitter or flickering. Achieves state-of-the-art results on fashion video and human dance synthesis benchmarks, outperforming existing image-to-video methods in character animation.	May struggle with highly stable hand movement generation, leading to occasional distortions and motion blur. Generating unseen parts during character movement can be unstable due to limited information from a single-view image. Lower operational efficiency compared to non-diffusion-model-based methods due to DDPM sampling.	character animation, diffusion models, image-to-video synthesis, appearance consistency, temporal stability
2311.17095 Report	Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models	Jiayun Luo, Siddhesh Khandelwal, Leonid Sigal, Boyang Li	From image-text pairs, large-scale vision-language models (VLMs) learn to implicitly associate image regions with words, which prove effective for tasks like visual question answering. However, leveraging the learned association for open-vocabulary semantic segmentation remains a challenge. In this paper, we propose a simple, yet extremely effective, training-free technique, Plug-and-Play Open-Vocabulary Semantic Segmentation (PnP-OVSS) for this task. PnP-OVSS leverages a VLM with direct text-to-image cross-attention and an image-text matching loss. To balance between over-segmentation and under-segmentation, we introduce Salience Dropout; by iteratively dropping patches that the model is most attentive to, we are able to better resolve the entire extent of the segmentation mask. \shortname{} does not require any neural network training and performs hyperparameter tuning without the need for any segmentation annotations, even for a validation set. PnP-OVSS demonstrates substantial improvements over comparable baselines (+29.4% mIoU on Pascal VOC, +13.2% mIoU on Pascal Context, +14.0% mIoU on MS COCO, and +11.4% mIoU on ADE-20K.) and even outperforms most baselines that conduct additional network training on top of pretrained VLMs. Our codebase is at https://github.com/letitiabanana/PnP-OVSS.	This paper introduces PnP-OVSS, a simple and effective training-free technique for open-vocabulary semantic segmentation, leveraging VLMs with cross-attention and image-text matching loss.	Bridging the gap between VLMs' ability to associate image regions with words and their application in open-vocabulary semantic segmentation.	PnP-OVSS extracts cross-attention maps from a VLM, sharpens them with GradCAM using ITM loss gradients, refines them iteratively with Salience Dropout, and applies Gaussian blur and Dense CRF for final segmentation.	PnP-OVSS achieves substantial improvements over training-free baselines (e.g., +29.4% mIoU on Pascal VOC). It outperforms most methods requiring finetuning but not using image-text pairs (e.g., +13.7% mIoU on Pascal VOC). PnP-OVSS even surpasses several techniques requiring finetuning on image-text pairs, particularly on datasets with more classes per image.	The performance of PnP-OVSS heavily relies on the choice of cross-attention layers and heads. PnP-OVSS struggles with images containing multiple small object instances or a clutter of different objects.	open-vocabulary semantic segmentation, vision-language models, zero-shot learning, cross-attention, gradcam
2311.17092 Report	SEED-Bench-2: Benchmarking Multimodal Large Language Models	Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, Ying Shan	Multimodal large language models (MLLMs), building upon the foundation of powerful large language models (LLMs), have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs (acting like a combination of GPT-4V and DALL-E 3). However, existing MLLM benchmarks remain limited to assessing only models' comprehension ability of single image-text inputs, failing to keep up with the strides made in MLLMs. A comprehensive benchmark is imperative for investigating the progress and uncovering the limitations of current MLLMs. In this work, we categorize the capabilities of MLLMs into hierarchical levels from $L_0$ to $L_4$ based on the modalities they can accept and generate, and propose SEED-Bench-2, a comprehensive benchmark that evaluates the \textbf{hierarchical} capabilities of MLLMs. Specifically, SEED-Bench-2 comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions, including the evaluation of both text and image generation. Multiple-choice questions with groundtruth options derived from human annotation enables an objective and efficient assessment of model performance, eliminating the need for human or GPT intervention during evaluation. We further evaluate the performance of 23 prominent open-source MLLMs and summarize valuable observations. By revealing the limitations of existing MLLMs through extensive evaluations, we aim for SEED-Bench-2 to provide insights that will motivate future research towards the goal of General Artificial Intelligence. Dataset and evaluation code are available at \href{https://github.com/AILab-CVC/SEED-Bench}	This paper presents SEED-Bench-2, a comprehensive benchmark designed to evaluate the hierarchical capabilities of Multimodal Large Language Models (MLLMs) up to L3, including their ability to generate both text and images from interleaved image-text inputs.	Existing MLLM benchmarks primarily focus on single image-text comprehension and fail to showcase the full range of MLLM capabilities, hindering progress in the field. A comprehensive benchmark is crucial for effectively evaluating and advancing MLLMs towards general artificial intelligence.	The authors categorize MLLM capabilities into hierarchical levels (L0-L4) and construct SEED-Bench-2 with 24K multiple-choice questions spanning 27 evaluation dimensions. The benchmark utilizes a sophisticated pipeline with foundation models, adapts existing datasets, and incorporates human-designed questions to ensure diversity and quality. Multiple-choice format enables objective evaluation using accuracy.	Existing MLLMs have not yet reached the ceiling level of capability L1 for fixed-form image and text comprehension, with the top model achieving only 60% accuracy. MLLMs struggle with comprehending free-form interleaved image-text inputs (L2) more than fixed-format inputs, likely due to training data limitations. Only a few MLLMs have reached capability L3 (image and text generation), highlighting the need for more research in this area.	Not all MLLMs with image generation capabilities utilize visual autoregression, limiting the evaluation strategy for image output. Future work includes incorporating evaluations for capability level L4 (open-form interleaved image-text input and output) and expanding evaluation dimensions.	multimodal large language models, benchmarking, multimodal comprehension, image generation, artificial intelligence
2311.17091 Report	Beyond Sole Strength: Customized Ensembles for Generalized Vision-Language Models	Zhihe Lu, Jiawang Bai, Xin Li, Zeyu Xiao, Xinchao Wang	Fine-tuning pre-trained vision-language models (VLMs), e.g., CLIP, for the open-world generalization has gained increasing popularity due to its practical value. However, performance advancements are limited when relying solely on intricate algorithmic designs for a single model, even one exhibiting strong performance, e.g., CLIP-ViT-B/16. This paper, for the first time, explores the collaborative potential of leveraging much weaker VLMs to enhance the generalization of a robust single model. The affirmative findings motivate us to address the generalization problem from a novel perspective, i.e., ensemble of pre-trained VLMs. We introduce three customized ensemble strategies, each tailored to one specific scenario. Firstly, we introduce the zero-shot ensemble, automatically adjusting the logits of different models based on their confidence when only pre-trained VLMs are available. Furthermore, for scenarios with extra few-shot samples, we propose the training-free and tuning ensemble, offering flexibility based on the availability of computing resources. The proposed ensemble strategies are evaluated on zero-shot, base-to-new, and cross-dataset generalization, achieving new state-of-the-art performance. Notably, this work represents an initial stride toward enhancing the generalization performance of VLMs via ensemble. The code is available at https://github.com/zhiheLu/Ensemble_VLM.git.	This paper explores ensemble learning to improve the open-world generalization of pre-trained vision-language models (VLMs), proposing three strategies: zero-shot ensemble, training-free ensemble, and tuning ensemble.	Existing methods relying on a single VLM, even when powerful, have reached performance saturation in generalization tasks. This paper shows that leveraging multiple VLMs, even weaker ones, can significantly enhance performance.	The authors introduce three ensemble strategies: (1) Zero-shot ensemble: Assigns confidence-aware weights to VLMs based on their prediction confidences. (2) Training-free ensemble: Uses a greedy search to find optimal weights on a small training set. (3) Tuning ensemble: Trains a sample-aware weight generator on a training set to dynamically generate weights for test samples.	Zero-shot ensemble achieves an average accuracy gain of 2.61% across 11 diverse datasets. Tuning ensemble achieves state-of-the-art performance on base-to-new and cross-dataset generalization benchmarks. The paper demonstrates the effectiveness of 'weak helps strong' phenomenon, where weaker VLMs contribute significantly to the ensemble's performance.	The explored ensemble strategies only scratch the surface of ensemble learning potential for VLM generalization, leaving room for further investigation. The current study primarily focuses on image classification tasks, future work could explore its applicability in other downstream tasks.	vision-language models, ensemble learning, open-world generalization, zero-shot learning, few-shot learning
2311.17089 Report	Multi-Scale 3D Gaussian Splatting for Anti-Aliased Rendering	Zhiwen Yan, Weng Fei Low, Yu Chen, Gim Hee Lee	3D Gaussians have recently emerged as a highly efficient representation for 3D reconstruction and rendering. Despite its high rendering quality and speed at high resolutions, they both deteriorate drastically when rendered at lower resolutions or from far away camera position. During low resolution or far away rendering, the pixel size of the image can fall below the Nyquist frequency compared to the screen size of each splatted 3D Gaussian and leads to aliasing effect. The rendering is also drastically slowed down by the sequential alpha blending of more splatted Gaussians per pixel. To address these issues, we propose a multi-scale 3D Gaussian splatting algorithm, which maintains Gaussians at different scales to represent the same scene. Higher-resolution images are rendered with more small Gaussians, and lower-resolution images are rendered with fewer larger Gaussians. With similar training time, our algorithm can achieve 13\%-66\% PSNR and 160\%-2400\% rendering speed improvement at 4$\times$-128$\times$ scale rendering on Mip-NeRF360 dataset compared to the single scale 3D Gaussian splatting.	This paper introduces a multi-scale 3D Gaussian splatting algorithm for novel view synthesis that enhances rendering quality and speed at low resolutions or when viewed from a distance.	Existing 3D Gaussian splatting methods suffer from severe aliasing and slow rendering speeds at low resolutions, limiting their use in large-scale scenes.	The algorithm utilizes multi-scale 3D Gaussians to represent the scene at varying levels of detail. Small Gaussians are aggregated into larger ones for coarser representations. During rendering, Gaussians are selectively chosen based on their 'pixel coverage', ensuring appropriate level of detail for the given resolution.	The method achieves 13%-66% PSNR and 160%-2400% rendering speed improvement at 4x-128x downsampled scales on the Mip-NeRF360 dataset. It maintains comparable rendering quality and speed to single-scale methods at the original resolution. Qualitative comparisons demonstrate significant reduction in aliasing artifacts and improved visual fidelity at low resolutions.	Gaussian filtering based on 'pixel coverage' requires splatting all Gaussians before filtering, introducing overhead at very low resolutions. Future work will explore lightweight criteria for filtering Gaussians before splatting for further speed enhancements.	3d gaussian splatting, novel view synthesis, anti-aliasing, multi-scale representation, computer graphics
2311.17086 Report	PEA-Diffusion: Parameter-Efficient Adapter with Knowledge Distillation in non-English Text-to-Image Generation	Jian Ma, Chen Chen, Qingsong Xie, Haonan Lu	Text-to-image diffusion models are well-known for their ability to generate realistic images based on textual prompts. However, the existing works have predominantly focused on English, lacking support for non-English text-to-image models. The most commonly used translation methods cannot solve the generation problem related to language culture, while training from scratch on a specific language dataset is prohibitively expensive. In this paper, we are inspired to propose a simple plug-and-play language transfer method based on knowledge distillation. All we need to do is train a lightweight MLP-like parameter-efficient adapter (PEA) with only 6M parameters under teacher knowledge distillation along with a small parallel data corpus. We are surprised to find that freezing the parameters of UNet can still achieve remarkable performance on the language-specific prompt evaluation set, demonstrating that PEA can stimulate the potential generation ability of the original UNet. Additionally, it closely approaches the performance of the English text-to-image model on a general prompt evaluation set. Furthermore, our adapter can be used as a plugin to achieve significant results in downstream tasks in cross-lingual text-to-image generation. Code will be available at: https://github.com/OPPO-Mente-Lab/PEA-Diffusion	This paper presents PEA-Diffusion, a plug-and-play adapter with knowledge distillation for parameter-efficient adaptation of English text-to-image diffusion models to other languages.	Existing solutions for non-English text-to-image generation are costly (training from scratch) or struggle with capturing culture-specific concepts.	PEA-Diffusion uses a lightweight MLP adapter trained with knowledge distillation from a pre-trained English Stable Diffusion model. It aligns representation spaces between the new (non-English) text encoder and the frozen image generator, requiring only a small amount of parallel data.	PEA-Diffusion effectively captures cultural nuances, outperforming translation, multilingual models, and direct fine-tuning on language-specific prompts. The method retains strong general image generation abilities, achieving comparable results to the original English model on general prompts. The plug-and-play adapter seamlessly integrates with downstream applications like LoRA, ControlNet, Inpainting, etc., facilitating cross-lingual adaptation of the English SD ecosystem.	Performance relies on the quality and representational power of the language-specific CLIP text encoder. The approach is bounded by the capabilities of the base English model, unable to surpass its general synthesis limits.	text-to-image generation, cross-lingual transfer, knowledge distillation, parameter-efficient adaptation, diffusion models
2311.17083 Report	CLiC: Concept Learning in Context	Mehdi Safaee, Aryan Mikaeili, Or Patashnik, Daniel Cohen-Or, Ali Mahdavi-Amiri	This paper addresses the challenge of learning a local visual pattern of an object from one image, and generating images depicting objects with that pattern. Learning a localized concept and placing it on an object in a target image is a nontrivial task, as the objects may have different orientations and shapes. Our approach builds upon recent advancements in visual concept learning. It involves acquiring a visual concept (e.g., an ornament) from a source image and subsequently applying it to an object (e.g., a chair) in a target image. Our key idea is to perform in-context concept learning, acquiring the local visual concept within the broader context of the objects they belong to. To localize the concept learning, we employ soft masks that contain both the concept within the mask and the surrounding image area. We demonstrate our approach through object generation within an image, showcasing plausible embedding of in-context learned concepts. We also introduce methods for directing acquired concepts to specific locations within target images, employing cross-attention mechanisms, and establishing correspondences between source and target objects. The effectiveness of our method is demonstrated through quantitative and qualitative experiments, along with comparisons against baseline techniques.	This paper proposes a method for learning local visual patterns from a single image and transferring them to other objects or generating new objects with the learned pattern, all while preserving the context of the pattern within the object.	Existing image editing methods struggle to effectively transfer local patterns while maintaining their context and relationship to the object they belong to. This method aims to address this challenge by learning patterns in the context of their surrounding object.	The method utilizes a diffusion model with in-context concept learning. It learns a token representing the pattern by optimizing multiple loss functions that encourage the model to focus on the pattern region, learn it in the context of the object, and avoid overfitting to the specific instance in the source image. For transfer, it uses masked blended diffusion editing and cross-attention guidance.	The method can successfully transfer various local patterns, such as ornaments, to objects of the same or different classes. It enables the generation of new objects that incorporate the learned pattern in a contextually relevant manner. Quantitative and qualitative comparisons demonstrate the superiority of the proposed method over existing personalization methods like Custom Diffusion, Break-A-Scene, and RealFill.	The method's performance may deteriorate when the source and target images have significant domain differences. The optimization process, while effective, is time-consuming and not suitable for real-time applications.	concept learning, image editing, image generation, diffusion models, pattern transfer
2311.17082 Report	DreamPropeller: Supercharge Text-to-3D Generation with Parallel Sampling	Linqi Zhou, Andy Shih, Chenlin Meng, Stefano Ermon	Recent methods such as Score Distillation Sampling (SDS) and Variational Score Distillation (VSD) using 2D diffusion models for text-to-3D generation have demonstrated impressive generation quality. However, the long generation time of such algorithms significantly degrades the user experience. To tackle this problem, we propose DreamPropeller, a drop-in acceleration algorithm that can be wrapped around any existing text-to-3D generation pipeline based on score distillation. Our framework generalizes Picard iterations, a classical algorithm for parallel sampling an ODE path, and can account for non-ODE paths such as momentum-based gradient updates and changes in dimensions during the optimization process as in many cases of 3D generation. We show that our algorithm trades parallel compute for wallclock time and empirically achieves up to 4.7x speedup with a negligible drop in generation quality for all tested frameworks.	This paper introduces DreamPropeller, a general acceleration algorithm applicable to any text-to-3D generation pipeline using score distillation.	Current text-to-3D methods based on score distillation, while producing impressive results, suffer from prohibitively long generation times, hindering their practical use.	The method leverages parallel computation by generalizing Picard iterations, a classic technique for parallel sampling of ODE paths, to accommodate the complexities of 3D generation, such as changing parameter dimensions and momentum-based gradient updates.	DreamPropeller consistently achieves more than a 4x speedup across various 3D representations and score distillation frameworks. The algorithm's performance improves with higher computational demands per iteration, making it especially beneficial for complex methods like ProlificDreamer. DreamPropeller maintains high generation quality, comparable to the original non-parallelized methods.	The current implementation relies on fixed random seeds to eliminate stochasticity during parallel computation, which might limit the exploration of diverse solutions. Further investigation is needed to explore more efficient strategies for handling LoRA model updates in VSD, potentially through asynchronous parameter sharing or gradient compression.	text-to-3d generation, score distillation, parallel computation, picard iteration, 3d gaussian splatting
2311.17076 Report	Compositional Chain-of-Thought Prompting for Large Multimodal Models	Chancharik Mitra, Brandon Huang, Trevor Darrell, Roei Herzig	The combination of strong visual backbones and Large Language Model (LLM) reasoning has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range of vision and language (VL) tasks. However, recent research has shown that even the most advanced LMMs still struggle to capture aspects of compositional visual reasoning, such as attributes and relationships between objects. One solution is to utilize scene graphs (SGs)--a formalization of objects and their relations and attributes that has been extensively used as a bridge between the visual and textual domains. Yet, scene graph data requires scene graph annotations, which are expensive to collect and thus not easily scalable. Moreover, finetuning an LMM based on SG data can lead to catastrophic forgetting of the pretraining objective. To overcome this, inspired by chain-of-thought methods, we propose Compositional Chain-of-Thought (CCoT), a novel zero-shot Chain-of-Thought prompting method that utilizes SG representations in order to extract compositional knowledge from an LMM. Specifically, we first generate an SG using the LMM, and then use that SG in the prompt to produce a response. Through extensive experiments, we find that the proposed CCoT approach not only improves LMM performance on several vision and language VL compositional benchmarks but also improves the performance of several popular LMMs on general multimodal benchmarks, without the need for fine-tuning or annotated ground-truth SGs. Code: https://github.com/chancharikmitra/CCoT	This paper proposes CCoT, a novel zero-shot Chain-of-Thought prompting method for Large Multimodal Models (LMMs) to improve compositional visual reasoning by leveraging scene graph (SG) representations.	Existing LMMs struggle to capture compositional aspects of visual scenes, often treating them as a 'bag of objects'. CCoT aims to address this by incorporating structured scene graph information into the reasoning process.	CCoT employs a two-step process: 1) Scene Graph Generation: An LMM is prompted to generate a scene graph relevant to the input image and task. 2) Response Generation: The LMM is prompted with the image, task prompt, and the generated scene graph to produce a response, leveraging the compositional information.	CCoT significantly improves performance on compositional visual reasoning benchmarks like Winoground and WHOOPS!. It also enhances performance on general multimodal benchmarks like SEEDBench, MMBench, and LLaVA-Bench In-the-Wild. Ablations confirm the importance of structured SGs over captions, JSON formatting, and optimal SG length for improved reasoning.	The method's performance is limited by the context length of current LLM backbones used in LMMs. Scene graphs might not be suitable for multimodal tasks with a stronger emphasis on language over visual reasoning.	large multimodal models, compositional reasoning, scene graphs, chain-of-thought prompting, zero-shot learning
2311.17043 Report	LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models	Yanwei Li, Chengyao Wang, Jiaya Jia	In this work, we present a novel method to tackle the token generation challenge in Vision Language Models (VLMs) for video and image understanding, called LLaMA-VID. Current VLMs, while proficient in tasks like image captioning and visual question answering, face computational burdens when processing long videos due to the excessive visual tokens. LLaMA-VID addresses this issue by representing each frame with two distinct tokens, namely context token and content token. The context token encodes the overall image context based on user input, whereas the content token encapsulates visual cues in each frame. This dual-token strategy significantly reduces the overload of long videos while preserving critical information. Generally, LLaMA-VID empowers existing frameworks to support hour-long videos and pushes their upper limit with an extra context token. It is proved to surpass previous methods on most of video- or image-based benchmarks. Code is available https://github.com/dvlab-research/LLaMA-VID}{https://github.com/dvlab-research/LLaMA-VID	LLaMA-VID is a novel method addressing the token generation challenge in Vision Language Models (VLMs) for video and image understanding.	Existing VLMs struggle to process long videos due to the computational burden of excessive visual tokens from consecutive frames.	LLaMA-VID represents each frame with two tokens: a context token encoding overall image context based on user input, and a content token encapsulating frame-specific visual cues. This reduces token overload while preserving vital information.	LLaMA-VID enables existing VLMs to support hour-long videos. It achieves state-of-the-art results on multiple video and image understanding benchmarks. The method is computationally efficient, completing training in 2 days on a single machine with 8 A100 GPUs.	Performance slightly decreases when the content token is significantly compressed (e.g., to 1 token/frame). Future work involves exploring dynamic token compression based on content importance and resource availability.	vision language models, video understanding, token generation, long video processing, instruction tuning
2311.17009 Report	Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer	Danah Yatim, Rafail Fridman, Omer Bar-Tal, Yoni Kasten, Tali Dekel	We present a new method for text-driven motion transfer - synthesizing a video that complies with an input text prompt describing the target objects and scene while maintaining an input video's motion and scene layout. Prior methods are confined to transferring motion across two subjects within the same or closely related object categories and are applicable for limited domains (e.g., humans). In this work, we consider a significantly more challenging setting in which the target and source objects differ drastically in shape and fine-grained motion characteristics (e.g., translating a jumping dog into a dolphin). To this end, we leverage a pre-trained and fixed text-to-video diffusion model, which provides us with generative and motion priors. The pillar of our method is a new space-time feature loss derived directly from the model. This loss guides the generation process to preserve the overall motion of the input video while complying with the target object in terms of shape and fine-grained motion traits.	This paper introduces a zero-shot method for text-driven motion transfer, enabling the transfer of motion from a source video to a target object specified by a text prompt, even when the source and target objects have significant differences in shape and motion characteristics.	This approach addresses limitations of existing motion transfer techniques that struggle with significant structural deviations between source and target objects, particularly those relying on explicit pose estimation or similar object categories.	The method leverages the generative motion priors of a pre-trained text-to-video diffusion model, guiding the generation process using a novel loss function based on pairwise differences of spatial marginal mean features extracted from the model.	The method successfully transfers motion while accommodating significant shape variations and generating plausible scene elements. Quantitative evaluation demonstrates superior performance in preserving motion fidelity and achieving high edit fidelity compared to existing text-driven video editing methods. User studies confirm the method's effectiveness, with participants consistently preferring its results over baselines.	Performance is limited by the pre-trained text-to-video model's ability to handle out-of-distribution motion and object combinations. Current text-to-video models have limitations in quality, resolution, and video length, restricting the applicability of the method.	motion transfer, text-driven video editing, diffusion models, space-time feature analysis, generative motion priors
2311.17002 Report	Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following	Yutong Feng, Biao Gong, Di Chen, Yujun Shen, Yu Liu, Jingren Zhou	Existing text-to-image (T2I) diffusion models usually struggle in interpreting complex prompts, especially those with quantity, object-attribute binding, and multi-subject descriptions. In this work, we introduce a semantic panel as the middleware in decoding texts to images, supporting the generator to better follow instructions. The panel is obtained through arranging the visual concepts parsed from the input text by the aid of large language models, and then injected into the denoising network as a detailed control signal to complement the text condition. To facilitate text-to-panel learning, we come up with a carefully designed semantic formatting protocol, accompanied by a fully-automatic data preparation pipeline. Thanks to such a design, our approach, which we call Ranni, manages to enhance a pre-trained T2I generator regarding its textual controllability. More importantly, the introduction of the generative middleware brings a more convenient form of interaction (i.e., directly adjusting the elements in the panel or using language instructions) and further allows users to finely customize their generation, based on which we develop a practical system and showcase its potential in continuous generation and chatting-based editing. Our project page is at https://ranni-t2i.github.io/Ranni.	Introduces Ranni, an improved text-to-image generation framework that uses a 'semantic panel' as a middleware to enhance the accuracy and controllability of image generation from text prompts.	Addresses the limitations of existing text-to-image models in interpreting complex prompts, particularly those involving quantity, attribute binding, and multi-subject descriptions.	Utilizes large language models (LLMs) to parse text prompts into visual concepts, which are then arranged into a structured semantic panel. This panel serves as a detailed control signal for a diffusion-based image generation model, enabling more precise control over image content and attributes.	Demonstrates superior performance in following complex prompts, particularly those involving quantity, spatial relationships, and attribute binding. Enables interactive image editing through direct manipulation of the semantic panel, allowing for intuitive modifications to object attributes, positions, and relationships. Explores the potential of LLM-powered chatting-based editing, enabling users to refine images through natural language instructions.	The text-to-panel stage, relying on LLMs, can sometimes produce inaccurate or overlapping object placements, leading to generation errors. The panel-to-image generation, while more controllable, still exhibits some degree of robustness, sometimes rectifying improper layouts from the text-to-panel stage, which might not always align with user intent.	text-to-image synthesis, diffusion models, large language models, semantic panel, interactive image editing
2311.16974 Report	COLE: A Hierarchical Generation Framework for Multi-Layered and Editable Graphic Design	Peidong Jia, Chenxuan Li, Yuhui Yuan, Zeyu Liu, Yichao Shen, Bohan Chen, Xingru Chen, Yinglin Zheng, Dong Chen, Ji Li, Xiaodong Xie, Shanghang Zhang, Baining Guo	Graphic design, which has been evolving since the 15th century, plays a crucial role in advertising. The creation of high-quality designs demands design-oriented planning, reasoning, and layer-wise generation. Unlike the recent CanvaGPT, which integrates GPT-4 with existing design templates to build a custom GPT, this paper introduces the COLE system - a hierarchical generation framework designed to comprehensively address these challenges. This COLE system can transform a vague intention prompt into a high-quality multi-layered graphic design, while also supporting flexible editing based on user input. Examples of such input might include directives like ``design a poster for Hisaishi's concert.'' The key insight is to dissect the complex task of text-to-design generation into a hierarchy of simpler sub-tasks, each addressed by specialized models working collaboratively. The results from these models are then consolidated to produce a cohesive final output. Our hierarchical task decomposition can streamline the complex process and significantly enhance generation reliability. Our COLE system comprises multiple fine-tuned Large Language Models (LLMs), Large Multimodal Models (LMMs), and Diffusion Models (DMs), each specifically tailored for design-aware layer-wise captioning, layout planning, reasoning, and the task of generating images and text. Furthermore, we construct the DESIGNINTENTION benchmark to demonstrate the superiority of our COLE system over existing methods in generating high-quality graphic designs from user intent. Last, we present a Canva-like multi-layered image editing tool to support flexible editing of the generated multi-layered graphic design images. We perceive our COLE system as an important step towards addressing more complex and multi-layered graphic design generation tasks in the future.	This paper presents a novel hierarchical generation framework named COLE, designed to simplify the complex process of graphic design generation. COLE leverages the power of large multimodal models (LMMs), large language models (LLMs), and diffusion models to decompose the task into manageable, coordinated sub-tasks.	Existing text-to-image models often struggle with generating high-quality, editable graphic designs from simple user intentions. COLE addresses these limitations by enabling design-oriented planning, reasoning, and layer-wise generation, resulting in multi-layered and editable graphic designs.	COLE employs a hierarchical approach, utilizing specialized models for each sub-task: 1) Design-LLM translates user intentions into structured JSON files. 2) Cascaded diffusion models generate background and object layers with visual planning and reasoning. 3) Typography-LMM predicts typography attributes based on visual content. 4) Multi-layered SVG editor allows for user editing.	COLE demonstrates competitive performance against state-of-the-art image generators like DALL-E and CanvaGPT, achieving superior results in design layout, typography, and innovation according to GPT4-V evaluation. The hierarchical task decomposition and specialized models in COLE facilitate the generation of high-quality graphic design images that are both editable and aligned with user intentions. The proposed COLE system exhibits strong generalization capability in layout planning and typography attribute reasoning, as evidenced by its performance on the Crello text box placement task.	The system exhibits limitations in typography block arrangement, the variety of editable visual elements, and typography color selection. Future work will focus on addressing these limitations and exploring the generation of more complex and diverse graphic designs.	graphic design generation, hierarchical generation, large language models, diffusion models, typography
2311.16973 Report	DemoFusion: Democratising High-Resolution Image Generation With No $$$	Ruoyi Du, Dongliang Chang, Timothy Hospedales, Yi-Zhe Song, Zhanyu Ma	High-resolution image generation with Generative Artificial Intelligence (GenAI) has immense potential but, due to the enormous capital investment required for training, it is increasingly centralised to a few large corporations, and hidden behind paywalls. This paper aims to democratise high-resolution GenAI by advancing the frontier of high-resolution generation while remaining accessible to a broad audience. We demonstrate that existing Latent Diffusion Models (LDMs) possess untapped potential for higher-resolution image generation. Our novel DemoFusion framework seamlessly extends open-source GenAI models, employing Progressive Upscaling, Skip Residual, and Dilated Sampling mechanisms to achieve higher-resolution image generation. The progressive nature of DemoFusion requires more passes, but the intermediate results can serve as "previews", facilitating rapid prompt iteration.	DemoFusion is a novel framework that extends open-source latent diffusion models (LDMs) for high-resolution image generation without requiring additional training or excessive memory resources.	High-resolution image generation with GenAI is becoming increasingly centralized and commercialized. DemoFusion aims to democratize this technology by enabling users with modest hardware to generate high-resolution images using existing open-source models.	DemoFusion builds upon the MultiDiffusion framework and introduces three key mechanisms: (i) Progressive Upscaling: Generates images iteratively from low to high resolutions, refining details in each phase. (ii) Skip Residual: Enhances global consistency by integrating noise-inverted representations from lower resolutions as residuals. (iii) Dilated Sampling: Improves global semantic coherence by using dilated sampling of denoising paths.	DemoFusion successfully generates high-resolution images (up to 4096^2 and beyond) with rich local details and global semantic coherence. Quantitative evaluations using FID, IS, and CLIP Score demonstrate DemoFusion's superior performance compared to SDXL, MultiDiffusion, SDXL+BSRGAN, and SCALECRAFTER. DemoFusion enables high-resolution generation on consumer-grade GPUs, making it accessible to a wider audience.	DemoFusion requires longer inference times compared to baseline methods due to its progressive upscaling and patch-wise denoising processes. Performance heavily relies on the underlying LDM's ability to generate coherent local patches at higher resolutions, and can be limited by the LDM's inherent biases.	generative artificial intelligence, high-resolution image generation, latent diffusion models, multidiffusion, progressive upscaling
2311.16961 Report	HumanRef: Single Image to 3D Human Generation via Reference-Guided Diffusion	Jingbo Zhang, Xiaoyu Li, Qi Zhang, Yanpei Cao, Ying Shan, Jing Liao	Generating a 3D human model from a single reference image is challenging because it requires inferring textures and geometries in invisible views while maintaining consistency with the reference image. Previous methods utilizing 3D generative models are limited by the availability of 3D training data. Optimization-based methods that lift text-to-image diffusion models to 3D generation often fail to preserve the texture details of the reference image, resulting in inconsistent appearances in different views. In this paper, we propose HumanRef, a 3D human generation framework from a single-view input. To ensure the generated 3D model is photorealistic and consistent with the input image, HumanRef introduces a novel method called reference-guided score distillation sampling (Ref-SDS), which effectively incorporates image guidance into the generation process. Furthermore, we introduce region-aware attention to Ref-SDS, ensuring accurate correspondence between different body regions. Experimental results demonstrate that HumanRef outperforms state-of-the-art methods in generating 3D clothed humans with fine geometry, photorealistic textures, and view-consistent appearances.	HumanRef, a novel framework for generating 3D clothed humans from a single image, leveraging a reference-guided score distillation sampling (Ref-SDS) loss to produce realistic and view-consistent textures.	Reconstructing 3D humans from single images is challenging due to the need to infer textures and geometries in unseen areas while maintaining consistency with the input view.	HumanRef uses a hash-encoded SDF network optimized in a coarse-to-fine manner, incorporating human geometry constraints, a Ref-SDS loss that injects image guidance into a pretrained diffusion model, and region-aware attention for precise local-region guidance.	Outperforms state-of-the-art methods in generating 3D clothed humans with fine geometry and photorealistic textures. Successfully generates view-consistent results matching the reference image. Produces more realistic textures compared to methods relying solely on text-guided diffusion models.	May suffer from the Janus problem in side views due to lack of view-specific constraints. Can fail in cases of extreme poses where body estimation is inaccurate.	3d human generation, diffusion models, score distillation sampling, region-aware attention, single image reconstruction
2311.16933 Report	SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models	Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala, Dahua Lin, Bo Dai	The development of text-to-video (T2V), i.e., generating videos with a given text prompt, has been significantly advanced in recent years. However, relying solely on text prompts often results in ambiguous frame composition due to spatial uncertainty. The research community thus leverages the dense structure signals, e.g., per-frame depth/edge sequences, to enhance controllability, whose collection accordingly increases the burden of inference. In this work, we present SparseCtrl to enable flexible structure control with temporally sparse signals, requiring only one or a few inputs, as shown in Figure 1. It incorporates an additional condition encoder to process these sparse signals while leaving the pre-trained T2V model untouched. The proposed approach is compatible with various modalities, including sketches, depth maps, and RGB images, providing more practical control for video generation and promoting applications such as storyboarding, depth rendering, keyframe animation, and interpolation. Extensive experiments demonstrate the generalization of SparseCtrl on both original and personalized T2V generators. Codes and models will be publicly available at https://guoyww.github.io/projects/SparseCtrl .	This paper presents SparseCtrl, an efficient method for controlling text-to-video (T2V) generation using temporally sparse condition maps, such as sketches, depth maps, or RGB images, via an add-on encoder.	Current T2V models struggle with fine-grained control and often require dense condition maps for every frame, leading to impractical inference costs. This paper addresses these limitations by enabling control with only a few keyframe conditions.	SparseCtrl employs a condition encoder with temporal-aware layers to propagate information from conditioned keyframes to unconditioned frames. It leverages masking strategies to handle varying sparsity levels and improves upon ControlNet's design by removing the noised sample input to the encoder.	SparseCtrl achieves high-fidelity control over the generated video content, closely adhering to the input conditions even with sparse input. The method demonstrates strong generalization ability, successfully applied to various tasks like sketch-to-video generation, depth-guided generation, image animation, and video interpolation. Extensive experiments and comparisons with existing methods showcase SparseCtrl's effectiveness in maintaining temporal consistency and achieving comparable or superior performance on chosen tasks.	The quality and domain of generated videos are limited by the pre-trained T2V backbone and training data. Out-of-domain inputs, like anime images, can pose challenges due to data scarcity in the training dataset. Future work could explore domain-specific backbones or more diverse training data.	text-to-video generation, sparse control, diffusion models, controllable video synthesis, keyframe animation
2311.16922 Report	Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding	Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, Lidong Bing	Large Vision-Language Models (LVLMs) have advanced considerably, intertwining visual recognition and language understanding to generate content that is not only coherent but also contextually attuned. Despite their success, LVLMs still suffer from the issue of object hallucinations, where models generate plausible yet incorrect outputs that include objects that do not exist in the images. To mitigate this issue, we introduce Visual Contrastive Decoding (VCD), a simple and training-free method that contrasts output distributions derived from original and distorted visual inputs. The proposed VCD effectively reduces the over-reliance on statistical bias and unimodal priors, two essential causes of object hallucinations. This adjustment ensures the generated content is closely grounded to visual inputs, resulting in contextually accurate outputs. Our experiments show that VCD, without either additional training or the usage of external tools, significantly mitigates the object hallucination issue across different LVLM families. Beyond mitigating object hallucinations, VCD also excels in general LVLM benchmarks, highlighting its wide-ranging applicability.	This paper introduces Visual Contrastive Decoding (VCD), a training-free method to mitigate object hallucinations in Large Vision-Language Models (LVLMs) by contrasting output distributions from original and distorted visual inputs.	Object hallucinations, the generation of plausible yet incorrect object descriptions, present a significant challenge to the reliability and applicability of LVLMs in real-world scenarios.	VCD contrasts output distributions generated from original and distorted images, effectively calibrating the model's over-reliance on statistical bias and unimodal priors (language priors). This approach requires no additional training or external tools.	VCD significantly reduces object hallucinations across different LVLM families (LLaVA-1.5, InstructBLIP, Qwen-VL) and datasets (MSCOCO, A-OKVQA, GQA). VCD shows consistent improvements on object hallucination benchmarks, including up to +7.4 F1 score boost on POPE and +18% improvement on MME. Beyond mitigating hallucinations, VCD enhances the general perception capabilities of LVLMs, as demonstrated by improved performance on MME and LLaVA-Bench.	The current implementation relies on basic Gaussian noise for visual distortion; exploring more fine-grained techniques could be beneficial. The study focuses on image and text LVLMs, future work could explore extending VCD to video understanding.	object hallucination, vision-language models, contrastive decoding, multimodal learning, artificial intelligence
2311.16918 Report	RichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-3D	Lingteng Qiu, Guanying Chen, Xiaodong Gu, Qi Zuo, Mutian Xu, Yushuang Wu, Weihao Yuan, Zilong Dong, Liefeng Bo, Xiaoguang Han	Lifting 2D diffusion for 3D generation is a challenging problem due to the lack of geometric prior and the complex entanglement of materials and lighting in natural images. Existing methods have shown promise by first creating the geometry through score-distillation sampling (SDS) applied to rendered surface normals, followed by appearance modeling. However, relying on a 2D RGB diffusion model to optimize surface normals is suboptimal due to the distribution discrepancy between natural images and normals maps, leading to instability in optimization. In this paper, recognizing that the normal and depth information effectively describe scene geometry and be automatically estimated from images, we propose to learn a generalizable Normal-Depth diffusion model for 3D generation. We achieve this by training on the large-scale LAION dataset together with the generalizable image-to-depth and normal prior models. In an attempt to alleviate the mixed illumination effects in the generated materials, we introduce an albedo diffusion model to impose data-driven constraints on the albedo component. Our experiments show that when integrated into existing text-to-3D pipelines, our models significantly enhance the detail richness, achieving state-of-the-art results. Our project page is https://aigc3d.github.io/richdreamer/.	This paper presents RichDreamer, a novel text-to-3D generation method that leverages a generalizable Normal-Depth diffusion model for enhanced detail and fidelity.	Existing text-to-3D methods struggle to generate high-quality, detailed objects due to the limitations of 2D diffusion models in capturing 3D geometry and material properties.	The authors propose a two-stage approach: 1) Train a Normal-Depth diffusion model on a massive real-world dataset (LAION) and fine-tune it on a synthetic dataset (Objaverse) to provide robust geometric priors. 2) Introduce a depth-conditioned albedo diffusion model to disentangle albedo from lighting effects, leading to more accurate appearance modeling.	RichDreamer significantly outperforms state-of-the-art methods in terms of both geometry and appearance quality, as evidenced by CLIP score comparisons and user studies. Pre-training the Normal-Depth diffusion model on a large-scale real-world dataset proves crucial for generalization ability. The depth-conditioned albedo diffusion model effectively separates albedo from lighting artifacts, resulting in more realistic relighting.	The current method primarily focuses on object-level generation, limiting its applicability to complex scenes. Further research is needed to develop a comprehensive appearance prior model that regularizes both diffuse and specular components.	text-to-3d, diffusion model, geometry prior, albedo diffusion, appearance modeling
2311.16854 Report	A Unified Approach for Text- and Image-guided 4D Scene Generation	Yufeng Zheng, Xueting Li, Koki Nagano, Sifei Liu, Karsten Kreis, Otmar Hilliges, Shalini De Mello	Large-scale diffusion generative models are greatly simplifying image, video and 3D asset creation from user-provided text prompts and images. However, the challenging problem of text-to-4D dynamic 3D scene generation with diffusion guidance remains largely unexplored. We propose Dream-in-4D, which features a novel two-stage approach for text-to-4D synthesis, leveraging (1) 3D and 2D diffusion guidance to effectively learn a high-quality static 3D asset in the first stage; (2) a deformable neural radiance field that explicitly disentangles the learned static asset from its deformation, preserving quality during motion learning; and (3) a multi-resolution feature grid for the deformation field with a displacement total variation loss to effectively learn motion with video diffusion guidance in the second stage. Through a user preference study, we demonstrate that our approach significantly advances image and motion quality, 3D consistency and text fidelity for text-to-4D generation compared to baseline approaches. Thanks to its motion-disentangled representation, Dream-in-4D can also be easily adapted for controllable generation where appearance is defined by one or multiple images, without the need to modify the motion learning stage. Thus, our method offers, for the first time, a unified approach for text-to-4D, image-to-4D and personalized 4D generation tasks.	Presents Dream-in-4D, a novel two-stage approach for text-to-4D dynamic 3D scene generation with diffusion guidance.	Addresses the limitations of existing methods in generating high-quality, 3D-consistent, and text-faithful dynamic 3D scenes from text prompts.	Leverages 3D and 2D diffusion guidance for high-quality static 3D asset generation in the first stage. Employs a deformable neural radiance field to disentangle static assets from motion, enabling motion learning with video diffusion guidance in the second stage. Introduces a multi-resolution feature grid and a displacement total variation loss for detailed and smooth motion.	Significantly improves image and motion quality, 3D consistency, and text fidelity for text-to-4D generation compared to baseline approaches. Enables controllable generation where appearance is defined by one or multiple images, without modifying the motion learning stage. Offers a unified approach for text-to-4D, image-to-4D, and personalized 4D generation tasks.	The combination of 3D and 2D diffusion priors may not always learn correct static 3D representations, particularly for complex prompts. The method cannot recover or learn correct motion if the initial static representation is inaccurate.	text-to-4d, diffusion models, deformable nerf, 3d scene generation, motion synthesis
2311.16737 Report	Point'n Move: Interactive Scene Object Manipulation on Gaussian Splatting Radiance Fields	Jiajun Huang, Hongchuan Yu	We propose Point'n Move, a method that achieves interactive scene object manipulation with exposed region inpainting. Interactivity here further comes from intuitive object selection and real-time editing. To achieve this, we adopt Gaussian Splatting Radiance Field as the scene representation and fully leverage its explicit nature and speed advantage. Its explicit representation formulation allows us to devise a 2D prompt points to 3D mask dual-stage self-prompting segmentation algorithm, perform mask refinement and merging, minimize change as well as provide good initialization for scene inpainting and perform editing in real-time without per-editing training, all leads to superior quality and performance. We test our method by performing editing on both forward-facing and 360 scenes. We also compare our method against existing scene object removal methods, showing superior quality despite being more capable and having a speed advantage.	Point'n Move, an interactive method for scene object manipulation with exposed region inpainting on Gaussian Splatting Radiance Fields.	Allows users to intuitively select, manipulate, and rearrange objects within 3D scenes for various applications like virtual home furnishing and AR/VR environment creation.	Leverages the explicit nature and speed of Gaussian Splatting Radiance Fields to perform 2D point prompt to 3D segmentation, scene content revealing pruning, reprojection-based initialization for inpainting, and direct manipulation of primitives for real-time editing.	Achieves high-quality object selection and editing in both 360 and forward-facing scenes. Demonstrates competitive performance against existing object removal methods, particularly in terms of inpainting quality and speed. Shows effectiveness of the dual-stage segmentation, content-revealing pruning, and reprojection-based initialization through ablation studies.	Currently does not handle lighting or texture, focusing solely on geometry editing. Inaccuracies in segmentation can lead to artifacts in the inpainted regions, particularly with shadows.	3d scene manipulation, gaussian splatting radiance fields, exposed region inpainting, interactive editing, 3d segmentation
2311.16711 Report	LEDITS++: Limitless Image Editing using Text-to-Image Models	Manuel Brack, Felix Friedrich, Katharina Kornmeier, Linoy Tsaban, Patrick Schramowski, Kristian Kersting, Apolinário Passos	Text-to-image diffusion models have recently received increasing interest for their astonishing ability to produce high-fidelity images from solely text inputs. Subsequent research efforts aim to exploit and apply their capabilities to real image editing. However, existing image-to-image methods are often inefficient, imprecise, and of limited versatility. They either require time-consuming fine-tuning, deviate unnecessarily strongly from the input image, and/or lack support for multiple, simultaneous edits. To address these issues, we introduce LEDITS++, an efficient yet versatile and precise textual image manipulation technique. LEDITS++'s novel inversion approach requires no tuning nor optimization and produces high-fidelity results with a few diffusion steps. Second, our methodology supports multiple simultaneous edits and is architecture-agnostic. Third, we use a novel implicit masking technique that limits changes to relevant image regions. We propose the novel TEdBench++ benchmark as part of our exhaustive evaluation. Our results demonstrate the capabilities of LEDITS++ and its improvements over previous methods. The project page is available at https://leditsplusplus-project.static.hf.space .	\model~is a novel, efficient, versatile, and precise method for text-driven image editing using text-to-image diffusion models.	Existing image-to-image editing methods are often inefficient, imprecise, and limited in versatility. They often require time-consuming fine-tuning, deviate significantly from the input image, and lack support for multiple simultaneous edits.	\model~employs a three-pronged approach: 1) Efficient image inversion using a modified DPM-Solver++ for faster and perfect reconstruction. 2) Versatile textual editing through a novel guidance term that allows multiple edits with individual control. 3) Semantic grounding of edits by combining attention and noise-based masking to restrict changes to relevant image regions.	\model~achieves perfect image reconstruction with significantly faster runtime compared to existing methods. \model~effectively performs various complex edits, including multi-concept editing, outperforming competing methods in fidelity and preserving image composition. Implicit masking within \model~is shown to effectively identify and ground edits to semantically relevant regions, as demonstrated through a segmentation task proxy.	Editing success is partially dependent on the capabilities of the underlying pre-trained diffusion model. While excelling at compositional robustness, object coherence within the edited region can be further improved.	image editing, text-to-image synthesis, diffusion models, semantic guidance, image inversion
2311.16635 Report	MotionZero:Exploiting Motion Priors for Zero-shot Text-to-Video Generation	Sitong Su, Litao Guo, Lianli Gao, Hengtao Shen, Jingkuan Song	Zero-shot Text-to-Video synthesis generates videos based on prompts without any videos. Without motion information from videos, motion priors implied in prompts are vital guidance. For example, the prompt "airplane landing on the runway" indicates motion priors that the "airplane" moves downwards while the "runway" stays static. Whereas the motion priors are not fully exploited in previous approaches, thus leading to two nontrivial issues: 1) the motion variation pattern remains unaltered and prompt-agnostic for disregarding motion priors; 2) the motion control of different objects is inaccurate and entangled without considering the independent motion priors of different objects. To tackle the two issues, we propose a prompt-adaptive and disentangled motion control strategy coined as MotionZero, which derives motion priors from prompts of different objects by Large-Language-Models and accordingly applies motion control of different objects to corresponding regions in disentanglement. Furthermore, to facilitate videos with varying degrees of motion amplitude, we propose a Motion-Aware Attention scheme which adjusts attention among frames by motion amplitude. Extensive experiments demonstrate that our strategy could correctly control motion of different objects and support versatile applications including zero-shot video edit.	MotionZero, a novel zero-shot text-to-video generation method, introduces a prompt-adaptive and disentangled motion control strategy by leveraging motion priors from prompts and first frames, enabling precise and realistic motion generation in synthesized videos.	Existing zero-shot text-to-video generation methods fail to fully utilize motion information inherent in prompts, leading to unrealistic motion patterns and inaccurate control of multiple objects.	1) Extracting Motion Priors: LLMs are employed to extract motion priors from text prompts and the generated first frame, identifying moving objects and their directions. 2) Disentangled Motion Control: Motion priors are applied separately to corresponding objects in the feature space, utilizing a segmentation model to locate object positions. 3) Motion-Aware Attention: A novel attention scheme adjusts attention among frames based on motion amplitude to accommodate videos with varying degrees of motion.	MotionZero demonstrates superior accuracy in motion control compared to existing methods, evidenced by quantitative evaluations and visual comparisons. The method effectively disentangles the motion of multiple objects, enabling independent and realistic movement within a scene. MotionZero supports various applications, including zero-shot video editing with background and foreground manipulation, human body control through skeleton information, camera motion simulation, and evolving event depiction.	The reliance on external segmentation models and LLMs introduces computational overhead. The current implementation focuses on generating videos with a fixed number of frames.	zero-shot learning, text-to-video generation, motion control, large language models, video editing
2311.16567 Report	MobileDiffusion: Subsecond Text-to-Image Generation on Mobile Devices	Yang Zhao, Yanwu Xu, Zhisheng Xiao, Tingbo Hou	The deployment of large-scale text-to-image diffusion models on mobile devices is impeded by their substantial model size and slow inference speed. In this paper, we propose \textbf{MobileDiffusion}, a highly efficient text-to-image diffusion model obtained through extensive optimizations in both architecture and sampling techniques. We conduct a comprehensive examination of model architecture design to reduce redundancy, enhance computational efficiency, and minimize model's parameter count, while preserving image generation quality. Additionally, we employ distillation and diffusion-GAN finetuning techniques on MobileDiffusion to achieve 8-step and 1-step inference respectively. Empirical studies, conducted both quantitatively and qualitatively, demonstrate the effectiveness of our proposed techniques. MobileDiffusion achieves a remarkable \textbf{sub-second} inference speed for generating a $512\times512$ image on mobile devices, establishing a new state of the art.	This paper introduces MobileDiffusion, a highly efficient text-to-image diffusion model designed for mobile devices, achieved by optimizing the model architecture and sampling techniques.	Deploying large-scale text-to-image models on mobile devices is challenging due to their size and slow inference speed. This work addresses this challenge by substantially reducing inference time and model size, paving the way for on-device text-to-image generation.	The authors perform a comprehensive analysis and modification of the UNet architecture, including reducing redundancy in transformer blocks, employing separable convolutions, and pruning residual blocks. They also utilize distillation and diffusion-GAN finetuning to reduce sampling steps.	MobileDiffusion achieves sub-second inference speed for generating 512x512 images on mobile devices (e.g., 0.2 seconds on iPhone 15 Pro). The model achieves comparable image quality to existing models like Stable Diffusion while being significantly smaller and faster. MobileDiffusion demonstrates strong performance in downstream tasks like controllable generation and LoRA finetuning.	The model still struggles with generating images requiring uncommon knowledge or complex quantity interpretation, potentially limited by the text encoder. Future work could explore extending MobileDiffusion to pixel-based diffusion models.	text-to-image generation, diffusion models, mobile ai, model compression, efficient architecture
2311.16513 Report	Fine-grained Appearance Transfer with Diffusion Models	Yuteng Ye, Guanwen Li, Hang Zhou, Cai Jiale, Junqing Yu, Yawei Luo, Zikai Song, Qilong Xing, Youjia Zhang, Wei Yang	Image-to-image translation (I2I), and particularly its subfield of appearance transfer, which seeks to alter the visual appearance between images while maintaining structural coherence, presents formidable challenges. Despite significant advancements brought by diffusion models, achieving fine-grained transfer remains complex, particularly in terms of retaining detailed structural elements and ensuring information fidelity. This paper proposes an innovative framework designed to surmount these challenges by integrating various aspects of semantic matching, appearance transfer, and latent deviation. A pivotal aspect of our approach is the strategic use of the predicted $x_0$ space by diffusion models within the latent space of diffusion processes. This is identified as a crucial element for the precise and natural transfer of fine-grained details. Our framework exploits this space to accomplish semantic alignment between source and target images, facilitating mask-wise appearance transfer for improved feature acquisition. A significant advancement of our method is the seamless integration of these features into the latent space, enabling more nuanced latent deviations without necessitating extensive model retraining or fine-tuning. The effectiveness of our approach is demonstrated through extensive experiments, which showcase its ability to adeptly handle fine-grained appearance transfers across a wide range of categories and domains. We provide our code at https://github.com/babahui/Fine-grained-Appearance-Transfer	This paper proposes a novel framework for fine-grained appearance transfer in image-to-image translation, leveraging the predicted x_0 space of diffusion models.	Fine-grained appearance transfer, aiming to alter visual appearance while preserving structure, is challenging for existing diffusion models, particularly in maintaining details and fidelity.	The framework integrates semantic matching in the x_0 space for detail alignment, appearance transfer based on matched features, and a latent deviation method for smooth integration of transferred features into the latent space.	The method effectively transfers fine-grained details across diverse categories and domains, outperforming existing image-guided methods in qualitative and quantitative comparisons. The framework excels in preserving structural integrity and information fidelity, contrasting with limitations of text-guided methods in precise detail control. Ablation studies validate the importance of both semantic matching and latent deviation components for accurate and visually plausible results.	The method's performance is susceptible to significant viewpoint differences and size discrepancies between source and target images. Future work will focus on extending the framework to broader image transfer contexts, addressing challenges beyond appearance transfer while maintaining fine-grained detail focus.	image-to-image translation, appearance transfer, diffusion models, semantic matching, latent deviation
2311.16512 Report	CoSeR: Bridging Image and Language for Cognitive Super-Resolution	Haoze Sun, Wenbo Li, Jianzhuang Liu, Haoyu Chen, Renjing Pei, Xueyi Zou, Youliang Yan, Yujiu Yang	Existing super-resolution (SR) models primarily focus on restoring local texture details, often neglecting the global semantic information within the scene. This oversight can lead to the omission of crucial semantic details or the introduction of inaccurate textures during the recovery process. In our work, we introduce the Cognitive Super-Resolution (CoSeR) framework, empowering SR models with the capacity to comprehend low-resolution images. We achieve this by marrying image appearance and language understanding to generate a cognitive embedding, which not only activates prior information from large text-to-image diffusion models but also facilitates the generation of high-quality reference images to optimize the SR process. To further improve image fidelity, we propose a novel condition injection scheme called "All-in-Attention", consolidating all conditional information into a single module. Consequently, our method successfully restores semantically correct and photorealistic details, demonstrating state-of-the-art performance across multiple benchmarks. Code: https://github.com/VINHYU/CoSeR	Introduces Cognitive Super-Resolution (CoSeR), a novel framework that empowers super-resolution models with cognitive abilities by generating semantic embeddings and high-quality reference images from low-resolution inputs using text-to-image diffusion models.	Existing SR models often neglect global semantic information, leading to the loss of crucial details or introduction of inaccurate textures. CoSeR addresses this by incorporating cognitive understanding similar to human perception.	A cognitive encoder extracts semantic and textural embeddings from LR images. These embeddings generate high-fidelity reference images and are integrated with LR input into a denoising U-Net via a novel All-in-Attention (AiA) module.	CoSeR achieves state-of-the-art performance on multiple benchmarks, including ImageNet Test2000, RealSR, and DRealSR. Generated reference images exhibit high semantic similarity to LR inputs, effectively guiding the restoration process. The AiA module enhances the fidelity of SR results, ensuring consistency with the input image.	The improvement from using multiple reference images plateaus beyond 2-3 images. Further research on accelerating the sampling process in diffusion-based SR models is needed.	image super-resolution, cognitive super-resolution, diffusion models, reference image generation, all-in-attention module
2311.16507 Report	Exploring Straighter Trajectories of Flow Matching with Diffusion Guidance	Siyu Xing, Jie Cao, Huaibo Huang, Xiao-Yu Zhang, Ran He	Flow matching as a paradigm of generative model achieves notable success across various domains. However, existing methods use either multi-round training or knowledge within minibatches, posing challenges in finding a favorable coupling strategy for straight trajectories. To address this issue, we propose a novel approach, Straighter trajectories of Flow Matching (StraightFM). It straightens trajectories with the coupling strategy guided by diffusion model from entire distribution level. First, we propose a coupling strategy to straighten trajectories, creating couplings between image and noise samples under diffusion model guidance. Second, StraightFM also integrates real data to enhance training, employing a neural network to parameterize another coupling process from images to noise samples. StraightFM is jointly optimized with couplings from above two mutually complementary directions, resulting in straighter trajectories and enabling both one-step and few-step generation. Extensive experiments demonstrate that StraightFM yields high quality samples with fewer step. StraightFM generates visually appealing images with a lower FID among diffusion and traditional flow matching methods within 5 sampling steps when trained on pixel space. In the latent space (i.e., Latent Diffusion), StraightFM achieves a lower KID value compared to existing methods on the CelebA-HQ 256 dataset in fewer than 10 sampling steps.	StraightFM is a novel approach for flow matching generative models that leverages diffusion model guidance to straighten trajectories and enable one-step and few-step generation.	Existing flow matching methods rely on multi-round training or minibatch knowledge for coupling strategies, leading to limitations in finding favorable couplings for straight trajectories.	StraightFM uses a pre-trained diffusion model to guide the coupling strategy by creating couplings between image and noise samples. It also integrates real data to enhance training by using a neural network to parameterize another coupling process from images to noise samples.	StraightFM achieves straighter paths and high-quality image generation in fewer steps, even with one-step generation, on CIFAR-10. StraightFM outperforms latent diffusion models on CelebA-HQ 256x256 dataset in under 10 sampling steps. StraightFM demonstrates promising results in image inpainting, highlighting the efficacy of natural optimal transport couplings for flow matching in restoration tasks.	The training of StraightFM depends on the coupling mechanism of diffusion model sampling, which can be slower than random coupling strategies. Future work can explore balancing coupling speed and sample quality.	generative models, flow matching, diffusion models, optimal transport, image generation
2311.16504 Report	Rethinking Directional Integration in Neural Radiance Fields	Congyue Deng, Jiawei Yang, Leonidas Guibas, Yue Wang	Recent works use the Neural radiance field (NeRF) to perform multi-view 3D reconstruction, providing a significant leap in rendering photorealistic scenes. However, despite its efficacy, NeRF exhibits limited capability of learning view-dependent effects compared to light field rendering or image-based view synthesis. To that end, we introduce a modification to the NeRF rendering equation which is as simple as a few lines of code change for any NeRF variations, while greatly improving the rendering quality of view-dependent effects. By swapping the integration operator and the direction decoder network, we only integrate the positional features along the ray and move the directional terms out of the integration, resulting in a disentanglement of the view-dependent and independent components. The modified equation is equivalent to the classical volumetric rendering in ideal cases on object surfaces with Dirac densities. Furthermore, we prove that with the errors caused by network approximation and numerical integration, our rendering equation exhibits better convergence properties with lower error accumulations compared to the classical NeRF. We also show that the modified equation can be interpreted as light field rendering with learned ray embeddings. Experiments on different NeRF variations show consistent improvements in the quality of view-dependent effects with our simple modification.	This paper introduces LiNeRF, a simple modification to the Neural Radiance Field (NeRF) rendering equation that enhances the rendering quality of view-dependent effects by disentangling view-dependent and view-independent components.	NeRFs struggle to effectively model view-dependent effects due to redundant view-direction queries that over-consume network capacity. This modification addresses this issue by integrating positional features along rays and decoding the aggregated feature with view direction, leading to a more efficient and accurate representation.	The core methodology involves swapping the integration operator and the direction decoder network in the NeRF rendering equation. This modification, which can be implemented with minimal code changes, integrates positional features along the ray and decodes the aggregated feature with view direction, similar to light field rendering with learned ray embeddings.	LiNeRF demonstrates consistent improvements in rendering view-dependent effects across various NeRF architectures and input encodings. Theoretical analysis proves that LiNeRF provides a better numerical estimator of radiance integration with a tighter error bound compared to classic NeRF. Experimental results on synthetic and real-world datasets, including Shiny Blender and Shiny datasets, showcase LiNeRF's capability to effectively model diverse view-dependent effects like reflections, refractions, and light interferences.	Limitations: LiNeRF, while showing significant improvements over classic NeRF, still lags behind image-based view synthesis methods specifically designed for non-Lambertian effects, which benefit from explicit pixel value representations. Future Work: The research team aims to investigate tighter integration of implicit radiance field rendering with explicit pixel-based rendering techniques and explore optimal feature selection strategies for integration based on network architectures.	neural radiance fields, nerf, view synthesis, light field rendering, view-dependent effects
2311.16499 Report	Deceptive-Human: Prompt-to-NeRF 3D Human Generation with 3D-Consistent Synthetic Images	Shiu-hong Kao, Xinhang Liu, Yu-Wing Tai, Chi-Keung Tang	This paper presents Deceptive-Human, a novel Prompt-to-NeRF framework capitalizing state-of-the-art control diffusion models (e.g., ControlNet) to generate a high-quality controllable 3D human NeRF. Different from direct 3D generative approaches, e.g., DreamFusion and DreamHuman, Deceptive-Human employs a progressive refinement technique to elevate the reconstruction quality. This is achieved by utilizing high-quality synthetic human images generated through the ControlNet with view-consistent loss. Our method is versatile and readily extensible, accommodating multimodal inputs, including a text prompt and additional data such as 3D mesh, poses, and seed images. The resulting 3D human NeRF model empowers the synthesis of highly photorealistic novel views from 360-degree perspectives. The key to our Deceptive-Human for hallucinating multi-view consistent synthetic human images lies in our progressive finetuning strategy. This strategy involves iteratively enhancing views using the provided multimodal inputs at each intermediate step to improve the human NeRF model. Within this iterative refinement process, view-dependent appearances are systematically eliminated to prevent interference with the underlying density estimation. Extensive qualitative and quantitative experimental comparison shows that our deceptive human models achieve state-of-the-art application quality.	Presents Deceptive-Human, a novel Prompt-to-NeRF framework that leverages control diffusion models to generate high-quality, controllable 3D human NeRFs.	Addresses the challenge of creating realistic 3D human models, particularly in generating high-fidelity appearances and enabling controllability.	Employs a progressive refinement technique. Generates coarse NeRF using view-consistent diffusion models. Iteratively enhances the coarse NeRF by denoising rendered images.	Achieves state-of-the-art application quality for 3D human generation. Demonstrates the first 3D human model that accepts inputs with various controls like text, pose, style, edges, depth, and seed images. Generates highly photorealistic novel views from 360-degree perspectives, showcasing the model's capability for novel view synthesis.	The quality of the generated 3D human relies on the accuracy of the mesh estimation module, which can be improved. Extending the framework to generate animated 3D humans, potentially leveraging style and pose controls, presents an interesting direction for future work.	3d human generation, neural radiance fields, diffusion models, controllable generation, progressive refinement
2311.16498 Report	MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model	Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, Mike Zheng Shou	This paper studies the human image animation task, which aims to generate a video of a certain reference identity following a particular motion sequence. Existing animation works typically employ the frame-warping technique to animate the reference image towards the target motion. Despite achieving reasonable results, these approaches face challenges in maintaining temporal consistency throughout the animation due to the lack of temporal modeling and poor preservation of reference identity. In this work, we introduce MagicAnimate, a diffusion-based framework that aims at enhancing temporal consistency, preserving reference image faithfully, and improving animation fidelity. To achieve this, we first develop a video diffusion model to encode temporal information. Second, to maintain the appearance coherence across frames, we introduce a novel appearance encoder to retain the intricate details of the reference image. Leveraging these two innovations, we further employ a simple video fusion technique to encourage smooth transitions for long video animation. Empirical results demonstrate the superiority of our method over baseline approaches on two benchmarks. Notably, our approach outperforms the strongest baseline by over 38% in terms of video fidelity on the challenging TikTok dancing dataset. Code and model will be made available.	This paper presents TempoAvatar, a novel diffusion-based human image animation framework that leverages temporal modeling and robust appearance encoding for generating temporally consistent and high-fidelity animations.	Existing animation methods often struggle with maintaining temporal consistency and preserving fine-grained details of the reference image, leading to flickering and unrealistic results.	TempoAvatar employs a video diffusion model with temporal attention blocks for capturing temporal information and introduces an appearance encoder to retain detailed features from the reference image. It also utilizes an image-video joint training strategy and a video fusion technique for enhancing animation quality and smoothness.	TempoAvatar achieves state-of-the-art performance on two benchmarks, TikTok and TED-talks, surpassing baselines in video fidelity and single-frame quality. The method demonstrates superior temporal consistency compared to existing diffusion-based animation approaches. TempoAvatar exhibits strong generalization ability, enabling cross-identity animation, unseen domain animation, and multi-person animation.	The higher L1 error on the TED-talks dataset suggests potential limitations in handling dynamic backgrounds due to the use of DensePose as motion input. Future work includes exploring alternative motion representations and extending the framework to handle more complex scenes and interactions.	image animation, diffusion models, temporal consistency, appearance encoding, human avatar
2311.16492 Report	VLPrompt: Vision-Language Prompting for Panoptic Scene Graph Generation	Zijian Zhou, Miaojing Shi, Holger Caesar	Panoptic Scene Graph Generation (PSG) aims at achieving a comprehensive image understanding by simultaneously segmenting objects and predicting relations among objects. However, the long-tail problem among relations leads to unsatisfactory results in real-world applications. Prior methods predominantly rely on vision information or utilize limited language information, such as object or relation names, thereby overlooking the utility of language information. Leveraging the recent progress in Large Language Models (LLMs), we propose to use language information to assist relation prediction, particularly for rare relations. To this end, we propose the Vision-Language Prompting (VLPrompt) model, which acquires vision information from images and language information from LLMs. Then, through a prompter network based on attention mechanism, it achieves precise relation prediction. Our extensive experiments show that VLPrompt significantly outperforms previous state-of-the-art methods on the PSG dataset, proving the effectiveness of incorporating language information and alleviating the long-tail problem of relations.	This paper proposes VLPrompt, a novel Vision-Language Prompting model for Panoptic Scene Graph Generation (PSG) that leverages the rich language information from Large Language Models (LLMs) to address the long-tail problem in relation categories.	Current PSG models struggle to accurately predict rare relations due to the long-tail problem. This paper explores the use of LLMs to provide common sense knowledge and improve relation prediction, particularly for rare relations.	VLPrompt consists of three components: (1) Vision feature extractor: extracts visual features from object pairs using a segmentation network; (2) Language feature extractor: employs designed prompts and LLMs to generate descriptions for potential relations and judgments on relation triplets, encoding them into features; (3) Vision-language prompter: utilizes an attention-based network to enable interaction between vision and language features for relation prediction, which are then fused for the final prediction.	VLPrompt significantly outperforms previous state-of-the-art methods on the PSG dataset, demonstrating the effectiveness of incorporating language information. Ablation studies validate the contribution of each component and the effectiveness of design choices. VLPrompt shows significant improvement in predicting rare relations, effectively alleviating the long-tail problem.	The model's efficiency could be further improved, as it currently has higher FLOPS compared to some previous models. The reliance on pre-trained LLMs may limit its generalizability to open-set relation prediction.	panoptic scene graph generation, large language models, vision-language model, long-tail problem, relation prediction
2311.16473 Report	GS-IR: 3D Gaussian Splatting for Inverse Rendering	Zhihao Liang, Qi Zhang, Ying Feng, Ying Shan, Kui Jia	We propose GS-IR, a novel inverse rendering approach based on 3D Gaussian Splatting (GS) that leverages forward mapping volume rendering to achieve photorealistic novel view synthesis and relighting results. Unlike previous works that use implicit neural representations and volume rendering (e.g. NeRF), which suffer from low expressive power and high computational complexity, we extend GS, a top-performance representation for novel view synthesis, to estimate scene geometry, surface material, and environment illumination from multi-view images captured under unknown lighting conditions. There are two main problems when introducing GS to inverse rendering: 1) GS does not support producing plausible normal natively; 2) forward mapping (e.g. rasterization and splatting) cannot trace the occlusion like backward mapping (e.g. ray tracing). To address these challenges, our GS-IR proposes an efficient optimization scheme that incorporates a depth-derivation-based regularization for normal estimation and a baking-based occlusion to model indirect lighting. The flexible and expressive GS representation allows us to achieve fast and compact geometry reconstruction, photorealistic novel view synthesis, and effective physically-based rendering. We demonstrate the superiority of our method over baseline methods through qualitative and quantitative evaluations on various challenging scenes.	GS-IR, a novel 3D Gaussian-based inverse rendering framework, leverages forward mapping splatting to deduce the physical attributes of a complex scene from multi-view images captured under unknown lighting conditions.	Existing inverse rendering methods using implicit neural representations suffer from low expressive power and high computational complexity. GS-IR, based on 3D Gaussian Splatting, offers a more compact and efficient representation for faster, real-time rendering while achieving high quality.	GS-IR employs a three-stage strategy: 1) Optimizes 3D Gaussians for geometry reconstruction and uses depth gradient to supervise normal estimation. 2) Precomputes occlusion information and stores it in spherical harmonics-based architectures to model indirect illumination. 3) Uses differentiable splatting with a physically-based rendering pipeline to optimize illumination and material-aware 3D Gaussians.	GS-IR achieves superior novel view synthesis and albedo quality compared to baseline methods on the TensoIR Synthetic dataset. The method demonstrates fast convergence and supports real-time rendering due to its efficient 3D Gaussian representation and tile-based rasterizer. GS-IR effectively handles complex real unbounded scenes, reconstructing high-fidelity geometry and materials.	Modeling the specular term of indirect illumination remains a limitation. Spherical Harmonics used for occlusion modeling are only suitable for low-frequency details.	inverse rendering, 3d gaussian splatting, physically-based rendering, novel view synthesis, relighting
2311.16465 Report	TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering	Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, Furu Wei	The diffusion model has been proven a powerful generative model in recent years, yet remains a challenge in generating visual text. Several methods alleviated this issue by incorporating explicit text position and content as guidance on where and what text to render. However, these methods still suffer from several drawbacks, such as limited flexibility and automation, constrained capability of layout prediction, and restricted style diversity. In this paper, we present TextDiffuser-2, aiming to unleash the power of language models for text rendering. Firstly, we fine-tune a large language model for layout planning. The large language model is capable of automatically generating keywords for text rendering and also supports layout modification through chatting. Secondly, we utilize the language model within the diffusion model to encode the position and texts at the line level. Unlike previous methods that employed tight character-level guidance, this approach generates more diverse text images. We conduct extensive experiments and incorporate user studies involving human participants as well as GPT-4V, validating TextDiffuser-2's capacity to achieve a more rational text layout and generation with enhanced diversity. The code and model will be available at \url{https://aka.ms/textdiffuser-2}.	This document provides guidelines for formatting author responses to peer reviews, specifically for LaTeX users.	Standardized formatting ensures that author responses are clear, concise, and easy for reviewers to read and assess.	The document outlines specific formatting requirements including page limits, font sizes, margin spacing, figure placement, and referencing styles for LaTeX.	Author responses must be no longer than one page. Figures and references should be numbered separately from the main paper. Font sizes and line widths should be legible in a printed copy.	The guidelines are specific to LaTeX, potentially limiting accessibility for authors using other typesetting systems. Further clarification on acceptable content beyond factual errors or requested information could be beneficial.	latex, author response, formatting guidelines, peer review, academic publishing
2311.16254 Report	Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models	Samuele Poppi, Tobia Poppi, Federico Cocchi, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara	Large-scale vision-and-language models, such as CLIP, are typically trained on web-scale data, which can introduce inappropriate content and lead to the development of unsafe and biased behavior. This, in turn, hampers their applicability in sensitive and trustworthy contexts and could raise significant concerns in their adoption. Our research introduces a novel approach to enhancing the safety of vision-and-language models by diminishing their sensitivity to NSFW (not safe for work) inputs. In particular, our methodology seeks to sever "toxic" linguistic and visual concepts, unlearning the linkage between unsafe linguistic or visual items and unsafe regions of the embedding space. We show how this can be done by fine-tuning a CLIP model on synthetic data obtained from a large language model trained to convert between safe and unsafe sentences, and a text-to-image generator. We conduct extensive experiments on the resulting embedding space for cross-modal retrieval, text-to-image, and image-to-text generation, where we show that our model can be remarkably employed with pre-trained generative models. Our source code and trained models are available at: https://github.com/aimagelab/safe-clip.	Introduces Safe-CLIP, a fine-tuning methodology to enhance the safety of pre-trained CLIP models by diminishing their sensitivity to NSFW inputs.	Addresses the issue of inappropriate content and biased behavior in large-scale vision-and-language models trained on web-scale data, enhancing their applicability in sensitive contexts.	Fine-tunes CLIP using a synthetic dataset of safe and unsafe images and texts, generated via a toxic language model and a text-to-image generator. Employs multiple loss functions to redirect inappropriate content to safe regions while preserving the embedding space structure.	Safe-CLIP significantly reduces the retrieval of NSFW images and text when using unsafe queries. Significantly reduces the probability of generating NSFW images using Stable Diffusion with both I2P and VISU prompts. Effectively reduces the probability of generating inappropriate textual descriptions by multimodal LLMs (e.g., LLaVA) when provided with NSFW images.	Limited guarantee of success, with potential failure cases. Ethical implications of the toxic language model used for dataset generation.	trustworthy ai, vision-and-language, nsfw concepts, cross-modal retrieval, text-to-image generation
2311.16122 Report	Semantic Generative Augmentations for Few-Shot Counting	Perla Doubinsky, Nicolas Audebert, Michel Crucianu, Hervé Le Borgne	With the availability of powerful text-to-image diffusion models, recent works have explored the use of synthetic data to improve image classification performances. These works show that it can effectively augment or even replace real data. In this work, we investigate how synthetic data can benefit few-shot class-agnostic counting. This requires to generate images that correspond to a given input number of objects. However, text-to-image models struggle to grasp the notion of count. We propose to rely on a double conditioning of Stable Diffusion with both a prompt and a density map in order to augment a training dataset for few-shot counting. Due to the small dataset size, the fine-tuned model tends to generate images close to the training images. We propose to enhance the diversity of synthesized images by exchanging captions between images thus creating unseen configurations of object types and spatial layout. Our experiments show that our diversified generation strategy significantly improves the counting accuracy of two recent and performing few-shot counting models on FSC147 and CARPK.	This paper introduces a novel data augmentation strategy for few-shot object counting that leverages the power of text-to-image diffusion models by conditioning them on both text prompts and density maps.	Few-shot object counting suffers from limited data, hindering performance. This work addresses this challenge with a data augmentation strategy based on text-to-image diffusion models tailored for counting tasks.	The authors fine-tune a pre-trained Stable Diffusion model using ControlNet to generate new images conditioned on both textual prompts and density maps. To enhance diversity, they propose a caption swapping mechanism based on semantic similarity, generating unseen combinations of objects and spatial layouts.	The proposed diverse augmentation strategy significantly improves counting accuracy over traditional augmentation methods on the FSC147 benchmark dataset. The method also improves the generalization capability of the models, as demonstrated by state-of-the-art performance on the CARPK dataset for car counting. Experiments demonstrate the importance of caption similarity-based swapping and the optimal balance between real and synthetic data during training.	Changing the object category through caption swapping may lead to inaccurate exemplar bounding boxes if the new object's shape differs significantly. Further exploration is needed for effectively refining exemplar boxes in cases where caption swapping leads to mismatches between the generated object and the original bounding box.	few-shot learning, object counting, data augmentation, diffusion models, controlnet
2311.16103 Report	Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models	Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, Li Yuan	Video-based large language models (Video-LLMs) have been recently introduced, targeting both fundamental improvements in perception and comprehension, and a diverse range of user inquiries. In pursuit of the ultimate goal of achieving artificial general intelligence, a truly intelligent Video-LLM model should not only see and understand the surroundings, but also possess human-level commonsense, and make well-informed decisions for the users. To guide the development of such a model, the establishment of a robust and comprehensive evaluation system becomes crucial. To this end, this paper proposes \textit{Video-Bench}, a new comprehensive benchmark along with a toolkit specifically designed for evaluating Video-LLMs. The benchmark comprises 10 meticulously crafted tasks, evaluating the capabilities of Video-LLMs across three distinct levels: Video-exclusive Understanding, Prior Knowledge-based Question-Answering, and Comprehension and Decision-making. In addition, we introduce an automatic toolkit tailored to process model outputs for various tasks, facilitating the calculation of metrics and generating convenient final scores. We evaluate 8 representative Video-LLMs using \textit{Video-Bench}. The findings reveal that current Video-LLMs still fall considerably short of achieving human-like comprehension and analysis of real-world videos, offering valuable insights for future research directions. The benchmark and toolkit are available at: \url{https://github.com/PKU-YuanGroup/Video-Bench}.	This paper introduces "Video-Bench," a comprehensive benchmark and toolkit for evaluating Video-LLMs across three levels of capability: Video-exclusive Understanding, Prior Knowledge-based Question-Answering, and Comprehension and Decision-making.	A robust evaluation system is crucial for guiding the development of Video-LLMs towards achieving artificial general intelligence, as existing benchmarks lack comprehensiveness in assessing these capabilities.	The benchmark comprises 10 meticulously crafted tasks, evaluating various aspects of Video-LLM abilities. An automatic toolkit processes model outputs, calculates metrics, and generates final scores, streamlining the evaluation workflow.	Current Video-LLMs excel at summarizing basic video content but struggle with temporal reasoning and detail-oriented tasks. Lack of domain-specific prior knowledge limits Video-LLMs' ability to understand and answer questions requiring external knowledge. Most tested models exhibit limited proficiency in comprehension and decision-making within complex scenarios, suggesting a need for larger-scale training and enhanced multimodal understanding.	The reliance on multiple-choice questions, while simplifying evaluation, may not fully capture the nuances of Video-LLM responses. Future work should explore more robust evaluation metrics for long-form text responses and address the need for efficient long video understanding.	video-llms, benchmarking, video understanding, multimodal learning, artificial general intelligence
2311.16101 Report	How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs	Haoqin Tu, Chenhang Cui, Zijun Wang, Yiyang Zhou, Bingchen Zhao, Junlin Han, Wangchunshu Zhou, Huaxiu Yao, Cihang Xie	This work focuses on the potential of Vision LLMs (VLLMs) in visual reasoning. Different from prior studies, we shift our focus from evaluating standard performance to introducing a comprehensive safety evaluation suite, covering both out-of-distribution (OOD) generalization and adversarial robustness. For the OOD evaluation, we present two novel VQA datasets, each with one variant, designed to test model performance under challenging conditions. In exploring adversarial robustness, we propose a straightforward attack strategy for misleading VLLMs to produce visual-unrelated responses. Moreover, we assess the efficacy of two jailbreaking strategies, targeting either the vision or language component of VLLMs. Our evaluation of 21 diverse models, ranging from open-source VLLMs to GPT-4V, yields interesting observations: 1) Current VLLMs struggle with OOD texts but not images, unless the visual information is limited; and 2) These VLLMs can be easily misled by deceiving vision encoders only, and their vision-language training often compromise safety protocols. We release this safety evaluation suite at https://github.com/UCSC-VLAA/vllm-safety-benchmark.	This paper introduces a comprehensive safety evaluation suite for Vision Large Language Models (VLLMs) encompassing out-of-distribution (OOD) generalization and adversarial robustness.	Assessing the safety of VLLMs is crucial for their responsible integration into real-world applications, as existing benchmarks primarily focus on standard performance.	The authors propose two novel VQA datasets for OOD evaluation and a simple attack strategy to mislead VLLMs. They also benchmark two jailbreaking attacks targeting vision and language components. Evaluation is performed on 21 models, including open-source VLLMs and GPT-4V.	VLLMs excel in comprehending OOD visual content but struggle with OOD textual input, highlighting the importance of language understanding. VLLMs, including GPT-4V, face challenges processing sketch images due to limited information content. Simple attacks targeting CLIP's vision encoder effectively mislead VLLMs, while GPT-4V exhibits a higher tendency to refuse answers to inappropriate inputs.	The study primarily focuses on CLIP-based VLLMs, leaving room for future research on other architectures. The proposed attack strategies, while effective, might not encompass the full spectrum of potential vulnerabilities in VLLMs.	vision language models, safety evaluation, out-of-distribution generalization, adversarial robustness, jailbreaking attacks
2311.16099 Report	GART: Gaussian Articulated Template Models	Jiahui Lei, Yufu Wang, Georgios Pavlakos, Lingjie Liu, Kostas Daniilidis	We introduce Gaussian Articulated Template Model GART, an explicit, efficient, and expressive representation for non-rigid articulated subject capturing and rendering from monocular videos. GART utilizes a mixture of moving 3D Gaussians to explicitly approximate a deformable subject's geometry and appearance. It takes advantage of a categorical template model prior (SMPL, SMAL, etc.) with learnable forward skinning while further generalizing to more complex non-rigid deformations with novel latent bones. GART can be reconstructed via differentiable rendering from monocular videos in seconds or minutes and rendered in novel poses faster than 150fps.	This paper introduces GART, a novel explicit and efficient representation for capturing and rendering non-rigid articulated subjects from monocular videos using Gaussian Mixture Models (GMM).	Current implicit methods like NeRFs, though high-quality, suffer from slow rendering speeds. Explicit methods, while efficient, often lack quality. GART bridges this gap by explicitly approximating the implicit radiance field, combining the strengths of both.	GART leverages a template model (e.g., SMPL, SMAL) and represents the canonical shape and appearance using GMM. It employs learnable forward skinning for animation and introduces latent bones to capture complex deformations, such as loose clothing.	GART achieves state-of-the-art performance on monocular human reconstruction and rendering benchmarks (ZJU-MoCap, People-Snapshot) with superior efficiency. GART demonstrates high fidelity in reconstructing challenging clothing like long dresses from the UBC-Fashion dataset, outperforming baselines like InstantAvatar. GART successfully extends to animal reconstruction, capturing detailed appearances of diverse dog breeds from in-the-wild videos using the D-SMAL template.	The method currently relies on the availability of template pose estimators, limiting its applicability to species without readily available estimators. Future work could explore capturing category-level priors from large in-the-wild video collections to generalize beyond single-video fitting.	3d reconstruction, articulated motion, monocular video, gaussian mixture model, differentiable rendering
2311.16097 Report	CG-HOI: Contact-Guided 3D Human-Object Interaction Generation	Christian Diller, Angela Dai	We propose CG-HOI, the first method to address the task of generating dynamic 3D human-object interactions (HOIs) from text. We model the motion of both human and object in an interdependent fashion, as semantically rich human motion rarely happens in isolation without any interactions. Our key insight is that explicitly modeling contact between the human body surface and object geometry can be used as strong proxy guidance, both during training and inference. Using this guidance to bridge human and object motion enables generating more realistic and physically plausible interaction sequences, where the human body and corresponding object move in a coherent manner. Our method first learns to model human motion, object motion, and contact in a joint diffusion process, inter-correlated through cross-attention. We then leverage this learned contact for guidance during inference to synthesize realistic and coherent HOIs. Extensive evaluation shows that our joint contact-based human-object interaction approach generates realistic and physically plausible sequences, and we show two applications highlighting the capabilities of our method. Conditioned on a given object trajectory, we can generate the corresponding human motion without re-training, demonstrating strong human-object interdependency learning. Our approach is also flexible, and can be applied to static real-world 3D scene scans.	This paper proposes CONTACT, a novel method for generating dynamic 3D human-object interactions (HOIs) from text descriptions by jointly modeling human motion, object motion, and contact between them.	Realistic modeling of human-object interactions is crucial for various applications, but previous methods struggled to generate plausible and coherent interactions due to neglecting the interdependency of human and object motions.	CONTACT utilizes a denoising diffusion process with cross-attention to learn the correlations between human, object, and contact representations. A contact-based object transform weighting scheme ensures object motion is primarily influenced by the body part in closest contact. During inference, a contact-based guidance refines generated sequences for physical plausibility.	CONTACT generates more realistic and physically plausible HOIs compared to baselines, effectively mitigating artifacts like object floating. The method demonstrates strong human-object interdependency learning, enabling conditional generation of human motion given object trajectories without retraining. CONTACT can be applied to populate static 3D scene scans with realistic HOIs.	The method currently focuses on interactions with a single object, limiting its applicability to more complex scenarios with multiple objects. The reliance on expensive 3D HOI data for training and manual text annotations poses challenges for scalability and generalization.	3d human-object interaction, denoising diffusion model, contact modeling, text-to-motion generation, scene population
2311.16096 Report	Animatable Gaussians: Learning Pose-dependent Gaussian Maps for High-fidelity Human Avatar Modeling	Zhe Li, Zerong Zheng, Lizhen Wang, Yebin Liu	Modeling animatable human avatars from RGB videos is a long-standing and challenging problem. Recent works usually adopt MLP-based neural radiance fields (NeRF) to represent 3D humans, but it remains difficult for pure MLPs to regress pose-dependent garment details. To this end, we introduce Animatable Gaussians, a new avatar representation that leverages powerful 2D CNNs and 3D Gaussian splatting to create high-fidelity avatars. To associate 3D Gaussians with the animatable avatar, we learn a parametric template from the input videos, and then parameterize the template on two front \& back canonical Gaussian maps where each pixel represents a 3D Gaussian. The learned template is adaptive to the wearing garments for modeling looser clothes like dresses. Such template-guided 2D parameterization enables us to employ a powerful StyleGAN-based CNN to learn the pose-dependent Gaussian maps for modeling detailed dynamic appearances. Furthermore, we introduce a pose projection strategy for better generalization given novel poses. Overall, our method can create lifelike avatars with dynamic, realistic and generalized appearances. Experiments show that our method outperforms other state-of-the-art approaches. Code: https://github.com/lizhe00/AnimatableGaussians	This paper presents Animatable Gaussians, a novel method for creating high-fidelity animatable human avatars from multi-view RGB videos using 3D Gaussian splatting and 2D CNNs.	Existing methods struggle to model fine-grained, dynamic details due to the limitations of MLPs in representing implicit functions. This work aims to overcome these limitations by leveraging the strengths of explicit representations and 2D CNNs.	The method learns a parametric template from the input videos to capture garment shape. It then parameterizes 3D Gaussians on this template and employs a StyleGAN-based network to predict pose-dependent Gaussian maps. Finally, it utilizes LBS for deforming Gaussians and differentiable splatting for rendering.	Creates high-fidelity avatars with detailed dynamic appearances, surpassing existing methods in visual quality. Learns a character-specific template, allowing for accurate animation of complex garments like long dresses. Introduces a pose projection strategy, enhancing generalization to novel, out-of-distribution poses.	Limited to entangled modeling of body and clothes, hindering applications like virtual try-on. Requires multi-view input for template reconstruction, limiting its applicability to monocular videos.	animatable avatars, 3d gaussian splatting, human modeling, computer vision, deep learning
2311.16090 Report	Self-correcting LLM-controlled Diffusion Models	Tsung-Han Wu, Long Lian, Joseph E. Gonzalez, Boyi Li, Trevor Darrell	Text-to-image generation has witnessed significant progress with the advent of diffusion models. Despite the ability to generate photorealistic images, current text-to-image diffusion models still often struggle to accurately interpret and follow complex input text prompts. In contrast to existing models that aim to generate images only with their best effort, we introduce Self-correcting LLM-controlled Diffusion (SLD). SLD is a framework that generates an image from the input prompt, assesses its alignment with the prompt, and performs self-corrections on the inaccuracies in the generated image. Steered by an LLM controller, SLD turns text-to-image generation into an iterative closed-loop process, ensuring correctness in the resulting image. SLD is not only training-free but can also be seamlessly integrated with diffusion models behind API access, such as DALL-E 3, to further boost the performance of state-of-the-art diffusion models. Experimental results show that our approach can rectify a majority of incorrect generations, particularly in generative numeracy, attribute binding, and spatial relationships. Furthermore, by simply adjusting the instructions to the LLM, SLD can perform image editing tasks, bridging the gap between text-to-image generation and image editing pipelines. We will make our code available for future research and applications.	Introduces Self-correcting LLM-controlled Diffusion (SLD), a framework that enhances text-to-image alignment in diffusion models by iteratively identifying and rectifying errors in generated images through LLM-guided object detection and latent space operations.	Addresses the limitations of existing text-to-image diffusion models that often struggle to accurately interpret and follow complex prompts, especially those requiring numeracy, spatial relationships, and attribute binding.	Employs an LLM parser to extract key objects from user prompts, an open-vocabulary detector to locate objects in the image, and an LLM controller to analyze discrepancies and suggest correction operations (addition, deletion, repositioning, attribute modification) implemented via latent space composition.	Significantly improves generation correctness over state-of-the-art diffusion models on complex prompts, as demonstrated by the LMD benchmark. Achieves substantial performance gains on numeracy, attribute binding, and spatial reasoning tasks. Effectively unifies text-to-image generation and image editing tasks within a single framework.	Faces challenges with objects of complex shapes due to limitations in the object segmentation module. Future work includes exploring the integration of advanced LMMs for more streamlined image assessment and editing.	text-to-image generation, diffusion models, large language models, image editing, self-correction
2311.16043 Report	Relightable 3D Gaussian: Real-time Point Cloud Relighting with BRDF Decomposition and Ray Tracing	Jian Gao, Chun Gu, Youtian Lin, Hao Zhu, Xun Cao, Li Zhang, Yao Yao	We present a novel differentiable point-based rendering framework for material and lighting decomposition from multi-view images, enabling editing, ray-tracing, and real-time relighting of the 3D point cloud. Specifically, a 3D scene is represented as a set of relightable 3D Gaussian points, where each point is additionally associated with a normal direction, BRDF parameters, and incident lights from different directions. To achieve robust lighting estimation, we further divide incident lights of each point into global and local components, as well as view-dependent visibilities. The 3D scene is optimized through the 3D Gaussian Splatting technique while BRDF and lighting are decomposed by physically-based differentiable rendering. Moreover, we introduce an innovative point-based ray-tracing approach based on the bounding volume hierarchy for efficient visibility baking, enabling real-time rendering and relighting of 3D Gaussian points with accurate shadow effects. Extensive experiments demonstrate improved BRDF estimation and novel view rendering results compared to state-of-the-art material estimation approaches. Our framework showcases the potential to revolutionize the mesh-based graphics pipeline with a relightable, traceable, and editable rendering pipeline solely based on point cloud. Project page:https://nju-3dv.github.io/projects/Relightable3DGaussian/.	This paper introduces a novel differentiable point-based rendering framework named Relightable 3D Gaussian for material and lighting decomposition from multi-view images. This enables editing, ray-tracing, and real-time relighting of the reconstructed 3D point cloud.	The proposed framework offers a potential alternative to the mesh-based graphics pipeline with a relightable, traceable, and editable rendering pipeline solely based on point cloud.	The framework represents a 3D scene as a set of relightable 3D Gaussian points, each associated with normal direction, BRDF parameters, and incident lights. It optimizes the scene representation using a combination of 3D Gaussian Splatting, physically-based differentiable rendering, and a novel point-based ray tracing approach based on the bounding volume hierarchy.	The method achieves improved BRDF estimation compared to existing material estimation approaches. It enables high-quality novel view synthesis, outperforming several state-of-the-art methods. The framework allows for real-time rendering and relighting of scenes with realistic shadow effects.	The method struggles with unbounded scenes and requires object masks during optimization. The integration of multi-view stereo (MVS) into the optimization process for more accurate geometry is left for future work.	differentiable rendering, point-based rendering, material and lighting decomposition, ray tracing, 3d gaussian splatting
2311.16037 Report	GaussianEditor: Editing 3D Gaussians Delicately with Text Instructions	Jiemin Fang, Junjie Wang, Xiaopeng Zhang, Lingxi Xie, Qi Tian	Recently, impressive results have been achieved in 3D scene editing with text instructions based on a 2D diffusion model. However, current diffusion models primarily generate images by predicting noise in the latent space, and the editing is usually applied to the whole image, which makes it challenging to perform delicate, especially localized, editing for 3D scenes. Inspired by recent 3D Gaussian splatting, we propose a systematic framework, named GaussianEditor, to edit 3D scenes delicately via 3D Gaussians with text instructions. Benefiting from the explicit property of 3D Gaussians, we design a series of techniques to achieve delicate editing. Specifically, we first extract the region of interest (RoI) corresponding to the text instruction, aligning it to 3D Gaussians. The Gaussian RoI is further used to control the editing process. Our framework can achieve more delicate and precise editing of 3D scenes than previous methods while enjoying much faster training speed, i.e. within 20 minutes on a single V100 GPU, more than twice as fast as Instruct-NeRF2NeRF (45 minutes -- 2 hours).	This paper proposes \name, a novel framework to edit 3D scenes delicately using text instructions and 3D Gaussian splatting.	Existing 3D scene editing methods using 2D diffusion models lack the ability to localize editing regions, making it difficult to perform delicate and precise 3D scene editing.	The method consists of three steps: 1) Region of Interest (RoI) extraction from text instruction, 2) Aligning the instruction RoI to 3D Gaussians through an image grounding model and training, 3) Editing the original 3D Gaussians within the obtained Gaussian RoI by a 2D diffusion model.	\name enables separate foreground and background editing, even in complex multi-object scenes. It achieves more delicate and precise 3D scene editing compared to previous methods like Instruct-NeRF2NeRF. The method exhibits fast training time, completing within 20 minutes on a single V100 GPU.	The scene description generation might be inaccurate when descriptions from different views of the same object vary significantly. The system's performance is limited by the accuracy of the grounding segmentation and diffusion models.	3d scene editing, text-guided editing, 3d gaussian splatting, region of interest, diffusion models
2311.15980 Report	Direct2.5: Diverse Text-to-3D Generation via Multi-view 2.5D Diffusion	Yuanxun Lu, Jingyang Zhang, Shiwei Li, Tian Fang, David McKinnon, Yanghai Tsin, Long Quan, Xun Cao, Yao Yao	Recent advances in generative AI have unveiled significant potential for the creation of 3D content. However, current methods either apply a pre-trained 2D diffusion model with the time-consuming score distillation sampling (SDS), or a direct 3D diffusion model trained on limited 3D data losing generation diversity. In this work, we approach the problem by employing a multi-view 2.5D diffusion fine-tuned from a pre-trained 2D diffusion model. The multi-view 2.5D diffusion directly models the structural distribution of 3D data, while still maintaining the strong generalization ability of the original 2D diffusion model, filling the gap between 2D diffusion-based and direct 3D diffusion-based methods for 3D content generation. During inference, multi-view normal maps are generated using the 2.5D diffusion, and a novel differentiable rasterization scheme is introduced to fuse the almost consistent multi-view normal maps into a consistent 3D model. We further design a normal-conditioned multi-view image generation module for fast appearance generation given the 3D geometry. Our method is a one-pass diffusion process and does not require any SDS optimization as post-processing. We demonstrate through extensive experiments that, our direct 2.5D generation with the specially-designed fusion scheme can achieve diverse, mode-seeking-free, and high-fidelity 3D content generation in only 10 seconds. Project page: https://nju-3dv.github.io/projects/direct25.	This paper introduces a novel approach to rapidly generate textured 3D meshes from text prompts, leveraging fine-tuned multi-view 2.5D diffusion models.	This work bridges the gap between computationally expensive Score Distillation Sampling (SDS) methods and limited generalizability of direct 3D diffusion models for text-to-3D generation.	The approach uses two fine-tuned diffusion models: one for multi-view normal map generation and another for texture generation conditioned on the normals. A differentiable rasterization scheme fuses the multi-view normals into a 3D mesh, and texture mapping completes the process.	The method generates diverse, high-fidelity 3D content in just 10 seconds, significantly faster than SDS-based techniques. It exhibits strong generalization to complex text prompts, surpassing direct 3D diffusion methods. The two-stage architecture allows for geometry-appearance disentanglement, enabling flexible content control.	The limited number of views (four) may result in incomplete reconstruction of unseen areas, such as concavities. Texture generation quality is constrained by the training data and could be enhanced with more sophisticated techniques.	text-to-3d generation, diffusion models, multi-view synthesis, differentiable rasterization, 2.5d representation
2311.15864 Report	InterControl: Generate Human Motion Interactions by Controlling Every Joint	Zhenzhi Wang, Jingbo Wang, Yixuan Li, Dahua Lin, Bo Dai	Text-conditioned human motion synthesis has made remarkable progress with the emergence of diffusion models in recent research. However, the majority of these motion diffusion models are primarily designed for a single character and overlook multi-human interactions. In our approach, we strive to explore this problem by synthesizing human motion with interactions for a group of characters of any size. The key aspect of our approach is the adaptation of human-wise interactions as pairs of human joints that can be either in contact or separated by a desired distance. In contrast to existing methods that necessitate training motion generation models on multi-human motion datasets with a fixed number of characters, our approach inherently possesses the flexibility to model human interactions involving an arbitrary number of individuals, thereby transcending the limitations imposed by the training data. We introduce a novel controllable motion generation method, InterControl, to encourage the synthesized motions maintaining the desired distance between joint pairs. It consists of a motion controller and an inverse kinematics guidance module that realistically and accurately aligns the joints of synthesized characters to the desired location. Furthermore, we demonstrate that the distance between joint pairs for human-wise interactions can be generated using an off-the-shelf Large Language Model (LLM). Experimental results highlight the capability of our framework to generate interactions with multiple human characters and its potential to work with off-the-shelf physics-based character simulators.	InterControl generates multi-person interactions using a single-person motion generation model trained on single-person data by precisely controlling the position of every joint in every person at any time, conditioned on text prompts and joint relations.	This approach overcomes the limitations of previous methods that require multi-human motion datasets with fixed numbers of characters and struggle with precise spatial control for realistic interactions.	InterControl integrates a Motion ControlNet (inspired by ControlNet) to process spatial control signals and an Inverse Kinematics (IK) Guidance module to align the synthesized motions to the desired locations. It uses joint contact pairs, automatically generated from text prompts by an off-the-shelf LLM, as control signals for interaction generation.	InterControl achieves state-of-the-art performance in semantic-level metrics (FID, R-precision, Diversity) on single-person motion generation. It demonstrates superior accuracy in spatial control metrics (Trajectory error, Location error, Average error) compared to previous spatially controllable methods. InterControl generates realistic multi-person interactions, confirmed by low spatial errors and a strong preference (80.4%) over prior work in a user study.	InterControl's interaction definition currently relies on distance and orientation, potentially limiting the complexity of interactions. The plausibility of generated interactions depends on the quality of the single-person motion data and the LLM's ability to infer joint contact pairs consistent with interaction descriptions.	motion synthesis, human interaction generation, diffusion models, controllable motion generation, inverse kinematics
2311.15841 Report	Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation	Siteng Huang, Biao Gong, Yutong Feng, Xi Chen, Yuqian Fu, Yu Liu, Donglin Wang	This study focuses on a novel task in text-to-image (T2I) generation, namely action customization. The objective of this task is to learn the co-existing action from limited data and generalize it to unseen humans or even animals. Experimental results show that existing subject-driven customization methods fail to learn the representative characteristics of actions and struggle in decoupling actions from context features, including appearance. To overcome the preference for low-level features and the entanglement of high-level features, we propose an inversion-based method Action-Disentangled Identifier (ADI) to learn action-specific identifiers from the exemplar images. ADI first expands the semantic conditioning space by introducing layer-wise identifier tokens, thereby increasing the representational richness while distributing the inversion across different features. Then, to block the inversion of action-agnostic features, ADI extracts the gradient invariance from the constructed sample triples and masks the updates of irrelevant channels. To comprehensively evaluate the task, we present an ActionBench that includes a variety of actions, each accompanied by meticulously selected samples. Both quantitative and qualitative results show that our ADI outperforms existing baselines in action-customized T2I generation. Our project page is at https://adi-t2i.github.io/ADI.	This paper introduces Action Customization for text-to-image generation, enabling the learning of specific actions from limited examples and their transfer to new subjects, including humans and animals.	Generating images with specific actions is challenging due to the difficulty in providing precise text descriptions and the limitations of existing controllable generation methods relying on skeletons or sketches.	The paper proposes ADI, which expands the semantic conditioning space with layer-wise identifier tokens and utilizes gradient masking to decouple action-related features from action-agnostic information like appearance.	ADI achieves high accuracy in generating specified actions while maintaining the fidelity of generated subjects, outperforming baselines like Stable Diffusion and ControlNet. The learned action identifiers can be effectively combined with various characters and animals to generate high-quality images, demonstrating generalization ability. Ablation studies confirm the effectiveness of layer-wise identifier tokens and gradient masking strategies in improving action customization performance.	The optimal masking ratio in ADI might need to be adjusted for different actions to achieve the best performance. Future work could explore incorporating action dynamics and temporal information to enhance the expressiveness of generated actions.	text-to-image generation, action customization, diffusion models, gradient masking, controllable image synthesis
2311.15813 Report	FlowZero: Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic Scene Syntax	Yu Lu, Linchao Zhu, Hehe Fan, Yi Yang	Text-to-video (T2V) generation is a rapidly growing research area that aims to translate the scenes, objects, and actions within complex video text into a sequence of coherent visual frames. We present FlowZero, a novel framework that combines Large Language Models (LLMs) with image diffusion models to generate temporally-coherent videos. FlowZero uses LLMs to understand complex spatio-temporal dynamics from text, where LLMs can generate a comprehensive dynamic scene syntax (DSS) containing scene descriptions, object layouts, and background motion patterns. These elements in DSS are then used to guide the image diffusion model for video generation with smooth object motions and frame-to-frame coherence. Moreover, FlowZero incorporates an iterative self-refinement process, enhancing the alignment between the spatio-temporal layouts and the textual prompts for the videos. To enhance global coherence, we propose enriching the initial noise of each frame with motion dynamics to control the background movement and camera motion adaptively. By using spatio-temporal syntaxes to guide the diffusion process, FlowZero achieves improvement in zero-shot video synthesis, generating coherent videos with vivid motion.	FlowZero, a novel framework that combines LLMs with image diffusion models to generate temporally coherent videos by converting text prompts into dynamic scene syntax (DSS) including scene descriptions, object layouts, and background motion patterns.	Generating coherent dynamic visual scenes in videos from text prompts remains challenging due to the succinct and abstract nature of video text prompts.	FlowZero uses LLMs to generate DSS, employs iterative self-refinement to ensure layout accuracy, and introduces motion-guided noise shifting to enhance global coherence. A modified U-Net with cross-attention mechanisms synthesizes the video frames.	FlowZero generates videos with accurate object motion and transformations, surpassing existing zero-shot and some training-based methods. Self-refinement process significantly improves the alignment of generated layouts with text prompts, enhancing spatial and temporal accuracy. Motion-guided noise shifting effectively controls background motion, leading to smoother and more coherent video synthesis.	The framework currently relies on pre-defined motion directions for background motion. Further research is needed to explore the generation of videos with longer durations and more complex scenes.	text-to-video generation, large language models, diffusion models, dynamic scene syntax, temporal coherence
2311.15776 Report	Stable Segment Anything Model	Qi Fan, Xin Tao, Lei Ke, Mingqiao Ye, Yuan Zhang, Pengfei Wan, Zhongyuan Wang, Yu-Wing Tai, Chi-Keung Tang	The Segment Anything Model (SAM) achieves remarkable promptable segmentation given high-quality prompts which, however, often require good skills to specify. To make SAM robust to casual prompts, this paper presents the first comprehensive analysis on SAM's segmentation stability across a diverse spectrum of prompt qualities, notably imprecise bounding boxes and insufficient points. Our key finding reveals that given such low-quality prompts, SAM's mask decoder tends to activate image features that are biased towards the background or confined to specific object parts. To mitigate this issue, our key idea consists of calibrating solely SAM's mask attention by adjusting the sampling locations and amplitudes of image features, while the original SAM model architecture and weights remain unchanged. Consequently, our deformable sampling plugin (DSP) enables SAM to adaptively shift attention to the prompted target regions in a data-driven manner, facilitated by our effective robust training strategy (RTS). During inference, dynamic routing plugin (DRP) is proposed that toggles SAM between the deformable and regular grid sampling modes, conditioned on the input prompt quality. Thus, our solution, termed Stable-SAM, offers several advantages: 1) improved SAM's segmentation stability across a wide range of prompt qualities, while 2) retaining SAM's powerful promptable segmentation efficiency and generality, with 3) minimal learnable parameters (0.08 M) and fast adaptation (by 1 training epoch). Extensive experiments across multiple datasets validate the effectiveness and advantages of our approach, underscoring Stable-SAM as a more robust solution for segmenting anything. Codes will be released upon acceptance. https://github.com/fanq15/Stable-SAM	The paper introduces Stable-SAM, a novel method to enhance the robustness of the Segment Anything Model (SAM) to inaccurate or insufficient prompts.	SAM's performance heavily relies on high-quality prompts, which are often difficult to obtain in real-world applications. This limits SAM's practical use in scenarios with casual or imprecise user inputs.	The paper proposes a Deformable Sampling Plugin (DSP) that calibrates SAM's mask attention by adjusting the sampling positions and amplitudes of image features based on a learnable offset network. Additionally, a Dynamic Routing Plugin (DRP) is introduced to toggle between DSP and regular grid sampling based on prompt quality. A robust training strategy (RTS) incorporating diverse prompt qualities further enhances the model's stability.	Stable-SAM significantly improves SAM's segmentation accuracy and stability across various prompt qualities, particularly for imprecise boxes and sparse points. Stable-SAM maintains SAM's zero-shot generalization ability and achieves competitive performance on multiple benchmarks, including MS COCO and SGinW. Stable-SAM exhibits strong model scalability, requiring minimal learnable parameters (0.08M) and achieving fast adaptation with only one epoch of training.	The spatial attention mechanism is not as effective as the proposed deformable sampling plugin in adapting SAM to handle suboptimal prompts. The robust training strategy, while improving stability, slightly compromises performance with high-quality prompts.	segment anything model, deformable attention, robust segmentation, zero-shot learning, prompt engineering
2311.15773 Report	Check, Locate, Rectify: A Training-Free Layout Calibration System for Text-to-Image Generation	Biao Gong, Siteng Huang, Yutong Feng, Shiwei Zhang, Yuyuan Li, Yu Liu	Diffusion models have recently achieved remarkable progress in generating realistic images. However, challenges remain in accurately understanding and synthesizing the layout requirements in the textual prompts. To align the generated image with layout instructions, we present a training-free layout calibration system SimM that intervenes in the generative process on the fly during inference time. Specifically, following a "check-locate-rectify" pipeline, the system first analyses the prompt to generate the target layout and compares it with the intermediate outputs to automatically detect errors. Then, by moving the located activations and making intra- and inter-map adjustments, the rectification process can be performed with negligible computational overhead. To evaluate SimM over a range of layout requirements, we present a benchmark SimMBench that compensates for the lack of superlative spatial relations in existing datasets. And both quantitative and qualitative results demonstrate the effectiveness of the proposed SimM in calibrating the layout inconsistencies. Our project page is at https://simm-t2i.github.io/SimM.	This paper presents SimM, a training-free layout calibration system for text-to-image generation that aligns generated images with layout instructions in textual prompts.	Most text-to-image generators struggle to accurately understand and interpret textual layout instructions, compromising the quality and fidelity of generated images.	SimM follows a "check-locate-rectify" pipeline. It checks for layout requirements and discrepancies, locates misplaced objects in intermediate cross-attention maps, and rectifies the activations by transferring them to target regions and performing intra-/inter-map activation adjustments.	SimM achieves state-of-the-art generation accuracy on both DrawBench and a newly proposed benchmark focusing on superlative spatial relations. The system effectively rectifies layout inconsistencies while maintaining excellent image quality. SimM operates in real-time with negligible computational overhead compared to training-based or large language model-based layout control methods.	A single adjustment strength parameter may not be optimal for all generation scenarios, leading to potential errors in complex layouts. The current implementation focuses on single-view image generation and could be extended to multi-view generation.	text-to-image generation, layout calibration, diffusion models, spatial relations, real-time system
2311.15744 Report	One More Step: A Versatile Plug-and-Play Module for Rectifying Diffusion Schedule Flaws and Enhancing Low-Frequency Controls	Minghui Hu, Jianbin Zheng, Chuanxia Zheng, Chaoyue Wang, Dacheng Tao, Tat-Jen Cham	It is well known that many open-released foundational diffusion models have difficulty in generating images that substantially depart from average brightness, despite such images being present in the training data. This is due to an inconsistency: while denoising starts from pure Gaussian noise during inference, the training noise schedule retains residual data even in the final timestep distribution, due to difficulties in numerical conditioning in mainstream formulation, leading to unintended bias during inference. To mitigate this issue, certain $\epsilon$-prediction models are combined with an ad-hoc offset-noise methodology. In parallel, some contemporary models have adopted zero-terminal SNR noise schedules together with $\mathbf{v}$-prediction, which necessitate major alterations to pre-trained models. However, such changes risk destabilizing a large multitude of community-driven applications anchored on these pre-trained models. In light of this, our investigation revisits the fundamental causes, leading to our proposal of an innovative and principled remedy, called One More Step (OMS). By integrating a compact network and incorporating an additional simple yet effective step during inference, OMS elevates image fidelity and harmonizes the dichotomy between training and inference, while preserving original model parameters. Once trained, various pre-trained diffusion models with the same latent domain can share the same OMS module.	This paper proposes "One More Step" (OMS), a plug-and-play method to improve image fidelity in pre-trained diffusion models without modifying their parameters.	Existing diffusion models often generate images with average brightness due to a discrepancy in terminal noise distribution between training and inference.	OMS introduces a compact, text-conditional network that maps pure Gaussian noise to the data-adulterated noise expected by pre-trained models at the start of sampling.	OMS enables generation of images with a wider range of brightness levels. The method is adaptable to various diffusion models and can share the same module across models with the same latent domain. Modifying prompts in OMS allows control over low-frequency image aspects like brightness and color.	Integrating OMS into the student model through distillation could reduce computational cost. Further exploration of OMS integration during model training from scratch or fine-tuning could be beneficial.	diffusion models, text-to-image synthesis, noise schedule, image fidelity, one more step
2311.15732 Report	GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?	Wenhao Wu, Huanjin Yao, Mengxi Zhang, Yuxin Song, Wanli Ouyang, Jingdong Wang	This paper does not present a novel method. Instead, it delves into an essential, yet must-know baseline in light of the latest advancements in Generative Artificial Intelligence (GenAI): the utilization of GPT-4 for visual understanding. Our study centers on the evaluation of GPT-4's linguistic and visual capabilities in zero-shot visual recognition tasks: Firstly, we explore the potential of its generated rich textual descriptions across various categories to enhance recognition performance without any training. Secondly, we evaluate GPT-4's visual proficiency in directly recognizing diverse visual content. We conducted extensive experiments to systematically evaluate GPT-4's performance across images, videos, and point clouds, using 16 benchmark datasets to measure top-1 and top-5 accuracy. Our findings show that GPT-4, enhanced with rich linguistic descriptions, significantly improves zero-shot recognition, offering an average top-1 accuracy increase of 7% across all datasets. GPT-4 excels in visual recognition, outshining OpenAI-CLIP's ViT-L and rivaling EVA-CLIP's ViT-E, particularly in video datasets HMDB-51 and UCF-101, where it leads by 22% and 9%, respectively. We hope this research contributes valuable data points and experience for future studies. We release our code at https://github.com/whwu95/GPT4Vis.	This paper presents a comprehensive evaluation of GPT-4's linguistic and visual capabilities for zero-shot visual recognition across images, videos, and point clouds.	This evaluation is important because it provides quantitative insights into GPT-4's visual understanding abilities, a crucial aspect of multimodal AI development.	The authors evaluate GPT-4's performance on 16 benchmark datasets using two approaches: 1) Leveraging GPT-4 to generate rich textual descriptions to enhance CLIP-based zero-shot recognition. 2) Directly evaluating GPT-4V's visual recognition accuracy.	GPT-4's generated descriptions consistently improve zero-shot recognition, achieving an average 7% top-1 accuracy gain across all datasets. GPT-4V demonstrates strong visual recognition capabilities, rivaling or exceeding EVA-CLIP's ViT-E, particularly on video datasets like UCF-101 and HMDB-51. Both GPT-4-enhanced CLIP and GPT-4V struggle with tasks heavily reliant on temporal modeling, like Something-Something V1.	The study focuses solely on visual recognition, neglecting other important vision tasks like object detection. The prompting strategy for GPT-4V is basic and may be suboptimal, potentially limiting performance.	gpt-4, zero-shot learning, visual recognition, multimodal ai, computer vision
2311.15707 Report	SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation	Jiehong Lin, Lihua Liu, Dekun Lu, Kui Jia	Zero-shot 6D object pose estimation involves the detection of novel objects with their 6D poses in cluttered scenes, presenting significant challenges for model generalizability. Fortunately, the recent Segment Anything Model (SAM) has showcased remarkable zero-shot transfer performance, which provides a promising solution to tackle this task. Motivated by this, we introduce SAM-6D, a novel framework designed to realize the task through two steps, including instance segmentation and pose estimation. Given the target objects, SAM-6D employs two dedicated sub-networks, namely Instance Segmentation Model (ISM) and Pose Estimation Model (PEM), to perform these steps on cluttered RGB-D images. ISM takes SAM as an advanced starting point to generate all possible object proposals and selectively preserves valid ones through meticulously crafted object matching scores in terms of semantics, appearance and geometry. By treating pose estimation as a partial-to-partial point matching problem, PEM performs a two-stage point matching process featuring a novel design of background tokens to construct dense 3D-3D correspondence, ultimately yielding the pose estimates. Without bells and whistles, SAM-6D outperforms the existing methods on the seven core datasets of the BOP Benchmark for both instance segmentation and pose estimation of novel objects.	SAM-6D, a novel framework for zero-shot 6D object pose estimation using RGB-D images, leveraging the Segment Anything Model (SAM) for enhanced proposal generation and a two-stage point matching process for accurate pose prediction.	Zero-shot 6D object pose estimation is crucial for real-world applications but challenging due to the need for model generalizability to novel objects.	SAM-6D consists of two sub-networks: ISM leverages SAM for proposal generation and introduces an object matching score based on semantics, appearance, and geometry for proposal selection. PEM formulates pose estimation as a partial-to-partial point matching problem, using background tokens and a two-stage matching process with novel Sparse-to-Dense Point Transformers for accurate pose calculation.	Outperforms existing methods in both instance segmentation and pose estimation of novel objects on seven BOP benchmark datasets. Proposed object matching score effectively identifies proposals corresponding to novel objects. Two-stage point matching process with background tokens and Sparse-to-Dense Point Transformers enables accurate pose estimation even with sparse correspondence.	Reliance on depth information may limit applicability in scenarios where depth sensing is unreliable. Computational cost, especially with SAM-based segmentation, may hinder real-time performance in certain applications.	6d object pose estimation, zero-shot learning, segment anything model, point matching, instance segmentation
2311.15658 Report	Regularization by Texts for Latent Diffusion Inverse Solvers	Jeongsol Kim, Geon Yeong Park, Hyungjin Chung, Jong Chul Ye	The recent advent of diffusion models has led to significant progress in solving inverse problems, leveraging these models as effective generative priors. Nonetheless, there remain challenges related to the ill-posed nature of such problems, often due to inherent ambiguities in measurements or intrinsic system symmetries. To address this, drawing inspiration from the human ability to resolve visual ambiguities through perceptual biases, here we introduce a novel latent diffusion inverse solver by regularization by texts (TReg). Specifically, TReg applies the textual description of the preconception of the solution during the reverse diffusion sampling, of which the description is dynamically reinforced through null-text optimization for adaptive negation. Our comprehensive experimental results demonstrate that TReg successfully mitigates ambiguity in the inverse problems, enhancing their effectiveness and accuracy.	Introduces "Regularization by Text" (TReg), a novel latent diffusion inverse solver that uses textual descriptions to reduce ambiguity in inverse problems.	Diffusion-based inverse solvers, while powerful, often struggle with inherent ambiguities in measurements. TReg aims to bridge this gap by incorporating human-like perceptual biases through textual descriptions.	TReg integrates textual descriptions during the reverse diffusion sampling process using an adaptive negation method. This method dynamically refines the textual guidance through null-text optimization, ensuring alignment with the evolving image reconstruction.	TReg successfully mitigates ambiguity in inverse problems, leading to more consistent and accurate solutions. Quantitative evaluations demonstrate superior performance in super-resolution and deblurring tasks compared to baseline methods, exhibiting lower LPIPS and y-MSE values. Qualitative results showcase TReg's ability to generate high-fidelity reconstructions that adhere to both the provided text prompts and the measurement data.	The effectiveness of TReg can be limited by the specificity and accuracy of the provided text prompt. Identifying informative text prompts solely from severely degraded measurements in real-world applications poses a challenge.	inverse problems, text regularization, latent diffusion models, generative priors, image reconstruction
2311.15657 Report	Enhancing Diffusion Models with Text-Encoder Reinforcement Learning	Chaofeng Chen, Annan Wang, Haoning Wu, Liang Liao, Wenxiu Sun, Qiong Yan, Weisi Lin	Text-to-image diffusion models are typically trained to optimize the log-likelihood objective, which presents challenges in meeting specific requirements for downstream tasks, such as image aesthetics and image-text alignment. Recent research addresses this issue by refining the diffusion U-Net using human rewards through reinforcement learning or direct backpropagation. However, many of them overlook the importance of the text encoder, which is typically pretrained and fixed during training. In this paper, we demonstrate that by finetuning the text encoder through reinforcement learning, we can enhance the text-image alignment of the results, thereby improving the visual quality. Our primary motivation comes from the observation that the current text encoder is suboptimal, often requiring careful prompt adjustment. While fine-tuning the U-Net can partially improve performance, it remains suffering from the suboptimal text encoder. Therefore, we propose to use reinforcement learning with low-rank adaptation to finetune the text encoder based on task-specific rewards, referred as \textbf{TexForce}. We first show that finetuning the text encoder can improve the performance of diffusion models. Then, we illustrate that TexForce can be simply combined with existing U-Net finetuned models to get much better results without additional training. Finally, we showcase the adaptability of our method in diverse applications, including the generation of high-quality face and hand images.	Presents TexForce, a novel method employing reinforcement learning with low-rank adaptation to fine-tune the text encoder in text-to-image diffusion models, enhancing text-image alignment and improving visual quality.	Existing diffusion models often struggle with text-image alignment and achieving specific requirements for downstream tasks. While fine-tuning the U-Net has shown promise, the fixed, suboptimal text encoder limits overall efficacy.	Leverages DDPO (a PPO variant for diffusion models) to update the text encoder based on task-specific rewards. Employs LoRA for efficient adaptation and combination of learned capabilities from diverse tasks.	Fine-tuning the text encoder significantly improves text-image alignment and visual quality compared to the original Stable Diffusion model and other state-of-the-art methods. TexForce can be seamlessly integrated with existing fine-tuned U-Net models, further enhancing their performance without additional training. Demonstrates strong adaptability across various tasks, including generating high-quality face and hand images, and allows for combining learned capabilities from different tasks.	Similar to other RL-based methods, TexForce faces challenges in terms of sample efficiency. Engineering suitable reward functions for specific tasks can be complex.	text-to-image synthesis, diffusion models, reinforcement learning, text encoder fine-tuning, lora
2311.15648 Report	Reinforcement Learning from Diffusion Feedback: Q* for Image Search	Aboli Marathe	Large vision-language models are steadily gaining personalization capabilities at the cost of fine-tuning or data augmentation. We present two models for image generation using model-agnostic learning that align semantic priors with generative capabilities. RLDF, or Reinforcement Learning from Diffusion Feedback, is a singular approach for visual imitation through prior-preserving reward function guidance. This employs Q-learning (with standard Q*) for generation and follows a semantic-rewarded trajectory for image search through finite encoding-tailored actions. The second proposed method, noisy diffusion gradient, is optimization driven. At the root of both methods is a special CFG encoding that we propose for continual semantic guidance. Using only a single input image and no text input, RLDF generates high-quality images over varied domains including retail, sports and agriculture showcasing class-consistency and strong visual diversity. Project website is available at https://infernolia.github.io/RLDF.	Presents RLDF and nDg, model-agnostic learning models for class-driven semantic image imitation using a single input image, without text guidance or fine-tuning.	Addresses the bottleneck of human feedback in visual prompt engineering by enabling context-driven image generation guided by semantic priors.	Formulates image search as a Markov Decision Process (MDP) where an agent navigates a semantic encoding space derived from Context-Free Grammar. It employs Q-learning with semantic rewards based on diffusion feedback to guide the generation process towards the target image's semantic attributes.	RLDF generates high-quality images across various domains with class-consistency and visual diversity. Demonstrates model-agnostic stability across DALLE-2, SD 1.4, and SD 2.1 models. Generates a photo-realistic ImageNet clone with a distribution closer to the original ImageNet compared to baseline methods.	Computational cost increases in larger, more complex environments. Subject inconsistency persists, as the focus is on class-consistency over specific object replication.	image generation, semantic guidance, reinforcement learning, diffusion models, text-to-image synthesis
2311.15561 Report	ET3D: Efficient Text-to-3D Generation via Multi-View Distillation	Yiming Chen, Zhiqi Li, Peidong Liu	Recent breakthroughs in text-to-image generation has shown encouraging results via large generative models. Due to the scarcity of 3D assets, it is hardly to transfer the success of text-to-image generation to that of text-to-3D generation. Existing text-to-3D generation methods usually adopt the paradigm of DreamFusion, which conducts per-asset optimization by distilling a pretrained text-to-image diffusion model. The generation speed usually ranges from several minutes to tens of minutes per 3D asset, which degrades the user experience and also imposes a burden to the service providers due to the high computational budget. In this work, we present an efficient text-to-3D generation method, which requires only around 8 $ms$ to generate a 3D asset given the text prompt on a consumer graphic card. The main insight is that we exploit the images generated by a large pre-trained text-to-image diffusion model, to supervise the training of a text conditioned 3D generative adversarial network. Once the network is trained, we are able to efficiently generate a 3D asset via a single forward pass. Our method requires no 3D training data and provides an alternative approach for efficient text-to-3D generation by distilling pre-trained image diffusion models.	This paper proposes ET3D, an efficient text-to-3D generation method that distills knowledge from pre-trained text-to-multi-view image diffusion models to enable rapid 3D asset creation.	Existing text-to-3D generation techniques, often relying on time-consuming optimization processes, hinder user experience and escalate computational costs. ET3D addresses this by offering a fast and efficient alternative.	ET3D employs a teacher-student framework. A pre-trained text-to-multi-view image diffusion model acts as the teacher, generating multi-view images from text prompts. A text-conditioned GAN, the student, learns to generate 3D objects that, when rendered, match the teacher's multi-view image distribution.	ET3D generates 3D assets in approximately 8ms on a consumer-grade GPU, significantly faster than optimization-based methods. Evaluations demonstrate that ET3D achieves comparable or superior text-to-3D alignment compared to state-of-the-art approaches. The method exhibits strong generalization ability, effectively handling unseen text prompts and composing novel objects and styles.	The current implementation is trained on a limited set of text prompts due to resource constraints, potentially affecting performance on a wider range of concepts. Future work will explore incorporating larger and more diverse datasets to further enhance ET3D's generative capabilities.	text-to-3d generation, generative adversarial networks, multi-view distillation, diffusion models, efficient 3d content creation
2311.15556 Report	PKU-I2IQA: An Image-to-Image Quality Assessment Database for AI Generated Images	Jiquan Yuan, Xinyan Cao, Changjin Li, Fanyi Yang, Jinlong Lin, Xixin Cao	As image generation technology advances, AI-based image generation has been applied in various fields and Artificial Intelligence Generated Content (AIGC) has garnered widespread attention. However, the development of AI-based image generative models also brings new problems and challenges. A significant challenge is that AI-generated images (AIGI) may exhibit unique distortions compared to natural images, and not all generated images meet the requirements of the real world. Therefore, it is of great significance to evaluate AIGIs more comprehensively. Although previous work has established several human perception-based AIGC image quality assessment (AIGCIQA) databases for text-generated images, the AI image generation technology includes scenarios like text-to-image and image-to-image, and assessing only the images generated by text-to-image models is insufficient. To address this issue, we establish a human perception-based image-to-image AIGCIQA database, named PKU-I2IQA. We conduct a well-organized subjective experiment to collect quality labels for AIGIs and then conduct a comprehensive analysis of the PKU-I2IQA database. Furthermore, we have proposed two benchmark models: NR-AIGCIQA based on the no-reference image quality assessment method and FR-AIGCIQA based on the full-reference image quality assessment method. Finally, leveraging this database, we conduct benchmark experiments and compare the performance of the proposed benchmark models. The PKU-I2IQA database and benchmarks will be released to facilitate future research on \url{https://github.com/jiquan123/I2IQA}.	This paper introduces PKU-I2IQA, the first human perception-based image-to-image database for assessing the quality of AI-generated images.	Existing AIGC image quality assessment (AIGCIQA) methods primarily focus on text-to-image generation, neglecting the image-to-image scenario. This new database addresses this gap and enables more comprehensive evaluation of AIGC image quality.	The researchers collected images from 200 ImageNet categories and used them as prompts for two image-to-image generation models. They then conducted subjective experiments to collect human ratings on the generated images' quality, authenticity, and text-image correspondence. Two benchmark models were proposed: NR-AIGCIQA (no-reference) and FR-AIGCIQA (full-reference), leveraging different input combinations during training and testing.	FR-AIGCIQA outperforms NR-AIGCIQA, highlighting the benefit of using reference images. ResNet18 backbone network achieved the best performance for quality and correspondence scores. ResNet50 achieved the best overall performance.	The proposed models show promise but have room for improvement in terms of performance. Future work will explore incorporating reference images for text-to-image generation and enhancing the generalization ability of AIGCIQA models across different AI image generators.	aigc, image-to-image generation, image quality assessment, nr-aigciqa, fr-aigciqa
2311.15551 Report	Instruct2Attack: Language-Guided Semantic Adversarial Attacks	Jiang Liu, Chen Wei, Yuxiang Guo, Heng Yu, Alan Yuille, Soheil Feizi, Chun Pong Lau, Rama Chellappa	We propose Instruct2Attack (I2A), a language-guided semantic attack that generates semantically meaningful perturbations according to free-form language instructions. We make use of state-of-the-art latent diffusion models, where we adversarially guide the reverse diffusion process to search for an adversarial latent code conditioned on the input image and text instruction. Compared to existing noise-based and semantic attacks, I2A generates more natural and diverse adversarial examples while providing better controllability and interpretability. We further automate the attack process with GPT-4 to generate diverse image-specific text instructions. We show that I2A can successfully break state-of-the-art deep neural networks even under strong adversarial defenses, and demonstrate great transferability among a variety of network architectures.	The paper proposes Instruct2Attack (I2A), a novel language-guided semantic attack method that generates semantically meaningful adversarial perturbations using free-form language instructions.	I2A addresses the limitations of noise-based and existing semantic attacks by generating more natural and diverse adversarial examples with better controllability and interpretability, providing insights into model failure modes beyond pixel-level perturbations.	I2A leverages a latent conditional diffusion model, adversarially guiding the reverse diffusion process to find an adversarial latent code conditioned on the input image and text instruction. It also uses a perceptual constraint (LPIPS) to ensure similarity between the original and adversarial images. Additionally, it automates the instruction generation process with GPT-4.	I2A achieves significantly higher attack success rates than baseline attacks on ImageNet, especially under strong defenses (e.g., adversarial training, DiffPure). I2A shows better transferability under black-box settings compared to noise-based and existing semantic attacks. The generated adversarial examples are visually appealing and interpretable, reflecting the vulnerabilities of DNNs to common natural semantic modifications.	The current implementation of I2A has high computational cost. The quality and plausibility of automatically generated instructions need further improvement.	adversarial attack, semantic attack, diffusion model, language-guided image editing, gpt-4
2311.15537 Report	SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation	Bin Xie, Jiale Cao, Jin Xie, Fahad Shahbaz Khan, Yanwei Pang	Open-vocabulary semantic segmentation strives to distinguish pixels into different semantic groups from an open set of categories. Most existing methods explore utilizing pre-trained vision-language models, in which the key is to adopt the image-level model for pixel-level segmentation task. In this paper, we propose a simple encoder-decoder, named SED, for open-vocabulary semantic segmentation, which comprises a hierarchical encoder-based cost map generation and a gradual fusion decoder with category early rejection. The hierarchical encoder-based cost map generation employs hierarchical backbone, instead of plain transformer, to predict pixel-level image-text cost map. Compared to plain transformer, hierarchical backbone better captures local spatial information and has linear computational complexity with respect to input size. Our gradual fusion decoder employs a top-down structure to combine cost map and the feature maps of different backbone levels for segmentation. To accelerate inference speed, we introduce a category early rejection scheme in the decoder that rejects many no-existing categories at the early layer of decoder, resulting in at most 4.7 times acceleration without accuracy degradation. Experiments are performed on multiple open-vocabulary semantic segmentation datasets, which demonstrates the efficacy of our SED method. When using ConvNeXt-B, our SED method achieves mIoU score of 31.6\% on ADE20K with 150 categories at 82 millisecond ($ms$) per image on a single A6000. We will release it at \url{https://github.com/xb534/SED.git}.	This paper proposes SED, a novel encoder-decoder model for open-vocabulary semantic segmentation, featuring a hierarchical encoder for improved cost map generation and a gradual fusion decoder with category early rejection for efficient inference.	Existing open-vocabulary semantic segmentation methods struggle with either weak local spatial information, high computational cost, or slow inference speed. This work addresses these limitations to achieve a better balance between accuracy and efficiency.	The hierarchical encoder extracts multi-scale features to generate a pixel-level image-text cost map. The gradual fusion decoder combines cost map and hierarchical features for segmentation, employing category early rejection to accelerate inference by eliminating unlikely categories early on.	SED outperforms state-of-the-art methods on multiple open-vocabulary semantic segmentation benchmarks, including ADE20K, PASCAL VOC, and PASCAL-Context. The hierarchical encoder significantly improves performance compared to plain transformer-based encoders, thanks to its ability to capture rich local spatial information. The category early rejection scheme accelerates inference speed by up to 4.7 times without noticeable performance degradation.	The model sometimes struggles with differentiating near-synonym categories. Future work includes exploring category attention strategies and leveraging large-scale fine-grained datasets to address the synonym challenge.	open-vocabulary semantic segmentation, vision-language models, hierarchical encoder, gradual fusion decoder, category early rejection
2311.15510 Report	CaesarNeRF: Calibrated Semantic Representation for Few-shot Generalizable Neural Rendering	Haidong Zhu, Tianyu Ding, Tianyi Chen, Ilya Zharkov, Ram Nevatia, Luming Liang	Generalizability and few-shot learning are key challenges in Neural Radiance Fields (NeRF), often due to the lack of a holistic understanding in pixel-level rendering. We introduce CaesarNeRF, an end-to-end approach that leverages scene-level CAlibratEd SemAntic Representation along with pixel-level representations to advance few-shot, generalizable neural rendering, facilitating a holistic understanding without compromising high-quality details. CaesarNeRF explicitly models pose differences of reference views to combine scene-level semantic representations, providing a calibrated holistic understanding. This calibration process aligns various viewpoints with precise location and is further enhanced by sequential refinement to capture varying details. Extensive experiments on public datasets, including LLFF, Shiny, mip-NeRF 360, and MVImgNet, show that CaesarNeRF delivers state-of-the-art performance across varying numbers of reference views, proving effective even with a single reference image. The project page of this work can be found at https://haidongz-usc.github.io/project/caesarnerf.	Introduces CaesarNeRF, a novel few-shot generalizable NeRF method leveraging calibrated scene-level semantic representations alongside pixel-level features, enabling high-quality rendering of novel scenes from as few as one reference view.	Addresses the limitations of existing generalizable NeRF methods that struggle with few-shot rendering due to their reliance solely on pixel-level features, lacking a holistic scene understanding.	Employs a shared encoder to generate both scene-level and pixel-level features. It calibrates semantic representations across views using camera pose transformations and introduces a sequential refinement module to capture varying details at different rendering stages.	Achieves state-of-the-art performance on LLFF, Shiny, mip-NeRF 360, and MVImgNet datasets, demonstrating superior quality and consistency, especially with one or two reference views. Shows significant improvement over existing methods in few-shot scenarios, effectively mitigating depth ambiguity and producing sharper, more detailed renderings. Demonstrates adaptability by integrating the Caesar pipeline with other state-of-the-art NeRF architectures, leading to consistent performance gains.	CaesarNeRF's performance could be further enhanced by incorporating explicit depth information. Exploring the integration of generative capabilities within the Caesar framework could further improve rendering quality.	neural radiance fields, novel view synthesis, few-shot learning, generalizable nerf, semantic representation
2311.15478 Report	HawkI: Homography & Mutual Information Guidance for 3D-free Single Image to Aerial View	Divya Kothandaraman, Tianyi Zhou, Ming Lin, Dinesh Manocha	We present HawkI, for synthesizing aerial-view images from text and an exemplar image, without any additional multi-view or 3D information for finetuning or at inference. HawkI uses techniques from classical computer vision and information theory. It seamlessly blends the visual features from the input image within a pretrained text-to-2Dimage stable diffusion model with a test-time optimization process for a careful bias-variance trade-off, which uses an Inverse Perspective Mapping (IPM) homography transformation to provide subtle cues for aerialview synthesis. At inference, HawkI employs a unique mutual information guidance formulation to steer the generated image towards faithfully replicating the semantic details of the input-image, while maintaining a realistic aerial perspective. Mutual information guidance maximizes the semantic consistency between the generated image and the input image, without enforcing pixel-level correspondence between vastly different viewpoints. Through extensive qualitative and quantitative comparisons against text + exemplar-image based methods and 3D/ multi-view based novel-view synthesis methods on proposed synthetic and real datasets, we demonstrate that our method achieves a significantly better bias-variance trade-off towards generating high fidelity aerial-view images.Code and data is available at https://github.com/divyakraman/HawkI2024.	\model~synthesizes aerial-view images from text and a single exemplar image without relying on multi-view or 3D data during finetuning or inference.	This method is valuable for generating diverse aerial-view synthetic data for tasks like aerial perception and providing weak supervision in cross-view synthesis applications like localization and mapping.	\model~employs a test-time optimization process to incorporate the input image's features into a pretrained text-to-2D-image stable diffusion model. It utilizes Inverse Perspective Mapping (IPM) for weak aerial-view guidance and a novel mutual information guidance formulation to ensure semantic consistency between generated aerial views and input images.	\model~generates more accurate aerial viewpoints compared to text + exemplar-image based methods like DreamBooth and Imagic. It demonstrates superior fidelity to the input image compared to prior text-based aerial view synthesis techniques, as evidenced by higher CLIP-I, SSCD, and DINO scores. Despite being 3D-free, \model~achieves comparable or better results than 3D-based novel-view synthesis methods on benchmark tasks, highlighting the effectiveness of its classical guidance approaches.	The lack of explicit 3D information limits precise camera angle control in the generated scenes. Further improvement in fidelity with respect to the input image is needed for more accurate cross-view synthesis applications.	aerial view synthesis, text-to-image generation, stable diffusion, inverse perspective mapping, mutual information guidance
2311.15477 Report	DreamCreature: Crafting Photorealistic Virtual Creatures from Imagination	Kam Woh Ng, Xiatian Zhu, Yi-Zhe Song, Tao Xiang	Recent text-to-image (T2I) generative models allow for high-quality synthesis following either text instructions or visual examples. Despite their capabilities, these models face limitations in creating new, detailed creatures within specific categories (e.g., virtual dog or bird species), which are valuable in digital asset creation and biodiversity analysis. To bridge this gap, we introduce a novel task, Virtual Creatures Generation: Given a set of unlabeled images of the target concepts (e.g., 200 bird species), we aim to train a T2I model capable of creating new, hybrid concepts within diverse backgrounds and contexts. We propose a new method called DreamCreature, which identifies and extracts the underlying sub-concepts (e.g., body parts of a specific species) in an unsupervised manner. The T2I thus adapts to generate novel concepts (e.g., new bird species) with faithful structures and photorealistic appearance by seamlessly and flexibly composing learned sub-concepts. To enhance sub-concept fidelity and disentanglement, we extend the textual inversion technique by incorporating an additional projector and tailored attention loss regularization. Extensive experiments on two fine-grained image benchmarks demonstrate the superiority of DreamCreature over prior methods in both qualitative and quantitative evaluation. Ultimately, the learned sub-concepts facilitate diverse creative applications, including innovative consumer product designs and nuanced property modifications.	This paper introduces DreamCreature, a novel method for virtual creature generation that automatically discovers and composes sub-concepts from unlabeled images, enabling the creation of new, hybrid concepts (e.g., novel bird species).	Existing text-to-image models struggle to create new, detailed concepts within specific categories, limiting their application in areas like digital asset creation and biodiversity analysis. DreamCreature addresses this gap by enabling the creation of novel concepts with realistic appearances and structures.	DreamCreature uses unsupervised learning to identify sub-concepts (e.g., body parts) within a dataset. It then leverages textual inversion with a dedicated projector and an attention loss to disentangle and learn representations for each sub-concept, enabling their flexible composition during generation.	DreamCreature outperforms existing personalization methods in generating new creatures by combining sub-concepts from different species, as evidenced by higher Exact Matching Rate (EMR) and Cosine Similarity (CoSim) scores. The method demonstrates superior performance in conventional image generation tasks, achieving better FID, CLIP, and DINO scores compared to other approaches. The learned sub-concepts exhibit strong transferability, allowing for creative applications like property modification in images and innovative digital asset design.	The accuracy of sub-concept discovery may be limited by the use of a self-supervised pre-trained feature extractor. Composing small sub-concepts (e.g., tails, legs) presents a challenge.	virtual creature generation, text-to-image synthesis, sub-concept learning, textual inversion, creative ai
2311.15475 Report	MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers	Yawar Siddiqui, Antonio Alliegro, Alexey Artemov, Tatiana Tommasi, Daniele Sirigatti, Vladislav Rosov, Angela Dai, Matthias Nießner	We introduce MeshGPT, a new approach for generating triangle meshes that reflects the compactness typical of artist-created meshes, in contrast to dense triangle meshes extracted by iso-surfacing methods from neural fields. Inspired by recent advances in powerful large language models, we adopt a sequence-based approach to autoregressively generate triangle meshes as sequences of triangles. We first learn a vocabulary of latent quantized embeddings, using graph convolutions, which inform these embeddings of the local mesh geometry and topology. These embeddings are sequenced and decoded into triangles by a decoder, ensuring that they can effectively reconstruct the mesh. A transformer is then trained on this learned vocabulary to predict the index of the next embedding given previous embeddings. Once trained, our model can be autoregressively sampled to generate new triangle meshes, directly generating compact meshes with sharp edges, more closely imitating the efficient triangulation patterns of human-crafted meshes. MeshGPT demonstrates a notable improvement over state of the art mesh generation methods, with a 9% increase in shape coverage and a 30-point enhancement in FID scores across various categories.	Introduces MeshGPT, a novel method for generating compact and efficient triangle meshes, mimicking the style of human-crafted meshes, using a GPT-inspired transformer trained on a vocabulary of learned geometric embeddings.	Existing 3D shape generation methods often rely on representations like voxels, point clouds, or neural fields, which require post-processing to convert into meshes, resulting in dense and over-tessellated outputs. MeshGPT addresses this by directly generating compact meshes, reflecting the efficient triangulation patterns found in artist-created models.	Learns a vocabulary of quantized geometric embeddings from mesh triangles using graph convolutions. A GPT-style decoder-only transformer is trained on this vocabulary to autoregressively predict sequences of triangle embeddings, which are then decoded into mesh faces.	Achieves a 9% improvement in shape coverage and a 30-point enhancement in FID scores compared to state-of-the-art methods. Generates compact meshes with sharp edges and high fidelity, surpassing baselines in visual quality. Demonstrates shape novelty, producing shapes that differ from the training dataset while maintaining realism.	Autoregressive generation leads to slower sampling times, posing challenges for real-time applications. Limited context window size of the transformer might restrict the generation of large-scale scenes.	mesh generation, transformers, geometric deep learning, generative models, 3d shape synthesis
2311.15435 Report	Functional Diffusion	Biao Zhang, Peter Wonka	We propose a new class of generative diffusion models, called functional diffusion. In contrast to previous work, functional diffusion works on samples that are represented by functions with a continuous domain. Functional diffusion can be seen as an extension of classical diffusion models to an infinite-dimensional domain. Functional diffusion is very versatile as images, videos, audio, 3D shapes, deformations, \etc, can be handled by the same framework with minimal changes. In addition, functional diffusion is especially suited for irregular data or data defined in non-standard domains. In our work, we derive the necessary foundations for functional diffusion and propose a first implementation based on the transformer architecture. We show generative results on complicated signed distance functions and deformation functions defined on 3D surfaces.	Introduces functional diffusion, a novel class of generative diffusion models that operate on samples represented as functions with continuous domains, extending diffusion models to infinite-dimensional spaces.	Provides a versatile framework for generating various data types (images, videos, audio, 3D shapes, deformations) within a unified framework, especially suitable for irregular data or non-standard domains.	Represents functions using continuous latent vectors and sampled function values, trains a denoising network to progressively denoise functions from noisy initial states, and leverages a DDIM-based sampling method for efficient inference.	Generates high-quality, detailed 3D shapes from sparse point clouds, outperforming existing methods in terms of visual fidelity and quantitative metrics. Successfully models and generates 3D deformation fields from sparse correspondences, demonstrating superior performance compared to baseline methods. Demonstrates the capability to generate raw signed distance functions (SDFs) directly, unlike previous methods that predict binary occupancies or truncated SDFs.	Requires significant computational resources for training, potentially limiting its scalability to large datasets. Involves exploring the sampling rate of the sampled function representation as a hyperparameter during training.	generative diffusion models, functional data, 3d shape generation, deformation fields, neural fields
2311.15383 Report	Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding	Zhihao Yuan, Jinke Ren, Chun-Mei Feng, Hengshuang Zhao, Shuguang Cui, Zhen Li	3D Visual Grounding (3DVG) aims at localizing 3D object based on textual descriptions. Conventional supervised methods for 3DVG often necessitate extensive annotations and a predefined vocabulary, which can be restrictive. To address this issue, we propose a novel visual programming approach for zero-shot open-vocabulary 3DVG, leveraging the capabilities of large language models (LLMs). Our approach begins with a unique dialog-based method, engaging with LLMs to establish a foundational understanding of zero-shot 3DVG. Building on this, we design a visual program that consists of three types of modules, i.e., view-independent, view-dependent, and functional modules. These modules, specifically tailored for 3D scenarios, work collaboratively to perform complex reasoning and inference. Furthermore, we develop an innovative language-object correlation module to extend the scope of existing 3D object detectors into open-vocabulary scenarios. Extensive experiments demonstrate that our zero-shot approach can outperform some supervised baselines, marking a significant stride towards effective 3DVG.	This paper proposes a novel zero-shot visual programming approach for 3D Visual Grounding (3DVG) that leverages the capabilities of large language models (LLMs) to localize 3D objects in a scene based on textual descriptions, without the need for extensive annotations or a predefined vocabulary.	Existing supervised 3DVG methods are limited by the need for extensive annotations and a predefined vocabulary, making them difficult to apply in real-world scenarios.	The approach involves a dialog-based method to establish an understanding of zero-shot 3DVG with LLMs and designs a visual program consisting of view-independent, view-dependent, and functional modules for reasoning and inference. It also introduces a language-object correlation (LOC) module to extend 3D object detectors to open-vocabulary scenarios.	The zero-shot approach outperforms some existing supervised methods on the ScanRefer and Nr3D datasets. The LOC module effectively combines 3D geometric information and 2D appearance features for improved object localization in open-vocabulary settings. The visual programming approach demonstrates the ability to handle complex spatial relations and perform multi-step reasoning for 3DVG.	The accuracy of the approach heavily relies on the quality of the generated visual programs and the performance of the LLMs. Expanding the range of spatial relations and modules within the visual programming framework can further enhance the capabilities and address more complex 3DVG scenarios.	3d visual grounding, large language models, zero-shot learning, visual programming, open vocabulary
2311.15368 Report	Flow-Guided Diffusion for Video Inpainting	Bohai Gu, Yongsheng Yu, Heng Fan, Libo Zhang	Video inpainting has been challenged by complex scenarios like large movements and low-light conditions. Current methods, including emerging diffusion models, face limitations in quality and efficiency. This paper introduces the Flow-Guided Diffusion model for Video Inpainting (FGDVI), a novel approach that significantly enhances temporal consistency and inpainting quality via reusing an off-the-shelf image generation diffusion model. We employ optical flow for precise one-step latent propagation and introduces a model-agnostic flow-guided latent interpolation technique. This technique expedites denoising, seamlessly integrating with any Video Diffusion Model (VDM) without additional training. Our FGDVI demonstrates a remarkable 10% improvement in flow warping error E_warp over existing state-of-the-art methods. Our comprehensive experiments validate superior performance of FGDVI, offering a promising direction for advanced video inpainting. The code and detailed results will be publicly available in https://github.com/NevSNev/FGDVI.	This paper presents FGDVI, a novel flow-guided diffusion model for video inpainting that leverages optical flow and reuses an off-the-shelf image generation diffusion model for enhanced temporal consistency and inpainting quality.	Video inpainting in complex scenarios with large movements and low-light conditions remains challenging for existing methods, demanding improved quality and efficiency.	FGDVI employs optical flow for precise one-step latent propagation and introduces a model-agnostic flow-guided latent interpolation technique to accelerate denoising, integrating seamlessly with any Video Diffusion Model (VDM) without additional training.	FGDVI demonstrates a remarkable 10% improvement in flow warping error (E_warp) over state-of-the-art methods. The proposed flow-guided latent interpolation method boosts inference speed by approximately 29% compared to vanilla diffusion. FGDVI excels in qualitative and quantitative evaluations, especially in handling complex scenarios with large masks and object removal.	The paper uses a pre-trained LDM instead of more powerful Stable Diffusion models with cross-attention for text input. Future work aims to design algorithms with fewer keyframes for flow-based interpolation to further enhance temporal consistency.	video inpainting, diffusion models, optical flow, latent interpolation, temporal consistency
2311.15308 Report	AV-Deepfake1M: A Large-Scale LLM-Driven Audio-Visual Deepfake Dataset	Zhixi Cai, Shreya Ghosh, Aman Pankaj Adatia, Munawar Hayat, Abhinav Dhall, Kalin Stefanov	The detection and localization of highly realistic deepfake audio-visual content are challenging even for the most advanced state-of-the-art methods. While most of the research efforts in this domain are focused on detecting high-quality deepfake images and videos, only a few works address the problem of the localization of small segments of audio-visual manipulations embedded in real videos. In this research, we emulate the process of such content generation and propose the AV-Deepfake1M dataset. The dataset contains content-driven (i) video manipulations, (ii) audio manipulations, and (iii) audio-visual manipulations for more than 2K subjects resulting in a total of more than 1M videos. The paper provides a thorough description of the proposed data generation pipeline accompanied by a rigorous analysis of the quality of the generated data. The comprehensive benchmark of the proposed dataset utilizing state-of-the-art deepfake detection and localization methods indicates a significant drop in performance compared to previous datasets. The proposed dataset will play a vital role in building the next-generation deepfake localization methods. The dataset and associated code are available at https://github.com/ControlNet/AV-Deepfake1M .	This paper presents AV-Deepfake1M, a large-scale content-driven audio-visual dataset for temporal deepfake localization generated using ChatGPT for realistic transcript manipulation and state-of-the-art audio and video generation methods.	Discriminating real from fake content is increasingly challenging with advancements in content generation, making reliable detection methods vital, especially for localized manipulations within real content, which existing datasets lack.	The dataset is generated in a three-stage pipeline: 1) ChatGPT manipulates real transcripts with insertions, deletions, and replacements, 2) High-quality audio is generated using VITS and YourTTS, 3) Lip-synced visual frames are generated using TalkLip.	AV-Deepfake1M significantly surpasses previous datasets in scale and diversity with over 2K subjects and 1M videos, including diverse fake segment lengths and lower average proportions of manipulations. Benchmarking state-of-the-art deepfake detection and localization methods on AV-Deepfake1M reveals a significant performance drop compared to previous datasets, highlighting the dataset's difficulty. Human evaluation shows that even experts struggle to detect and localize deepfakes in AV-Deepfake1M, emphasizing the need for advanced detection methods.	The dataset exhibits an imbalance in terms of the number of fake and real videos. Potential misuse of the dataset exists despite distribution restrictions and end-user license agreements.	deepfakes, dataset, temporal localization, content-driven manipulation, large language model
2311.15291 Report	Obj-NeRF: Extract Object NeRFs from Multi-view Images	Zhiyi Li, Lihe Ding, Tianfan Xue	Neural Radiance Fields (NeRFs) have demonstrated remarkable effectiveness in novel view synthesis within 3D environments. However, extracting a radiance field of one specific object from multi-view images encounters substantial challenges due to occlusion and background complexity, thereby presenting difficulties in downstream applications such as NeRF editing and 3D mesh extraction. To solve this problem, in this paper, we propose Obj-NeRF, a comprehensive pipeline that recovers the 3D geometry of a specific object from multi-view images using a single prompt. This method combines the 2D segmentation capabilities of the Segment Anything Model (SAM) in conjunction with the 3D reconstruction ability of NeRF. Specifically, we first obtain multi-view segmentation for the indicated object using SAM with a single prompt. Then, we use the segmentation images to supervise NeRF construction, integrating several effective techniques. Additionally, we construct a large object-level NeRF dataset containing diverse objects, which can be useful in various downstream tasks. To demonstrate the practicality of our method, we also apply Obj-NeRF to various applications, including object removal, rotation, replacement, and recoloring.	This paper presents Obj-NeRF, a novel pipeline for extracting and reconstructing the 3D geometry of specific objects from multi-view images using a single prompt.	Extracting object-specific radiance fields from multi-view images is challenging due to occlusion and background complexity, hindering downstream applications like NeRF editing and 3D mesh extraction. Obj-NeRF addresses this by leveraging the strengths of 2D segmentation and 3D NeRF reconstruction.	Obj-NeRF leverages the Segment Anything Model (SAM) for multi-view segmentation based on user prompts and combines it with NeRF reconstruction techniques. It employs a sparse point cloud for multi-view consistency, handles object obstruction, and incorporates sparse and dense depth supervision for enhanced novel view synthesis.	Obj-NeRF effectively segments and reconstructs objects from various multi-view datasets, outperforming previous methods in quality. The pipeline enables the creation of a large, multi-view object NeRF dataset beneficial for tasks like 3D generation. Extracted object NeRFs are demonstrated for applications like object removal, replacement, rotation, and color changing within existing NeRF scenes.	Future work includes extending the constructed object NeRF dataset to broader 3D generation tasks. Investigating methods for further improving the reconstruction quality and handling complex object interactions is crucial.	neural radiance fields, nerf, 3d object segmentation, novel view synthesis, segment anything model (sam)
2311.15260 Report	NeuRAD: Neural Rendering for Autonomous Driving	Adam Tonderski, Carl Lindström, Georg Hess, William Ljungbergh, Lennart Svensson, Christoffer Petersson	Neural radiance fields (NeRFs) have gained popularity in the autonomous driving (AD) community. Recent methods show NeRFs' potential for closed-loop simulation, enabling testing of AD systems, and as an advanced training data augmentation technique. However, existing methods often require long training times, dense semantic supervision, or lack generalizability. This, in turn, hinders the application of NeRFs for AD at scale. In this paper, we propose NeuRAD, a robust novel view synthesis method tailored to dynamic AD data. Our method features simple network design, extensive sensor modeling for both camera and lidar -- including rolling shutter, beam divergence and ray dropping -- and is applicable to multiple datasets out of the box. We verify its performance on five popular AD datasets, achieving state-of-the-art performance across the board. To encourage further development, we will openly release the NeuRAD source code. See https://github.com/georghess/NeuRAD .	NeuRAD is a novel view synthesis method for dynamic autonomous driving data, capable of handling large-scale scenes and generalizing to multiple datasets.	NeRFs have potential for closed-loop simulation and data augmentation in autonomous driving, but existing methods are limited by long training times, reliance on dense supervision, and lack of generalizability.	NeuRAD uses a single network with an actor-aware hash encoding for static and dynamic elements. It models sensor characteristics like rolling shutter, beam divergence, and ray dropping. It employs a CNN decoder and proposal sampling for efficiency.	Achieves state-of-the-art novel view synthesis performance on five AD datasets (PandaSet, nuScenes, KITTI, Argoverse 2, ZOD). Significantly outperforms previous methods in lidar simulation, accurately capturing ray dropping effects. Demonstrates generalization to novel viewpoints and actor manipulations, enabling realistic scenario generation.	Assumes rigid actors, limiting its applicability to pedestrians and other deformable objects. Struggles with challenging conditions like night scenes and time-dependent object appearance (e.g., brake lights).	neural radiance fields, autonomous driving, novel view synthesis, lidar simulation, scene generation
2311.15230 Report	GAIA: Zero-shot Talking Avatar Generation	Tianyu He, Junliang Guo, Runyi Yu, Yuchi Wang, Jialiang Zhu, Kaikai An, Leyi Li, Xu Tan, Chunyu Wang, Han Hu, HsiangTao Wu, Sheng Zhao, Jiang Bian	Zero-shot talking avatar generation aims at synthesizing natural talking videos from speech and a single portrait image. Previous methods have relied on domain-specific heuristics such as warping-based motion representation and 3D Morphable Models, which limit the naturalness and diversity of the generated avatars. In this work, we introduce GAIA (Generative AI for Avatar), which eliminates the domain priors in talking avatar generation. In light of the observation that the speech only drives the motion of the avatar while the appearance of the avatar and the background typically remain the same throughout the entire video, we divide our approach into two stages: 1) disentangling each frame into motion and appearance representations; 2) generating motion sequences conditioned on the speech and reference portrait image. We collect a large-scale high-quality talking avatar dataset and train the model on it with different scales (up to 2B parameters). Experimental results verify the superiority, scalability, and flexibility of GAIA as 1) the resulting model beats previous baseline models in terms of naturalness, diversity, lip-sync quality, and visual quality; 2) the framework is scalable since larger models yield better results; 3) it is general and enables different applications like controllable talking avatar generation and text-instructed avatar generation.	Introduces GAIA (Generative AI for Avatar), a novel framework for zero-shot talking avatar generation that eliminates domain-specific priors like warping-based motion representations and 3D Morphable Models.	Existing methods rely on domain-specific heuristics that limit the naturalness and diversity of generated avatars. GAIA aims to overcome these limitations by directly learning from data distributions.	GAIA uses a two-stage approach: 1) Disentangling motion and appearance representations of video frames with a Variational AutoEncoder (VAE). 2) Generating motion sequences conditioned on speech and a reference portrait image using a diffusion model.	GAIA outperforms previous state-of-the-art methods in subjective evaluations of naturalness, diversity, lip-sync quality, and visual quality. The framework is scalable, with larger models yielding better results. GAIA is a general framework enabling applications like controllable talking avatar generation and text-instructed avatar generation.	Reliance on pre-trained landmark and head pose extractors might hinder end-to-end learning. Future work includes exploring fully end-to-end learning and disentangling motion and appearance without landmarks.	talking avatar generation, zero-shot learning, diffusion models, variational autoencoder, motion and appearance disentanglement
2311.15157 Report	Advancing Vision Transformers with Group-Mix Attention	Chongjian Ge, Xiaohan Ding, Zhan Tong, Li Yuan, Jiangliu Wang, Yibing Song, Ping Luo	Vision Transformers (ViTs) have been shown to enhance visual recognition through modeling long-range dependencies with multi-head self-attention (MHSA), which is typically formulated as Query-Key-Value computation. However, the attention map generated from the Query and Key captures only token-to-token correlations at one single granularity. In this paper, we argue that self-attention should have a more comprehensive mechanism to capture correlations among tokens and groups (i.e., multiple adjacent tokens) for higher representational capacity. Thereby, we propose Group-Mix Attention (GMA) as an advanced replacement for traditional self-attention, which can simultaneously capture token-to-token, token-to-group, and group-to-group correlations with various group sizes. To this end, GMA splits the Query, Key, and Value into segments uniformly and performs different group aggregations to generate group proxies. The attention map is computed based on the mixtures of tokens and group proxies and used to re-combine the tokens and groups in Value. Based on GMA, we introduce a powerful backbone, namely GroupMixFormer, which achieves state-of-the-art performance in image classification, object detection, and semantic segmentation with fewer parameters than existing models. For instance, GroupMixFormer-L (with 70.3M parameters and 384^2 input) attains 86.2% Top-1 accuracy on ImageNet-1K without external data, while GroupMixFormer-B (with 45.8M parameters) attains 51.2% mIoU on ADE20K.	This paper proposes Group-Mix Attention (GMA), an advanced attention mechanism for Vision Transformers (ViTs) that captures token-to-token, token-to-group, and group-to-group correlations to enhance representational capacity.	Standard self-attention in ViTs only captures token-to-token correlations at a single granularity, limiting their ability to model complex visual patterns. GMA addresses this by incorporating correlations among token groups of various sizes.	GMA divides input tokens into segments and uses sliding-window-based aggregators (e.g., depth-wise convolutions) to generate group proxies. It then computes attention on mixtures of individual tokens and these group proxies, enabling multi-granularity correlation modeling.	GroupMixFormer, a hierarchical ViT built on GMA, achieves state-of-the-art performance on ImageNet-1K classification, COCO object detection, and ADE20K semantic segmentation. Experiments show that GMA effectively models group correlations, leading to fine-grained visual representations beneficial for various vision tasks. Incorporating GMA into other ViT architectures like Swin and PVT also consistently improves their performance.	The current implementation of GMA with depth-wise convolutions as aggregators leads to slower inference speed, though this can be improved by using more efficient aggregators. Exploring alternative aggregator implementations and further optimizing the kernel size configurations could yield additional performance gains.	vision transformer, self-attention, group-mix attention, image classification, object detection, semantic segmentation
2311.15040 Report	InstaStyle: Inversion Noise of a Stylized Image is Secretly a Style Adviser	Xing Cui, Zekun Li, Pei Pei Li, Huaibo Huang, Zhaofeng He	Stylized text-to-image generation focuses on creating images from textual descriptions while adhering to a style specified by a few reference images. However, subtle style variations within different reference images can hinder the model from accurately learning the target style. In this paper, we propose InstaStyle, a novel approach that excels in generating high-fidelity stylized images with only a single reference image. Our approach is based on the finding that the inversion noise from a stylized reference image inherently carries the style signal, as evidenced by their non-zero signal-to-noise ratio. We employ DDIM inversion to extract this noise from the reference image and leverage a diffusion model to generate new stylized images from the "style" noise. Additionally, the inherent ambiguity and bias of textual prompts impede the precise conveying of style. To address this, we introduce a learnable style token via prompt refinement, which enhances the accuracy of the style description for the reference image. Qualitative and quantitative experimental results demonstrate that InstaStyle achieves superior performance compared to current benchmarks. Furthermore, our approach also showcases its capability in the creative task of style combination with mixed inversion noise.	Proposes InstaStyle, a novel stylized text-to-image generation method that effectively captures and generates images in specific styles using only a single reference image.	Addresses limitations of existing methods that struggle to capture subtle style variations from multiple reference images or rely on ambiguous textual prompts.	Leverages DDIM inversion to extract style information from a single reference image and employs a prompt refinement scheme to learn a style token, enhancing style accuracy and enabling style combination.	Generates high-fidelity stylized images with fine-grained style details from a single reference image. Learned style token effectively avoids ambiguity and bias present in human-written textual style descriptions. Supports creative style combination by mixing inversion noise and employing a composed guidance mechanism.	Limited exploration of the impact of different masking strategies and prompt mix ratios on style combination. Reliance on manual selection for prompt refinement, which could be automated in future work.	stylized image generation, text-to-image synthesis, diffusion models, ddim inversion, prompt refinement
2311.15027 Report	Double-Flow-based Steganography without Embedding for Image-to-Image Hiding	Bingbing Song, Derui Wang, Tianwei Zhang, Renyang Liu, Yu Lin, Wei Zhou	As an emerging concept, steganography without embedding (SWE) hides a secret message without directly embedding it into a cover. Thus, SWE has the unique advantage of being immune to typical steganalysis methods and can better protect the secret message from being exposed. However, existing SWE methods are generally criticized for their poor payload capacity and low fidelity of recovered secret messages. In this paper, we propose a novel steganography-without-embedding technique, named DF-SWE, which addresses the aforementioned drawbacks and produces diverse and natural stego images. Specifically, DF-SWE employs a reversible circulation of double flow to build a reversible bijective transformation between the secret image and the generated stego image. Hence, it provides a way to directly generate stego images from secret images without a cover image. Besides leveraging the invertible property, DF-SWE can invert a secret image from a generated stego image in a nearly lossless manner and increases the fidelity of extracted secret images. To the best of our knowledge, DF-SWE is the first SWE method that can hide large images and multiple images into one image with the same size, significantly enhancing the payload capacity. According to the experimental results, the payload capacity of DF-SWE achieves 24-72 BPP is 8000-16000 times compared to its competitors while producing diverse images to minimize the exposure risk. Importantly, DF-SWE can be applied in the steganography of secret images in various domains without requiring training data from the corresponding domains. This domain-agnostic property suggests that DF-SWE can 1) be applied to hiding private data and 2) be deployed in resource-limited systems.	This paper proposes DF-SWE, a novel steganography-without-embedding technique that uses a reversible circulation of double flow to hide large and multiple images within a single, naturally generated stego image.	Existing SWE methods suffer from limited payload capacity and low fidelity of recovered secret messages. DF-SWE addresses these limitations by enabling the hiding of large images, even multiple images, without a cover image, thereby significantly enhancing security against steganalysis.	DF-SWE employs two flow-based models to establish a reversible bijective transformation between secret images and generated stego images. It leverages prior knowledge sampling for initialization, high-dimensional space replacement for information transfer, and distribution consistency transformation to ensure high-quality stego image generation.	DF-SWE achieves a payload capacity of 24-72 BPP, significantly higher than existing SWE methods. The method ensures a low extraction error, enabling near-lossless recovery of hidden images. DF-SWE exhibits domain generalization, enabling it to hide images from different domains without requiring domain-specific training data.	While achieving excellent secret image recovery, the method is not completely lossless. Future work includes exploring complete lossless recovery and extending the method to multi-modal data hiding.	image steganography, steganography without embedding, flow-based model, image hiding, domain generalization
2311.14768 Report	AdaDiff: Adaptive Step Selection for Fast Diffusion	Hui Zhang, Zuxuan Wu, Zhen Xing, Jie Shao, Yu-Gang Jiang	Diffusion models, as a type of generative models, have achieved impressive results in generating images and videos conditioned on textual conditions. However, the generation process of diffusion models involves denoising for dozens of steps to produce photorealistic images/videos, which is computationally expensive. Unlike previous methods that design ``one-size-fits-all'' approaches for speed up, we argue denoising steps should be sample-specific conditioned on the richness of input texts. To this end, we introduce AdaDiff, a lightweight framework designed to learn instance-specific step usage policies, which are then used by the diffusion model for generation. AdaDiff is optimized using a policy gradient method to maximize a carefully designed reward function, balancing inference time and generation quality. We conduct experiments on three image generation and two video generation benchmarks and demonstrate that our approach achieves similar results in terms of visual quality compared to the baseline using a fixed 50 denoising steps while reducing inference time by at least 33%, going as high as 40%. Furthermore, our qualitative analysis shows that our method allocates more steps to more informative text conditions and fewer steps to simpler text conditions.	This paper introduces AdaDiff, an end-to-end framework that learns instance-specific step usage policies for diffusion models conditioned on textual prompts to reduce computational cost and inference time.	Diffusion models, while effective, require dozens of computationally expensive denoising steps for generating high-quality images/videos. This paper argues that the number of steps should be adaptive to the complexity of the input prompt, unlike traditional "one-size-fits-all" approaches.	AdaDiff employs a lightweight step selection network trained using reinforcement learning with a policy gradient method. The network learns to maximize a reward function that balances image/video quality (evaluated using an IQS model) and the number of steps saved.	AdaDiff reduces inference time by 33%-40% compared to fixed-step baselines while maintaining similar visual quality across various image and video generation benchmarks. The learned adaptive policies demonstrate superior performance over random step selection, achieving better visual quality with similar computational resources. AdaDiff can be seamlessly integrated with other diffusion model acceleration methods and exhibits promising zero-shot transfer capabilities to different datasets.	The current implementation primarily focuses on a predefined set of discrete steps for the DDIM sampler. The IQS model, while effective, might not fully encapsulate the nuances of human perception in all scenarios.	diffusion models, generative models, text-to-image generation, text-to-video generation, reinforcement learning
2311.14760 Report	SinSR: Diffusion-Based Image Super-Resolution in a Single Step	Yufei Wang, Wenhan Yang, Xinyuan Chen, Yaohui Wang, Lanqing Guo, Lap-Pui Chau, Ziwei Liu, Yu Qiao, Alex C. Kot, Bihan Wen	While super-resolution (SR) methods based on diffusion models exhibit promising results, their practical application is hindered by the substantial number of required inference steps. Recent methods utilize degraded images in the initial state, thereby shortening the Markov chain. Nevertheless, these solutions either rely on a precise formulation of the degradation process or still necessitate a relatively lengthy generation path (e.g., 15 iterations). To enhance inference speed, we propose a simple yet effective method for achieving single-step SR generation, named SinSR. Specifically, we first derive a deterministic sampling process from the most recent state-of-the-art (SOTA) method for accelerating diffusion-based SR. This allows the mapping between the input random noise and the generated high-resolution image to be obtained in a reduced and acceptable number of inference steps during training. We show that this deterministic mapping can be distilled into a student model that performs SR within only one inference step. Additionally, we propose a novel consistency-preserving loss to simultaneously leverage the ground-truth image during the distillation process, ensuring that the performance of the student model is not solely bound by the feature manifold of the teacher model, resulting in further performance improvement. Extensive experiments conducted on synthetic and real-world datasets demonstrate that the proposed method can achieve comparable or even superior performance compared to both previous SOTA methods and the teacher model, in just one sampling step, resulting in a remarkable up to x10 speedup for inference. Our code will be released at https://github.com/wyf0912/SinSR	This paper proposes SinSR, a novel method for single-step super-resolution (SR) image generation using a distilled deterministic mapping from a pre-trained diffusion model.	Existing diffusion-based SR methods, while effective, suffer from slow inference speed due to the numerous steps required in the Markov chain.	The authors first derive a deterministic sampling process from a state-of-the-art SR diffusion model (ResShift). Then, they train a student network to learn the deterministic mapping between input noise and the generated HR image in a single step using a novel consistency-preserving loss that leverages ground-truth images.	SinSR achieves comparable or superior performance to state-of-the-art SR methods on both synthetic and real-world datasets. The method reduces inference steps from 15 to 1, resulting in a significant speedup. Directly learning the deterministic mapping between noise and HR images is shown to be more effective than denoising at different noise levels.	The training process, while faster than training from scratch, still involves solving ODEs which can be computationally expensive. Further exploration of alternative teacher diffusion models and distillation strategies could potentially yield additional performance gains.	super-resolution, diffusion models, image generation, single-step inference, knowledge distillation
2311.14749 Report	Compositional Zero-shot Learning via Progressive Language-based Observations	Lin Li, Guikun Chen, Jun Xiao, Long Chen	Compositional zero-shot learning aims to recognize unseen state-object compositions by leveraging known primitives (state and object) during training. However, effectively modeling interactions between primitives and generalizing knowledge to novel compositions remains a perennial challenge. There are two key factors: object-conditioned and state-conditioned variance, i.e., the appearance of states (or objects) can vary significantly when combined with different objects (or states). For instance, the state "old" can signify a vintage design for a "car" or an advanced age for a "cat". In this paper, we argue that these variances can be mitigated by predicting composition categories based on pre-observed primitive. To this end, we propose Progressive Language-based Observations (PLO), which can dynamically determine a better observation order of primitives. These observations comprise a series of concepts or languages that allow the model to understand image content in a step-by-step manner. Specifically, PLO adopts pre-trained vision-language models (VLMs) to empower the model with observation capabilities. We further devise two variants: 1) PLO-VLM: a two-step method, where a pre-observing classifier dynamically determines the observation order of two primitives. 2) PLO-LLM: a multi-step scheme, which utilizes large language models (LLMs) to craft composition-specific prompts for step-by-step observing. Extensive ablations on three challenging datasets demonstrate the superiority of PLO compared with state-of-the-art methods, affirming its abilities in compositional recognition.	The paper proposes Progressive Language-based Observations (PLO), a novel approach for compositional zero-shot learning that dynamically determines the order of observations using language to recognize unseen state-object compositions.	Effectively modeling interactions between primitives (states and objects) and generalizing to novel compositions in CZSL is challenging due to object-conditioned and state-conditioned variance in visual appearance.	PLO leverages pre-trained vision-language models (VLMs) to enable models to observe image content step-by-step. It has two variants: PLO-VLM uses a pre-observing classifier to dynamically determine the observation order of primitives, while PLO-LLM utilizes large language models (LLMs) to craft composition-specific prompts for step-by-step observation.	PLO outperforms state-of-the-art CZSL methods on three benchmark datasets (MIT-States, UT-Zappos, and C-GQA) in both closed-world and open-world settings. Dynamically determining the observation order based on image content leads to better performance than fixed observation orders. Increasing the number of observation prompts in PLO-LLM generally improves accuracy.	PLO primarily focuses on recognizing novel compositions of seen states and objects, not entirely new state or object categories. PLO-LLM's reliance on external language model APIs introduces cost constraints, especially with a large number of composition categories.	compositional zero-shot learning, vision-language models, large language models, dynamic observation order, progressive observation
2311.14671 Report	SEGIC: Unleashing the Emergent Correspondence for In-Context Segmentation	Lingchen Meng, Shiyi Lan, Hengduo Li, Jose M. Alvarez, Zuxuan Wu, Yu-Gang Jiang	In-context segmentation aims at segmenting novel images using a few labeled example images, termed as "in-context examples", exploring content similarities between examples and the target. The resulting models can be generalized seamlessly to novel segmentation tasks, significantly reducing the labeling and training costs compared with conventional pipelines. However, in-context segmentation is more challenging than classic ones requiring the model to learn segmentation rules conditioned on a few samples. Unlike previous work with ad-hoc or non-end-to-end designs, we propose SEGIC, an end-to-end segment-in-context framework built upon a single vision foundation model (VFM). In particular, SEGIC leverages the emergent correspondence within VFM to capture dense relationships between target images and in-context samples. As such, information from in-context samples is then extracted into three types of instructions, i.e. geometric, visual, and meta instructions, serving as explicit conditions for the final mask prediction. SEGIC is a straightforward yet effective approach that yields state-of-the-art performance on one-shot segmentation benchmarks. Notably, SEGIC can be easily generalized to diverse tasks, including video object segmentation and open-vocabulary segmentation. Code will be available at https://github.com/MengLcool/SEGIC.	\modelname is an end-to-end segment-in-context framework that leverages the emergent correspondence of a single frozen vision foundation model for in-context segmentation.	In-context learning in vision, particularly for segmentation, is challenging but highly desirable as it allows models to generalize to novel segmentation tasks with low training costs.	\modelname leverages a pre-trained vision foundation model to establish dense correspondences between target images and in-context examples. It then extracts geometric, visual, and meta instructions from in-context samples to guide a lightweight mask decoder for segmentation.	\modelname achieves state-of-the-art performance on one-shot segmentation benchmarks, including COCO-20$^i$, FSS-1000, and LVIS-92$^i$. Without fine-tuning on video data, \modelname demonstrates competitive zero-shot video object segmentation performance on DAVIS-17 and YouTube-VOS-18. It shows strong performance on generic semantic segmentation (COCO, ADE20k) and open-vocabulary semantic segmentation (PC-459, ADE-847) benchmarks.	Current work mainly focuses on utilizing one in-context example per entity. Instance-level segmentation in an open-world setting is not extensively explored.	in-context learning, segmentation generalist, vision foundation model, emergent correspondence, one-shot segmentation
2311.14631 Report	CatVersion: Concatenating Embeddings for Diffusion-Based Text-to-Image Personalization	Ruoyu Zhao, Mingrui Zhu, Shiyin Dong, Nannan Wang, Xinbo Gao	We propose CatVersion, an inversion-based method that learns the personalized concept through a handful of examples. Subsequently, users can utilize text prompts to generate images that embody the personalized concept, thereby achieving text-to-image personalization. In contrast to existing approaches that emphasize word embedding learning or parameter fine-tuning for the diffusion model, which potentially causes concept dilution or overfitting, our method concatenates embeddings on the feature-dense space of the text encoder in the diffusion model to learn the gap between the personalized concept and its base class, aiming to maximize the preservation of prior knowledge in diffusion models while restoring the personalized concepts. To this end, we first dissect the text encoder's integration in the image generation process to identify the feature-dense space of the encoder. Afterward, we concatenate embeddings on the Keys and Values in this space to learn the gap between the personalized concept and its base class. In this way, the concatenated embeddings ultimately manifest as a residual on the original attention output. To more accurately and unbiasedly quantify the results of personalized image generation, we improve the CLIP image alignment score based on masks. Qualitatively and quantitatively, CatVersion helps to restore personalization concepts more faithfully and enables more robust editing.	CatVersion, a novel text-to-image personalization method, concatenates embeddings into a highly integrated feature space within the text encoder of diffusion models, learning the difference between a personalized concept and its base class.	Existing T2I personalization methods struggle with concept dilution or overfitting when learning personalized concepts. This work addresses these limitations by learning in a feature-dense space within the text encoder, improving the fidelity of personalized concept restoration and text-guided editability.	The authors first identify a feature-dense space within the last few layers of the CLIP text encoder. Then, learnable embeddings are concatenated to the Keys and Values in this space and optimized to learn the difference between the personalized concept and its base class. This difference is ultimately represented as a residual on the original attention output.	CatVersion demonstrates superior performance in restoring personalized concepts and enabling text-guided editing compared to baseline methods. Optimizing embeddings in a feature-dense space leads to better learning of the target concept and improves contextual understanding. Concatenating residual embeddings significantly enhances the reconstruction ability of personalized concepts.	The current implementation requires separate optimization for each concept, impacting inversion speed. The method is limited to learning a single concept per optimization process.	text-to-image generation, personalization, diffusion models, clip, concept inversion
2311.14603 Report	Animate124: Animating One Image to 4D Dynamic Scene	Yuyang Zhao, Zhiwen Yan, Enze Xie, Lanqing Hong, Zhenguo Li, Gim Hee Lee	We introduce Animate124 (Animate-one-image-to-4D), the first work to animate a single in-the-wild image into 3D video through textual motion descriptions, an underexplored problem with significant applications. Our 4D generation leverages an advanced 4D grid dynamic Neural Radiance Field (NeRF) model, optimized in three distinct stages using multiple diffusion priors. Initially, a static model is optimized using the reference image, guided by 2D and 3D diffusion priors, which serves as the initialization for the dynamic NeRF. Subsequently, a video diffusion model is employed to learn the motion specific to the subject. However, the object in the 3D videos tends to drift away from the reference image over time. This drift is mainly due to the misalignment between the text prompt and the reference image in the video diffusion model. In the final stage, a personalized diffusion prior is therefore utilized to address the semantic drift. As the pioneering image-text-to-4D generation framework, our method demonstrates significant advancements over existing baselines, evidenced by comprehensive quantitative and qualitative assessments.	\ours is the first framework to animate a single in-the-wild image into 3D video with motion defined by a text prompt.	Dynamic 3D scenes effectively represent the real world and have applications in video games, AR, and VR.	A static-to-dynamic and coarse-to-fine strategy optimizes a 4D grid dynamic NeRF using diffusion priors from 2D image, 3D, and personalized image diffusion models in three stages.	\ours outperforms baselines in generating coherent 3D videos from single images and text prompts. \ours exhibits superior control over the protagonist's motion compared to MAV3D. Semantic refinement using a personalized diffusion prior effectively mitigates semantic drift.	The reliance on a large CFG scale for SDS can lead to over-saturation and over-smoothing. Limited availability of diverse and high-quality image-text-4D datasets.	3d video generation, dynamic nerf, diffusion models, text-to-3d, image animation
2311.14552 Report	Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models	Yufei Zhan, Yousong Zhu, Zhiyang Chen, Fan Yang, Ming Tang, Jinqiao Wang	Replicating the innate human ability to detect all objects based on free-form texts at any granularity remains a formidable challenge for Vision-Language models. Current Large Vision Language Models (LVLMs) are predominantly constrained to grounding a single, pre-existing object, relying solely on data from Referring Expression Comprehension tasks. The limitation leads to a compromise in model design, necessitating the introduction of visual expert models or the integration of customized head structures. Beyond these constraints, our research delves into the untapped potential of LVLMs and uncover their inherent capability for basic object perception, allowing them to accurately identify and locate objects of interest. Building on this insight, we introduce a novel language-prompted localization dataset designed to fully unleash the capabilities of LVLMs in integrating fine-grained object perception with precise location awareness. More importantly, we present $\textbf{Griffon}$, a purely LVLM-based baseline, which does not require the introduction of any special tokens, expert models, or additional detection modules. It simply maintains a consistent structure with popular LVLMs by unifying data formats across various localization-related scenarios and is trained end-to-end through a well-designed pipeline. Comprehensive experiments demonstrate that $\textbf{Griffon}$ not only achieves state-of-the-art performance on the fine-grained RefCOCO series but also approaches the capabilities of the expert model Faster RCNN on the detection benchmark MSCOCO.	This paper introduces a novel language-prompted localization dataset and a purely LVLM-based baseline model called Griffon, capable of localizing objects at any granularity based on free-form input texts.	Existing Vision-Language models struggle to locate multiple objects from complex text descriptions, often relying on external expert models or specialized heads, limiting their generalizability and efficiency.	The authors construct a large-scale dataset with various localization scenarios and train Griffon in two stages: (1) basic scenario pre-training for multi-object perception and (2) full scenario instruction tuning for user intention comprehension. A training-free scoring mechanism ranks object outputs for improved confidence.	Griffon achieves state-of-the-art results on the RefCOCO series for single referent localization. Griffon approaches the performance of the expert model Faster RCNN on the MSCOCO object detection benchmark. Griffon effectively handles complex scenarios, including localizing multiple objects of the same category and refusing to output for non-existing objects.	The current work mainly focuses on localization tasks. Future work will explore integrating other vision and language tasks into Griffon.	vision-language models, object localization, referring expression comprehension, object detection, multi-object perception
2311.14521 Report	GaussianEditor: Swift and Controllable 3D Editing with Gaussian Splatting	Yiwen Chen, Zilong Chen, Chi Zhang, Feng Wang, Xiaofeng Yang, Yikai Wang, Zhongang Cai, Lei Yang, Huaping Liu, Guosheng Lin	3D editing plays a crucial role in many areas such as gaming and virtual reality. Traditional 3D editing methods, which rely on representations like meshes and point clouds, often fall short in realistically depicting complex scenes. On the other hand, methods based on implicit 3D representations, like Neural Radiance Field (NeRF), render complex scenes effectively but suffer from slow processing speeds and limited control over specific scene areas. In response to these challenges, our paper presents GaussianEditor, an innovative and efficient 3D editing algorithm based on Gaussian Splatting (GS), a novel 3D representation. GaussianEditor enhances precision and control in editing through our proposed Gaussian semantic tracing, which traces the editing target throughout the training process. Additionally, we propose Hierarchical Gaussian splatting (HGS) to achieve stabilized and fine results under stochastic generative guidance from 2D diffusion models. We also develop editing strategies for efficient object removal and integration, a challenging task for existing methods. Our comprehensive experiments demonstrate GaussianEditor's superior control, efficacy, and rapid performance, marking a significant advancement in 3D editing. Project Page: https://buaacyw.github.io/gaussian-editor/	Presents GaussianEditor, a novel 3D editing algorithm based on Gaussian Splatting for fast and controllable 3D scene editing.	Traditional mesh-based editing struggles with complex scenes, while NeRF-based editing is slow and lacks local control. GaussianEditor leverages Gaussian Splatting's advantages for speed and controllability in 3D editing.	Introduces Gaussian semantic tracing for precise editing target localization and Hierarchical Gaussian Splatting (HGS) for stable optimization under generative guidance. Develops specific algorithms for object removal and integration in Gaussian Splatting.	Achieves superior control over editing areas compared to previous methods like Instruct-Nerf2Nerf. Enables efficient object removal and integration within minutes, significantly faster than NeRF-based editing. Demonstrates effectiveness in various editing tasks, including scene modification, facial swaps, and object manipulation.	Reliance on 2D diffusion models for guidance limits editing capabilities for complex prompts. Future work includes exploring alternative guidance mechanisms and expanding editing functionalities.	3d editing, gaussian splatting, generative guidance, semantic tracing, 3d inpainting
2311.14494 Report	MVControl: Adding Conditional Control to Multi-view Diffusion for Controllable Text-to-3D Generation	Zhiqi Li, Yiming Chen, Lingzhe Zhao, Peidong Liu	We introduce MVControl, a novel neural network architecture that enhances existing pre-trained multi-view 2D diffusion models by incorporating additional input conditions, e.g. edge maps. Our approach enables the generation of controllable multi-view images and view-consistent 3D content. To achieve controllable multi-view image generation, we leverage MVDream as our base model, and train a new neural network module as additional plugin for end-to-end task-specific condition learning. To precisely control the shapes and views of generated images, we innovatively propose a new conditioning mechanism that predicts an embedding encapsulating the input spatial and view conditions, which is then injected to the network globally. Once MVControl is trained, score-distillation (SDS) loss based optimization can be performed to generate 3D content, in which process we propose to use a hybrid diffusion prior. The hybrid prior relies on a pre-trained Stable-Diffusion network and our trained MVControl for additional guidance. Extensive experiments demonstrate that our method achieves robust generalization and enables the controllable generation of high-quality 3D content. Code available at https://github.com/WU-CVGL/MVControl/.	Introduces MVControl, a novel neural network architecture enhancing pre-trained multi-view 2D diffusion models with additional input conditions (e.g., edge maps) for controllable text-to-3D generation.	Addresses the limitations of existing text-to-3D generation methods in achieving fine-grained control over generated content, similar to ControlNet in text-to-image generation.	Leverages MVDream as the base model and incorporates a trainable control network. Employs a conditioning module to predict embeddings from input conditions (e.g., edge maps, camera poses), injecting them into the network for control. Utilizes a hybrid diffusion prior with Stable-Diffusion and MVControl for controllable text-to-3D generation via score distillation optimization.	Achieves fine-grained control over the shapes and views of generated multi-view images. Generates high-fidelity controllable multi-view images and view-consistent 3D content. Demonstrates robust generalization and superior performance compared to prior text-to-3D methods.	Current implementation primarily explores edge maps as conditional input. Reliance on pre-trained models might limit the generation of novel object categories.	text-to-3d generation, multi-view diffusion models, controllable image synthesis, score distillation sampling, 3d deep learning
2311.14284 Report	Paragraph-to-Image Generation with Information-Enriched Diffusion Model	Weijia Wu, Zhuang Li, Yefei He, Mike Zheng Shou, Chunhua Shen, Lele Cheng, Yan Li, Tingting Gao, Di Zhang, Zhongyuan Wang	Text-to-image (T2I) models have recently experienced rapid development, achieving astonishing performance in terms of fidelity and textual alignment capabilities. However, given a long paragraph (up to 512 words), these generation models still struggle to achieve strong alignment and are unable to generate images depicting complex scenes. In this paper, we introduce an information-enriched diffusion model for paragraph-to-image generation task, termed ParaDiffusion, which delves into the transference of the extensive semantic comprehension capabilities of large language models to the task of image generation. At its core is using a large language model (e.g., Llama V2) to encode long-form text, followed by fine-tuning with LORA to alignthe text-image feature spaces in the generation task. To facilitate the training of long-text semantic alignment, we also curated a high-quality paragraph-image pair dataset, namely ParaImage. This dataset contains a small amount of high-quality, meticulously annotated data, and a large-scale synthetic dataset with long text descriptions being generated using a vision-language model. Experiments demonstrate that ParaDiffusion outperforms state-of-the-art models (SD XL, DeepFloyd IF) on ViLG-300 and ParaPrompts, achieving up to 15% and 45% human voting rate improvements for visual appeal and text faithfulness, respectively. The code and dataset will be released to foster community research on long-text alignment.	Introduces ParaDiffusion, an information-enriched diffusion model for paragraph-to-image generation, tackling the challenge of aligning long-form text with images.	Existing T2I models struggle with long paragraphs due to limitations in data (short captions) and architecture (text encoder constraints).	1. Created ParaImage, a dataset with paragraph-image pairs (up to 400 words), including synthetic (ParaImage-Big) and manually annotated (ParaImage-Small) data. 2. Employed Llama V2 as a text encoder, fine-tuned with LoRA to align text and image feature spaces. 3. Three-stage training: pre-training, paragraph-image alignment learning, quality tuning.	Outperforms state-of-the-art models (SD XL, DeepFloyd IF) on visual appeal and text faithfulness by up to 15% and 45%, respectively. Fine-tuning LLM with LoRA significantly improves performance compared to using frozen LLMs. ParaImage dataset, especially the manually annotated portion, proves crucial for high-quality results.	Inference speed needs optimization (consider ODE solvers, consistency models). Occasional unrealistic image generation (address with data augmentation, geometric/semantic constraints).	text-to-image generation, long-text alignment, diffusion models, large language models, dataset creation
2311.14282 Report	Image Super-Resolution with Text Prompt Diffusion	Zheng Chen, Yulun Zhang, Jinjin Gu, Xin Yuan, Linghe Kong, Guihai Chen, Xiaokang Yang	Image super-resolution (SR) methods typically model degradation to improve reconstruction accuracy in complex and unknown degradation scenarios. However, extracting degradation information from low-resolution images is challenging, which limits the model performance. To boost image SR performance, one feasible approach is to introduce additional priors. Inspired by advancements in multi-modal methods and text prompt image processing, we introduce text prompts to image SR to provide degradation priors. Specifically, we first design a text-image generation pipeline to integrate text into the SR dataset through the text degradation representation and degradation model. The text representation applies a discretization manner based on the binning method to describe the degradation abstractly. This method maintains the flexibility of the text and is user-friendly. Meanwhile, we propose the PromptSR to realize the text prompt SR. The PromptSR utilizes the pre-trained language model (e.g., T5 or CLIP) to enhance restoration. We train the model on the generated text-image dataset. Extensive experiments indicate that introducing text prompts into SR, yields excellent results on both synthetic and real-world images. Code is available at: https://github.com/zhengchen1999/PromptSR.	This paper introduces text prompts as degradation priors to enhance image super-resolution by providing additional information about image degradation.	Modeling degradation in image super-resolution is crucial, especially in complex real-world scenarios, and using text prompts can provide richer degradation information than solely relying on low-resolution images.	The authors propose a text-image generation pipeline to create a dataset containing low-resolution images, corresponding high-resolution images, and text prompts describing the degradation. They also introduce PromptSR, a network based on a diffusion model and a pre-trained language model, to perform super-resolution conditioned on both the low-resolution image and the text prompt.	Introducing text prompts significantly improves super-resolution performance compared to methods without text guidance. The proposed method exhibits flexibility in handling different degradation and prompt formats, including random order and simplified descriptions. PromptSR achieves superior performance on both synthetic and real-world datasets, demonstrating the effectiveness of incorporating text prompts in image super-resolution.	The performance slightly drops when using randomly ordered degradation operations compared to a fixed order. Future work could explore combining image content descriptions with degradation descriptions in the text prompt for potential further improvements.	image super-resolution, text prompt, diffusion model, degradation prior, blind super-resolution
2311.14208 Report	ECRF: Entropy-Constrained Neural Radiance Fields Compression with Frequency Domain Optimization	Soonbin Lee, Fangwen Shu, Yago Sanchez, Thomas Schierl, Cornelius Hellge	Explicit feature-grid based NeRF models have shown promising results in terms of rendering quality and significant speed-up in training. However, these methods often require a significant amount of data to represent a single scene or object. In this work, we present a compression model that aims to minimize the entropy in the frequency domain in order to effectively reduce the data size. First, we propose using the discrete cosine transform (DCT) on the tensorial radiance fields to compress the feature-grid. This feature-grid is transformed into coefficients, which are then quantized and entropy encoded, following a similar approach to the traditional video coding pipeline. Furthermore, to achieve a higher level of sparsity, we propose using an entropy parameterization technique for the frequency domain, specifically for DCT coefficients of the feature-grid. Since the transformed coefficients are optimized during the training phase, the proposed model does not require any fine-tuning or additional information. Our model only requires a lightweight compression pipeline for encoding and decoding, making it easier to apply volumetric radiance field methods for real-world applications. Experimental results demonstrate that our proposed frequency domain entropy model can achieve superior compression performance across various datasets. The source code will be made publicly available.	This paper introduces Entropy-Constrained Radiance Fields (ECRF), a novel compression framework for tensorial radiance fields that minimizes entropy in the DCT coefficient domain for efficient compression.	Explicit grid-based NeRF models, while efficient in training and rendering, often lead to large storage sizes, hindering their practicality. This work addresses this issue by significantly compressing these models without compromising rendering quality.	The proposed ECRF employs a frequency-domain entropy parameterization technique. This involves applying DCT to the feature-grid, quantizing the coefficients to 8-bit, and finally employing entropy coding for a compact representation.	ECRF achieves superior compression performance, especially at low bitrates, outperforming existing methods. The use of DCT and entropy minimization in the frequency domain leads to more sparse and efficient representations compared to spatial domain methods. The proposed compression pipeline, including quantization and entropy coding, achieves a significant reduction in model size (up to 28x) with minimal impact on rendering quality (PSNR drop of only 0.1 dB).	The entropy calculation adds overhead to the training time (4-5 minutes longer than the baseline). Extremely low bitrates can lead to block artifacts due to the loss of high-frequency information.	neural radiance fields, nerf compression, 3d scene representation, frequency domain compression, discrete cosine transform
2311.14097 Report	ACT-Diffusion: Efficient Adversarial Consistency Training for One-step Diffusion Models	Fei Kong, Jinhao Duan, Lichao Sun, Hao Cheng, Renjing Xu, Hengtao Shen, Xiaofeng Zhu, Xiaoshuang Shi, Kaidi Xu	Though diffusion models excel in image generation, their step-by-step denoising leads to slow generation speeds. Consistency training addresses this issue with single-step sampling but often produces lower-quality generations and requires high training costs. In this paper, we show that optimizing consistency training loss minimizes the Wasserstein distance between target and generated distributions. As timestep increases, the upper bound accumulates previous consistency training losses. Therefore, larger batch sizes are needed to reduce both current and accumulated losses. We propose Adversarial Consistency Training (ACT), which directly minimizes the Jensen-Shannon (JS) divergence between distributions at each timestep using a discriminator. Theoretically, ACT enhances generation quality, and convergence. By incorporating a discriminator into the consistency training framework, our method achieves improved FID scores on CIFAR10 and ImageNet 64$\times$64 and LSUN Cat 256$\times$256 datasets, retains zero-shot image inpainting capabilities, and uses less than $1/6$ of the original batch size and fewer than $1/2$ of the model parameters and training steps compared to the baseline method, this leads to a substantial reduction in resource consumption. Our code is available:https://github.com/kong13661/ACT	This paper introduces Adversarial Consistency Training (ACT), a novel method for improving the efficiency and performance of consistency training in diffusion models.	Diffusion models excel in image generation but suffer from slow generation speeds due to iterative denoising. Consistency training accelerates this process with single-step sampling, but often compromises generation quality. This work aims to address these limitations.	The paper analyzes consistency training loss and proves its equivalence to optimizing the upper bound of the Wasserstein distance. To mitigate accumulated errors, ACT incorporates a discriminator into consistency training, directly minimizing the Jensen-Shannon divergence between distributions at each timestep. Additionally, it utilizes gradient penalty-based adaptive data augmentation to further enhance performance.	ACT achieves significantly better FID scores compared to standard consistency training on CIFAR10, ImageNet 64x64, and LSUN Cat 256x256 datasets. ACT achieves these improvements with a significantly smaller batch size (less than 1/6th), fewer model parameters, and fewer training steps compared to the baseline consistency training method. The proposed method retains the zero-shot image inpainting capability inherent to consistency models.	The interaction between consistency training loss and the adversarial loss introduced by the discriminator requires further investigation. Exploration of distances beyond Jensen-Shannon Divergence for minimizing the discrepancy between generated and target distributions could be beneficial.	diffusion models, generative adversarial networks, consistency training, image generation, adversarial consistency training
2311.14029 Report	Understanding the Vulnerability of CLIP to Image Compression	Cangxiong Chen, Vinay P. Namboodiri, Julian Padget	CLIP is a widely used foundational vision-language model that is used for zero-shot image recognition and other image-text alignment tasks. We demonstrate that CLIP is vulnerable to change in image quality under compression. This surprising result is further analysed using an attribution method-Integrated Gradients. Using this attribution method, we are able to better understand both quantitatively and qualitatively exactly the nature in which the compression affects the zero-shot recognition accuracy of this model. We evaluate this extensively on CIFAR-10 and STL-10. Our work provides the basis to understand this vulnerability of CLIP and can help us develop more effective methods to improve the robustness of CLIP and other vision-language models.	This paper investigates the sensitivity of CLIP, a popular vision-language model, to image compression in zero-shot image recognition tasks.	CLIP, trained on massive datasets with diverse image qualities, is expected to be robust to image degradation. However, this paper discovers its vulnerability to compression, which is crucial for understanding and improving the reliability of vision-language models.	The authors evaluate CLIP's performance on compressed CIFAR-10 and STL-10 datasets with different image encoders. They employ Integrated Gradients, an attribution method, to analyze how compression affects predictions at the pixel level.	CLIP's accuracy significantly decreases with increasing image compression on both CIFAR-10 and STL-10. Integrated Gradients effectively quantifies and visualizes the impact of compression on CLIP's predictions. The visualizations reveal the inductive biases of different image encoders, such as ResNet-50's locality and ViT-B/32's global attention.	The study primarily focuses on JPEG compression and a fixed text prompt, limiting the generalizability of findings. Future work includes investigating mitigation strategies like data augmentation to enhance CLIP's robustness to image quality variations.	clip, vision-language models, image compression, robustness, integrated gradients
2311.13833 Report	Lego: Learning to Disentangle and Invert Concepts Beyond Object Appearance in Text-to-Image Diffusion Models	Saman Motamed, Danda Pani Paudel, Luc Van Gool	Diffusion models have revolutionized generative content creation and text-to-image (T2I) diffusion models in particular have increased the creative freedom of users by allowing scene synthesis using natural language. T2I models excel at synthesizing concepts such as nouns, appearances, and styles. To enable customized content creation based on a few example images of a concept, methods such as Textual Inversion and DreamBooth invert the desired concept and enable synthesizing it in new scenes. However, inverting more general concepts that go beyond object appearance and style (adjectives and verbs) through natural language, remains a challenge. Two key characteristics of these concepts contribute to the limitations of current inversion methods. 1) Adjectives and verbs are entangled with nouns (subject) and can hinder appearance-based inversion methods, where the subject appearance leaks into the concept embedding and 2) describing such concepts often extends beyond single word embeddings (being frozen in ice, walking on a tightrope, etc.) that current methods do not handle. In this study, we introduce Lego, a textual inversion method designed to invert subject entangled concepts from a few example images. Lego disentangles concepts from their associated subjects using a simple yet effective Subject Separation step and employs a Context Loss that guides the inversion of single/multi-embedding concepts. In a thorough user study, Lego-generated concepts were preferred over 70% of the time when compared to the baseline. Additionally, visual question answering using a large language model suggested Lego-generated concepts are better aligned with the text description of the concept.	Introduced "Lego," a novel textual inversion method for text-to-image diffusion models that inverts general concepts from images, focusing on adjectives and verbs entangled with subjects.	Current text-to-image models and inversion techniques struggle to represent concepts beyond object appearance, especially adjectives and verbs entangled with subjects, limiting creative control in image generation.	Lego augments Textual Inversion with (1) "Subject Separation" to disentangle concept embeddings from subject appearance and (2) a contrastive "Context Loss" to guide learning of multi-embedding concepts.	Lego successfully inverts concepts like "melting," "frozen in ice," and "walking on a rope," outperforming language-guided models. Human evaluation shows a strong preference for Lego-generated concepts (over 70%) compared to baseline language descriptions. Lego demonstrates compositionality by combining learned concepts and handling complex, multi-word embedding concepts.	Lego's ability to invert concepts is limited by the capabilities of the backbone diffusion model (e.g., facial expressions with earlier versions). Future work includes extending Lego to learn dynamic concepts from videos.	text-to-image synthesis, diffusion models, textual inversion, concept learning, generative ai
2311.13831 Report	Posterior Distillation Sampling	Juil Koo, Chanho Park, Minhyuk Sung	We introduce Posterior Distillation Sampling (PDS), a novel optimization method for parametric image editing based on diffusion models. Existing optimization-based methods, which leverage the powerful 2D prior of diffusion models to handle various parametric images, have mainly focused on generation. Unlike generation, editing requires a balance between conforming to the target attribute and preserving the identity of the source content. Recent 2D image editing methods have achieved this balance by leveraging the stochastic latent encoded in the generative process of diffusion models. To extend the editing capabilities of diffusion models shown in pixel space to parameter space, we reformulate the 2D image editing method into an optimization form named PDS. PDS matches the stochastic latents of the source and the target, enabling the sampling of targets in diverse parameter spaces that align with a desired attribute while maintaining the source's identity. We demonstrate that this optimization resembles running a generative process with the target attribute, but aligning this process with the trajectory of the source's generative process. Extensive editing results in Neural Radiance Fields and Scalable Vector Graphics representations demonstrate that PDS is capable of sampling targets to fulfill the aforementioned balance across various parameter spaces.	The paper introduces Posterior Distillation Sampling (PDS), a new optimization method for editing parametric images generated by diffusion models.	Existing editing methods for parametric images struggle to balance conforming to the target attribute while preserving the source content's identity. PDS addresses this by aligning the generative process of the target image with that of the source image.	PDS reformulates stochastic diffusion inversion, a 2D image editing method, into an optimization form. It matches the stochastic latents of the source and target images during the generative process, ensuring the target image inherits the source's identity.	PDS enables complex geometric changes and object addition in NeRF editing, outperforming existing methods in both qualitative and quantitative comparisons. In SVG editing, PDS makes minimal changes to align with target prompts while preserving structural semantics better than other optimization methods. User studies for both NeRF and SVG editing demonstrate a strong preference for PDS results over baseline methods.	The paper notes occasional artifacts in NeRF editing results, mitigated by a refinement stage using SDEdit and a reconstruction loss. Future work could explore further applications of PDS in other parametric image domains.	diffusion models, image editing, parametric images, neural radiance fields (nerf), scalable vector graphics (svg)
2311.13681 Report	Compact 3D Gaussian Representation for Radiance Field	Joo Chan Lee, Daniel Rho, Xiangyu Sun, Jong Hwan Ko, Eunbyung Park	Neural Radiance Fields (NeRFs) have demonstrated remarkable potential in capturing complex 3D scenes with high fidelity. However, one persistent challenge that hinders the widespread adoption of NeRFs is the computational bottleneck due to the volumetric rendering. On the other hand, 3D Gaussian splatting (3DGS) has recently emerged as an alternative representation that leverages a 3D Gaussisan-based representation and adopts the rasterization pipeline to render the images rather than volumetric rendering, achieving very fast rendering speed and promising image quality. However, a significant drawback arises as 3DGS entails a substantial number of 3D Gaussians to maintain the high fidelity of the rendered images, which requires a large amount of memory and storage. To address this critical issue, we place a specific emphasis on two key objectives: reducing the number of Gaussian points without sacrificing performance and compressing the Gaussian attributes, such as view-dependent color and covariance. To this end, we propose a learnable mask strategy that significantly reduces the number of Gaussians while preserving high performance. In addition, we propose a compact but effective representation of view-dependent color by employing a grid-based neural field rather than relying on spherical harmonics. Finally, we learn codebooks to compactly represent the geometric attributes of Gaussian by vector quantization. With model compression techniques such as quantization and entropy coding, we consistently show over 25$\times$ reduced storage and enhanced rendering speed, while maintaining the quality of the scene representation, compared to 3DGS. Our work provides a comprehensive framework for 3D scene representation, achieving high performance, fast training, compactness, and real-time rendering. Our project page is available at https://maincold2.github.io/c3dgs/.	This paper introduces a novel method for compactly representing 3D scenes using 3D Gaussians, significantly reducing storage and enhancing rendering speed in 3D Gaussian Splatting (3DGS) without compromising quality.	3DGS, despite its fast rendering, requires significant memory and storage due to the large number of Gaussians and their attributes. This work addresses this limitation, paving the way for efficient and high-quality 3D scene representation.	The proposed method employs a learnable masking strategy to remove redundant Gaussians based on volume and transparency. It also utilizes a grid-based neural field for compact view-dependent color representation and learnable codebooks for efficient storage of Gaussian geometry (scale and rotation).	Achieves over 25x storage reduction and significantly enhanced rendering speed compared to 3DGS across various datasets. Maintains high-quality scene reconstruction, comparable or even superior to 3DGS. Demonstrates the effectiveness of volume-based masking, compact color representation, and geometry codebooks through ablation studies.	The training time of the proposed method is slightly longer than 3DGS due to the additional learning components. Future work includes exploring more efficient neural field architectures and codebook compression techniques for further reducing storage and memory requirements.	3d gaussian splatting, 3d scene representation, neural rendering, model compression, real-time rendering
2311.13655 Report	GAN-Avatar: Controllable Personalized GAN-based Human Head Avatar	Berna Kabadayi, Wojciech Zielonka, Bharat Lal Bhatnagar, Gerard Pons-Moll, Justus Thies	Digital humans and, especially, 3D facial avatars have raised a lot of attention in the past years, as they are the backbone of several applications like immersive telepresence in AR or VR. Despite the progress, facial avatars reconstructed from commodity hardware are incomplete and miss out on parts of the side and back of the head, severely limiting the usability of the avatar. This limitation in prior work stems from their requirement of face tracking, which fails for profile and back views. To address this issue, we propose to learn person-specific animatable avatars from images without assuming to have access to precise facial expression tracking. At the core of our method, we leverage a 3D-aware generative model that is trained to reproduce the distribution of facial expressions from the training data. To train this appearance model, we only assume to have a collection of 2D images with the corresponding camera parameters. For controlling the model, we learn a mapping from 3DMM facial expression parameters to the latent space of the generative model. This mapping can be learned by sampling the latent space of the appearance model and reconstructing the facial parameters from a normalized frontal view, where facial expression estimation performs well. With this scheme, we decouple 3D appearance reconstruction and animation control to achieve high fidelity in image synthesis. In a series of experiments, we compare our proposed technique to state-of-the-art monocular methods and show superior quality while not requiring expression tracking of the training data.	This paper introduces a novel method for generating animatable 3D human head avatars from images without requiring precise facial expression tracking during training.	Existing methods heavily rely on accurate facial expression tracking, which often fails for profile or back views, limiting their ability to reconstruct complete head avatars.	The method utilizes a 3D-aware generative model (EG3D) to learn person-specific appearance and geometry from images and camera parameters. It then employs a mapping network to map 3DMM facial expression parameters to the latent space of the generative model for animation control.	The method achieves superior visual quality compared to state-of-the-art monocular avatar reconstruction techniques, particularly in challenging regions like teeth and hair. It enables the generation of 3D-consistent novel views, leading to the reconstruction of complete 360-degree head avatars. The approach demonstrates robustness to imperfect camera poses, outperforming baseline methods when trained on noisy data.	The training process is computationally expensive, requiring several hours on high-end GPUs. The method is limited to the facial expressions present in the training data and cannot extrapolate to unseen expressions.	3d avatar reconstruction, generative adversarial networks, facial expression mapping, novel view synthesis, tracker-free appearance learning
2311.13620 Report	The Challenges of Image Generation Models in Generating Multi-Component Images	Tham Yik Foong, Shashank Kotyan, Po Yuan Mao, Danilo Vasconcellos Vargas	Recent advances in text-to-image generators have led to substantial capabilities in image generation. However, the complexity of prompts acts as a bottleneck in the quality of images generated. A particular under-explored facet is the ability of generative models to create high-quality images comprising multiple components given as a prior. In this paper, we propose and validate a metric called Components Inclusion Score (CIS) to evaluate the extent to which a model can correctly generate multiple components. Our results reveal that the evaluated models struggle to incorporate all the visual elements from prompts with multiple components (8.53% drop in CIS per component for all evaluated models). We also identify a significant decline in the quality of the images and context awareness within an image as the number of components increased (15.91% decrease in inception Score and 9.62% increase in Frechet Inception Distance). To remedy this issue, we fine-tuned Stable Diffusion V2 on a custom-created test dataset with multiple components, outperforming its vanilla counterpart. To conclude, these findings reveal a critical limitation in existing text-to-image generators, shedding light on the challenge of generating multiple components within a single image using a complex prompt.	This paper introduces Components Inclusion Score (CIS), a novel metric to evaluate the ability of text-to-image generators to accurately incorporate multiple components from a prompt into a single image.	Current text-to-image generators struggle to generate high-quality images with multiple components, limiting their ability to handle complex prompts. This work provides a way to quantify this limitation and analyze the factors contributing to it.	The paper proposes the CIS metric which uses CLIP model to evaluate the presence of each component mentioned in the prompt within the generated image. Additionally, a new dataset MCID is created by combining images from ImageNet to train and evaluate models on multi-component image generation.	Existing image generators show a significant drop in CIS as the number of components in the prompt increases, indicating their difficulty in handling multi-component generation. The quality of generated images, as measured by IS and FID, also deteriorates with an increase in the number of components. Fine-tuning Stable Diffusion on MCID leads to improved CIS, emphasizing the importance of data distribution with multi-component images.	The accuracy of CIS is limited by the capability of CLIP model used for component identification. The MCID dataset, while diverse, may not fully represent the complexity of real-world multi-component scenes with natural interactions.	text-to-image generation, multi-component generation, evaluation metric, clip, stable diffusion
2311.13617 Report	Boosting3D: High-Fidelity Image-to-3D by Boosting 2D Diffusion Prior to 3D Prior with Progressive Learning	Kai Yu, Jinlin Liu, Mengyang Feng, Miaomiao Cui, Xuansong Xie	We present Boosting3D, a multi-stage single image-to-3D generation method that can robustly generate reasonable 3D objects in different data domains. The point of this work is to solve the view consistency problem in single image-guided 3D generation by modeling a reasonable geometric structure. For this purpose, we propose to utilize better 3D prior to training the NeRF. More specifically, we train an object-level LoRA for the target object using original image and the rendering output of NeRF. And then we train the LoRA and NeRF using a progressive training strategy. The LoRA and NeRF will boost each other while training. After the progressive training, the LoRA learns the 3D information of the generated object and eventually turns to an object-level 3D prior. In the final stage, we extract the mesh from the trained NeRF and use the trained LoRA to optimize the structure and appearance of the mesh. The experiments demonstrate the effectiveness of the proposed method. Boosting3D learns object-specific 3D prior which is beyond the ability of pre-trained diffusion priors and achieves state-of-the-art performance in the single image-to-3d generation task.	Boosting3D, a multi-stage single image-to-3D generation method that robustly generates 3D objects in different data domains by modeling geometric structure.	Addresses the view consistency problem in single image-guided 3D generation, which struggles with uncommon or asymmetrical objects.	Uses a three-stage optimization process: coarse NeRF generation, fine NeRF refinement using a progressively trained object-level LoRA, and mesh refinement using the trained LoRA.	Generates high-quality and stable 3D objects from single images. Learns object-specific 3D priors beyond pre-trained diffusion models. Achieves state-of-the-art performance in single image-to-3D generation for both real and synthetic images.	High computational cost, requiring over an hour of training time. Future work will focus on optimizing speed using faster 3D representations.	image-to-3d generation, 3d reconstruction, diffusion models, nerf, lora
2311.13608 Report	Breathing Life Into Sketches Using Text-to-Video Priors	Rinon Gal, Yael Vinker, Yuval Alaluf, Amit H. Bermano, Daniel Cohen-Or, Ariel Shamir, Gal Chechik	A sketch is one of the most intuitive and versatile tools humans use to convey their ideas visually. An animated sketch opens another dimension to the expression of ideas and is widely used by designers for a variety of purposes. Animating sketches is a laborious process, requiring extensive experience and professional design skills. In this work, we present a method that automatically adds motion to a single-subject sketch (hence, "breathing life into it"), merely by providing a text prompt indicating the desired motion. The output is a short animation provided in vector representation, which can be easily edited. Our method does not require extensive training, but instead leverages the motion prior of a large pretrained text-to-video diffusion model using a score-distillation loss to guide the placement of strokes. To promote natural and smooth motion and to better preserve the sketch's appearance, we model the learned motion through two components. The first governs small local deformations and the second controls global affine transformations. Surprisingly, we find that even models that struggle to generate sketch videos on their own can still serve as a useful backbone for animating abstract representations.	This paper introduces a method for automatically animating single-subject sketches using text prompts, leveraging the motion priors of pre-trained text-to-video diffusion models.	Animating sketches is a laborious task that requires significant artistic expertise. This method simplifies the process, requiring only a static sketch and a text prompt, making animation accessible to a wider audience.	The method uses a neural network trained with a score-distillation sampling loss to predict displacements for the control points of a vector-based sketch representation. It separates motion into local deformations and global affine transformations to ensure smooth and natural movement while preserving the original sketch's characteristics.	The method effectively animates sketches across diverse domains and prompts, capturing complex movements like swaying, dancing, and swirling. It outperforms existing pixel-based image-to-video methods in preserving sketch fidelity and aligning motion with text prompts. User studies confirm that the method produces animations that are both consistent with the input sketch and aligned with the desired motion.	The method is currently limited to single-subject sketches and may struggle with complex scenes or sketches with multiple objects. There is a trade-off between motion quality and preserving the sketch's appearance, requiring careful hyperparameter tuning.	sketch animation, text-to-video generation, score-distillation sampling, vector graphics, motion priors
2311.13601 Report	Visual In-Context Prompting	Feng Li, Qing Jiang, Hao Zhang, Tianhe Ren, Shilong Liu, Xueyan Zou, Huaizhe Xu, Hongyang Li, Chunyuan Li, Jianwei Yang, Lei Zhang, Jianfeng Gao	In-context prompting in large language models (LLMs) has become a prevalent approach to improve zero-shot capabilities, but this idea is less explored in the vision domain. Existing visual prompting methods focus on referring segmentation to segment the most relevant object, falling short of addressing many generic vision tasks like open-set segmentation and detection. In this paper, we introduce a universal visual in-context prompting framework for both tasks. In particular, we build on top of an encoder-decoder architecture, and develop a versatile prompt encoder to support a variety of prompts like strokes, boxes, and points. We further enhance it to take an arbitrary number of reference image segments as the context. Our extensive explorations show that the proposed visual in-context prompting elicits extraordinary referring and generic segmentation capabilities to refer and detect, yielding competitive performance to close-set in-domain datasets and showing promising results on many open-set segmentation datasets. By joint training on COCO and SA-1B, our model achieves $57.7$ PQ on COCO and $23.2$ PQ on ADE20K. Code will be available at https://github.com/UX-Decoder/DINOv.	This paper proposes DINOv, a novel visual in-context prompting framework for both referring and generic image segmentation, enabling open-set segmentation using only visual prompts.	In-context prompting is powerful for LLMs but less explored in vision, particularly for generic tasks like open-set segmentation. Existing visual prompting methods mainly focus on referring segmentation.	DINOv leverages an encoder-decoder architecture with a prompt encoder to handle various prompts (strokes, boxes, points). It utilizes reference image-prompt pairs to learn visual concepts and adapts prompts to target images. The model is trained jointly on COCO and SA-1B datasets for both referring and generic segmentation.	DINOv achieves comparable performance to close-set models on in-domain datasets like COCO. It demonstrates promising generalization ability on open-set benchmarks like ADE20K and SegInW using only visual prompts. The framework effectively handles video object segmentation in a zero-shot manner by leveraging learned visual prompts from previous frames.	The model's performance could be further improved by scaling up the semantically labeled data. Future work can explore incorporating text prompts for enhanced multi-modal understanding.	visual prompting, in-context learning, open-set segmentation, referring segmentation, video object segmentation
2311.13600 Report	ZipLoRA: Any Subject in Any Style by Effectively Merging LoRAs	Viraj Shah, Nataniel Ruiz, Forrester Cole, Erika Lu, Svetlana Lazebnik, Yuanzhen Li, Varun Jampani	Methods for finetuning generative models for concept-driven personalization generally achieve strong results for subject-driven or style-driven generation. Recently, low-rank adaptations (LoRA) have been proposed as a parameter-efficient way of achieving concept-driven personalization. While recent work explores the combination of separate LoRAs to achieve joint generation of learned styles and subjects, existing techniques do not reliably address the problem; they often compromise either subject fidelity or style fidelity. We propose ZipLoRA, a method to cheaply and effectively merge independently trained style and subject LoRAs in order to achieve generation of any user-provided subject in any user-provided style. Experiments on a wide range of subject and style combinations show that ZipLoRA can generate compelling results with meaningful improvements over baselines in subject and style fidelity while preserving the ability to recontextualize. Project page: https://ziplora.github.io	Proposes ZipLoRA, a method for merging independently trained style and subject LoRAs to generate images of any subject in any style using diffusion models.	Solves the open problem of generating a specific subject in a specific style with diffusion models, enabling greater control and personalization.	Leverages the sparsity of LoRA updates and minimizes cosine similarity between merged columns to reduce signal interference while preserving individual LoRA capabilities.	ZipLoRA generates high-quality stylized images superior to direct merging and joint training. It retains the ability to re-contextualize subjects and control the extent of stylization. Quantitative user studies and image/text alignment scores demonstrate the effectiveness of ZipLoRA over baselines.	Relies on the style learning capability of SDXL, which needs further investigation. Image/text alignment metrics used for evaluation might not perfectly capture stylistic nuances.	image stylization, diffusion models, lora, personalized image generation, stable diffusion
2311.13596 Report	T-Rex: Counting by Visual Prompting	Qing Jiang, Feng Li, Tianhe Ren, Shilong Liu, Zhaoyang Zeng, Kent Yu, Lei Zhang	We introduce T-Rex, an interactive object counting model designed to first detect and then count any objects. We formulate object counting as an open-set object detection task with the integration of visual prompts. Users can specify the objects of interest by marking points or boxes on a reference image, and T-Rex then detects all objects with a similar pattern. Guided by the visual feedback from T-Rex, users can also interactively refine the counting results by prompting on missing or falsely-detected objects. T-Rex has achieved state-of-the-art performance on several class-agnostic counting benchmarks. To further exploit its potential, we established a new counting benchmark encompassing diverse scenarios and challenges. Both quantitative and qualitative results show that T-Rex possesses exceptional zero-shot counting capabilities. We also present various practical application scenarios for T-Rex, illustrating its potential in the realm of visual prompting.	Introduces T-Rex, an interactive object counting model that uses visual prompts (boxes or points) to detect and count objects in an image, achieving state-of-the-art performance on several benchmarks.	Object counting is important for various fields but existing methods have limitations like unintuitive visualization, closed-set detectors, or reliance on textual descriptions. T-Rex addresses these with visual prompts and interactive refinement.	T-Rex utilizes an image encoder, prompt encoder, and box decoder. It supports positive-only, positive with negative, and cross-image prompt modes for accurate and user-refined counting. A new benchmark, CA-44, was created to test its capabilities.	T-Rex outperforms state-of-the-art methods on FSC147 and FSCD-LVIS benchmarks. It demonstrates superior performance on CA-44, showcasing its zero-shot counting ability across diverse domains. T-Rex shows higher accuracy than GPT-4V in counting, suggesting its advantage in object perception for this task.	T-Rex faces challenges in single-target scenes with dense clusters, dense multi-object scenes, and cross-image workflows. Future work will focus on improving its performance and robustness.	object counting, visual prompting, interactive model, open-set detection, computer vision
2311.13570 Report	WildFusion: Learning 3D-Aware Latent Diffusion Models in View Space	Katja Schwarz, Seung Wook Kim, Jun Gao, Sanja Fidler, Andreas Geiger, Karsten Kreis	Modern learning-based approaches to 3D-aware image synthesis achieve high photorealism and 3D-consistent viewpoint changes for the generated images. Existing approaches represent instances in a shared canonical space. However, for in-the-wild datasets a shared canonical system can be difficult to define or might not even exist. In this work, we instead model instances in view space, alleviating the need for posed images and learned camera distributions. We find that in this setting, existing GAN-based methods are prone to generating flat geometry and struggle with distribution coverage. We hence propose WildFusion, a new approach to 3D-aware image synthesis based on latent diffusion models (LDMs). We first train an autoencoder that infers a compressed latent representation, which additionally captures the images' underlying 3D structure and enables not only reconstruction but also novel view synthesis. To learn a faithful 3D representation, we leverage cues from monocular depth prediction. Then, we train a diffusion model in the 3D-aware latent space, thereby enabling synthesis of high-quality 3D-consistent image samples, outperforming recent state-of-the-art GAN-based methods. Importantly, our 3D-aware LDM is trained without any direct supervision from multiview images or 3D geometry and does not require posed images or learned pose or camera distributions. It directly learns a 3D representation without relying on canonical camera coordinates. This opens up promising research avenues for scalable 3D-aware image synthesis and 3D content creation from in-the-wild image data. See https://katjaschwarz.github.io/wildfusion for videos of our 3D results.	Introduces WildFusion, a 3D-aware latent diffusion model for image synthesis that operates in view space, eliminating the need for posed images or pre-defined camera distributions.	Existing 3D-aware generative models struggle with in-the-wild datasets lacking a shared canonical coordinate system and often suffer from limitations like mode collapse in GAN-based approaches.	Two-stage approach: 1) Trains a 3D-aware autoencoder with adversarial supervision for novel views and incorporates monocular depth cues for improved geometry. 2) Fits a latent diffusion model on the compressed, 3D-aware latent space learned by the autoencoder.	Outperforms state-of-the-art 3D-aware GANs on unposed image datasets, demonstrating superior image quality, geometry, and diversity. Achieves high-quality novel view synthesis directly from single images, surpassing GAN-based methods requiring inversion. Demonstrates promising applications in 3D-aware image manipulation, including semantic interpolation and generative resampling.	While modeling in view space is advantageous, achieving sharp 3D geometry remains challenging. Current implementation is limited to a predefined range of viewpoints and cannot generate full 360° views.	3d-aware image synthesis, latent diffusion models, view space, unposed images, novel view synthesis
2311.13535 Report	DiffusionMat: Alpha Matting as Sequential Refinement Learning	Yangyang Xu, Shengfeng He, Wenqi Shao, Kwan-Yee K. Wong, Yu Qiao, Ping Luo	In this paper, we introduce DiffusionMat, a novel image matting framework that employs a diffusion model for the transition from coarse to refined alpha mattes. Diverging from conventional methods that utilize trimaps merely as loose guidance for alpha matte prediction, our approach treats image matting as a sequential refinement learning process. This process begins with the addition of noise to trimaps and iteratively denoises them using a pre-trained diffusion model, which incrementally guides the prediction towards a clean alpha matte. The key innovation of our framework is a correction module that adjusts the output at each denoising step, ensuring that the final result is consistent with the input image's structures. We also introduce the Alpha Reliability Propagation, a novel technique designed to maximize the utility of available guidance by selectively enhancing the trimap regions with confident alpha information, thus simplifying the correction task. To train the correction module, we devise specialized loss functions that target the accuracy of the alpha matte's edges and the consistency of its opaque and transparent regions. We evaluate our model across several image matting benchmarks, and the results indicate that DiffusionMat consistently outperforms existing methods. Project page at~\url{https://cnnlstm.github.io/DiffusionMat	Presents DiffusionMat, a novel image matting framework that uses a diffusion model to refine alpha mattes from coarse to refined, treating image matting as a sequential refinement learning process.	Overcomes limitations of conventional methods that treat trimaps as static guidance, instead leveraging the iterative feedback of diffusion models to enhance the matting of unknown regions.	Trains a diffusion model on alpha mattes, injects noise into the input trimap, then iteratively denoises it. Employs a correction module at each step to ensure consistency with the input image. Introduces Alpha Reliability Propagation to focus on refining ambiguous regions.	Achieves state-of-the-art performance on portrait matting benchmarks (P3M-10K, Human-2K) and general image matting (Composition-1k). Exhibits robustness against inaccurate trimaps by leveraging the generative prior learned from extensive alpha matte datasets. Produces perceptually favorable alpha mattes with finer details compared to conventional methods.	Computational efficiency is lower compared to single-pass methods. Future work includes exploring more efficient diffusion models to improve speed.	image matting, diffusion models, sequential refinement learning, alpha matte prediction, trimap guidance
2311.13443 Report	Guided Flows for Generative Modeling and Decision Making	Qinqing Zheng, Matt Le, Neta Shaul, Yaron Lipman, Aditya Grover, Ricky T. Q. Chen	Classifier-free guidance is a key component for enhancing the performance of conditional generative models across diverse tasks. While it has previously demonstrated remarkable improvements for the sample quality, it has only been exclusively employed for diffusion models. In this paper, we integrate classifier-free guidance into Flow Matching (FM) models, an alternative simulation-free approach that trains Continuous Normalizing Flows (CNFs) based on regressing vector fields. We explore the usage of \emph{Guided Flows} for a variety of downstream applications. We show that Guided Flows significantly improves the sample quality in conditional image generation and zero-shot text-to-speech synthesis, boasting state-of-the-art performance. Notably, we are the first to apply flow models for plan generation in the offline reinforcement learning setting, showcasing a 10x speedup in computation compared to diffusion models while maintaining comparable performance.	This paper integrates classifier-free guidance into Flow Matching (FM) models, enhancing their performance in conditional generation tasks.	This integration is crucial as it allows FM models, a computationally efficient alternative to diffusion models, to leverage conditional information more effectively, leading to significant improvements in sample quality for various downstream applications.	The authors introduce 'Guided Flows', an adaptation of classifier-free guidance for FM models. They modify velocity vector fields by combining unconditional and conditional velocity fields, weighted by a guidance parameter.	Guided Flows significantly enhance sample quality in conditional image generation and zero-shot text-to-speech synthesis, achieving state-of-the-art performance. This paper demonstrates the first successful application of flow models for return-conditioned plan generation in offline reinforcement learning, achieving comparable performance to diffusion models but with a 10x speedup in computation. Guided Flows outperform unguided flows in generating coherent state sequences for locomotion tasks, highlighting the importance of guidance for planning.	The theoretical justification for Guided Flows relies on an assumption that may not hold perfectly in practice, suggesting a need for further investigation. While replanning at every timestep ensures planning accuracy, exploring heuristics to reuse previously generated plans could further improve computational efficiency.	flow matching, classifier-free guidance, generative modeling, offline reinforcement learning, plan generation
2311.13435 Report	PG-Video-LLaVA: Pixel Grounding Large Video-Language Models	Shehan Munasinghe, Rusiru Thushara, Muhammad Maaz, Hanoona Abdul Rasheed, Salman Khan, Mubarak Shah, Fahad Khan	Extending image-based Large Multimodal Models (LMMs) to videos is challenging due to the inherent complexity of video data. The recent approaches extending image-based LMMs to videos either lack the grounding capabilities (e.g., VideoChat, Video-ChatGPT, Video-LLaMA) or do not utilize the audio-signals for better video understanding (e.g., Video-ChatGPT). Addressing these gaps, we propose PG-Video-LLaVA, the first LMM with pixel-level grounding capability, integrating audio cues by transcribing them into text to enrich video-context understanding. Our framework uses an off-the-shelf tracker and a novel grounding module, enabling it to spatially localize objects in videos following user instructions. We evaluate PG-Video-LLaVA using video-based generative and question-answering benchmarks and introduce new benchmarks specifically designed to measure prompt-based object grounding performance in videos. Further, we propose the use of Vicuna over GPT-3.5, as utilized in Video-ChatGPT, for video-based conversation benchmarking, ensuring reproducibility of results which is a concern with the proprietary nature of GPT-3.5. Our framework builds on SoTA image-based LLaVA model and extends its advantages to the video domain, delivering promising gains on video-based conversation and grounding tasks. Project Page: https://github.com/mbzuai-oryx/Video-LLaVA	The paper introduces PG-Video-LLaVA, the first video-based Large Multimodal Model (LMM) capable of pixel-level grounding, integrating audio cues to enhance video understanding.	Extending image-based LMMs to videos is challenging due to the complexity of video data. Existing video LMMs lack either grounding capabilities or the ability to utilize audio signals effectively.	PG-Video-LLaVA leverages a CLIP-based visual encoder, audio transcription, and a novel grounding module for object localization. It's trained on a large video instruction dataset and evaluated on video-based generative and question-answering benchmarks.	PG-Video-LLaVA outperforms existing video-based conversational models like Video-ChatGPT and Video-LLaMA in ungrounded dialogues. The model effectively localizes objects in videos based on user instructions, demonstrating superior spatial grounding capabilities. Incorporating audio transcripts significantly enhances the model's understanding of video content, leading to improved accuracy in tasks like question answering.	The spatial grounding module's reliance on scene segmentation and object tracking can introduce errors in complex scenarios. Further research is needed to explore more sophisticated methods for integrating audio and visual information for a deeper understanding of video content.	large multimodal models, video understanding, visual grounding, audio-visual integration, video question answering
2311.13398 Report	Depth-Regularized Optimization for 3D Gaussian Splatting in Few-Shot Images	Jaeyoung Chung, Jeongtaek Oh, Kyoung Mu Lee	In this paper, we present a method to optimize Gaussian splatting with a limited number of images while avoiding overfitting. Representing a 3D scene by combining numerous Gaussian splats has yielded outstanding visual quality. However, it tends to overfit the training views when only a small number of images are available. To address this issue, we introduce a dense depth map as a geometry guide to mitigate overfitting. We obtained the depth map using a pre-trained monocular depth estimation model and aligning the scale and offset using sparse COLMAP feature points. The adjusted depth aids in the color-based optimization of 3D Gaussian splatting, mitigating floating artifacts, and ensuring adherence to geometric constraints. We verify the proposed method on the NeRF-LLFF dataset with varying numbers of few images. Our approach demonstrates robust geometry compared to the original method that relies solely on images. Project page: robot0321.github.io/DepthRegGS	This paper proposes a novel method to optimize 3D Gaussian Splatting using a limited number of images by leveraging depth information to prevent overfitting.	Reconstructing 3D scenes from a few images is crucial for practical applications but challenging due to limited geometric information, leading to overfitting in existing methods like 3D Gaussian Splatting.	The method utilizes a pre-trained monocular depth estimation model to obtain dense depth maps, adjusts their scale and offset using sparse COLMAP feature points, and integrates the adjusted depth as a geometry guide during the 3D Gaussian splatting optimization process. Additionally, an early stopping strategy and a smoothness constraint on the depth map further enhance the optimization stability.	The proposed depth-guided optimization significantly improves the performance of 3D Gaussian Splatting in few-shot scenarios, achieving better visual quality and geometric accuracy compared to the original method. The approach successfully mitigates overfitting issues and generates plausible 3D reconstructions even with a limited number of input images. Ablation studies confirm the effectiveness of each component, including depth adjustment, depth loss, smoothness constraint, and early stopping strategy.	The performance heavily relies on the accuracy of the pre-trained monocular depth estimation model and its generalization ability to different scenes and domains. Reliance on COLMAP points for depth adjustment limits the applicability to scenes where COLMAP might struggle, such as textureless regions. Future work includes exploring alternative depth regularization methods.	3d gaussian splatting, few-shot learning, depth estimation, 3d reconstruction, novel view synthesis
2311.13384 Report	LucidDreamer: Domain-free Generation of 3D Gaussian Splatting Scenes	Jaeyoung Chung, Suyoung Lee, Hyeongjin Nam, Jaerin Lee, Kyoung Mu Lee	With the widespread usage of VR devices and contents, demands for 3D scene generation techniques become more popular. Existing 3D scene generation models, however, limit the target scene to specific domain, primarily due to their training strategies using 3D scan dataset that is far from the real-world. To address such limitation, we propose LucidDreamer, a domain-free scene generation pipeline by fully leveraging the power of existing large-scale diffusion-based generative model. Our LucidDreamer has two alternate steps: Dreaming and Alignment. First, to generate multi-view consistent images from inputs, we set the point cloud as a geometrical guideline for each image generation. Specifically, we project a portion of point cloud to the desired view and provide the projection as a guidance for inpainting using the generative model. The inpainted images are lifted to 3D space with estimated depth maps, composing a new points. Second, to aggregate the new points into the 3D scene, we propose an aligning algorithm which harmoniously integrates the portions of newly generated 3D scenes. The finally obtained 3D scene serves as initial points for optimizing Gaussian splats. LucidDreamer produces Gaussian splats that are highly-detailed compared to the previous 3D scene generation methods, with no constraint on domain of the target scene. Project page: https://luciddreamer-cvlab.github.io/	LucidDreamer, a domain-free 3D scene generation pipeline that leverages Stable Diffusion, depth estimation, and 3D Gaussian splatting to create diverse, high-quality scenes from various inputs (text, RGB, RGBD).	Existing 3D scene generation methods are limited to specific domains due to training on restricted 3D scan datasets. LucidDreamer overcomes this by leveraging the power of pre-trained, large-scale image generation models for diverse and high-quality results.	1. Point Cloud Construction: Starting from an initial image/depth map, LucidDreamer iteratively expands the point cloud. It projects the existing points to a new camera view, inpaints the missing regions using Stable Diffusion, estimates depth, and lifts the inpainted pixels to 3D, aligning them for consistency. 2. Gaussian Splat Optimization: The final point cloud initializes a 3D Gaussian Splatting model, further optimized using reprojected images for a continuous, high-fidelity 3D scene representation.	Generates high-quality, multi-view consistent 3D scenes from various input domains (realistic, anime, lego) and formats (text, RGB, RGBD). Outperforms existing methods like RGBD2 in terms of visual quality, resolution, and domain generalization. Ablation studies validate the importance of point cloud initialization and masked training for Gaussian Splat optimization.	Reliance on multiple off-the-shelf models (Stable Diffusion, depth estimation) could lead to accumulated errors. Exploration of more efficient point cloud aggregation and alignment strategies for larger-scale scenes.	3d scene generation, diffusion models, gaussian splatting, multi-view consistency, domain generalization
2311.13231 Report	Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model	Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Qimai Li, Weihan Shen, Xiaolong Zhu, Xiu Li	Using reinforcement learning with human feedback (RLHF) has shown significant promise in fine-tuning diffusion models. Previous methods start by training a reward model that aligns with human preferences, then leverage RL techniques to fine-tune the underlying models. However, crafting an efficient reward model demands extensive datasets, optimal architecture, and manual hyperparameter tuning, making the process both time and cost-intensive. The direct preference optimization (DPO) method, effective in fine-tuning large language models, eliminates the necessity for a reward model. However, the extensive GPU memory requirement of the diffusion model's denoising process hinders the direct application of the DPO method. To address this issue, we introduce the Direct Preference for Denoising Diffusion Policy Optimization (D3PO) method to directly fine-tune diffusion models. The theoretical analysis demonstrates that although D3PO omits training a reward model, it effectively functions as the optimal reward model trained using human feedback data to guide the learning process. This approach requires no training of a reward model, proving to be more direct, cost-effective, and minimizing computational overhead. In experiments, our method uses the relative scale of objectives as a proxy for human preference, delivering comparable results to methods using ground-truth rewards. Moreover, D3PO demonstrates the ability to reduce image distortion rates and generate safer images, overcoming challenges lacking robust reward models. Our code is publicly available at https://github.com/yk7333/D3PO.	Introduces D3PO, a method for directly fine-tuning diffusion models using human feedback without needing a separate reward model.	Existing RLHF methods for fine-tuning diffusion models require resource-intensive reward model training, making them inefficient and costly. D3PO aims to address this by directly incorporating human preferences.	D3PO reinterprets the diffusion model's denoising process as a multi-step MDP. By extending the DPO theory to this MDP framework, D3PO directly updates the diffusion model's policy based on human feedback at each denoising step.	Achieves comparable performance to reward-model-based methods on quantitative objectives like image compressibility and aesthetic quality. Successfully reduces image distortions in generated hands and anime characters. Demonstrates the ability to enhance image safety and improve prompt-image alignment based on human feedback.	Assumes that all state-action pairs within a preferred segment are better than those in a less preferred segment. Relies on human evaluation, which can be subjective and difficult to scale.	diffusion models, reinforcement learning, human feedback, direct preference optimization, image generation
2311.13073 Report	FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline	Vladimir Arkhipkin, Zein Shaheen, Viacheslav Vasilev, Elizaveta Dakhova, Andrey Kuznetsov, Denis Dimitrov	Multimedia generation approaches occupy a prominent place in artificial intelligence research. Text-to-image models achieved high-quality results over the last few years. However, video synthesis methods recently started to develop. This paper presents a new two-stage latent diffusion text-to-video generation architecture based on the text-to-image diffusion model. The first stage concerns keyframes synthesis to figure the storyline of a video, while the second one is devoted to interpolation frames generation to make movements of the scene and objects smooth. We compare several temporal conditioning approaches for keyframes generation. The results show the advantage of using separate temporal blocks over temporal layers in terms of metrics reflecting video generation quality aspects and human preference. The design of our interpolation model significantly reduces computational costs compared to other masked frame interpolation approaches. Furthermore, we evaluate different configurations of MoVQ-based video decoding scheme to improve consistency and achieve higher PSNR, SSIM, MSE, and LPIPS scores. Finally, we compare our pipeline with existing solutions and achieve top-2 scores overall and top-1 among open-source solutions: CLIPSIM = 0.2976 and FVD = 433.054. Project page: https://ai-forever.github.io/kandinsky-video/	This paper presents FusionFrames, a novel two-stage latent diffusion model for text-to-video generation, focusing on improving video quality, consistency, and smoothness.	Video generation models are computationally expensive and require large datasets. This paper addresses these challenges by leveraging pre-trained text-to-image models and introducing efficient architectural designs.	The model uses a pre-trained T2I model for keyframe generation and introduces separate temporal blocks for enhanced temporal consistency. A novel interpolation model generates intermediate frames, and a MoVQ-GAN based decoder with architectural variations is used for improved decoding.	Separate temporal blocks outperform traditional mixed spatial-temporal layers in video quality metrics and human evaluation. The proposed interpolation architecture generates higher-quality interpolated frames with faster inference compared to masked frame interpolation. A MoVQ-GAN decoder with 3D convolutions and temporal attention yields the best decoding quality.	Ambiguities in calculating metrics like FVD and IS make comparison with other studies challenging. Lack of open-source solutions for latent space interpolation limits direct comparison with existing interpolation methods.	text-to-video generation, latent diffusion models, video frame interpolation, temporal consistency, movq-gan
2311.12981 Report	SD-NAE: Generating Natural Adversarial Examples with Stable Diffusion	Yueqian Lin, Jingyang Zhang, Yiran Chen, Hai Li	Natural Adversarial Examples (NAEs), images arising naturally from the environment and capable of deceiving classifiers, are instrumental in robustly evaluating and identifying vulnerabilities in trained models. In this work, unlike prior works that passively collect NAEs from real images, we propose to actively synthesize NAEs using the state-of-the-art Stable Diffusion. Specifically, our method formulates a controlled optimization process, where we perturb the token embedding that corresponds to a specified class to generate NAEs. This generation process is guided by the gradient of loss from the target classifier, ensuring that the created image closely mimics the ground-truth class yet fools the classifier. Named SD-NAE (Stable Diffusion for Natural Adversarial Examples), our innovative method is effective in producing valid and useful NAEs, which is demonstrated through a meticulously designed experiment. Code is available at https://github.com/linyueqian/SD-NAE.	This paper introduces SD-NAE, a novel method for actively synthesizing Natural Adversarial Examples (NAEs) using Stable Diffusion by optimizing the class token embedding in the condition embedding space.	Robustly evaluating deep image classifiers is challenging, and NAEs are instrumental in identifying model vulnerabilities. Unlike prior passive NAE collection methods, SD-NAE offers greater flexibility and control over generating specific challenging examples.	SD-NAE optimizes the class-related token embedding of a pre-trained Stable Diffusion model, guided by the loss gradient of a target image classifier. This process aims to induce misclassification while maintaining the image's ground-truth semantic meaning.	SD-NAE achieves a 43.5% fooling rate against a ResNet-50 ImageNet classifier, demonstrating its effectiveness in generating NAEs. The generated NAEs exhibit variations in color, background, view angle, and style, highlighting SD-NAE's potential for evaluating model generalization. Compared to a GAN-based NAE generation method, SD-NAE shows superior performance in both fooling rate and the quality of generated images.	SD-NAE can be computationally expensive, especially with a large number of optimization steps. The generated images might sometimes exhibit unnatural appearances, inheriting limitations from the underlying Stable Diffusion model.	natural adversarial examples, stable diffusion, robustness evaluation, image classification, adversarial machine learning
2311.12908 Report	Diffusion Model Alignment Using Direct Preference Optimization	Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, Nikhil Naik	Large language models (LLMs) are fine-tuned using human comparison data with Reinforcement Learning from Human Feedback (RLHF) methods to make them better aligned with users' preferences. In contrast to LLMs, human preference learning has not been widely explored in text-to-image diffusion models; the best existing approach is to fine-tune a pretrained model using carefully curated high quality images and captions to improve visual appeal and text alignment. We propose Diffusion-DPO, a method to align diffusion models to human preferences by directly optimizing on human comparison data. Diffusion-DPO is adapted from the recently developed Direct Preference Optimization (DPO), a simpler alternative to RLHF which directly optimizes a policy that best satisfies human preferences under a classification objective. We re-formulate DPO to account for a diffusion model notion of likelihood, utilizing the evidence lower bound to derive a differentiable objective. Using the Pick-a-Pic dataset of 851K crowdsourced pairwise preferences, we fine-tune the base model of the state-of-the-art Stable Diffusion XL (SDXL)-1.0 model with Diffusion-DPO. Our fine-tuned base model significantly outperforms both base SDXL-1.0 and the larger SDXL-1.0 model consisting of an additional refinement model in human evaluation, improving visual appeal and prompt alignment. We also develop a variant that uses AI feedback and has comparable performance to training on human preferences, opening the door for scaling of diffusion model alignment methods.	The paper introduces Diffusion-DPO, a novel method for aligning text-to-image diffusion models with human preferences by directly optimizing the model on pairwise comparison data.	Current text-to-image diffusion models lack a robust alignment stage with human preferences, limiting their ability to generate images that truly reflect user desires.	The authors adapt the Direct Preference Optimization (DPO) method from LLMs to diffusion models, utilizing the evidence lower bound to derive a differentiable objective function. They fine-tune Stable Diffusion XL (SDXL)-1.0 model using Diffusion-DPO on the Pick-a-Pic dataset, a large dataset of crowdsourced pairwise preferences.	Diffusion-DPO significantly outperforms the baseline SDXL and the larger SDXL-(base + refinement) model in human evaluation, achieving a 69% preference rate on the PartiPrompts dataset. Diffusion-DPO-tuned SDXL generates images with superior visual appeal, better prompt alignment, and finer details compared to the baseline models. The method can also effectively learn from AI feedback, demonstrating promising results for scaling diffusion model alignment using pretrained scoring networks.	Potential biases in human preference data could be reflected in the trained model. Current work is an offline algorithm, requiring further research for online learning methods.	text-to-image generation, diffusion models, human preference learning, dpo, stable diffusion xl
2311.12897 Report	An Efficient 3D Gaussian Representation for Monocular/Multi-view Dynamic Scenes	Kai Katsumata, Duc Minh Vo, Hideki Nakayama	In novel view synthesis of scenes from multiple input views, 3D Gaussian splatting emerges as a viable alternative to existing radiance field approaches, delivering great visual quality and real-time rendering. While successful in static scenes, the present advancement of 3D Gaussian representation, however, faces challenges in dynamic scenes in terms of memory consumption and the need for numerous observations per time step, due to the onus of storing 3D Gaussian parameters per time step. In this study, we present an efficient 3D Gaussian representation tailored for dynamic scenes in which we define positions and rotations as functions of time while leaving other time-invariant properties of the static 3D Gaussian unchanged. Notably, our representation reduces memory usage, which is consistent regardless of the input sequence length. Additionally, it mitigates the risk of overfitting observed frames by accounting for temporal changes. The optimization of our Gaussian representation based on image and flow reconstruction results in a powerful framework for dynamic scene view synthesis in both monocular and multi-view cases. We obtain the highest rendering speed of $118$ frames per second (FPS) at a resolution of $1352 \times 1014$ with a single GPU, showing the practical usability and effectiveness of our proposed method in dynamic scene rendering scenarios.	This paper proposes an efficient dynamic 3D Gaussian representation for real-time novel view synthesis of dynamic scenes from monocular or multi-view videos.	Existing methods for dynamic scene novel view synthesis either suffer from slow rendering speed (neural radiance fields) or high memory consumption in dynamic scenes (3D Gaussian splatting).	The method represents 3D Gaussian parameters (position, rotation) as a function of time, allowing for compact representation of dynamic motion. It optimizes the Gaussian parameters by minimizing the reconstruction loss between rendered and target images, and further enhances temporal consistency using optical flow supervision.	Achieves competitive visual quality with state-of-the-art neural rendering methods on D-NeRF, DyNeRF, and HyperNeRF datasets. Significantly faster rendering speed than previous high-quality methods, achieving real-time performance even at high resolutions. Demonstrates lower memory consumption compared to methods storing parameters per timestamp, especially beneficial for long sequences.	The method assumes Gaussian existence throughout the scene, limiting its ability to model topological changes like fluid motion. The explicit representation sacrifices continuity and smoothness of neural rendering, leading to artifacts with inaccurate camera poses and lower generalization performance.	novel view synthesis, dynamic scenes, 3d gaussian splatting, real-time rendering, optical flow
2311.12891 Report	Text-Guided Texturing by Synchronized Multi-View Diffusion	Yuxin Liu, Minshan Xie, Hanyuan Liu, Tien-Tsin Wong	This paper introduces a novel approach to synthesize texture to dress up a given 3D object, given a text prompt. Based on the pretrained text-to-image (T2I) diffusion model, existing methods usually employ a project-and-inpaint approach, in which a view of the given object is first generated and warped to another view for inpainting. But it tends to generate inconsistent texture due to the asynchronous diffusion of multiple views. We believe such asynchronous diffusion and insufficient information sharing among views are the root causes of the inconsistent artifact. In this paper, we propose a synchronized multi-view diffusion approach that allows the diffusion processes from different views to reach a consensus of the generated content early in the process, and hence ensures the texture consistency. To synchronize the diffusion, we share the denoised content among different views in each denoising step, specifically blending the latent content in the texture domain from views with overlap. Our method demonstrates superior performance in generating consistent, seamless, highly detailed textures, comparing to state-of-the-art methods.	This paper introduces Synchronized Multi-View Diffusion (MVD), a novel approach to generate consistent, seamless, and highly detailed textures on 3D objects from text prompts, leveraging pre-trained text-to-image diffusion models.	Existing project-and-inpaint methods for text-guided 3D object texturing suffer from inconsistencies and artifacts due to the asynchronous nature of diffusion across multiple views.	MVD synchronizes the diffusion process across multiple views by sharing denoised latent information in overlapping texture regions during each denoising step, enabling early consensus on texture structure and color distribution. It also leverages self-attention reuse for enhanced consistency.	MVD generates consistent and seamless textures, effectively addressing the limitations of existing approaches. The method produces highly detailed textures, preserving fine-grained features. Quantitative evaluation demonstrates superior performance compared to state-of-the-art methods, achieving the best FID score.	The method inherits the pre-trained model's bias towards common viewpoints, making it challenging to generate textures for less common views. Depth discontinuities can lead to imperfect boundaries in denoised views, potentially causing color bleeding during texture extraction. Future work could explore optimization-based extraction methods with perceptual losses or boundary masking.	texture synthesis, text-guided synthesis, 3d object texturing, diffusion models, multi-view consistency
2311.12847 Report	CopyScope: Model-level Copyright Infringement Quantification in the Diffusion Workflow	Junlei Zhou, Jiashi Gao, Ziwei Wang, Xuetao Wei	Web-based AI image generation has become an innovative art form that can generate novel artworks with the rapid development of the diffusion model. However, this new technique brings potential copyright infringement risks as it may incorporate the existing artworks without the owners' consent. Copyright infringement quantification is the primary and challenging step towards AI-generated image copyright traceability. Previous work only focused on data attribution from the training data perspective, which is unsuitable for tracing and quantifying copyright infringement in practice because of the following reasons: (1) the training datasets are not always available in public; (2) the model provider is the responsible party, not the image. Motivated by this, in this paper, we propose CopyScope, a new framework to quantify the infringement of AI-generated images from the model level. We first rigorously identify pivotal components within the AI image generation pipeline. Then, we propose to take advantage of Fr\'echet Inception Distance (FID) to effectively capture the image similarity that fits human perception naturally. We further propose the FID-based Shapley algorithm to evaluate the infringement contribution among models. Extensive experiments demonstrate that our work not only reveals the intricacies of infringement quantification but also effectively depicts the infringing models quantitatively, thus promoting accountability in AI image-generation tasks.	Proposes CopyScope, a framework to quantify copyright infringement in AI-generated images at the model level, addressing limitations of data attribution methods.	AI image generation tools raise copyright concerns as they may infringe on existing artworks, necessitating model-level infringement quantification for accountability.	Identifies key infringement components in the diffusion workflow, uses Fréchet Inception Distance (FID) to measure image similarity, and employs a FID-based Shapley algorithm to evaluate model contributions.	FID effectively captures image similarity aligning with human perception. FID-Shapley algorithm accurately quantifies infringement contributions of different models in the diffusion workflow. CopyScope provides a promising solution for copyright traceability and promotes legal AI-generated content use.	Current work focuses on a single image, Mona Lisa, for evaluation. Future work will explore extending CopyScope to broader image datasets and real-world infringement scenarios.	copyright infringement, ai image generation, diffusion models, accountability, fid
2311.12793 Report	ShareGPT4V: Improving Large Multi-Modal Models with Better Captions	Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, Dahua Lin	In the realm of large multi-modal models (LMMs), efficient modality alignment is crucial yet often constrained by the scarcity of high-quality image-text data. To address this bottleneck, we introduce the ShareGPT4V dataset, a pioneering large-scale resource featuring 1.2 million highly descriptive captions, which surpasses existing datasets in diversity and information content, covering world knowledge, object properties, spatial relationships, and aesthetic evaluations. Specifically, ShareGPT4V originates from a curated 100K high-quality captions collected from advanced GPT4-Vision and has been expanded to 1.2M with a superb caption model trained on this subset. ShareGPT4V first demonstrates its effectiveness for the Supervised Fine-Tuning (SFT) phase, by substituting an equivalent quantity of detailed captions in existing SFT datasets with a subset of our high-quality captions, significantly enhancing the LMMs like LLaVA-7B, LLaVA-1.5-13B, and Qwen-VL-Chat-7B on the MME and MMBench benchmarks, with respective gains of 222.8/22.0/22.3 and 2.7/1.3/1.5. We further incorporate ShareGPT4V data into both the pre-training and SFT phases, obtaining ShareGPT4V-7B, a superior LMM based on a simple architecture that has remarkable performance across a majority of the multi-modal benchmarks. This project is available at https://ShareGPT4V.github.io to serve as a pivotal resource for advancing the LMMs community.	This paper introduces ShareGPT4V, a large-scale dataset containing 1.2 million image-text pairs with highly descriptive captions generated by GPT4-Vision, designed for improving multi-modal model training.	Existing image-text datasets often rely on simplistic captions, hindering the ability of Large Multi-Modal Models (LMMs) to effectively align visual and textual information. ShareGPT4V addresses this by providing high-quality captions rich in details, knowledge, and relationships, enabling better modality alignment.	The authors first collected 100K images from diverse sources and used carefully crafted prompts to generate detailed descriptions using GPT4-Vision. This data was then used to train a general caption model, Share-Captioner. Finally, they used Share-Captioner to generate captions for 1.2M images, creating the ShareGPT4V-PT dataset. They also developed ShareGPT4V-7B, a LMM trained using the dataset.	Replacing existing SFT captions with those from ShareGPT4V significantly improves LLM performance across various benchmarks. ShareGPT4V-7B, a 7B parameter model, outperforms many state-of-the-art LMMs with larger sizes and training datasets on 11 multi-modal benchmarks. Ablation studies confirm the importance of high-quality captions and fine-tuning strategies for pre-training and fine-tuning LMMs.	The ShareGPT4V-PT dataset currently uses images from existing public datasets; exploring new image sources could further enhance diversity. While ShareGPT4V-7B achieves impressive results, future work can investigate scaling the model size and exploring alternative architectures.	multi-modal learning, image captioning, large language models, dataset, vision-language
2311.12775 Report	SuGaR: Surface-Aligned Gaussian Splatting for Efficient 3D Mesh Reconstruction and High-Quality Mesh Rendering	Antoine Guédon, Vincent Lepetit	We propose a method to allow precise and extremely fast mesh extraction from 3D Gaussian Splatting. Gaussian Splatting has recently become very popular as it yields realistic rendering while being significantly faster to train than NeRFs. It is however challenging to extract a mesh from the millions of tiny 3D gaussians as these gaussians tend to be unorganized after optimization and no method has been proposed so far. Our first key contribution is a regularization term that encourages the gaussians to align well with the surface of the scene. We then introduce a method that exploits this alignment to extract a mesh from the Gaussians using Poisson reconstruction, which is fast, scalable, and preserves details, in contrast to the Marching Cubes algorithm usually applied to extract meshes from Neural SDFs. Finally, we introduce an optional refinement strategy that binds gaussians to the surface of the mesh, and jointly optimizes these Gaussians and the mesh through Gaussian splatting rendering. This enables easy editing, sculpting, rigging, animating, compositing and relighting of the Gaussians using traditional softwares by manipulating the mesh instead of the gaussians themselves. Retrieving such an editable mesh for realistic rendering is done within minutes with our method, compared to hours with the state-of-the-art methods on neural SDFs, while providing a better rendering quality. Our project page is the following: https://anttwo.github.io/sugar/	This paper introduces SuGaR, a method for fast and accurate mesh extraction from 3D Gaussian Splatting representations.	Mesh-based scene representations are valuable for editing, sculpting, animation, and relighting in Computer Graphics, but extracting meshes from the unstructured point clouds of Gaussian Splatting has been challenging.	SuGaR first encourages alignment of Gaussian Splats with the scene surface through a novel regularization term during optimization. Then, it efficiently samples points on a level set of the Gaussian density function and uses Poisson reconstruction to generate a mesh. Optionally, it refines the mesh and binds new Gaussians to it for improved rendering.	SuGaR extracts detailed meshes from Gaussian Splatting representations within minutes on a single GPU. The method outperforms state-of-the-art mesh-based Novel View Synthesis techniques in terms of rendering quality. Binding refined Gaussians to the mesh allows for high-quality rendering and facilitates intuitive scene manipulation using traditional mesh editing tools.	The rendering quality of SuGaR, while exceeding other mesh-based methods, is not yet on par with the best NeRF models or vanilla Gaussian Splatting in all cases. Future work could explore more sophisticated methods for distinguishing foreground and background points during mesh extraction.	gaussian splatting, mesh extraction, novel view synthesis, 3d scene representation, computer graphics
2311.12631 Report	GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning	Jiaxi Lv, Yi Huang, Mingfu Yan, Jiancheng Huang, Jianzhuang Liu, Yifan Liu, Yafei Wen, Xiaoxin Chen, Shifeng Chen	Recent advances in text-to-video generation have harnessed the power of diffusion models to create visually compelling content conditioned on text prompts. However, they usually encounter high computational costs and often struggle to produce videos with coherent physical motions. To tackle these issues, we propose GPT4Motion, a training-free framework that leverages the planning capability of large language models such as GPT, the physical simulation strength of Blender, and the excellent image generation ability of text-to-image diffusion models to enhance the quality of video synthesis. Specifically, GPT4Motion employs GPT-4 to generate a Blender script based on a user textual prompt, which commands Blender's built-in physics engine to craft fundamental scene components that encapsulate coherent physical motions across frames. Then these components are inputted into Stable Diffusion to generate a video aligned with the textual prompt. Experimental results on three basic physical motion scenarios, including rigid object drop and collision, cloth draping and swinging, and liquid flow, demonstrate that GPT4Motion can generate high-quality videos efficiently in maintaining motion coherency and entity consistency. GPT4Motion offers new insights in text-to-video research, enhancing its quality and broadening its horizon for further explorations.	GPT4Motion, a training-free text-to-video generation framework that leverages GPT-4's planning ability to drive Blender simulations and guide Stable Diffusion for generating physically coherent videos.	Current text-to-video models struggle to generate videos with coherent physical motions due to the lack of physical understanding. This work introduces a novel approach using LLMs and physics engine to address the challenge.	GPT-4 generates Blender scripts from user prompts to simulate physical scenarios, producing edge and depth maps as conditions for Stable Diffusion. Cross-frame attention in SDXL enhances temporal consistency.	GPT4Motion accurately controls physical properties like gravity, wind strength, and viscosity. Outperforms baselines in generating realistic physical motions with better motion smoothness and less flickering. User study confirms superior performance in physical accuracy, text-video alignment, and flicker reduction.	Extending to more complex motions requiring refined LLM instructions is a challenge. Occasional flickering in generated videos needs further investigation.	text-to-video generation, physical simulation, large language models, blender, stable diffusion
2311.12490 Report	Hyb-NeRF: A Multiresolution Hybrid Encoding for Neural Radiance Fields	Yifan Wang, Yi Gong, Yuan Zeng	Recent advances in Neural radiance fields (NeRF) have enabled high-fidelity scene reconstruction for novel view synthesis. However, NeRF requires hundreds of network evaluations per pixel to approximate a volume rendering integral, making it slow to train. Caching NeRFs into explicit data structures can effectively enhance rendering speed but at the cost of higher memory usage. To address these issues, we present Hyb-NeRF, a novel neural radiance field with a multi-resolution hybrid encoding that achieves efficient neural modeling and fast rendering, which also allows for high-quality novel view synthesis. The key idea of Hyb-NeRF is to represent the scene using different encoding strategies from coarse-to-fine resolution levels. Hyb-NeRF exploits memory-efficiency learnable positional features at coarse resolutions and the fast optimization speed and local details of hash-based feature grids at fine resolutions. In addition, to further boost performance, we embed cone tracing-based features in our learnable positional encoding that eliminates encoding ambiguity and reduces aliasing artifacts. Extensive experiments on both synthetic and real-world datasets show that Hyb-NeRF achieves faster rendering speed with better rending quality and even a lower memory footprint in comparison to previous state-of-the-art methods.	Presents Hyb-NeRF, a novel neural radiance field representation using multi-resolution hybrid encoding for memory-efficient and high-quality scene representation and fast rendering.	Addresses limitations of existing NeRF methods that are either slow to train or memory-intensive, aiming to achieve both fast and high-quality novel view synthesis.	Combines learnable positional features at coarse resolution levels with hash-based feature grids at fine resolution levels. Integrates cone tracing-based features in the positional encoding to reduce aliasing and improve accuracy. Employs shallow MLPs for efficient processing.	Achieves faster rendering speed and better rendering quality compared to previous state-of-the-art methods. Demonstrates significantly lower memory footprint than previous voxel-based methods. Successfully reconstructs high-quality radiance fields in 9 minutes with the smallest model achieving better rendering quality in 4 minutes.	Limited exploration of higher resolution levels due to memory constraints. Further investigation into the application of hybrid encoding in dynamic scene representation.	neural radiance fields, novel view synthesis, multi-resolution encoding, hybrid encoding, learnable positional encoding
2311.12386 Report	Point, Segment and Count: A Generalized Framework for Object Counting	Zhizhong Huang, Mingliang Dai, Yi Zhang, Junping Zhang, Hongming Shan	Class-agnostic object counting aims to count all objects in an image with respect to example boxes or class names, \emph{a.k.a} few-shot and zero-shot counting. In this paper, we propose a generalized framework for both few-shot and zero-shot object counting based on detection. Our framework combines the superior advantages of two foundation models without compromising their zero-shot capability: (\textbf{i}) SAM to segment all possible objects as mask proposals, and (\textbf{ii}) CLIP to classify proposals to obtain accurate object counts. However, this strategy meets the obstacles of efficiency overhead and the small crowded objects that cannot be localized and distinguished. To address these issues, our framework, termed PseCo, follows three steps: point, segment, and count. Specifically, we first propose a class-agnostic object localization to provide accurate but least point prompts for SAM, which consequently not only reduces computation costs but also avoids missing small objects. Furthermore, we propose a generalized object classification that leverages CLIP image/text embeddings as the classifier, following a hierarchical knowledge distillation to obtain discriminative classifications among hierarchical mask proposals. Extensive experimental results on FSC-147, COCO, and LVIS demonstrate that PseCo achieves state-of-the-art performance in both few-shot/zero-shot object counting/detection. Code: https://github.com/Hzzone/PseCo	This paper proposes a novel generalized framework named PSCNet for few-shot and zero-shot object counting and detection by leveraging the strengths of SAM and CLIP.	Existing class-agnostic object counting methods often rely on density maps, lacking interpretability and struggling with small object detection. This paper addresses these limitations by combining the power of SAM for segmentation and CLIP for classification.	PSCNet employs a three-step approach: 1) Class-agnostic object localization using a point decoder to identify potential object locations, 2) Segmentation with SAM using the identified points as prompts, 3) Classification of the segmented proposals using CLIP image/text embeddings with a hierarchical knowledge distillation strategy.	PSCNet achieves state-of-the-art performance on few-shot and zero-shot object counting on the FSC-147 dataset, outperforming both density-based and detection-based methods. The method demonstrates superior performance on object detection tasks compared to baselines, achieving significant improvements on FSC-147 and FSCD-LVIS datasets. Evaluation on large-scale datasets like COCO and LVIS shows that PSCNet achieves substantial performance gains over existing open-vocabulary object detection methods like Detic.	The method's reliance on the inference of SAM's mask decoder introduces computational overhead compared to traditional object detection methods. PSCNet may face challenges in extremely crowded scenes or with inaccurate example images/text prompts, as highlighted in the failure cases.	object counting, object detection, few-shot learning, zero-shot learning, sam, clip
2311.12342 Report	LoCo: Locally Constrained Training-Free Layout-to-Image Synthesis	Peiang Zhao, Han Li, Ruiyang Jin, S. Kevin Zhou	Recent text-to-image diffusion models have reached an unprecedented level in generating high-quality images. However, their exclusive reliance on textual prompts often falls short in precise control of image compositions. In this paper, we propose LoCo, a training-free approach for layout-to-image Synthesis that excels in producing high-quality images aligned with both textual prompts and layout instructions. Specifically, we introduce a Localized Attention Constraint (LAC), leveraging semantic affinity between pixels in self-attention maps to create precise representations of desired objects and effectively ensure the accurate placement of objects in designated regions. We further propose a Padding Token Constraint (PTC) to leverage the semantic information embedded in previously neglected padding tokens, improving the consistency between object appearance and layout instructions. LoCo seamlessly integrates into existing text-to-image and layout-to-image models, enhancing their performance in spatial control and addressing semantic failures observed in prior methods. Extensive experiments showcase the superiority of our approach, surpassing existing state-of-the-art training-free layout-to-image methods both qualitatively and quantitatively across multiple benchmarks.	This paper introduces LoCo, a training-free approach for layout-to-image synthesis that leverages localized attention constraints and padding token information to generate high-quality images adhering to both textual and layout conditions.	Existing text-to-image synthesis methods struggle with precise control of image compositions, making it challenging to accurately place objects in desired locations. LoCo addresses this limitation by providing accurate spatial control without requiring model training.	LoCo utilizes two novel constraints: (1) Localized Attention Constraint (LAC) enhances cross-attention maps using self-attention to achieve precise object representation and alignment with layout instructions. (2) Padding Tokens Constraint (PTC) leverages semantic information in padding tokens to enhance consistency between object appearance and layout.	LoCo outperforms state-of-the-art training-free layout-to-image methods on standard benchmarks (HRS-Bench, DrawBench) in terms of spatial accuracy and image quality. LoCo effectively handles both bounding box and semantic mask layout instructions, demonstrating its versatility. The method can be seamlessly integrated into fully-supervised layout-to-image models (e.g., GLIGEN) as a plug-and-play booster, enhancing their performance.	The performance of LoCo depends on the choice of hyperparameters, requiring careful tuning for optimal results. The current implementation primarily focuses on single-image generation. Exploring extensions to video or sequential image synthesis is a potential area for future work.	image synthesis, diffusion models, layout-to-image synthesis, spatial control, attention mechanisms
2311.12193 Report	Disentangling Structure and Appearance in ViT Feature Space	Narek Tumanyan, Omer Bar-Tal, Shir Amir, Shai Bagon, Tali Dekel	We present a method for semantically transferring the visual appearance of one natural image to another. Specifically, our goal is to generate an image in which objects in a source structure image are "painted" with the visual appearance of their semantically related objects in a target appearance image. To integrate semantic information into our framework, our key idea is to leverage a pre-trained and fixed Vision Transformer (ViT) model. Specifically, we derive novel disentangled representations of structure and appearance extracted from deep ViT features. We then establish an objective function that splices the desired structure and appearance representations, interweaving them together in the space of ViT features. Based on our objective function, we propose two frameworks of semantic appearance transfer -- "Splice", which works by training a generator on a single and arbitrary pair of structure-appearance images, and "SpliceNet", a feed-forward real-time appearance transfer model trained on a dataset of images from a specific domain. Our frameworks do not involve adversarial training, nor do they require any additional input information such as semantic segmentation or correspondences. We demonstrate high-resolution results on a variety of in-the-wild image pairs, under significant variations in the number of objects, pose, and appearance. Code and supplementary material are available in our project page: splice-vit.github.io.	This paper introduces Splice and SpliceNet, novel methods for semantically transferring the visual appearance of one image to another by leveraging disentangled representations of structure and appearance from pre-trained DINO-ViT features.	Semantic appearance transfer enables generating an image where objects in a source image are “painted” with the appearance of semantically similar objects in a target image, facilitating realistic and meaningful visual transformations.	The methods use disentangled structure representations (self-similarity of keys in the deepest attention module) and appearance representations (global [CLS] token) extracted from DINO-ViT. Splice trains a generator on a single input image pair, while SpliceNet uses a feed-forward model trained on a domain-specific dataset.	Splice achieves high-quality semantic appearance transfer on diverse in-the-wild image pairs, outperforming baselines in user studies. SpliceNet enables real-time semantic appearance transfer within a specific domain, demonstrating superior performance compared to GAN-based methods. Both Splice and SpliceNet demonstrate the effectiveness of leveraging pre-trained ViT features for encoding and manipulating structure and appearance information.	The performance of Splice and SpliceNet depends on the quality of semantic representations learned by DINO-ViT. SpliceNet is limited to a specific domain due to its domain-specific training dataset, while Splice requires training from scratch for each image pair.	style transfer, vision transformers, appearance transfer, feature disentanglement, semantic image editing
2311.12174 Report	LABELMAKER: Automatic Semantic Label Generation from RGB-D Trajectories	Silvan Weder, Hermann Blum, Francis Engelmann, Marc Pollefeys	Semantic annotations are indispensable to train or evaluate perception models, yet very costly to acquire. This work introduces a fully automated 2D/3D labeling framework that, without any human intervention, can generate labels for RGB-D scans at equal (or better) level of accuracy than comparable manually annotated datasets such as ScanNet. Our approach is based on an ensemble of state-of-the-art segmentation models and 3D lifting through neural rendering. We demonstrate the effectiveness of our LabelMaker pipeline by generating significantly better labels for the ScanNet datasets and automatically labelling the previously unlabeled ARKitScenes dataset. Code and models are available at https://labelmaker.org	This work presents LabelMaker, a fully automated 2D/3D labeling framework that leverages an ensemble of state-of-the-art segmentation models and 3D lifting through neural rendering to generate accurate labels for RGB-D scans without human intervention.	Semantic annotations are crucial for training and evaluating perception models, but acquiring them is expensive and time-consuming. LabelMaker addresses this challenge by enabling the generation of high-quality labels at scale without human effort, facilitating the development and evaluation of perception models and potentially unlocking the potential of large unlabeled datasets like ARKitScenes.	LabelMaker employs an ensemble of 2D and 3D segmentation models (InternImage, OVSeg, CMX, Mask3D), projects their predictions into a common label space, and applies a consensus voting mechanism to obtain 2D labels for each frame. These 2D predictions are then lifted into 3D using a neural radiance field, which helps to improve consistency and detail, enabling the generation of both 2D and 3D semantic labels.	LabelMaker generates labels on par with or better than human-annotated datasets like ScanNet, as demonstrated by evaluations on ScanNet and Replica datasets. The method outperforms existing baselines, including human-annotated ScanNet labels and labels refined with SemanticNeRF, in both 2D and 3D semantic segmentation metrics. LabelMaker can be used to automatically label large unlabeled datasets, such as ARKitScenes, paving the way for utilizing these datasets in training and evaluating 3D perception models.	LabelMaker is currently limited to a fixed set of classes, which could be addressed by incorporating language embeddings for more flexibility and ambiguity resolution. The 3D lifting component relies on SDFStudio, which has many hyperparameters, and further optimization could potentially improve results.	semantic segmentation, 3d labeling, neural rendering, rgb-d, scannet
2311.12079 Report	FreeKD: Knowledge Distillation via Semantic Frequency Prompt	Yuan Zhang, Tao Huang, Jiaming Liu, Tao Jiang, Kuan Cheng, Shanghang Zhang	Knowledge distillation (KD) has been applied to various tasks successfully, and mainstream methods typically boost the student model via spatial imitation losses. However, the consecutive downsamplings induced in the spatial domain of teacher model is a type of corruption, hindering the student from analyzing what specific information needs to be imitated, which results in accuracy degradation. To better understand the underlying pattern of corrupted feature maps, we shift our attention to the frequency domain. During frequency distillation, we encounter a new challenge: the low-frequency bands convey general but minimal context, while the high are more informative but also introduce noise. Not each pixel within the frequency bands contributes equally to the performance. To address the above problem: (1) We propose the Frequency Prompt plugged into the teacher model, absorbing the semantic frequency context during finetuning. (2) During the distillation period, a pixel-wise frequency mask is generated via Frequency Prompt, to localize those pixel of interests (PoIs) in various frequency bands. Additionally, we employ a position-aware relational frequency loss for dense prediction tasks, delivering a high-order spatial enhancement to the student model. We dub our Frequency Knowledge Distillation method as FreeKD, which determines the optimal localization and extent for the frequency distillation. Extensive experiments demonstrate that FreeKD not only outperforms spatial-based distillation methods consistently on dense prediction tasks (e.g., FreeKD brings 3.8 AP gains for RepPoints-R50 on COCO2017 and 4.55 mIoU gains for PSPNet-R18 on Cityscapes), but also conveys more robustness to the student. Notably, we also validate the generalization of our approach on large-scale vision models (e.g., DINO and SAM).	This paper introduces FreeKD, a novel knowledge distillation method that operates in the frequency domain for dense prediction tasks.	Existing spatial-based distillation methods suffer from downsampling corruption, hindering students from effectively learning valuable information. FreeKD addresses this by distilling knowledge from frequency bands.	FreeKD utilizes Discrete Wavelet Transform for frequency band decomposition. It incorporates a semantic Frequency Prompt to identify crucial pixels of interest in frequency bands and a position-aware relational frequency loss for improved spatial understanding.	FreeKD consistently outperforms state-of-the-art spatial distillation methods on object detection (COCO) and semantic segmentation (Cityscapes). Students trained with FreeKD exhibit enhanced robustness and domain generalization capabilities, validated on COCO-C. The method's efficacy extends to large-scale vision models like DINO and SAM, demonstrating its generality.	The study primarily focuses on dense prediction tasks; exploring other vision tasks could be beneficial. Future work could investigate different interaction methods between Frequency Prompt and frequency bands.	knowledge distillation, frequency domain, dense prediction, frequency prompt, robustness
2311.12075 Report	BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning	Siyuan Liang, Mingli Zhu, Aishan Liu, Baoyuan Wu, Xiaochun Cao, Ee-Chien Chang	Studying backdoor attacks is valuable for model copyright protection and enhancing defenses. While existing backdoor attacks have successfully infected multimodal contrastive learning models such as CLIP, they can be easily countered by specialized backdoor defenses for MCL models. This paper reveals the threats in this practical scenario that backdoor attacks can remain effective even after defenses and introduces the \emph{\toolns} attack, which is resistant to backdoor detection and model fine-tuning defenses. To achieve this, we draw motivations from the perspective of the Bayesian rule and propose a dual-embedding guided framework for backdoor attacks. Specifically, we ensure that visual trigger patterns approximate the textual target semantics in the embedding space, making it challenging to detect the subtle parameter variations induced by backdoor learning on such natural trigger patterns. Additionally, we optimize the visual trigger patterns to align the poisoned samples with target vision features in order to hinder the backdoor unlearning through clean fine-tuning. Extensive experiments demonstrate that our attack significantly outperforms state-of-the-art baselines (+45.3% ASR) in the presence of SoTA backdoor defenses, rendering these mitigation and detection strategies virtually ineffective. Furthermore, our approach effectively attacks some more rigorous scenarios like downstream tasks. We believe that this paper raises awareness regarding the potential threats associated with the practical application of multimodal contrastive learning and encourages the development of more robust defense mechanisms.	This paper introduces \emph{\toolns}, a novel backdoor attack framework for Multimodal Contrastive Learning (MCL) models like CLIP, which demonstrates resistance against existing backdoor detection and mitigation techniques.	The research highlights a significant threat in the practical application of MCL: even with defense mechanisms like backdoor detection and fine-tuning, backdoor attacks can remain effective, potentially compromising the reliability of pre-trained MCL models.	Inspired by the Bayesian rule, the authors propose a dual-embedding guided framework. This framework optimizes visual trigger patterns to achieve two key goals: 1) minimizing parameter deviations from the clean model to evade detection and 2) aligning poisoned samples with target vision features to resist unlearning during clean fine-tuning.	\toolns significantly outperforms state-of-the-art backdoor attacks by +45.3% ASR against fine-tuning defenses. The attack successfully evades detection by DECREE, achieving a high \mathcal{PL}^1-norm score (0.082), indicating the difficulty in detecting the implanted backdoor. \toolns maintains high effectiveness (87.21% ASR) even when defenders fine-tune the poisoned model with clean data from a different domain.	The paper primarily focuses on image classification tasks and acknowledges the need to investigate backdoor attacks on more complex tasks built upon MCL. The authors highlight the need for developing more robust and advanced backdoor detection and mitigation methods specifically designed for MCL models to counter the threats posed by attacks like \toolns.	backdoor attack, multimodal contrastive learning, clip, backdoor detection, fine-tuning defense
2311.12066 Report	EditShield: Protecting Unauthorized Image Editing by Instruction-guided Diffusion Models	Ruoxi Chen, Haibo Jin, Jinyin Chen, Lichao Sun	Text-to-image diffusion models have emerged as an evolutionary for producing creative content in image synthesis. Based on the impressive generation abilities of these models, instruction-guided diffusion models can edit images with simple instructions and input images. While they empower users to obtain their desired edited images with ease, they have raised concerns about unauthorized image manipulation. Prior research has delved into the unauthorized use of personalized diffusion models; however, this problem of instruction-guided diffusion models remains largely unexplored. In this paper, we first propose a protection method EditShield against unauthorized modifications from such models. Specifically, EditShield works by adding imperceptible perturbations that can shift the latent representation used in the diffusion process, forcing models to generate unrealistic images with mismatched subjects. Our extensive experiments demonstrate EditShield's effectiveness among synthetic and real-world datasets. Besides, EditShield also maintains robustness against various editing types and synonymous instruction phrases.	This paper proposes EditShield, a method to protect images from unauthorized editing using instruction-guided diffusion models.	Instruction-guided diffusion models, while powerful for image editing, pose risks of unauthorized manipulation and misuse, necessitating protective measures.	EditShield crafts imperceptible perturbations that disrupt the latent representation of images, leading to unrealistic outputs after editing.	EditShield effectively protects against unauthorized editing, as demonstrated by quantitative metrics and qualitative results. The method exhibits robustness against various editing types and synonymous instruction phrases. EditShield remains partially effective even against potential countermeasures like spatial smoothing and JPEG compression.	The effectiveness of EditShield might be reduced by specific countermeasures designed to mitigate the added perturbations. Future work includes exploring more sophisticated countermeasures and defenses for a stronger protection mechanism.	image protection, diffusion models, image editing, unauthorized manipulation, adversarial perturbations
2311.12063 Report	DatasetNeRF: Efficient 3D-aware Data Factory with Generative Radiance Fields	Yu Chi, Fangneng Zhan, Sibo Wu, Christian Theobalt, Adam Kortylewski	Progress in 3D computer vision tasks demands a huge amount of data, yet annotating multi-view images with 3D-consistent annotations, or point clouds with part segmentation is both time-consuming and challenging. This paper introduces DatasetNeRF, a novel approach capable of generating infinite, high-quality 3D-consistent 2D annotations alongside 3D point cloud segmentations, while utilizing minimal 2D human-labeled annotations. Specifically, we leverage the strong semantic prior within a 3D generative model to train a semantic decoder, requiring only a handful of fine-grained labeled samples. Once trained, the decoder efficiently generalizes across the latent space, enabling the generation of infinite data. The generated data is applicable across various computer vision tasks, including video segmentation and 3D point cloud segmentation. Our approach not only surpasses baseline models in segmentation quality, achieving superior 3D consistency and segmentation precision on individual images, but also demonstrates versatility by being applicable to both articulated and non-articulated generative models. Furthermore, we explore applications stemming from our approach, such as 3D-aware semantic editing and 3D inversion.	DatasetNeRF is a novel framework that leverages pre-trained 3D GANs to generate infinite, high-quality, 3D-consistent 2D annotations and 3D point cloud segmentations, requiring minimal 2D human-labeled annotations.	Annotating multi-view images or point clouds with 3D-consistent labels is time-consuming and challenging, hindering progress in 3D computer vision tasks that require large amounts of data.	The method trains a semantic segmentation branch on a pre-trained 3D GAN, enhancing the feature tri-plane for semantic volumetric rendering. A depth prior from the 3D GAN backbone ensures 3D consistency and enables back-projection of 2D segmentations to 3D point cloud segmentations.	DatasetNeRF surpasses baseline models in segmentation quality, achieving superior 3D consistency and segmentation precision. The framework demonstrates versatility by being applicable to both articulated and non-articulated generative models. DatasetNeRF enables applications such as 3D-aware semantic editing and 3D inversion.	The performance improvement plateaus with increasing training samples beyond a certain point. The current method focuses on generating annotations for specific object categories and could be extended to broader and more complex scenes.	3d computer vision, semantic segmentation, point cloud segmentation, generative adversarial networks, dataset generation
2311.12024 Report	PF-LRM: Pose-Free Large Reconstruction Model for Joint Pose and Shape Prediction	Peng Wang, Hao Tan, Sai Bi, Yinghao Xu, Fujun Luan, Kalyan Sunkavalli, Wenping Wang, Zexiang Xu, Kai Zhang	We propose a Pose-Free Large Reconstruction Model (PF-LRM) for reconstructing a 3D object from a few unposed images even with little visual overlap, while simultaneously estimating the relative camera poses in ~1.3 seconds on a single A100 GPU. PF-LRM is a highly scalable method utilizing the self-attention blocks to exchange information between 3D object tokens and 2D image tokens; we predict a coarse point cloud for each view, and then use a differentiable Perspective-n-Point (PnP) solver to obtain camera poses. When trained on a huge amount of multi-view posed data of ~1M objects, PF-LRM shows strong cross-dataset generalization ability, and outperforms baseline methods by a large margin in terms of pose prediction accuracy and 3D reconstruction quality on various unseen evaluation datasets. We also demonstrate our model's applicability in downstream text/image-to-3D task with fast feed-forward inference. Our project website is at: https://totoro97.github.io/pf-lrm .	This paper proposes PF-LRM, a method for reconstructing a 3D object from a few unposed images while simultaneously estimating relative camera poses.	Many real-world scenarios involve sparse image capture with little overlap, making traditional Structure-from-Motion methods unreliable. PF-LRM addresses this by jointly learning camera poses and 3D shapes.	The method utilizes a single-stream transformer model processing image and 3D object tokens. It predicts a coarse point cloud for each view, enabling camera pose estimation via a differentiable Perspective-n-Point solver.	PF-LRM achieves state-of-the-art pose estimation accuracy on unseen datasets like OmniObject3D, GSO, and ABO. It demonstrates strong cross-dataset generalization ability, outperforming baselines in novel view synthesis quality. The method has potential applications in downstream tasks like text/image-to-3D generation.	Limitations include ignoring background information for pose estimation and not modeling view-dependent effects. Future work involves incorporating background cues, handling view-dependent appearance, and increasing reconstruction resolution.	3d reconstruction, pose estimation, transformer, nerf, sparse views
2311.11700 Report	GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting	Chi Yan, Delin Qu, Dan Xu, Bin Zhao, Zhigang Wang, Dong Wang, Xuelong Li	In this paper, we introduce \textbf{GS-SLAM} that first utilizes 3D Gaussian representation in the Simultaneous Localization and Mapping (SLAM) system. It facilitates a better balance between efficiency and accuracy. Compared to recent SLAM methods employing neural implicit representations, our method utilizes a real-time differentiable splatting rendering pipeline that offers significant speedup to map optimization and RGB-D rendering. Specifically, we propose an adaptive expansion strategy that adds new or deletes noisy 3D Gaussians in order to efficiently reconstruct new observed scene geometry and improve the mapping of previously observed areas. This strategy is essential to extend 3D Gaussian representation to reconstruct the whole scene rather than synthesize a static object in existing methods. Moreover, in the pose tracking process, an effective coarse-to-fine technique is designed to select reliable 3D Gaussian representations to optimize camera pose, resulting in runtime reduction and robust estimation. Our method achieves competitive performance compared with existing state-of-the-art real-time methods on the Replica, TUM-RGBD datasets. Project page: https://gs-slam.github.io/.	GS-SLAM, a novel dense visual SLAM method that leverages 3D Gaussian Splatting for efficient and accurate scene reconstruction and camera pose estimation.	Existing SLAM methods struggle to balance efficiency and accuracy, particularly in generating detailed dense maps. GS-SLAM addresses this by utilizing the speed of splatting rendering and 3D Gaussian representations.	GS-SLAM optimizes camera tracking and mapping with a differentiable RGB-D rendering approach using 3D Gaussians and splatting. It employs an adaptive expansion strategy to manage 3D Gaussian elements and a coarse-to-fine approach for pose estimation.	Achieves state-of-the-art performance in dense neural RGB-D SLAM on Replica and TUM-RGBD datasets. Exhibits superior rendering performance, achieving up to 100x faster speeds than previous methods. Effectively balances efficiency and accuracy for real-time tracking, mapping, and rendering.	Reliance on high-quality depth data may limit performance in certain conditions. High memory requirements for large scenes, suggesting future work on optimization via techniques like quantization or clustering.	slam, 3d gaussian splatting, dense mapping, camera pose estimation, real-time rendering
2311.11697 Report	Cut-and-Paste: Subject-Driven Video Editing with Attention Control	Zhichao Zuo, Zhao Zhang, Yan Luo, Yang Zhao, Haijun Zhang, Yi Yang, Meng Wang	This paper presents a novel framework termed Cut-and-Paste for real-word semantic video editing under the guidance of text prompt and additional reference image. While the text-driven video editing has demonstrated remarkable ability to generate highly diverse videos following given text prompts, the fine-grained semantic edits are hard to control by plain textual prompt only in terms of object details and edited region, and cumbersome long text descriptions are usually needed for the task. We therefore investigate subject-driven video editing for more precise control of both edited regions and background preservation, and fine-grained semantic generation. We achieve this goal by introducing an reference image as supplementary input to the text-driven video editing, which avoids racking your brain to come up with a cumbersome text prompt describing the detailed appearance of the object. To limit the editing area, we refer to a method of cross attention control in image editing and successfully extend it to video editing by fusing the attention map of adjacent frames, which strikes a balance between maintaining video background and spatio-temporal consistency. Compared with current methods, the whole process of our method is like ``cut" the source object to be edited and then ``paste" the target object provided by reference image. We demonstrate that our method performs favorably over prior arts for video editing under the guidance of text prompt and extra reference image, as measured by both quantitative and subjective evaluations.	This paper proposes Cut-and-Paste, a novel subject-driven video editing framework that uses both text prompts and reference images for fine-grained control over semantic video editing.	Existing text-driven video editing methods lack precise control over semantic edits and struggle to preserve the original video's background and temporal consistency. This paper aims to solve this by leveraging the semantic information of a reference image in addition to the text prompts.	The proposed method combines a pre-trained text-to-image Latent Diffusion Model (LDM) with a multimodal encoder (BLIP-2) to fuse text prompts and reference image representations. It employs an attention control mechanism with adjacent frames to maintain spatio-temporal consistency.	Cut-and-Paste demonstrates superior performance over state-of-the-art text-driven video editing methods in terms of fine-grained control, background preservation, and spatio-temporal consistency. Quantitative evaluations using CLIP Score and LPIPS show that Cut-and-Paste achieves higher text-image similarity and lower deviation from the original video frames. A user study confirms that users strongly prefer Cut-and-Paste for both text-video alignment and video fidelity.	The current method faces limitations in editing multiple objects simultaneously and changing object size on a large scale. Future work includes enhancing the model's capability to handle multiple objects and remove existing objects in video frames. Additionally, eliminating the fine-tuning process could make the approach more user-friendly.	video editing, diffusion models, text-guided synthesis, attention control, subject-driven editing
2311.11695 Report	Clarity ChatGPT: An Interactive and Adaptive Processing System for Image Restoration and Enhancement	Yanyan Wei, Zhao Zhang, Jiahuan Ren, Xiaogang Xu, Richang Hong, Yi Yang, Shuicheng Yan, Meng Wang	The generalization capability of existing image restoration and enhancement (IRE) methods is constrained by the limited pre-trained datasets, making it difficult to handle agnostic inputs such as different degradation levels and scenarios beyond their design scopes. Moreover, they are not equipped with interactive mechanisms to consider user preferences or feedback, and their end-to-end settings cannot provide users with more choices. Faced with the above-mentioned IRE method's limited performance and insufficient interactivity, we try to solve it from the engineering and system framework levels. Specifically, we propose Clarity ChatGPT-a transformative system that combines the conversational intelligence of ChatGPT with multiple IRE methods. Clarity ChatGPT can automatically detect image degradation types and select appropriate IRE methods to restore images, or iteratively generate satisfactory results based on user feedback. Its innovative features include a CLIP-powered detector for accurate degradation classification, no-reference image quality evaluation for performance evaluation, region-specific processing for precise enhancements, and advanced fusion techniques for optimal restoration results. Clarity ChatGPT marks a significant advancement in integrating language and vision, enhancing image-text interactions, and providing a robust, high-performance IRE solution. Our case studies demonstrate that Clarity ChatGPT effectively improves the generalization and interaction capabilities in the IRE, and also fills the gap in the low-level domain of the existing vision-language model.	Clarity ChatGPT, a system bridging conversational AI (ChatGPT) with image restoration and enhancement (IRE) methods using Visual and Restoration & Enhancement Foundation Models (VFMs & REFMs).	Existing IRE methods lack adaptability to diverse degradation types and user feedback. Clarity ChatGPT addresses these limitations by integrating LLMs with VFMs and REFMs for interactive, user-centric IRE solutions.	Clarity ChatGPT employs a pipeline including: (1) CLIP-powered degradation detection, (2) no-reference IQA, (3) region-specific processing using SAM and GroundingDINO, and (4) multi-result fusion with a U-Net architecture.	Fine-tuned CLIP achieves 94.57% Top-1 accuracy for degradation classification, significantly outperforming the original CLIP (38.27%). Multiple results fusion for low-light enhancement with denoising shows superior performance (PSNR: 27.23, SSIM: 0.823) compared to individual methods. Case studies demonstrate Clarity ChatGPT's capability in handling complex IRE tasks, including region-specific enhancements and challenging degradation types, exceeding ChatGPT-4V's performance.	Limited model sharing and collaborative development of IRE algorithms. Lack of a comprehensive user feedback mechanism for system optimization and personalization.	image restoration, image enhancement, chatgpt, vision-language models, interactive image processing
2311.11666 Report	OmniSeg3D: Omniversal 3D Segmentation via Hierarchical Contrastive Learning	Haiyang Ying, Yixuan Yin, Jinzhi Zhang, Fan Wang, Tao Yu, Ruqi Huang, Lu Fang	Towards holistic understanding of 3D scenes, a general 3D segmentation method is needed that can segment diverse objects without restrictions on object quantity or categories, while also reflecting the inherent hierarchical structure. To achieve this, we propose OmniSeg3D, an omniversal segmentation method aims for segmenting anything in 3D all at once. The key insight is to lift multi-view inconsistent 2D segmentations into a consistent 3D feature field through a hierarchical contrastive learning framework, which is accomplished by two steps. Firstly, we design a novel hierarchical representation based on category-agnostic 2D segmentations to model the multi-level relationship among pixels. Secondly, image features rendered from the 3D feature field are clustered at different levels, which can be further drawn closer or pushed apart according to the hierarchical relationship between different levels. In tackling the challenges posed by inconsistent 2D segmentations, this framework yields a global consistent 3D feature field, which further enables hierarchical segmentation, multi-object selection, and global discretization. Extensive experiments demonstrate the effectiveness of our method on high-quality 3D segmentation and accurate hierarchical structure understanding. A graphical user interface further facilitates flexible interaction for omniversal 3D segmentation.	This paper presents OmniSeg3D, an omniversal 3D segmentation method that segments diverse objects in 3D without restrictions on categories or quantity, while also capturing hierarchical structure.	Holistic 3D scene understanding requires a general 3D segmentation method that overcomes limitations of existing methods, such as category restrictions and inability to reflect hierarchical structure.	The method leverages multi-view 2D segmentations and lifts them into a consistent 3D feature field through a hierarchical contrastive learning framework. This is achieved by: (1) Designing a hierarchical 2D representation based on category-agnostic segmentations to model multi-level relationships. (2) Hierarchically clustering image features rendered from the 3D feature field, drawing them closer or pushing them apart based on their hierarchical relationships.	OmniSeg3D achieves state-of-the-art performance on hierarchical 3D segmentation benchmarks, demonstrating its ability to understand scene structure across scales. The method outperforms baseline methods in 3D instance segmentation tasks, showcasing its effectiveness in segmenting individual objects. A user-friendly graphical user interface enables interactive 3D segmentation, facilitating applications like annotation and object manipulation.	The lack of a clear definition for hierarchy levels in the current method may lead to inconsistent segmentation levels across different objects. Objects that never appear in the same image might exhibit similar semantic features due to contrastive learning being applied on single images.	3d segmentation, hierarchical representation learning, contrastive learning, multi-view consistency, interactive segmentation
2311.11600 Report	Deep Equilibrium Diffusion Restoration with Parallel Sampling	Jiezhang Cao, Yue Shi, Kai Zhang, Yulun Zhang, Radu Timofte, Luc Van Gool	Diffusion model-based image restoration (IR) aims to use diffusion models to recover high-quality (HQ) images from degraded images, achieving promising performance. Due to the inherent property of diffusion models, most existing methods need long serial sampling chains to restore HQ images step-by-step, resulting in expensive sampling time and high computation costs. Moreover, such long sampling chains hinder understanding the relationship between inputs and restoration results since it is hard to compute the gradients in the whole chains. In this work, we aim to rethink the diffusion model-based IR models through a different perspective, i.e., a deep equilibrium (DEQ) fixed point system, called DeqIR. Specifically, we derive an analytical solution by modeling the entire sampling chain in these IR models as a joint multivariate fixed point system. Based on the analytical solution, we can conduct parallel sampling and restore HQ images without training. Furthermore, we compute fast gradients via DEQ inversion and found that initialization optimization can boost image quality and control the generation direction. Extensive experiments on benchmarks demonstrate the effectiveness of our method on typical IR tasks and real-world settings.	This paper presents DeqIR, a zero-shot image restoration method based on deep equilibrium (DEQ) fixed-point systems for parallel sampling and initialization optimization in diffusion models.	Existing diffusion model-based IR methods suffer from long serial sampling chains, leading to high computational costs and difficulties in understanding the relationship between inputs and outputs.	The authors model the entire sampling chain as a joint multivariate fixed point system, deriving an analytical solution for parallel sampling. DEQ inversion enables efficient gradient computation for initialization optimization.	DeqIR achieves parallel sampling, enabling faster inference and multi-GPU training compared to sequential sampling methods. DEQ inversion allows for efficient computation of gradients, facilitating initialization optimization to improve restoration quality and control generation direction. Extensive experiments demonstrate DeqIR's effectiveness on various IR tasks, outperforming existing zero-shot methods and achieving comparable results to supervised approaches, with promising real-world applicability.	The performance of DeqIR depends on the accuracy of the degradation matrix, which might be unknown or inaccurate in some real-world scenarios. Exploring the application of DEQ inversion to extend DeqIR for supervised learning is a potential future direction.	image restoration, diffusion models, deep equilibrium models, parallel sampling, initialization optimization
2311.11469 Report	DiffGANPaint: Fast Inpainting Using Denoising Diffusion GANs	Moein Heidari, Alireza Morsali, Tohid Abedini, Samin Heydarian	Free-form image inpainting is the task of reconstructing parts of an image specified by an arbitrary binary mask. In this task, it is typically desired to generalize model capabilities to unseen mask types, rather than learning certain mask distributions. Capitalizing on the advances in diffusion models, in this paper, we propose a Denoising Diffusion Probabilistic Model (DDPM) based model capable of filling missing pixels fast as it models the backward diffusion process using the generator of a generative adversarial network (GAN) network to reduce sampling cost in diffusion models. Experiments on general-purpose image inpainting datasets verify that our approach performs superior or on par with most contemporary works.	Presents DiffGANPaint, a novel image inpainting method combining a Denoising Diffusion Probabilistic Model (DDPM) with a Generative Adversarial Network (GAN) for fast and high-quality reconstruction of missing image regions.	Addresses the computational expense of traditional DDPM-based image inpainting methods while maintaining high visual quality.	Utilizes a trained DDPM to denoise the input image, then employs a trained GAN generator to fill in the masked regions, leveraging the structural consistency of DDPM and the generation speed of GANs.	DiffGANPaint generates high-quality inpainted images with superior or comparable performance to contemporary methods. The method demonstrates strong generalization capabilities across diverse datasets, including CelebA-HQ faces and generic images. DiffGANPaint achieves fast inpainting with a low computational budget compared to traditional DDPM approaches.	The paper does not provide quantitative comparisons to other state-of-the-art inpainting methods. Further exploration of different GAN architectures and their impact on inpainting quality is a potential avenue for future work.	image inpainting, diffusion models, generative adversarial networks, ddpm, gan
2311.11465 Report	Understanding Segment Anything Model: SAM is Biased Towards Texture Rather than Shape	Chaoning Zhang, Yu Qiao, Shehbaz Tariq, Sheng Zheng, Chenshuang Zhang, Chenghao Li, Hyundong Shin, Choong Seon Hong	In contrast to the human vision that mainly depends on the shape for recognizing the objects, deep image recognition models are widely known to be biased toward texture. Recently, Meta research team has released the first foundation model for image segmentation, termed segment anything model (SAM), which has attracted significant attention. In this work, we understand SAM from the perspective of texture \textit{v.s.} shape. Different from label-oriented recognition tasks, the SAM is trained to predict a mask for covering the object shape based on a promt. With this said, it seems self-evident that the SAM is biased towards shape. In this work, however, we reveal an interesting finding: the SAM is strongly biased towards texture-like dense features rather than shape. This intriguing finding is supported by a novel setup where we disentangle texture and shape cues and design texture-shape cue conflict for mask prediction.	This paper investigates whether the Segment Anything Model (SAM) prioritizes texture or shape cues when predicting object masks, revealing a surprising bias towards texture.	Understanding the role of texture and shape in SAM's decision-making process is crucial for comprehending its capabilities and limitations as a foundation model for image segmentation.	The authors disentangle texture and shape cues by creating images with only one type of cue and images with conflicting cues, then analyze SAM's mask predictions on these manipulated images.	Texture alone can be sufficient for accurate mask prediction. Shape alone leads to less accurate mask predictions. In cases of conflicting cues, SAM predominantly relies on texture over shape.	The analysis primarily focuses on silhouette-based images, potentially limiting the generalizability of findings. Further research is needed to investigate the impact of different texture types and complexities on SAM's bias.	segment anything model (sam), image segmentation, texture bias, shape bias, computer vision
2311.11325 Report	MoVideo: Motion-Aware Video Generation with Diffusion Models	Jingyun Liang, Yuchen Fan, Kai Zhang, Radu Timofte, Luc Van Gool, Rakesh Ranjan	While recent years have witnessed great progress on using diffusion models for video generation, most of them are simple extensions of image generation frameworks, which fail to explicitly consider one of the key differences between videos and images, i.e., motion. In this paper, we propose a novel motion-aware video generation (MoVideo) framework that takes motion into consideration from two aspects: video depth and optical flow. The former regulates motion by per-frame object distances and spatial layouts, while the later describes motion by cross-frame correspondences that help in preserving fine details and improving temporal consistency. More specifically, given a key frame that exists or generated from text prompts, we first design a diffusion model with spatio-temporal modules to generate the video depth and the corresponding optical flows. Then, the video is generated in the latent space by another spatio-temporal diffusion model under the guidance of depth, optical flow-based warped latent video and the calculated occlusion mask. Lastly, we use optical flows again to align and refine different frames for better video decoding from the latent space to the pixel space. In experiments, MoVideo achieves state-of-the-art results in both text-to-video and image-to-video generation, showing promising prompt consistency, frame consistency and visual quality.	This paper proposes MoVideo, a novel motion-aware video generation framework that explicitly incorporates depth and optical flow to control video motion.	Existing video generation diffusion models often lack explicit motion modeling and struggle to generate videos with natural and consistent motion. MoVideo addresses this by leveraging depth for spatial layout guidance and optical flow for temporal consistency.	MoVideo consists of four stages: 1) Key frame generation from text prompts using Latent Diffusion, 2) Video depth and optical flow generation conditioned on the key frame using a 3D diffusion model, 3) Latent video generation guided by depth, optical flow-warped latent video, and occlusion mask using another 3D diffusion model, 4) Optical flow-augmented video decoding for enhanced temporal consistency.	MoVideo achieves state-of-the-art results in both text-to-video and image-to-video generation on various datasets like WebVid-10M and DAVIS. The generated videos exhibit strong prompt consistency, frame consistency, and high visual quality. Ablation studies validate the contribution of each component, particularly the use of warped video, occlusion masks, and flow-augmented decoding.	The current model is limited to generating videos from a single key frame. Exploring higher-resolution video generation and more complex motion patterns is left for future work.	video generation, diffusion models, motion modeling, optical flow, depth estimation
2311.11284 Report	LucidDreamer: Towards High-Fidelity Text-to-3D Generation via Interval Score Matching	Yixun Liang, Xin Yang, Jiantao Lin, Haodong Li, Xiaogang Xu, Yingcong Chen	The recent advancements in text-to-3D generation mark a significant milestone in generative models, unlocking new possibilities for creating imaginative 3D assets across various real-world scenarios. While recent advancements in text-to-3D generation have shown promise, they often fall short in rendering detailed and high-quality 3D models. This problem is especially prevalent as many methods base themselves on Score Distillation Sampling (SDS). This paper identifies a notable deficiency in SDS, that it brings inconsistent and low-quality updating direction for the 3D model, causing the over-smoothing effect. To address this, we propose a novel approach called Interval Score Matching (ISM). ISM employs deterministic diffusing trajectories and utilizes interval-based score matching to counteract over-smoothing. Furthermore, we incorporate 3D Gaussian Splatting into our text-to-3D generation pipeline. Extensive experiments show that our model largely outperforms the state-of-the-art in quality and training efficiency.	This paper proposes LucidDreamer, a novel text-to-3D generation framework that leverages Interval Score Matching (ISM) to enhance the fidelity of generated 3D models.	Existing text-to-3D generation methods often produce overly smooth models lacking intricate details. This stems from limitations in Score Distillation Sampling (SDS), which relies on inconsistent and low-quality pseudo-ground-truth data.	The authors introduce ISM, which utilizes deterministic diffusing trajectories through DDIM inversion and conducts matching between interval steps in the diffusion process. This, coupled with employing 3D Gaussian Splatting as the 3D representation, facilitates high-quality 3D generation.	LucidDreamer generates highly realistic and detailed 3D models, surpassing state-of-the-art methods in visual quality. ISM effectively addresses the over-smoothing issue prevalent in SDS-based approaches. The proposed framework exhibits efficiency in training and rendering, enabling high-resolution outputs with reduced computational burden.	The influence of interval length on generation quality necessitates further investigation and potential refinements. Exploring the full potential of ISM for advanced editing tasks, such as 2D/3D manipulation and control, holds promise for future work.	text-to-3d generation, score distillation sampling, interval score matching, 3d gaussian splatting, generative models
2311.11261 Report	Adversarial Prompt Tuning for Vision-Language Models	Jiaming Zhang, Xingjun Ma, Xin Wang, Lingyu Qiu, Jiaqi Wang, Yu-Gang Jiang, Jitao Sang	With the rapid advancement of multimodal learning, pre-trained Vision-Language Models (VLMs) such as CLIP have demonstrated remarkable capacities in bridging the gap between visual and language modalities. However, these models remain vulnerable to adversarial attacks, particularly in the image modality, presenting considerable security risks. This paper introduces Adversarial Prompt Tuning (AdvPT), a novel technique to enhance the adversarial robustness of image encoders in VLMs. AdvPT innovatively leverages learnable text prompts and aligns them with adversarial image embeddings, to address the vulnerabilities inherent in VLMs without the need for extensive parameter training or modification of the model architecture. We demonstrate that AdvPT improves resistance against white-box and black-box adversarial attacks and exhibits a synergistic effect when combined with existing image-processing-based defense techniques, further boosting defensive capabilities. Comprehensive experimental analyses provide insights into adversarial prompt tuning, a novel paradigm devoted to improving resistance to adversarial images through textual input modifications, paving the way for future robust multimodal learning research. These findings open up new possibilities for enhancing the security of VLMs. Our code is available at https://github.com/jiamingzhang94/Adversarial-Prompt-Tuning.	This paper proposes Adversarial Prompt Tuning (AdvPT), a novel technique to enhance the adversarial robustness of image encoders in VLMs by aligning learnable text prompts with adversarial image embeddings.	Existing VLMs are vulnerable to adversarial attacks, particularly in the image modality, presenting security risks. AdvPT addresses this by improving robustness without extensive parameter training or model architecture modification.	AdvPT generates an adversarial image embedding bank. It then optimizes learnable text prompts by aligning them with these adversarial embeddings through backpropagation in the text encoder, leaving the image encoder untouched.	AdvPT significantly improves robustness against both white-box and black-box attacks compared to vanilla CLIP. It demonstrates synergy with existing image-based defense techniques, further boosting robustness. The paper provides insights into AdvPT's working mechanism, generalization-robustness trade-off, and transferability across datasets.	The evaluation of adversarial robustness is limited by the specific attacks used. The focus is restricted to image recognition tasks.	adversarial robustness, vision-language models, prompt tuning, adversarial attacks, multimodal learning
2311.11243 Report	AutoStory: Generating Diverse Storytelling Images with Minimal Human Effort	Wen Wang, Canyu Zhao, Hao Chen, Zhekai Chen, Kecheng Zheng, Chunhua Shen	Story visualization aims to generate a series of images that match the story described in texts, and it requires the generated images to satisfy high quality, alignment with the text description, and consistency in character identities. Given the complexity of story visualization, existing methods drastically simplify the problem by considering only a few specific characters and scenarios, or requiring the users to provide per-image control conditions such as sketches. However, these simplifications render these methods incompetent for real applications. To this end, we propose an automated story visualization system that can effectively generate diverse, high-quality, and consistent sets of story images, with minimal human interactions. Specifically, we utilize the comprehension and planning capabilities of large language models for layout planning, and then leverage large-scale text-to-image models to generate sophisticated story images based on the layout. We empirically find that sparse control conditions, such as bounding boxes, are suitable for layout planning, while dense control conditions, e.g., sketches and keypoints, are suitable for generating high-quality image content. To obtain the best of both worlds, we devise a dense condition generation module to transform simple bounding box layouts into sketch or keypoint control conditions for final image generation, which not only improves the image quality but also allows easy and intuitive user interactions. In addition, we propose a simple yet effective method to generate multi-view consistent character images, eliminating the reliance on human labor to collect or draw character images.	Proposes AutoStory, a fully automated story visualization system that generates diverse, high-quality, and consistent story images with minimal human interaction, using LLMs for layout planning and text-to-image models for image generation.	Story visualization is important for various applications like art creation, education, and cultural heritage, but existing methods are limited in versatility and require significant user effort.	Uses LLMs to generate layouts from story texts, a dense condition generation module to transform layouts into sketch or keypoint conditions, and a multi-subject customization model for image generation. A training-free method generates multi-view consistent character images, eliminating the need for user-provided character images.	Generates high-quality, text-aligned, and identity-consistent story images in diverse styles. Achieves superior quantitative results in text-to-image and image-to-image similarity compared to existing methods. Outperforms competing approaches in user studies evaluating text alignment, identity preservation, and image quality.	Multi-concept customization process can be slow. Future work aims to accelerate the customization for real-time generation.	story visualization, text-to-image generation, large language models, diffusion models, controllable image generation
2311.11221 Report	GaussianDiffusion: 3D Gaussian Splatting for Denoising Diffusion Probabilistic Models with Structured Noise	Xinhai Li, Huaibin Wang, Kuo-Kun Tseng	Text-to-3D, known for its efficient generation methods and expansive creative potential, has garnered significant attention in the AIGC domain. However, the amalgamation of Nerf and 2D diffusion models frequently yields oversaturated images, posing severe limitations on downstream industrial applications due to the constraints of pixelwise rendering method. Gaussian splatting has recently superseded the traditional pointwise sampling technique prevalent in NeRF-based methodologies, revolutionizing various aspects of 3D reconstruction. This paper introduces a novel text to 3D content generation framework based on Gaussian splatting, enabling fine control over image saturation through individual Gaussian sphere transparencies, thereby producing more realistic images. The challenge of achieving multi-view consistency in 3D generation significantly impedes modeling complexity and accuracy. Taking inspiration from SJC, we explore employing multi-view noise distributions to perturb images generated by 3D Gaussian splatting, aiming to rectify inconsistencies in multi-view geometry. We ingeniously devise an efficient method to generate noise that produces Gaussian noise from diverse viewpoints, all originating from a shared noise source. Furthermore, vanilla 3D Gaussian-based generation tends to trap models in local minima, causing artifacts like floaters, burrs, or proliferative elements. To mitigate these issues, we propose the variational Gaussian splatting technique to enhance the quality and stability of 3D appearance. To our knowledge, our approach represents the first comprehensive utilization of Gaussian splatting across the entire spectrum of 3D content generation processes.	This paper presents GaussianDiffusion, a novel text-to-3D generation framework based on Gaussian splatting for accelerated rendering and realistic 3D content creation from text prompts.	Existing text-to-3D methods suffer from limitations like oversaturated images, slow rendering speed, multi-view inconsistency, and artifacts in generation. This work addresses these limitations with a novel Gaussian splatting based framework.	The proposed GaussianDiffusion framework leverages Gaussian splatting for 3D representation and addresses multi-view consistency through a structured noise injection approach. It further introduces variational Gaussian splatting to enhance appearance quality and mitigate artifacts.	GaussianDiffusion achieves significantly faster convergence compared to previous state-of-the-art methods like SJC and 3DFuse. The introduction of structured noise effectively addresses multi-view geometric inconsistency, leading to better 3D structure generation. Variational Gaussian splatting enhances the generated 3D appearance by reducing artifacts such as floaters, burrs, and proliferative elements.	The use of variational Gaussian splatting, while improving realism, introduces some blurriness and haze in the generated output. Future work will focus on refining the variational Gaussian splatting technique to mitigate blurriness and enhance overall appearance quality.	text-to-3d, gaussian splatting, 3d content generation, multi-view consistency, variational gaussian splatting
2311.11207 Report	On the Noise Scheduling for Generating Plausible Designs with Diffusion Models	Jiajie Fan, Laure Vuaille, Thomas Bäck, Hao Wang	Deep Generative Models (DGMs) are widely used to create innovative designs across multiple industries, ranging from fashion to the automotive sector. In addition to generating images of high visual quality, the task of structural design generation imposes more stringent constrains on the semantic expression, e.g., no floating material or missing part, which we refer to as plausibility in this work. We delve into the impact of noise schedules of diffusion models on the plausibility of the outcome: there exists a range of noise levels at which the model's performance decides the result plausibility. Also, we propose two techniques to determine such a range for a given image set and devise a novel parametric noise schedule for better plausibility. We apply this noise schedule to the training and sampling of the well-known diffusion model EDM and compare it to its default noise schedule. Compared to EDM, our schedule significantly improves the rate of plausible designs from 83.4% to 93.5% and Fr\'echet Inception Distance (FID) from 7.84 to 4.87. Further applications of advanced image editing tools demonstrate the model's solid understanding of structure.	This paper proposes a Plausibility-oriented Diffusion Model (PoDM) that prioritizes a specific range of noise levels during training and sampling to improve the plausibility of generated structural designs.	Existing diffusion models often prioritize visual quality over the plausibility of generated structures, leading to unrealistic designs.	The authors identify a 'plausibility-relevant' range of noise levels in the diffusion process. They then modify the noise schedule of an existing diffusion model (EDM) to prioritize this range during both training and sampling.	PoDM significantly increases the rate of plausible designs from 83.4% (EDM) to 93.5%, almost reaching the performance of DDPM (94%) but with a much faster sampling speed. PoDM achieves a FID of 4.87, improving upon EDM's 7.84. The authors demonstrate PoDM's ability to semantically manipulate structural designs using image editing techniques like interpolation, dragging, and inpainting.	The study focuses solely on the BIKED dataset, potentially limiting the generalizability of the findings. Future work could explore automated methods for evaluating the plausibility of generated images.	diffusion models, generative design, structural design, noise schedule, image plausibility
2311.10995 Report	Behavior Optimized Image Generation	Varun Khurana, Yaman K Singla, Jayakumar Subramanian, Rajiv Ratn Shah, Changyou Chen, Zhiqiang Xu, Balaji Krishnamurthy	The last few years have witnessed great success on image generation, which has crossed the acceptance thresholds of aesthetics, making it directly applicable to personal and commercial applications. However, images, especially in marketing and advertising applications, are often created as a means to an end as opposed to just aesthetic concerns. The goal can be increasing sales, getting more clicks, likes, or image sales (in the case of stock businesses). Therefore, the generated images need to perform well on these key performance indicators (KPIs), in addition to being aesthetically good. In this paper, we make the first endeavor to answer the question of "How can one infuse the knowledge of the end-goal within the image generation process itself to create not just better-looking images but also "better-performing'' images?''. We propose BoigLLM, an LLM that understands both image content and user behavior. BoigLLM knows how an image should look to get a certain required KPI. We show that BoigLLM outperforms 13x larger models such as GPT-3.5 and GPT-4 in this task, demonstrating that while these state-of-the-art models can understand images, they lack information on how these images perform in the real world. To generate actual pixels of behavior-conditioned images, we train a diffusion-based model (BoigSD) to align with a proposed BoigLLM-defined reward. We show the performance of the overall pipeline on two datasets covering two different behaviors: a stock dataset with the number of forward actions as the KPI and a dataset containing tweets with the total likes as the KPI, denoted as BoigBench. To advance research in the direction of utility-driven image generation and understanding, we release BoigBench, a benchmark dataset containing 168 million enterprise tweets with their media, brand account names, time of post, and total likes.	Introduces behavior-optimized image generation (BOIG), focusing on generating images that not only look good but also perform well on key performance indicators (KPIs) like likes and downloads.	Images often serve a purpose beyond aesthetics, especially in marketing. Aligning image generation with user behavior can lead to more effective marketing campaigns.	1. Creates BoigLLM, an LLM fine-tuned to understand image content and predict user behavior (likes, downloads). 2. Uses BoigLLM as a reward model to train BoigSD, a diffusion model that generates images optimized for desired KPIs.	BoigLLM outperforms larger LLMs (GPT-3.5, GPT-4) in predicting image attributes based on desired behavior. BoigSD generates images that score higher on BoigLLM's reward model, indicating better alignment with desired KPIs. Supervised fine-tuning of stable diffusion on high-KPI images alone does not improve performance.	The reward function relies on non-differentiable featurizers, limiting the use of end-to-end analytic policy gradients. Current work focuses on likes and downloads; exploring other KPIs and user behaviors is crucial.	image generation, user behavior, large language models, diffusion models, marketing
2311.10982 Report	Make Pixels Dance: High-Dynamic Video Generation	Yan Zeng, Guoqiang Wei, Jiani Zheng, Jiaxin Zou, Yang Wei, Yuchen Zhang, Hang Li	Creating high-dynamic videos such as motion-rich actions and sophisticated visual effects poses a significant challenge in the field of artificial intelligence. Unfortunately, current state-of-the-art video generation methods, primarily focusing on text-to-video generation, tend to produce video clips with minimal motions despite maintaining high fidelity. We argue that relying solely on text instructions is insufficient and suboptimal for video generation. In this paper, we introduce PixelDance, a novel approach based on diffusion models that incorporates image instructions for both the first and last frames in conjunction with text instructions for video generation. Comprehensive experimental results demonstrate that PixelDance trained with public data exhibits significantly better proficiency in synthesizing videos with complex scenes and intricate motions, setting a new standard for video generation.	This paper introduces Frame-FID, a novel video generation approach using diffusion models that incorporates image instructions for the first and last frames alongside text instructions.	Current video generation models struggle to create high-dynamic videos with complex scenes and motions. This approach aims to address this limitation by providing more direct visual guidance.	The method utilizes a latent diffusion model conditioned on text and image instructions. The image instructions, encoded using a VAE, are concatenated with the latent video representation. The model is trained to avoid directly replicating the last frame, allowing flexibility during inference.	Frame-FID achieves state-of-the-art results on zero-shot video generation benchmarks MSR-VTT and UCF-101, outperforming existing methods in metrics like FVD and CLIP-similarity. The model demonstrates superior performance in generating long videos with temporal consistency compared to autoregressive and hierarchical approaches. Frame-FID exhibits strong generalization ability, generating high-quality videos in out-of-domain styles like comics and cartoons despite being trained primarily on realistic data.	The model's performance could be further enhanced by training on larger, higher-quality, and more diverse video datasets. Incorporating annotated texts describing key video elements and motions could improve alignment with user instructions.	video generation, diffusion models, image instruction, long video generation, zero-shot video editing
2311.10807 Report	SENetV2: Aggregated dense layer for channelwise and global representations	Mahendran Narayanan	Convolutional Neural Networks (CNNs) have revolutionized image classification by extracting spatial features and enabling state-of-the-art accuracy in vision-based tasks. The squeeze and excitation network proposed module gathers channelwise representations of the input. Multilayer perceptrons (MLP) learn global representation from the data and in most image classification models used to learn extracted features of the image. In this paper, we introduce a novel aggregated multilayer perceptron, a multi-branch dense layer, within the Squeeze excitation residual module designed to surpass the performance of existing architectures. Our approach leverages a combination of squeeze excitation network module with dense layers. This fusion enhances the network's ability to capture channel-wise patterns and have global knowledge, leading to a better feature representation. This proposed model has a negligible increase in parameters when compared to SENet. We conduct extensive experiments on benchmark datasets to validate the model and compare them with established architectures. Experimental results demonstrate a remarkable increase in the classification accuracy of the proposed model.	This paper introduces SENetV2, an enhanced Squeeze and Excitation Network (SENet) module called Squeeze Aggregated Excitation (SaE) that improves feature representation by incorporating multi-branch fully connected layers within the SENet architecture.	The authors aim to address the limitations of CNNs in capturing global representations and enhance the performance of SENet by introducing an aggregated multi-layer perceptron (MLP) within the Squeeze Excitation Residual Module.	The authors propose a novel SaE module which incorporates multi-branch fully connected layers within the squeeze operation of the SENet module. This enables the model to learn richer global representations while maintaining a relatively lightweight structure. The proposed module is integrated into a ResNet architecture and evaluated on CIFAR-10, CIFAR-100, and a modified ImageNet dataset.	SENetV2 outperforms vanilla ResNet and SENet on CIFAR-10 and CIFAR-100 datasets, demonstrating the effectiveness of the aggregated FC layers. The proposed model achieves competitive results on the modified ImageNet dataset, further validating its capability in improving image classification accuracy. The SaE module proves to be effective in enhancing feature representation by combining spatial, channel-wise, and global representations.	The paper acknowledges the computational limitations, particularly with the modified ImageNet dataset, which could have limited the full potential of SENetV2. Further exploration of different cardinality values and reduction sizes within the SaE module could lead to additional performance improvements.	image classification, convolutional neural networks, squeeze and excitation networks, aggregated modules, global representations
2311.10794 Report	Text-to-Sticker: Style Tailoring Latent Diffusion Models for Human Expression	Animesh Sinha, Bo Sun, Anmol Kalia, Arantxa Casanova, Elliot Blanchard, David Yan, Winnie Zhang, Tony Nelli, Jiahui Chen, Hardik Shah, Licheng Yu, Mitesh Kumar Singh, Ankit Ramchandani, Maziar Sanjabi, Sonal Gupta, Amy Bearman, Dhruv Mahajan	We introduce Style Tailoring, a recipe to finetune Latent Diffusion Models (LDMs) in a distinct domain with high visual quality, prompt alignment and scene diversity. We choose sticker image generation as the target domain, as the images significantly differ from photorealistic samples typically generated by large-scale LDMs. We start with a competent text-to-image model, like Emu, and show that relying on prompt engineering with a photorealistic model to generate stickers leads to poor prompt alignment and scene diversity. To overcome these drawbacks, we first finetune Emu on millions of sticker-like images collected using weak supervision to elicit diversity. Next, we curate human-in-the-loop (HITL) Alignment and Style datasets from model generations, and finetune to improve prompt alignment and style alignment respectively. Sequential finetuning on these datasets poses a tradeoff between better style alignment and prompt alignment gains. To address this tradeoff, we propose a novel fine-tuning method called Style Tailoring, which jointly fits the content and style distribution and achieves best tradeoff. Evaluation results show our method improves visual quality by 14%, prompt alignment by 16.2% and scene diversity by 15.3%, compared to prompt engineering the base Emu model for stickers generation.	Introduces Style Tailoring, a novel fine-tuning method for Latent Diffusion Models (LDMs) to generate images in a distinct domain (sticker images) with high visual quality, prompt alignment, and scene diversity.	Addresses the limitations of existing LDM fine-tuning methods that struggle to simultaneously improve prompt alignment, visual diversity, visual appeal, and adherence to a specific style.	Employs a multi-stage fine-tuning approach: (1) Domain alignment using weakly aligned sticker-like images. (2) Prompt alignment using a human-in-the-loop (HITL) dataset. (3) Style alignment using an expert-in-the-loop (EITL) dataset. Introduces Style Tailoring, which jointly optimizes for content and style by training on different data distributions at different denoising timesteps.	Style Tailoring achieves the best trade-off between prompt alignment, style alignment, visual quality, and scene diversity compared to baseline methods and sequential fine-tuning. Domain alignment fine-tuning significantly improves scene diversity and moderately enhances prompt alignment. Human and expert-in-the-loop datasets are crucial for achieving high prompt and style alignment, respectively.	Rare occurrences of photorealistic backgrounds in generated stickers, potentially due to unseen concepts during training. Subjectivity in human evaluation of generative models, as preferences can shift over time.	latent diffusion models, fine-tuning, style transfer, text-to-image generation, human-in-the-loop
2311.10708 Report	SelfEval: Leveraging the discriminative nature of generative models for evaluation	Sai Saketh Rambhatla, Ishan Misra	In this work, we show that text-to-image generative models can be 'inverted' to assess their own text-image understanding capabilities in a completely automated manner. Our method, called SelfEval, uses the generative model to compute the likelihood of real images given text prompts, making the generative model directly applicable to discriminative tasks. Using SelfEval, we repurpose standard datasets created for evaluating multimodal text-image discriminative models to evaluate generative models in a fine-grained manner: assessing their performance on attribute binding, color recognition, counting, shape recognition, spatial understanding. To the best of our knowledge SelfEval is the first automated metric to show a high degree of agreement for measuring text-faithfulness with the gold-standard human evaluations across multiple models and benchmarks. Moreover, SelfEval enables us to evaluate generative models on challenging tasks such as Winoground image-score where they demonstrate competitive performance to discriminative models. We also show severe drawbacks of standard automated metrics such as CLIP-score to measure text faithfulness on benchmarks such as DrawBench, and how SelfEval sidesteps these issues. We hope SelfEval enables easy and reliable automated evaluation for diffusion models.	The paper introduces "SelfEval," a method to automatically assess the text-image understanding of text-to-image generative models by inverting them to perform discriminative tasks.	Automated evaluation of text-to-image models is crucial for efficient research and comparison but current methods rely on external models like CLIP, introducing biases and limitations.	SelfEval estimates the likelihood of real images given text prompts using the diffusion model itself, converting it into a discriminative model for image-text matching tasks.	SelfEval's ranking of text-faithfulness across different diffusion models aligns with human evaluation. Latent diffusion models show superior text-faithfulness compared to pixel diffusion models, confirmed by both SelfEval and human evaluations. SelfEval enables diffusion models to achieve competitive performance on challenging benchmarks like Winoground, surpassing previous methods and some discriminative models.	SelfEval's computational cost is directly proportional to the number of timesteps in the diffusion process. Future work could explore generalizing SelfEval to non-diffusion based generative models.	text-to-image generation, diffusion models, automated evaluation, text faithfulness, image-text matching
2311.10522 Report	Enhancing Object Coherence in Layout-to-Image Synthesis	Yibin Wang, Weizhong Zhang, Jianwei Zheng, Cheng Jin	Layout-to-image synthesis is an emerging technique in conditional image generation. It aims to generate complex scenes, where users require fine control over the layout of the objects in a scene. However, it remains challenging to control the object coherence, including semantic coherence (e.g., the cat looks at the flowers or not) and physical coherence (e.g., the hand and the racket should not be misaligned). In this paper, we propose a novel diffusion model with effective global semantic fusion (GSF) and self-similarity feature enhancement modules to guide the object coherence for this task. For semantic coherence, we argue that the image caption contains rich information for defining the semantic relationship within the objects in the images. Instead of simply employing cross-attention between captions and generated images, which addresses the highly relevant layout restriction and semantic coherence separately and thus leads to unsatisfying results shown in our experiments, we develop GSF to fuse the supervision from the layout restriction and semantic coherence requirement and exploit it to guide the image synthesis process. Moreover, to improve the physical coherence, we develop a Self-similarity Coherence Attention (SCA) module to explicitly integrate local contextual physical coherence into each pixel's generation process. Specifically, we adopt a self-similarity map to encode the coherence restrictions and employ it to extract coherent features from text embedding. Through visualization of our self-similarity map, we explore the essence of SCA, revealing that its effectiveness is not only in capturing reliable physical coherence patterns but also in enhancing complex texture generation. Extensive experiments demonstrate the superiority of our proposed method in both image generation quality and controllability.	This paper presents EOCNet, a novel diffusion model for layout-to-image synthesis (LIS) that addresses object coherence challenges by incorporating global semantic fusion (GSF) and self-similarity feature enhancement (SFE) modules.	LIS often struggles with maintaining object coherence, both semantically (e.g., ensuring a cat looks at flowers) and physically (e.g., aligning a hand with a racket). EOCNet tackles these issues to achieve higher quality and controllability in generated images.	EOCNet leverages a pre-trained text-to-image diffusion model. GSF integrates semantic coherence cues from captions and layout restrictions. SFE, comprising rectified cross-attention (RCA) and self-similarity coherence attention (SCA), refines object generation with contextual awareness.	EOCNet outperforms SOTA methods on FID and DS, indicating superior image quality and diversity. Visualization of SCA's self-similarity maps reveals its effectiveness in capturing physical coherence patterns and enhancing complex texture generation. Caption integration enables fine-grained control over semantic coherence and image style.	EOCNet encounters difficulties generating highly intricate textures, like realistic hands. Semantic misalignments may occur when the caption's coherence requirements conflict with the layout.	layout-to-image synthesis, diffusion models, object coherence, semantic fusion, self-similarity attention
2311.10329 Report	High-fidelity Person-centric Subject-to-Image Synthesis	Yibin Wang, Weizhong Zhang, Jianwei Zheng, Cheng Jin	Current subject-driven image generation methods encounter significant challenges in person-centric image generation. The reason is that they learn the semantic scene and person generation by fine-tuning a common pre-trained diffusion, which involves an irreconcilable training imbalance. Precisely, to generate realistic persons, they need to sufficiently tune the pre-trained model, which inevitably causes the model to forget the rich semantic scene prior and makes scene generation over-fit to the training data. Moreover, even with sufficient fine-tuning, these methods can still not generate high-fidelity persons since joint learning of the scene and person generation also lead to quality compromise. In this paper, we propose Face-diffuser, an effective collaborative generation pipeline to eliminate the above training imbalance and quality compromise. Specifically, we first develop two specialized pre-trained diffusion models, i.e., Text-driven Diffusion Model (TDM) and Subject-augmented Diffusion Model (SDM), for scene and person generation, respectively. The sampling process is divided into three sequential stages, i.e., semantic scene construction, subject-scene fusion, and subject enhancement. The first and last stages are performed by TDM and SDM respectively. The subject-scene fusion stage, that is the collaboration achieved through a novel and highly effective mechanism, Saliency-adaptive Noise Fusion (SNF). Specifically, it is based on our key observation that there exists a robust link between classifier-free guidance responses and the saliency of generated images. In each time step, SNF leverages the unique strengths of each model and allows for the spatial blending of predicted noises from both models automatically in a saliency-aware manner. Extensive experiments confirm the impressive effectiveness and robustness of the Face-diffuser.	This paper introduces Face-diffuser, a novel pipeline for person-centric image generation that addresses limitations of existing methods by independently training two diffusion models for semantic scenes and person generation, and then seamlessly fusing their outputs using a Saliency-adaptive Noise Fusion mechanism.	Existing subject-driven image generation methods struggle with person-centric generation due to training imbalance (overfitting to text prompts and forgetting scene priors) and compromised person quality from joint scene and person learning.	The proposed method utilizes two independently trained diffusion models, one for scenes (TDM) and one for persons (SDM). During sampling, a three-stage process unfolds: 1) TDM constructs the scene, 2) a novel Saliency-adaptive Noise Fusion (SNF) mechanism combines outputs from TDM and SDM based on saliency maps derived from classifier-free guidance responses, and 3) SDM refines person details.	Face-diffuser quantitatively outperforms state-of-the-art methods in both single- and multi-subject generation, demonstrating superior identity preservation and prompt consistency. Qualitative results showcase Face-diffuser's ability to generate high-fidelity persons consistently embedded within diverse semantic scenes, surpassing the capabilities of existing methods. Ablation studies confirm the importance of each stage in the pipeline and the effectiveness of the proposed SNF mechanism for seamless and high-quality image synthesis.	The strong reliance on reference images for person generation raises privacy concerns due to the potential for unauthorized use of facial features. The current method faces limitations in editing specific attributes of generated persons. Future work aims to address these limitations and enhance control over attribute editing.	image generation, diffusion models, person-centric generation, saliency-adaptive fusion, classifier-free guidance
2311.10123 Report	MetaDreamer: Efficient Text-to-3D Creation With Disentangling Geometry and Texture	Lincong Feng, Muyu Wang, Maoyu Wang, Kuo Xu, Xiaoli Liu	Generative models for 3D object synthesis have seen significant advancements with the incorporation of prior knowledge distilled from 2D diffusion models. Nevertheless, challenges persist in the form of multi-view geometric inconsistencies and slow generation speeds within the existing 3D synthesis frameworks. This can be attributed to two factors: firstly, the deficiency of abundant geometric a priori knowledge in optimization, and secondly, the entanglement issue between geometry and texture in conventional 3D generation methods.In response, we introduce MetaDreammer, a two-stage optimization approach that leverages rich 2D and 3D prior knowledge. In the first stage, our emphasis is on optimizing the geometric representation to ensure multi-view consistency and accuracy of 3D objects. In the second stage, we concentrate on fine-tuning the geometry and optimizing the texture, thereby achieving a more refined 3D object. Through leveraging 2D and 3D prior knowledge in two stages, respectively, we effectively mitigate the interdependence between geometry and texture. MetaDreamer establishes clear optimization objectives for each stage, resulting in significant time savings in the 3D generation process. Ultimately, MetaDreamer can generate high-quality 3D objects based on textual prompts within 20 minutes, and to the best of our knowledge, it is the most efficient text-to-3D generation method. Furthermore, we introduce image control into the process, enhancing the controllability of 3D generation. Extensive empirical evidence confirms that our method is not only highly efficient but also achieves a quality level that is at the forefront of current state-of-the-art 3D generation techniques.	MetaDreamer is a novel text-to-3D generation method that employs a two-stage, coarse-to-fine optimization process to efficiently generate high-quality 3D geometry and textures.	Existing 3D generation methods suffer from slow generation speeds and struggle to balance geometric accuracy with high-quality textures. This is due to a lack of geometric prior knowledge and entanglement of geometry and texture optimization.	MetaDreamer disentangles geometry and texture learning by using 3D priors (view-dependent diffusion model, depth, and reference image) in the first stage for coarse geometric optimization. In the second stage, it utilizes fine-tuned 2D priors (text-to-image diffusion model) for texture refinement and geometric detailing.	MetaDreamer generates high-quality 3D objects with strong multi-view consistency and detailed textures within 20 minutes, outperforming state-of-the-art methods in both speed and quality. Quantitative evaluations using CLIP similarity and T3Bench demonstrate MetaDreamer's superior performance in text-3D consistency and visual quality. Ablation studies confirm the effectiveness of the two-stage disentanglement approach, highlighting the complementary roles of 3D and 2D priors.	MetaDreamer faces limitations in multi-object generation scenarios due to the lack of multi-object priors in current geometric knowledge. Future work will focus on incorporating richer multi-object geometric priors to enhance the model's capabilities.	text-to-3d generation, 3d object synthesis, disentanglement learning, geometric priors, texture priors
2311.10081 Report	DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback	Yangyi Chen, Karan Sikka, Michael Cogswell, Heng Ji, Ajay Divakaran	We present DRESS, a large vision language model (LVLM) that innovatively exploits Natural Language feedback (NLF) from Large Language Models to enhance its alignment and interactions by addressing two key limitations in the state-of-the-art LVLMs. First, prior LVLMs generally rely only on the instruction finetuning stage to enhance alignment with human preferences. Without incorporating extra feedback, they are still prone to generate unhelpful, hallucinated, or harmful responses. Second, while the visual instruction tuning data is generally structured in a multi-turn dialogue format, the connections and dependencies among consecutive conversational turns are weak. This reduces the capacity for effective multi-turn interactions. To tackle these, we propose a novel categorization of the NLF into two key types: critique and refinement. The critique NLF identifies the strengths and weaknesses of the responses and is used to align the LVLMs with human preferences. The refinement NLF offers concrete suggestions for improvement and is adopted to improve the interaction ability of the LVLMs-- which focuses on LVLMs' ability to refine responses by incorporating feedback in multi-turn interactions. To address the non-differentiable nature of NLF, we generalize conditional reinforcement learning for training. Our experimental results demonstrate that DRESS can generate more helpful (9.76%), honest (11.52%), and harmless (21.03%) responses, and more effectively learn from feedback during multi-turn interactions compared to SOTA LVMLs.	This paper proposes DRESS, a large vision language model (LVLM) that utilizes Natural Language Feedback (NLF) from Large Language Models to enhance its alignment with human preferences and improve multi-turn interaction capabilities.	Existing LVLMs often generate unhelpful, hallucinated, or harmful responses due to limited alignment with human preferences and weak multi-turn interaction abilities.	The approach categorizes NLF into 'critique' for evaluating response quality and 'refinement' for suggesting improvements. DRESS is trained using a generalized conditional reinforcement learning algorithm to incorporate this non-differentiable feedback.	DRESS generates responses that are significantly more helpful, honest, and harmless compared to state-of-the-art LVLMs. The model demonstrates superior multi-turn interaction ability, effectively learning from feedback to refine responses iteratively. The paper introduces a new dataset, VLSafe, designed for evaluating and aligning LVLMs for harmlessness.	The reliance on GPT-4 for feedback and evaluation introduces a dependency on its capabilities and limitations. Future work could explore scaling up the RLAIF stage using web-scale data and developing more sophisticated refinement NLF modeling techniques.	large vision language models, natural language feedback, alignment, multi-turn interaction, harmlessness
2311.09753 Report	DIFFNAT: Improving Diffusion Image Quality Using Natural Image Statistics	Aniket Roy, Maiterya Suin, Anshul Shah, Ketul Shah, Jiang Liu, Rama Chellappa	Diffusion models have advanced generative AI significantly in terms of editing and creating naturalistic images. However, efficiently improving generated image quality is still of paramount interest. In this context, we propose a generic "naturalness" preserving loss function, viz., kurtosis concentration (KC) loss, which can be readily applied to any standard diffusion model pipeline to elevate the image quality. Our motivation stems from the projected kurtosis concentration property of natural images, which states that natural images have nearly constant kurtosis values across different band-pass versions of the image. To retain the "naturalness" of the generated images, we enforce reducing the gap between the highest and lowest kurtosis values across the band-pass versions (e.g., Discrete Wavelet Transform (DWT)) of images. Note that our approach does not require any additional guidance like classifier or classifier-free guidance to improve the image quality. We validate the proposed approach for three diverse tasks, viz., (1) personalized few-shot finetuning using text guidance, (2) unconditional image generation, and (3) image super-resolution. Integrating the proposed KC loss has improved the perceptual quality across all these tasks in terms of both FID, MUSIQ score, and user evaluation.	This paper proposes DiffNat, a novel kurtosis concentration (KC) loss function to improve the image quality of diffusion models by leveraging the statistical properties of natural images.	Despite advancements in diffusion models, generated images can lack naturalness, especially in few-shot learning scenarios. This new loss function aims to address this limitation.	The KC loss leverages the kurtosis concentration property of natural images, which states that the kurtosis values across different bandpass filtered versions of an image tend to be constant. The KC loss minimizes the difference between maximum and minimum kurtosis values across DWT filtered versions of the generated image, thereby enhancing naturalness.	Adding the KC loss to DreamBooth and Custom Diffusion for few-shot finetuning results in improved image quality as measured by FID and MUSIQ scores. Integrating the KC loss with DDPM for unconditional image generation leads to better perceptual quality across diverse datasets. Incorporating the KC loss in image super-resolution diffusion models (Guided Diffusion and Latent Diffusion) significantly enhances the perceptual quality of super-resolved images.	The paper primarily focuses on visual quality improvement and does not explicitly address potential limitations related to computational overhead or generalization ability. Future work could explore the application of KC loss to other generative tasks and investigate its effectiveness in conjunction with different diffusion model architectures.	diffusion models, image quality, natural image statistics, kurtosis concentration, generative ai
2311.09571 Report	3D Paintbrush: Local Stylization of 3D Shapes with Cascaded Score Distillation	Dale Decatur, Itai Lang, Kfir Aberman, Rana Hanocka	In this work we develop 3D Paintbrush, a technique for automatically texturing local semantic regions on meshes via text descriptions. Our method is designed to operate directly on meshes, producing texture maps which seamlessly integrate into standard graphics pipelines. We opt to simultaneously produce a localization map (to specify the edit region) and a texture map which conforms to it. This synergistic approach improves the quality of both the localization and the stylization. To enhance the details and resolution of the textured area, we leverage multiple stages of a cascaded diffusion model to supervise our local editing technique with generative priors learned from images at different resolutions. Our technique, referred to as Cascaded Score Distillation (CSD), simultaneously distills scores at multiple resolutions in a cascaded fashion, enabling control over both the granularity and global understanding of the supervision. We demonstrate the effectiveness of 3D Paintbrush to locally texture a variety of shapes within different semantic regions. Project page: https://threedle.github.io/3d-paintbrush	3D Paintbrush is a method for automatically texturing local semantic regions on 3D meshes using text descriptions, producing texture maps compatible with standard graphics pipelines.	Existing 3D editing methods struggle with precise local edits based on text prompts. 3D Paintbrush addresses this by generating both detailed texture maps and accurate localization maps for specified regions on meshes.	The method uses neural networks to represent localization and texture maps. It leverages a novel Cascaded Score Distillation (CSD) technique that utilizes multiple stages of a cascaded diffusion model for high-resolution, text-driven supervision.	3D Paintbrush generates highly detailed and localized textures on a variety of 3D shapes. Simultaneous optimization of localization and texture maps improves the quality and detail of both. CSD allows for control over the granularity and global understanding of the text-driven supervision, enabling high-resolution results.	Currently, editing capabilities are limited to textures. Future work includes expanding to other localized edits like deformations and materials, as well as co-texturing multiple shapes.	3d texturing, local editing, text-to-3d, cascaded diffusion models, score distillation
2311.09221 Report	Single-Image 3D Human Digitization with Shape-Guided Diffusion	Badour AlBahar, Shunsuke Saito, Hung-Yu Tseng, Changil Kim, Johannes Kopf, Jia-Bin Huang	We present an approach to generate a 360-degree view of a person with a consistent, high-resolution appearance from a single input image. NeRF and its variants typically require videos or images from different viewpoints. Most existing approaches taking monocular input either rely on ground-truth 3D scans for supervision or lack 3D consistency. While recent 3D generative models show promise of 3D consistent human digitization, these approaches do not generalize well to diverse clothing appearances, and the results lack photorealism. Unlike existing work, we utilize high-capacity 2D diffusion models pretrained for general image synthesis tasks as an appearance prior of clothed humans. To achieve better 3D consistency while retaining the input identity, we progressively synthesize multiple views of the human in the input image by inpainting missing regions with shape-guided diffusion conditioned on silhouette and surface normal. We then fuse these synthesized multi-view images via inverse rendering to obtain a fully textured high-resolution 3D mesh of the given person. Experiments show that our approach outperforms prior methods and achieves photorealistic 360-degree synthesis of a wide range of clothed humans with complex textures from a single image.	This paper presents a novel approach to generate a 360-degree view of a person with consistent, high-resolution appearance from a single image.	Creating photorealistic 3D human models typically requires multi-view images or 3D scans, which are difficult to obtain. This work aims to address this challenge by enabling personalized 3D human digitization from easily accessible single images.	The method leverages a pre-trained 2D diffusion model for general image synthesis as a human appearance prior. It reconstructs the 3D geometry, synthesizes multi-view images via shape-guided diffusion inpainting using normal and silhouette maps, and finally fuses these images into a textured 3D mesh.	The approach outperforms previous methods in generating high-fidelity textured 3D humans from single images. It effectively leverages the power of large-scale pre-trained 2D diffusion models for 3D human digitization. Shape guidance using both normal and silhouette maps during inpainting significantly improves the preservation of shape and structural details.	The approach currently relies on off-the-shelf methods for base geometry reconstruction and back-view synthesis, inheriting their limitations. The generated textures lack view-dependency, which could be addressed in future work.	digital humans, single-image 3d reconstruction, diffusion models, shape-guided synthesis, multi-view fusion
2311.09215 Report	ConvNet vs Transformer, Supervised vs CLIP: Beyond ImageNet Accuracy	Kirill Vishniakov, Zhiqiang Shen, Zhuang Liu	Modern computer vision offers a great variety of models to practitioners, and selecting a model from multiple options for specific applications can be challenging. Conventionally, competing model architectures and training protocols are compared by their classification accuracy on ImageNet. However, this single metric does not fully capture performance nuances critical for specialized tasks. In this work, we conduct an in-depth comparative analysis of model behaviors beyond ImageNet accuracy, for both ConvNet and Vision Transformer architectures, each across supervised and CLIP training paradigms. Although our selected models have similar ImageNet accuracies and compute requirements, we find that they differ in many other aspects: types of mistakes, output calibration, transferability, and feature invariance, among others. This diversity in model characteristics, not captured by traditional metrics, highlights the need for more nuanced analysis when choosing among different models. Our code is available at https://github.com/kirill-vish/Beyond-INet.	This paper presents a comparative analysis of ConvNet (ConvNeXt) and Vision Transformer (ViT) models trained with supervised and CLIP paradigms, going beyond traditional ImageNet accuracy evaluation to explore their behavioral nuances.	Selecting appropriate vision models for specific tasks is challenging with numerous architectures and training methods. Relying solely on ImageNet accuracy is insufficient as it overlooks important model behaviors, particularly for specialized applications.	The authors analyze four pretrained models (ConvNeXt and ViT, each with supervised and CLIP training) with similar ImageNet accuracies and computational costs. They evaluate various properties like model mistakes, shape/texture bias, calibration, robustness, transferability, performance on synthetic data, and transformation invariance.	CLIP models exhibit better transferability and fewer classification errors relative to their ImageNet accuracy, while supervised models excel in robustness benchmarks and calibration. Supervised ConvNeXt demonstrates strong performance across various benchmarks, including transferability, challenging the dominance of CLIP models in this aspect. ConvNeXt outperforms ViT on synthetic data, while ViT shows a higher shape bias. This highlights architecture-specific strengths and weaknesses beyond ImageNet performance.	The robustness evaluation is limited to ImageNet variants, potentially biasing the results. The study primarily focuses on pretrained models, neglecting the impact of fine-tuning on specific downstream tasks.	model selection, convnext, vision transformer, clip, benchmarking
2311.09191 Report	Domain Aligned CLIP for Few-shot Classification	Muhammad Waleed Gondal, Jochen Gast, Inigo Alonso Ruiz, Richard Droste, Tommaso Macri, Suren Kumar, Luitpold Staudigl	Large vision-language representation learning models like CLIP have demonstrated impressive performance for zero-shot transfer to downstream tasks while largely benefiting from inter-modal (image-text) alignment via contrastive objectives. This downstream performance can further be enhanced by full-scale fine-tuning which is often compute intensive, requires large labelled data, and can reduce out-of-distribution (OOD) robustness. Furthermore, sole reliance on inter-modal alignment might overlook the rich information embedded within each individual modality. In this work, we introduce a sample-efficient domain adaptation strategy for CLIP, termed Domain Aligned CLIP (DAC), which improves both intra-modal (image-image) and inter-modal alignment on target distributions without fine-tuning the main model. For intra-modal alignment, we introduce a lightweight adapter that is specifically trained with an intra-modal contrastive objective. To improve inter-modal alignment, we introduce a simple framework to modulate the precomputed class text embeddings. The proposed few-shot fine-tuning framework is computationally efficient, robust to distribution shifts, and does not alter CLIP's parameters. We study the effectiveness of DAC by benchmarking on 11 widely used image classification tasks with consistent improvements in 16-shot classification upon strong baselines by about 2.3% and demonstrate competitive performance on 4 OOD robustness benchmarks.	This paper proposes Domain Aligned CLIP (DAC), a sample-efficient domain adaptation strategy for CLIP that improves few-shot classification by aligning both intra-modal (image-image) and inter-modal (image-text) representations on target distributions.	Adapting large vision-language models like CLIP to downstream tasks often requires fine-tuning, which can be resource-intensive and prone to overfitting. DAC offers a computationally efficient alternative that leverages few-shot data for improved domain adaptation.	DAC utilizes a two-stage adaptation strategy. First, a lightweight adapter layer is trained with a supervised contrastive objective to improve intra-modal alignment. Second, CLIP's text embeddings are fine-tuned to enhance inter-modal alignment, resulting in DAC-VT.	DAC-VT consistently outperforms competitive few-shot CLIP adaptation baselines on 11 image classification benchmarks, demonstrating the effectiveness of aligning both intra- and inter-modal representations. DAC-V, which only aligns visual features, shows better robustness to distribution shifts compared to methods focusing solely on inter-modal alignment. Analysis reveals that DAC's intra- and inter-modal classifiers make uncorrelated errors, leading to improved performance through ensembling.	The two-stage adaptation process increases the computational overhead during fine-tuning compared to some baselines. Further improvement of ensembling intra- and inter-modal classifiers is possible as both still exhibit uncorrelated errors.	few-shot learning, domain adaptation, vision-language models, contrastive learning, clip
2311.08403 Report	Instant3D: Instant Text-to-3D Generation	Ming Li, Pan Zhou, Jia-Wei Liu, Jussi Keppo, Min Lin, Shuicheng Yan, Xiangyu Xu	Text-to-3D generation has attracted much attention from the computer vision community. Existing methods mainly optimize a neural field from scratch for each text prompt, relying on heavy and repetitive training cost which impedes their practical deployment. In this paper, we propose a novel framework for fast text-to-3D generation, dubbed Instant3D. Once trained, Instant3D is able to create a 3D object for an unseen text prompt in less than one second with a single run of a feedforward network. We achieve this remarkable speed by devising a new network that directly constructs a 3D triplane from a text prompt. The core innovation of our Instant3D lies in our exploration of strategies to effectively inject text conditions into the network. In particular, we propose to combine three key mechanisms: cross-attention, style injection, and token-to-plane transformation, which collectively ensure precise alignment of the output with the input text. Furthermore, we propose a simple yet effective activation function, the scaled-sigmoid, to replace the original sigmoid function, which speeds up the training convergence by more than ten times. Finally, to address the Janus (multi-head) problem in 3D generation, we propose an adaptive Perp-Neg algorithm that can dynamically adjust its concept negation scales according to the severity of the Janus problem during training, effectively reducing the multi-head effect. Extensive experiments on a wide variety of benchmark datasets demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods both qualitatively and quantitatively, while achieving significantly better efficiency. The code, data, and models are available at https://github.com/ming1993li/Instant3DCodes.	This paper presents Instant3D, a novel framework for fast text-to-3D generation capable of creating a 3D object from an unseen text prompt in under one second with a single run of a feedforward network.	Existing text-to-3D generation methods rely on computationally expensive optimization for each new text prompt, hindering their practical deployment due to slow response times and lack of shared 3D priors across objects.	Instant3D leverages a conditional feedforward network that directly constructs a 3D triplane representation from a text prompt. It employs three key mechanisms for effective text condition injection: cross-attention, style injection with Adaptive Instance Normalization (AdaIN), and token-to-plane transformation. It also introduces a scaled-sigmoid activation function for faster training convergence and an adaptive Perp-Neg algorithm to address the multi-head problem.	Instant3D achieves high-quality 3D generation with accurate text-3D alignment, outperforming state-of-the-art methods qualitatively and quantitatively on various benchmark datasets. The proposed method demonstrates superior efficiency, generating 3D objects in under a second compared to hours required by existing optimization-based approaches. Ablation studies confirm the effectiveness of each proposed component, highlighting their contributions to fast and accurate text-to-3D generation.	Current benchmark prompt sets, while diverse, are relatively small compared to text-to-image datasets, limiting the model's generalization ability. The computational cost of training text-to-3D networks remains high, posing challenges for scaling up to larger datasets.	text-to-3d generation, neural radiance fields, deep learning, computer vision, generative models
2311.08400 Report	Towards Open-Ended Visual Recognition with Large Language Model	Qihang Yu, Xiaohui Shen, Liang-Chieh Chen	Localizing and recognizing objects in the open-ended physical world poses a long-standing challenge within the domain of machine perception. Recent methods have endeavored to address the issue by employing a class-agnostic mask (or box) proposal model, complemented by an open-vocabulary classifier (e.g., CLIP) using pre-extracted text embeddings. However, it is worth noting that these open-vocabulary recognition models still exhibit limitations in practical applications. On one hand, they rely on the provision of class names during testing, where the recognition performance heavily depends on this predefined set of semantic classes by users. On the other hand, when training with multiple datasets, human intervention is required to alleviate the label definition conflict between them. In this paper, we introduce the OmniScient Model (OSM), a novel Large Language Model (LLM) based mask classifier, as a straightforward and effective solution to the aforementioned challenges. Specifically, OSM predicts class labels in a generative manner, thus removing the supply of class names during both training and testing. It also enables cross-dataset training without any human interference, exhibiting robust generalization capabilities due to the world knowledge acquired from the LLM. By combining OSM with an off-the-shelf mask proposal model, we present promising results on various benchmarks, and demonstrate its effectiveness in handling novel concepts. Code/model are available at https://github.com/bytedance/OmniScient-Model.	This paper presents OmniScient Model (OSM), a novel generative framework for open-ended recognition tasks that leverages a Large Language Model (LLM) to predict class labels directly without predefined vocabularies.	Existing open-vocabulary recognition models rely on predefined class names, hindering their applicability to real-world scenarios with novel concepts and complicating training with multiple datasets.	OSM combines a frozen CLIP-ViT for feature extraction, a trainable bridging module (Mask Query Former) for mask-aware feature resampling, and a frozen LLM for generative class label prediction. The model is trained with an instruction tuning approach on multiple segmentation datasets.	OSM achieves comparable accuracy to discriminative models when evaluated on mask classification with ground-truth masks, demonstrating the effectiveness of generative models for discriminative tasks. OSM exhibits strong generalization ability, achieving state-of-the-art performance on open-vocabulary benchmarks and handling novel concepts beyond predefined vocabularies. The proposed Mode Query mechanism allows OSM to balance between vocabulary-specific and vocabulary-agnostic predictions, making it adaptable to diverse real-world scenarios.	The trade-off between accuracy and generalization ability in OSM requires further investigation to mitigate potential overfitting to training vocabularies. Exploring stronger base models and larger datasets with better diversity could further improve OSM's performance and expressiveness in class label prediction.	open-ended recognition, open-vocabulary, generative model, large language model, segmentation
2311.08046 Report	Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding	Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, Li Yuan	Large language models have demonstrated impressive universal capabilities across a wide range of open-ended tasks and have extended their utility to encompass multimodal conversations. However, existing methods encounter challenges in effectively handling both image and video understanding, particularly with limited visual tokens. In this work, we introduce Chat-UniVi, a Unified Vision-language model capable of comprehending and engaging in conversations involving images and videos through a unified visual representation. Specifically, we employ a set of dynamic visual tokens to uniformly represent images and videos. This representation framework empowers the model to efficiently utilize a limited number of visual tokens to simultaneously capture the spatial details necessary for images and the comprehensive temporal relationship required for videos. Moreover, we leverage a multi-scale representation, enabling the model to perceive both high-level semantic concepts and low-level visual details. Notably, Chat-UniVi is trained on a mixed dataset containing both images and videos, allowing direct application to tasks involving both mediums without requiring any modifications. Extensive experimental results demonstrate that Chat-UniVi consistently outperforms even existing methods exclusively designed for either images or videos. Code is available at https://github.com/PKU-YuanGroup/Chat-UniVi.	This paper introduces Chat-UniVi, a unified vision-language model capable of understanding and engaging in conversations involving both images and videos through a shared representation framework.	Existing methods for multimodal conversations often specialize in either image or video understanding, struggling to effectively capture both spatial details and temporal relationships with limited visual tokens.	Chat-UniVi leverages dynamic visual tokens to uniformly represent images and videos. It employs a token merging method based on the DPC-KNN clustering algorithm to progressively merge visual tokens with similar semantic meanings, reducing the token number while preserving crucial information. Additionally, a multi-scale representation is used to capture both high-level semantic concepts and low-level visual details.	Chat-UniVi consistently outperforms existing methods exclusively designed for either images or videos in both GPT-based and question-answering evaluations. The model achieves impressive results in object hallucination benchmarks, indicating its strong capability to comprehend visual content and resist generating unrealistic descriptions. Joint training on a mixed dataset of images and videos is shown to be crucial, allowing Chat-UniVi to excel in tasks involving both media types without requiring any modifications.	The model currently relies on the capabilities of pre-trained large language models, inheriting their potential vulnerabilities such as hallucination and limitations in long sequence processing. While natural language serves as a flexible interface for various tasks, it might not be optimal for tasks demanding structured outputs or generating dense predictions.	multimodal learning, vision-language model, large language model, dynamic visual tokens, multi-scale representation
2311.07885 Report	One-2-3-45++: Fast Single Image to 3D Objects with Consistent Multi-View Generation and 3D Diffusion	Minghua Liu, Ruoxi Shi, Linghao Chen, Zhuoyang Zhang, Chao Xu, Xinyue Wei, Hansheng Chen, Chong Zeng, Jiayuan Gu, Hao Su	Recent advancements in open-world 3D object generation have been remarkable, with image-to-3D methods offering superior fine-grained control over their text-to-3D counterparts. However, most existing models fall short in simultaneously providing rapid generation speeds and high fidelity to input images - two features essential for practical applications. In this paper, we present One-2-3-45++, an innovative method that transforms a single image into a detailed 3D textured mesh in approximately one minute. Our approach aims to fully harness the extensive knowledge embedded in 2D diffusion models and priors from valuable yet limited 3D data. This is achieved by initially finetuning a 2D diffusion model for consistent multi-view image generation, followed by elevating these images to 3D with the aid of multi-view conditioned 3D native diffusion models. Extensive experimental evaluations demonstrate that our method can produce high-quality, diverse 3D assets that closely mirror the original input image. Our project webpage: https://sudo-ai-3d.github.io/One2345plus_page.	Presents One-2-3-45++, a method that generates textured 3D meshes from a single image in approximately one minute, leveraging 2D diffusion models and 3D data priors for fast generation and high fidelity to the input.	Addresses limitations of existing image-to-3D methods that are either slow and high-quality (optimization-based) or fast and low-quality (feed-forward).	1. Consistent Multi-view Generation: Fine-tunes a 2D diffusion model to generate consistent multi-view images from a single input image. 2. 3D Diffusion with Multi-View Condition: Employs a multi-view conditioned 3D diffusion model to generate a textured mesh from the multi-view images. 3. Texture Refinement: Uses a lightweight optimization technique to refine the texture of the generated mesh using the multi-view images.	Achieves state-of-the-art results on the GSO dataset in terms of F-Score, CLIP similarity, and user preference. Outperforms existing text-to-3D methods in terms of CLIP similarity and user preference. Demonstrates significant speed advantages over optimization-based methods while maintaining high fidelity to the input image.	Potential to improve geometry robustness and detail by incorporating additional guiding conditions from 2D diffusion models. Reliance on accurate camera pose estimation for multi-view generation.	3d generation, image-to-3d, text-to-3d, diffusion models, multi-view consistency
2311.07414 Report	FIRST: A Million-Entry Dataset for Text-Driven Fashion Synthesis and Design	Zhen Huang, Yihao Li, Dong Pei, Jiapeng Zhou, Xuliang Ning, Jianlin Han, Xiaoguang Han, Xuejun Chen	Text-driven fashion synthesis and design is an extremely valuable part of artificial intelligence generative content(AIGC), which has the potential to propel a tremendous revolution in the traditional fashion industry. To advance the research on text-driven fashion synthesis and design, we introduce a new dataset comprising a million high-resolution fashion images with rich structured textual(FIRST) descriptions. In the FIRST, there is a wide range of attire categories and each image-paired textual description is organized at multiple hierarchical levels. Experiments on prevalent generative models trained over FISRT show the necessity of FIRST. We invite the community to further develop more intelligent fashion synthesis and design systems that make fashion design more creative and imaginative based on our dataset. The dataset will be released soon.	Introduces FIRST, a million-entry dataset of high-resolution fashion images with rich, structured textual descriptions for advancing text-driven fashion synthesis and design.	Existing fashion datasets lack either textual descriptions or have limited scale and unstructured text, hindering the development of intelligent fashion design systems.	Collected over a million raw images from the internet and commercial partners, cleaned for quality, and hierarchically annotated using GPT-4V and human revision.	FIRST is the largest fashion dataset with hierarchical annotations, covering diverse attire categories and photographic scenes. Fine-tuning Stable Diffusion on FIRST significantly improves FID and CLIP-S scores, demonstrating enhanced generation quality and text control. Human feedback confirms improved quality and text-image alignment of generated images after fine-tuning on FIRST.	Current diffusion models struggle with long text prompts like those in FIRST, limiting their capacity to handle detailed descriptions. Generating cohesive fashion collections from shared design philosophies remains a challenge, requiring models to understand abstract concepts and translate them into coherent visual styles.	fashion synthesis, text-to-image generation, dataset, diffusion models, computer vision
2311.06978 Report	Augmented Bridge Matching	Valentin De Bortoli, Guan-Horng Liu, Tianrong Chen, Evangelos A. Theodorou, Weilie Nie	Flow and bridge matching are a novel class of processes which encompass diffusion models. One of the main aspect of their increased flexibility is that these models can interpolate between arbitrary data distributions i.e. they generalize beyond generative modeling and can be applied to learning stochastic (and deterministic) processes of arbitrary transfer tasks between two given distributions. In this paper, we highlight that while flow and bridge matching processes preserve the information of the marginal distributions, they do \emph{not} necessarily preserve the coupling information unless additional, stronger optimality conditions are met. This can be problematic if one aims at preserving the original empirical pairing. We show that a simple modification of the matching process recovers this coupling by augmenting the velocity field (or drift) with the information of the initial sample point. Doing so, we lose the Markovian property of the process but preserve the coupling information between distributions. We illustrate the efficiency of our augmentation in learning mixture of image translation tasks.	This paper investigates flow/bridge matching, showing that while it preserves marginal distributions, it doesn't always preserve coupling information, crucial for tasks like image translation where paired data relationships are key.	Preserving coupling information is important in applications like image translation where the paired training data encodes the relationship between degraded and clean images.	The authors leverage Doob h-transform theory to analyze flow/bridge matching fixed points and propose "Augmented Bridge Matching," which modifies the drift term to explicitly incorporate initial sample information, thus preserving the coupling.	Bridge matching preserves the original coupling if and only if the training coupling is the optimal transport coupling (Schrödinger Bridge). Augmenting the bridge matching drift term with initial sample information allows for the preservation of the training coupling. Augmented Bridge Matching outperforms standard bridge matching in multi-domain image-to-image translation tasks, both qualitatively and quantitatively (using FID).	The impact of intermediate augmentation levels (conditioning on X_{αt} with α ∈ (0,1)) on coupling preservation remains unclear. High entropy in the training coupling can hinder the training process due to increased loss variance.	diffusion models, bridge matching, coupling preservation, image translation, doob h-transform
2311.06791 Report	InfMLLM: A Unified Framework for Visual-Language Tasks	Qiang Zhou, Zhibin Wang, Wei Chu, Yinghui Xu, Hao Li, Yuan Qi	Large language models (LLMs) have proven their remarkable versatility in handling a comprehensive range of language-centric applications. To expand LLMs' capabilities to a broader spectrum of modal inputs, multimodal large language models (MLLMs) have attracted growing interest. This work delves into enabling LLMs to tackle more vision-language-related tasks, particularly image captioning, visual question answering (VQA,) and visual grounding. To this end, we implemented a three-stage training scheme: starting with lightweight alignment pretraining, then moderate-weight multitask hybrid training, and finally, LLM fine-tuning to improve instruction following capability. Throughout the training process, the requirements on GPU memory gradually increase. To effectively manage the number of visual embeddings passed to the LLM while preserving their positional information, we introduce a straightforward visual adapter module dubbed pool-adapter. Our experiments demonstrate that preserving the positional information of visual embeddings through the pool-adapter is particularly beneficial for tasks like visual grounding. We name our proposed approach InfMLLM and have evaluated it extensively on various benchmark datasets. Our results demonstrate that InfMLLM achieves either state-of-the-art (SOTA) performance or performance comparable to recent MLLMs. The code and model will be made open-source at: \url{https://github.com/mightyzau/InfMLLM}.	Presents InfMLLM, a MultiModal Large Language Model framework that uses a pool-adapter to adjust the number of image embeddings dynamically while preserving positional information for enhanced performance in vision-language tasks	Extends the capabilities of LLMs to multimodal domains, enabling them to handle tasks like image captioning, visual question answering, and visual grounding more effectively	Implements a three-stage training scheme: lightweight alignment pretraining of a visual adapter, moderate-weight multitask hybrid training, and LLM fine-tuning for improved instruction following. Introduces pool-adapter to align visual features with text embeddings while maintaining positional information	InfMLLM achieves state-of-the-art results in visual grounding and visual question answering tasks Demonstrates competitive performance in image captioning and text-oriented VQA tasks Shows that increasing visual embeddings generally improves performance, and online adjustment of embedding quantity offers a balance between speed and accuracy	Multitask finetuning presents optimization conflicts between individual tasks, requiring careful tuning of loss weights and data ratios Exploring more effective solutions for multitask finetuning is crucial	multimodal learning, large language models, vision-language tasks, image captioning, visual question answering, visual grounding
2311.06783 Report	Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models	Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Annan Wang, Kaixin Xu, Chunyi Li, Jingwen Hou, Guangtao Zhai, Geng Xue, Wenxiu Sun, Qiong Yan, Weisi Lin	Multi-modality foundation models, as represented by GPT-4V, have brought a new paradigm for low-level visual perception and understanding tasks, that can respond to a broad range of natural human instructions in a model. While existing foundation models have shown exciting potentials on low-level visual tasks, their related abilities are still preliminary and need to be improved. In order to enhance these models, we conduct a large-scale subjective experiment collecting a vast number of real human feedbacks on low-level vision. Each feedback follows a pathway that starts with a detailed description on the low-level visual appearance (e.g. clarity, color, brightness of an image, and ends with an overall conclusion, with an average length of 45 words. The constructed Q-Pathway dataset includes 58K detailed human feedbacks on 18,973 images with diverse low-level appearance. Moreover, to enable foundation models to robustly respond to diverse types of questions, we design a GPT-participated conversion to process these feedbacks into diverse-format 200K instruction-response pairs. Experimental results indicate that the Q-Instruct consistently elevates low-level perception and understanding abilities across several foundational models. We anticipate that our datasets can pave the way for a future that general intelligence can perceive, understand low-level visual appearance and evaluate visual quality like a human. Our dataset, model zoo, and demo is published at: https://q-future.github.io/Q-Instruct.	This paper introduces Q-Instruct, the first large-scale dataset for low-level visual instruction tuning of Multi-modality Large Language Models (MLLMs).	Existing MLLMs excel at high-level visual tasks but struggle with low-level visual perception and understanding due to the lack of dedicated training data.	The authors first collected Q-Pathway, a dataset of 58K human text feedbacks on the low-level aspects of 18,973 images. They then used GPT to automatically transform Q-Pathway into Q-Instruct, a dataset of 200K instruction-response pairs suitable for instruction tuning.	Fine-tuning MLLMs on Q-Instruct significantly improves their performance on low-level visual question answering (up to 17% improvement on distortion-related questions). Q-Instruct enhances the ability of MLLMs to provide detailed descriptions of low-level visual attributes and image quality. Remarkably, text-driven instruction tuning with Q-Instruct effectively aligns MLLMs with numerical image quality assessment, exhibiting strong generalization even to unseen image types.	While improving low-level visual abilities, fine-tuning with Q-Instruct might compromise performance on general-purpose or reasoning-intensive tasks. Despite the improvement, Q-Instruct tuned models still fall short of human performance and may require further development to fully replace human judgment on low-level visual tasks.	multi-modality large language models, low-level vision, instruction tuning, image quality assessment, visual question answering
2311.06612 Report	PerceptionGPT: Effectively Fusing Visual Perception into LLM	Renjie Pi, Lewei Yao, Jiahui Gao, Jipeng Zhang, Tong Zhang	The integration of visual inputs with large language models (LLMs) has led to remarkable advancements in multi-modal capabilities, giving rise to visual large language models (VLLMs). However, effectively harnessing VLLMs for intricate visual perception tasks remains a challenge. In this paper, we present a novel end-to-end framework named PerceptionGPT, which efficiently and effectively equips the VLLMs with visual perception abilities by leveraging the representation power of LLMs' token embedding. Our proposed method treats the token embedding of the LLM as the carrier of spatial information, then leverage lightweight visual task encoders and decoders to perform visual perception tasks (e.g., detection, segmentation). Our approach significantly alleviates the training difficulty suffered by previous approaches that formulate the visual outputs as discrete tokens, and enables achieving superior performance with fewer trainable parameters, less training data and shorted training time. Moreover, as only one token embedding is required to decode the visual outputs, the resulting sequence length during inference is significantly reduced. Consequently, our approach enables accurate and flexible representations, seamless integration of visual perception tasks, and efficient handling of a multiple of visual outputs. We validate the effectiveness and efficiency of our approach through extensive experiments. The results demonstrate significant improvements over previous methods with much fewer trainable parameters and GPU hours, which facilitates future research in enabling LLMs with visual perception abilities.	This paper introduces PerceptionGPT, a novel framework for efficiently training perception-enhanced vision language models (P-VLMs) by leveraging the representation power of LLM's token embedding.	Existing methods for integrating visual perception into VLLMs face challenges such as training difficulty, quantization errors from discrete token representation, and increased context length. PerceptionGPT addresses these limitations.	PerceptionGPT utilizes lightweight visual task encoders and decoders to represent visual perception signals (bounding boxes, segmentation masks) within the LLM's token embedding space, eliminating the need for discrete tokenization.	PerceptionGPT achieves state-of-the-art performance on referring expression comprehension and segmentation tasks with only parameter-efficient tuning. The method significantly reduces training difficulty, enabling good performance even with a small fraction of tunable parameters. By representing perception signals with a single token embedding, PerceptionGPT accelerates decoding speed, especially for complex information like segmentation masks.	The paper primarily focuses on object detection and segmentation, with potential for incorporating other perception tasks. Further exploration of model scaling and its impact on performance is an area for future research.	vision language model, visual perception, large language model, token embedding, multi-modal learning
2311.06243 Report	Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization	Weiyang Liu, Zeju Qiu, Yao Feng, Yuliang Xiu, Yuxuan Xue, Longhui Yu, Haiwen Feng, Zhen Liu, Juyeon Heo, Songyou Peng, Yandong Wen, Michael J. Black, Adrian Weller, Bernhard Schölkopf	Large foundation models are becoming ubiquitous, but training them from scratch is prohibitively expensive. Thus, efficiently adapting these powerful models to downstream tasks is increasingly important. In this paper, we study a principled finetuning paradigm -- Orthogonal Finetuning (OFT) -- for downstream task adaptation. Despite demonstrating good generalizability, OFT still uses a fairly large number of trainable parameters due to the high dimensionality of orthogonal matrices. To address this, we start by examining OFT from an information transmission perspective, and then identify a few key desiderata that enable better parameter-efficiency. Inspired by how the Cooley-Tukey fast Fourier transform algorithm enables efficient information transmission, we propose an efficient orthogonal parameterization using butterfly structures. We apply this parameterization to OFT, creating a novel parameter-efficient finetuning method, called Orthogonal Butterfly (BOFT). By subsuming OFT as a special case, BOFT introduces a generalized orthogonal finetuning framework. Finally, we conduct an extensive empirical study of adapting large vision transformers, large language models, and text-to-image diffusion models to various downstream tasks in vision and language.	This paper proposes Orthogonal Butterfly (BOFT), a parameter-efficient finetuning method for foundation models that leverages butterfly structures to create dense orthogonal matrices for weight updates.	Efficiently adapting large foundation models to downstream tasks is crucial, and BOFT offers a principled approach to finetuning that improves upon existing methods like Orthogonal Finetuning (OFT) and LoRA.	BOFT parameterizes a dense orthogonal matrix as a product of multiple sparse orthogonal matrices, inspired by the butterfly structures used in the fast Fourier transform algorithm. This allows for a significant reduction in trainable parameters while maintaining expressiveness and stability.	BOFT consistently outperforms LoRA and OFT in terms of accuracy and parameter efficiency across various tasks, including natural language understanding, mathematical reasoning, image classification, high-quality segmentation, and controllable text-to-image generation. The butterfly structure in BOFT introduces a beneficial inductive bias for generalization, as evidenced by its superior performance compared to OFT with the same effective block size. BOFT enables smooth weight interpolation by gradually setting trained orthogonal butterfly components to identity matrices, highlighting its ability to preserve semantic information and explore a favorable weight space.	BOFT introduces a slight training runtime overhead compared to OFT due to the multiplication of multiple orthogonal matrices. The optimality of the butterfly structure for information transmission in this context remains an open question, and exploring other network topologies could potentially yield further improvements.	parameter-efficient finetuning, foundation models, orthogonal matrices, butterfly structures, information transmission
2311.05770 Report	PolyMaX: General Dense Prediction with Mask Transformer	Xuan Yang, Liangzhe Yuan, Kimberly Wilber, Astuti Sharma, Xiuye Gu, Siyuan Qiao, Stephanie Debats, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko, Liang-Chieh Chen	Dense prediction tasks, such as semantic segmentation, depth estimation, and surface normal prediction, can be easily formulated as per-pixel classification (discrete outputs) or regression (continuous outputs). This per-pixel prediction paradigm has remained popular due to the prevalence of fully convolutional networks. However, on the recent frontier of segmentation task, the community has been witnessing a shift of paradigm from per-pixel prediction to cluster-prediction with the emergence of transformer architectures, particularly the mask transformers, which directly predicts a label for a mask instead of a pixel. Despite this shift, methods based on the per-pixel prediction paradigm still dominate the benchmarks on the other dense prediction tasks that require continuous outputs, such as depth estimation and surface normal prediction. Motivated by the success of DORN and AdaBins in depth estimation, achieved by discretizing the continuous output space, we propose to generalize the cluster-prediction based method to general dense prediction tasks. This allows us to unify dense prediction tasks with the mask transformer framework. Remarkably, the resulting model PolyMaX demonstrates state-of-the-art performance on three benchmarks of NYUD-v2 dataset. We hope our simple yet effective design can inspire more research on exploiting mask transformers for more dense prediction tasks. Code and model will be made available.	Proposes PolyMaX, a novel mask transformer framework that unifies various dense prediction tasks such as semantic segmentation, depth estimation, and surface normal prediction using a cluster-prediction paradigm.	Addresses the limitations of existing dense prediction models that struggle to generalize across tasks with different output domains (discrete vs. continuous) by introducing a unified architecture based on mask transformers.	Extends the cluster-prediction approach used in semantic segmentation to continuous domains by discretizing the output space into learnable clusters. Employs a mask transformer to learn cluster centers and their corresponding probability distribution maps, which are then linearly combined to generate the final predictions.	Achieves state-of-the-art performance on NYUD-v2 dataset for semantic segmentation, depth estimation, and surface normal prediction, outperforming existing methods by a significant margin. Demonstrates superior scalability compared to conventional per-pixel prediction methods when pretrained on larger datasets. Provides high-quality pseudo-labels for semantic segmentation on Taskonomy dataset, facilitating future research in multi-task dense prediction.	Despite achieving high performance, the model still exhibits limitations in handling transparent and reflective surfaces in depth and surface normal prediction. Future work may explore better loss functions to address the issue of over-smoothness observed in depth and surface normal predictions.	dense prediction, mask transformer, cluster-prediction, semantic segmentation, depth estimation, surface normal prediction
2311.05613 Report	Window Attention is Bugged: How not to Interpolate Position Embeddings	Daniel Bolya, Chaitanya Ryali, Judy Hoffman, Christoph Feichtenhofer	Window attention, position embeddings, and high resolution finetuning are core concepts in the modern transformer era of computer vision. However, we find that naively combining these near ubiquitous components can have a detrimental effect on performance. The issue is simple: interpolating position embeddings while using window attention is wrong. We study two state-of-the-art methods that have these three components, namely Hiera and ViTDet, and find that both do indeed suffer from this bug. To fix it, we introduce a simple absolute window position embedding strategy, which solves the bug outright in Hiera and allows us to increase both speed and performance of the model in ViTDet. We finally combine the two to obtain HieraDet, which achieves 61.7 box mAP on COCO, making it state-of-the-art for models that only use ImageNet-1k pretraining. This all stems from what is essentially a 3 line bug fix, which we name "absolute win".	This paper identifies a bug that occurs when interpolating absolute position embeddings in models using window attention, particularly during high-resolution fine-tuning.	The bug negatively impacts performance in tasks like image recognition and object detection, hindering the effectiveness of high-resolution fine-tuning in vision transformers.	The authors analyze the interaction between window attention and position embeddings, demonstrating the misalignment caused by naive interpolation. They propose "absolute win", a method separating position embeddings into window and global embeddings, enabling correct interpolation.	Absolute win significantly improves image recognition accuracy when fine-tuning at higher resolutions, outperforming baselines like Swin and MViTv2. In object detection, absolute win boosts performance in both ViTDet and HieraDet, achieving state-of-the-art results with ImageNet-1k pretraining. The method also increases inference speed by mitigating the need for computationally expensive relative position embeddings.	The study primarily focuses on Hiera and ViTDet, leaving the exploration of absolute win's impact on other architectures for future work. Further investigation into the optimal strategies for training fully supervised transformers and closing the performance gap with MAE pre-trained models is needed.	vision transformers, position embeddings, window attention, high-resolution fine-tuning, object detection
2311.05556 Report	LCM-LoRA: A Universal Stable-Diffusion Acceleration Module	Simian Luo, Yiqin Tan, Suraj Patil, Daniel Gu, Patrick von Platen, Apolinário Passos, Longbo Huang, Jian Li, Hang Zhao	Latent Consistency Models (LCMs) have achieved impressive performance in accelerating text-to-image generative tasks, producing high-quality images with minimal inference steps. LCMs are distilled from pre-trained latent diffusion models (LDMs), requiring only ~32 A100 GPU training hours. This report further extends LCMs' potential in two aspects: First, by applying LoRA distillation to Stable-Diffusion models including SD-V1.5, SSD-1B, and SDXL, we have expanded LCM's scope to larger models with significantly less memory consumption, achieving superior image generation quality. Second, we identify the LoRA parameters obtained through LCM distillation as a universal Stable-Diffusion acceleration module, named LCM-LoRA. LCM-LoRA can be directly plugged into various Stable-Diffusion fine-tuned models or LoRAs without training, thus representing a universally applicable accelerator for diverse image generation tasks. Compared with previous numerical PF-ODE solvers such as DDIM, DPM-Solver, LCM-LoRA can be viewed as a plug-in neural PF-ODE solver that possesses strong generalization abilities. Project page: https://github.com/luosiallen/latent-consistency-model.	This work introduces LCM-LoRA, a universal training-free acceleration module for Stable-Diffusion (SD) that acts as an independent neural network-based solver module to predict the solution of Probability Flow ODE (PF-ODE), enabling fast inference with minimal steps on various fine-tuned SD models and LoRAs.	Current open-source models and acceleration techniques have yet to achieve real-time generation on standard consumer GPUs, highlighting the need for a balance between speed and quality in LDM-generated imagery.	The work extends Latent Consistency Models (LCMs) by: (1) applying LoRA distillation to Stable-Diffusion models (SD-V1.5, SSD-1B, and SDXL) to reduce memory consumption and achieve superior image generation quality, and (2) identifying LoRA parameters from LCM distillation as a universal SD acceleration module (LCM-LoRA), which can be directly plugged into fine-tuned SD models or LoRAs without training.	LCM-LoRA significantly reduces memory requirements during training. LCD paradigm effectively scales to larger models like SDXL and SSD-1B. LCM-LoRA demonstrates robust generalization capabilities, achieving fast inference with minimal steps when combined with other fine-tuned SD models and LoRAs.	Further investigation is needed to fully understand the impact of combining LCM-LoRA with LoRA parameters from various datasets. Exploration of different linear combination strategies for acceleration and style vectors may further improve performance.	stable diffusion, latent consistency models, image generation, model acceleration, lora
2311.05463 Report	ControlStyle: Text-Driven Stylized Image Generation Using Diffusion Priors	Jingwen Chen, Yingwei Pan, Ting Yao, Tao Mei	Recently, the multimedia community has witnessed the rise of diffusion models trained on large-scale multi-modal data for visual content creation, particularly in the field of text-to-image generation. In this paper, we propose a new task for ``stylizing'' text-to-image models, namely text-driven stylized image generation, that further enhances editability in content creation. Given input text prompt and style image, this task aims to produce stylized images which are both semantically relevant to input text prompt and meanwhile aligned with the style image in style. To achieve this, we present a new diffusion model (ControlStyle) via upgrading a pre-trained text-to-image model with a trainable modulation network enabling more conditions of text prompts and style images. Moreover, diffusion style and content regularizations are simultaneously introduced to facilitate the learning of this modulation network with these diffusion priors, pursuing high-quality stylized text-to-image generation. Extensive experiments demonstrate the effectiveness of our ControlStyle in producing more visually pleasing and artistic results, surpassing a simple combination of text-to-image model and conventional style transfer techniques.	This paper introduces a new diffusion model, ControlStyle, for text-driven stylized image generation, which allows users to create images that match both a given text prompt and the style of a given image.	The task enhances editability in visual content creation, moving beyond existing methods that require a content image or struggle with accurate style descriptions.	ControlStyle builds on a pre-trained text-to-image diffusion model, adding a trainable modulation network that incorporates style image information and utilizes diffusion style and content regularizations to maintain image structure and style consistency.	ControlStyle produces higher quality results, with better visual appeal and style alignment, compared to cascaded text-to-image and style transfer methods, as per quantitative metrics and user study. The use of diffusion regularizations, leveraging image priors from the diffusion model's auto-encoder, proves more effective than perceptual loss, resulting in fewer artifacts. ControlStyle demonstrates strong generalizability by effectively adapting to styles not present in its training dataset.	The selection of features from the upsampling blocks for diffusion regularizations requires careful consideration to avoid performance degradation. Further exploration of combining ControlStyle with other conditional control models like ControlNet could lead to even more powerful and interesting applications.	diffusion models, text-to-image generation, style transfer, stylized image generation, content creation
2311.04498 Report	NExT-Chat: An LMM for Chat, Detection and Segmentation	Ao Zhang, Yuan Yao, Wei Ji, Zhiyuan Liu, Tat-Seng Chua	The development of large language models (LLMs) has greatly advanced the field of multimodal understanding, leading to the emergence of large multimodal models (LMMs). In order to enhance the level of visual comprehension, recent studies have equipped LMMs with region-level understanding capabilities by representing object bounding box coordinates as a series of text sequences (pix2seq). In this paper, we introduce a novel paradigm for object location modeling called pix2emb method, where we ask the LMM to output the location embeddings and then decode them with different decoders. This paradigm allows us to use different location formats (such as bounding boxes and masks) in multimodal conversations. Leveraging the proposed pix2emb method, we train an LMM named NExT-Chat and demonstrate its capability of handling multiple tasks like visual grounding, region captioning, and grounded reasoning. Comprehensive experiments show the effectiveness of our NExT-Chat on various tasks, e.g., NExT-Chat (87.7) vs. Shikra (86.9) on POPE-Random, NExT-Chat (68.9) vs. LISA (67.9) on referring expression segmentation task, and NExT-Chat (79.6) vs. Kosmos-2 (62.3) on region caption task. The code and model are released at https://github.com/NExT-ChatV/NExT-Chat.	This paper introduces pix2emb, a novel paradigm for object location modeling in large multimodal models (LMMs) that utilizes embeddings to accommodate different location formats like bounding boxes and segmentation masks.	Existing LMMs often rely on pix2seq, which is limited to discrete coordinate outputs and struggles with fine-grained formats like masks. Pix2emb addresses these limitations by enabling flexible output formats and leveraging established localization practices.	Pix2emb introduces two tokens: `` to initiate localization and `` as a placeholder for location embeddings. This allows LMMs to predict various location formats and utilize existing practices like L1, IoU, and GIoU loss functions. The authors train NExT-Chat, an LMM based on pix2emb, using a three-stage process: pre-training, instruction tuning, and segmentation training.	NExT-Chat achieves state-of-the-art results on the POPE benchmark for image hallucination diagnosis. It outperforms existing methods in referring expression segmentation, showing superior cIoU scores on RefCOCO, RefCOCO+, and RefCOCOg datasets. NExT-Chat exhibits strong performance in region captioning, surpassing baselines like Kosmos-2 in CIDEr score on RefCOCOg.	NExT-Chat is primarily trained on single image inputs, limiting its ability to handle multiple images. Lack of diverse training data hinders its performance in specialized domains like medical or satellite imagery.	large multimodal models, object location modeling, pix2emb, visual grounding, region captioning
2311.04391 Report	3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features	Chenfeng Xu, Huan Ling, Sanja Fidler, Or Litany	We present 3DiffTection, a state-of-the-art method for 3D object detection from single images, leveraging features from a 3D-aware diffusion model. Annotating large-scale image data for 3D detection is resource-intensive and time-consuming. Recently, pretrained large image diffusion models have become prominent as effective feature extractors for 2D perception tasks. However, these features are initially trained on paired text and image data, which are not optimized for 3D tasks, and often exhibit a domain gap when applied to the target data. Our approach bridges these gaps through two specialized tuning strategies: geometric and semantic. For geometric tuning, we fine-tune a diffusion model to perform novel view synthesis conditioned on a single image, by introducing a novel epipolar warp operator. This task meets two essential criteria: the necessity for 3D awareness and reliance solely on posed image data, which are readily available (e.g., from videos) and does not require manual annotation. For semantic refinement, we further train the model on target data with detection supervision. Both tuning phases employ ControlNet to preserve the integrity of the original feature capabilities. In the final step, we harness these enhanced capabilities to conduct a test-time prediction ensemble across multiple virtual viewpoints. Through our methodology, we obtain 3D-aware features that are tailored for 3D detection and excel in identifying cross-view point correspondences. Consequently, our model emerges as a powerful 3D detector, substantially surpassing previous benchmarks, e.g., Cube-RCNN, a precedent in single-view 3D detection by 9.43\% in AP3D on the Omni3D-ARkitscene dataset. Furthermore, 3DiffTection showcases robust data efficiency and generalization to cross-domain data.	This paper presents 3DiffTection, a novel method for single-image 3D object detection by leveraging pre-trained 2D diffusion models and enhancing their 3D awareness.	Annotating data for 3D object detection is expensive and time-consuming. 3DiffTection addresses this challenge by leveraging readily available pre-trained 2D diffusion models and enhancing their capabilities for 3D tasks.	The method uses two ControlNets: a geometric ControlNet trained for novel view synthesis using epipolar warping to instill 3D awareness, and a semantic ControlNet jointly trained with a 3D detection head for task-specific adaptation. It further enhances detection by ensembling predictions across virtually generated views.	3DiffTection significantly outperforms previous state-of-the-art methods on the Omni3D-ARKitScenes dataset for single-view 3D object detection. The method shows strong data efficiency, achieving superior performance with significantly less training data than competing approaches. 3DiffTection exhibits strong generalization to cross-domain data, effectively transferring learned 3D awareness to new datasets.	The method currently relies on accurate camera pose information, which can be challenging to obtain for in-the-wild video data. The use of Stable Diffusion architecture leads to high memory and runtime demands, limiting its applicability in real-time settings and requiring further optimization.	3d object detection, diffusion models, novel view synthesis, controlnet, data efficiency
2311.04315 Report	A Data Perspective on Enhanced Identity Preservation for Diffusion Personalization	Xingzhe He, Zhiwen Cao, Nicholas Kolkin, Lantao Yu, Kun Wan, Helge Rhodin, Ratheesh Kalarot	Large text-to-image models have revolutionized the ability to generate imagery using natural language. However, particularly unique or personal visual concepts, such as pets and furniture, will not be captured by the original model. This has led to interest in how to personalize a text-to-image model. Despite significant progress, this task remains a formidable challenge, particularly in preserving the subject's identity. Most researchers attempt to address this issue by modifying model architectures. These methods are capable of keeping the subject structure and color but fail to preserve identity details. Towards this issue, our approach takes a data-centric perspective. We introduce a novel regularization dataset generation strategy on both the text and image level. This strategy enables the model to preserve fine details of the desired subjects, such as text and logos. Our method is architecture-agnostic and can be flexibly applied on various text-to-image models. We show on established benchmarks that our data-centric approach forms the new state of the art in terms of identity preservation and text alignment.	This paper introduces a data-centric approach for enhancing identity preservation in diffusion-based text-to-image personalization, addressing overfitting issues observed in prior methods.	Existing methods for personalizing text-to-image models struggle to maintain subject identity and often overfit to training data, leading to reduced quality and diversity in generated images.	The proposed method generates a structured regularization dataset using formatted prompts that describe both foreground and background elements. This regularization dataset, combined with enhanced training prompts, improves the model's ability to learn personalized concepts without overfitting.	The method demonstrates superior subject identity preservation, retaining intricate details like logos and textures. It exhibits improved text alignment, generating images that are more faithful to the input text prompts. The approach is effective with both inanimate objects and living entities, showing adaptability across diverse subject types.	Generating the regularization dataset adds computational overhead to the training process. The current approach assumes manual annotation of training images, which could be automated in future work.	text-to-image generation, diffusion models, personalization, identity preservation, regularization dataset
2311.04287 Report	Holistic Evaluation of Text-To-Image Models	Tony Lee, Michihiro Yasunaga, Chenlin Meng, Yifan Mai, Joon Sung Park, Agrim Gupta, Yunzhi Zhang, Deepak Narayanan, Hannah Benita Teufel, Marco Bellagente, Minguk Kang, Taesung Park, Jure Leskovec, Jun-Yan Zhu, Li Fei-Fei, Jiajun Wu, Stefano Ermon, Percy Liang	The stunning qualitative improvement of recent text-to-image models has led to their widespread attention and adoption. However, we lack a comprehensive quantitative understanding of their capabilities and risks. To fill this gap, we introduce a new benchmark, Holistic Evaluation of Text-to-Image Models (HEIM). Whereas previous evaluations focus mostly on text-image alignment and image quality, we identify 12 aspects, including text-image alignment, image quality, aesthetics, originality, reasoning, knowledge, bias, toxicity, fairness, robustness, multilinguality, and efficiency. We curate 62 scenarios encompassing these aspects and evaluate 26 state-of-the-art text-to-image models on this benchmark. Our results reveal that no single model excels in all aspects, with different models demonstrating different strengths. We release the generated images and human evaluation results for full transparency at https://crfm.stanford.edu/heim/v1.1.0 and the code at https://github.com/stanford-crfm/helm, which is integrated with the HELM codebase.	The paper introduces HEIM, a novel benchmark for evaluating text-to-image models across 12 aspects, including aesthetics, originality, bias, and toxicity, addressing the limitations of existing benchmarks that primarily focus on image quality and text-image alignment.	Existing benchmarks for text-to-image models lack comprehensiveness, often overlooking crucial aspects like originality, aesthetics, bias, and toxicity. HEIM aims to fill this gap by providing a holistic evaluation framework.	HEIM evaluates 26 text-to-image models on 24 scenarios using both automated metrics (e.g., CLIPScore, FID) and human evaluation to provide a comprehensive assessment across the 12 identified aspects.	No single model excels in all aspects, highlighting the need for models with balanced capabilities. Weak correlations between human and automated metrics, particularly for photorealism and aesthetics, underscore the importance of human evaluation. Most models show poor performance in reasoning and multilinguality, indicating areas needing further research.	The 12 identified aspects may not be exhaustive and could be expanded in future work. The reliance on crowdsourced human evaluation, while valuable, has limitations, particularly for subjective aspects like aesthetics and originality.	benchmark, text-to-image generation, evaluation, bias, toxicity
2311.04257 Report	mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration	Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou	Multi-modal Large Language Models (MLLMs) have demonstrated impressive instruction abilities across various open-ended tasks. However, previous methods primarily focus on enhancing multi-modal capabilities. In this work, we introduce a versatile multi-modal large language model, mPLUG-Owl2, which effectively leverages modality collaboration to improve performance in both text and multi-modal tasks. mPLUG-Owl2 utilizes a modularized network design, with the language decoder acting as a universal interface for managing different modalities. Specifically, mPLUG-Owl2 incorporates shared functional modules to facilitate modality collaboration and introduces a modality-adaptive module that preserves modality-specific features. Extensive experiments reveal that mPLUG-Owl2 is capable of generalizing both text tasks and multi-modal tasks and achieving state-of-the-art performances with a single generic model. Notably, mPLUG-Owl2 is the first MLLM model that demonstrates the modality collaboration phenomenon in both pure-text and multi-modal scenarios, setting a pioneering path in the development of future multi-modal foundation models.	Presents \modelname, a multi-modal large language model that leverages modality collaboration to enhance performance in both text and multi-modal tasks.	Existing multi-modal LLMs struggle to balance the benefits of cross-modal interaction with the risk of modality interference, limiting their ability to excel in both text and multi-modal tasks simultaneously.	Introduces a modularized network design with a modality-adaptive module to facilitate cross-modal interaction while preserving modality-specific features. Employs a two-stage training paradigm consisting of vision-language pre-training and joint vision-language instruction tuning.	\modelname achieves state-of-the-art performance on 8 classic vision-language benchmarks and ranks first or second on 5 recent zero-shot multi-modal benchmarks. \modelname demonstrates state-of-the-art results on multiple pure-text benchmarks, highlighting the benefits of modality collaboration for enhancing text-based capabilities. Analysis confirms the positive impact of modality collaboration, especially in improving text-based understanding, knowledge, and reasoning abilities.	Limited number of test samples in certain benchmarks (e.g., MME) may lead to performance fluctuations. Despite efforts to mitigate bias, the model may still inherit some biases from the pre-trained LLM and web-sourced data.	multi-modal large language models, modality collaboration, modality-adaptive module, joint vision-language instruction tuning, zero-shot learning
2311.04251 Report	MixtureGrowth: Growing Neural Networks by Recombining Learned Parameters	Chau Pham, Piotr Teterwak, Soren Nelson, Bryan A. Plummer	Most deep neural networks are trained under fixed network architectures and require retraining when the architecture changes. If expanding the network's size is needed, it is necessary to retrain from scratch, which is expensive. To avoid this, one can grow from a small network by adding random weights over time to gradually achieve the target network size. However, this naive approach falls short in practice as it brings too much noise to the growing process. Prior work tackled this issue by leveraging the already learned weights and training data for generating new weights through conducting a computationally expensive analysis step. In this paper, we introduce MixtureGrowth, a new approach to growing networks that circumvents the initialization overhead in prior work. Before growing, each layer in our model is generated with a linear combination of parameter templates. Newly grown layer weights are generated by using a new linear combination of existing templates for a layer. On one hand, these templates are already trained for the task, providing a strong initialization. On the other, the new coefficients provide flexibility for the added layer weights to learn something new. We show that our approach boosts top-1 accuracy over the state-of-the-art by 2-2.5% on CIFAR-100 and ImageNet datasets, while achieving comparable performance with fewer FLOPs to a larger network trained from scratch. Code is available at https://github.com/chaudatascience/mixturegrowth.	Introduces MixtureGrowth, a novel method for growing neural networks by reusing and recombining learned parameter templates from smaller, pre-trained networks.	Reduces the computational cost of training large neural networks by leveraging knowledge from smaller models and avoiding expensive weight initialization procedures used in prior work.	Trains two small networks with shared parameter templates, fuses them into a larger network, initializes new weights by learning new linear combinations of existing templates, and fine-tunes the entire network.	Achieves 2-2.5% higher top-1 accuracy than state-of-the-art growing methods on CIFAR-100 and ImageNet. Outperforms target network accuracy on CIFAR-100 with half the FLOPs. Demonstrates robustness to growth point and benefits from fusing two small models over growing from a single one.	Limited exploration of growth beyond doubling network size. Further investigation into the relationship between template diversity and growth performance is needed.	neural network growing, template mixing, parameter sharing, model fusion, computational efficiency
2311.04246 Report	ADFactory: An Effective Framework for Generalizing Optical Flow with Nerf	Han Ling	A significant challenge facing current optical flow methods is the difficulty in generalizing them well to the real world. This is mainly due to the high cost of hand-crafted datasets, and existing self-supervised methods are limited by indirect loss and occlusions, resulting in fuzzy outcomes. To address this challenge, we introduce a novel optical flow training framework: automatic data factory (ADF). ADF only requires RGB images as input to effectively train the optical flow network on the target data domain. Specifically, we use advanced Nerf technology to reconstruct scenes from photo groups collected by a monocular camera, and then calculate optical flow labels between camera pose pairs based on the rendering results. To eliminate erroneous labels caused by defects in the scene reconstructed by Nerf, we screened the generated labels from multiple aspects, such as optical flow matching accuracy, radiation field confidence, and depth consistency. The filtered labels can be directly used for network supervision. Experimentally, the generalization ability of ADF on KITTI surpasses existing self-supervised optical flow and monocular scene flow algorithms. In addition, ADF achieves impressive results in real-world zero-point generalization evaluations and surpasses most supervised methods.	This paper proposes Automated Data Factory (ADF), a novel optical flow training framework utilizing scenes generated by Neural Radiance Fields (NeRF) to train deep optical flow networks.	ADF addresses the challenge of generalizing optical flow methods to real-world scenarios by providing a cost-effective way to generate large-scale, high-quality optical flow training data without manual annotation or expensive equipment.	ADF uses NeRF to reconstruct scenes from monocular camera images, generates optical flow labels between rendered camera poses, and employs data filtering techniques like structural similarity, radiation field confidence, and depth consistency to ensure label accuracy.	ADF-trained models demonstrate superior zero-shot generalization on real-world optical flow estimation compared to existing self-supervised and supervised methods. The proposed data filtering mechanism significantly improves the performance of trained optical flow models. ADF proves effective in training both traditional optical flow networks like RAFT and more advanced normalized scene flow models like Scale-flow.	ADF currently relies on static scenes due to limitations of NeRF, hindering its application in dynamic scenarios like KITTI raw data. Further research is needed to bridge the gap between the optical flow generated by ADF (closer to light flow) and the object flow typically used in supervised learning.	optical flow, neural radiance fields, self-supervised learning, zero-shot generalization, data generation
2311.04219 Report	OtterHD: A High-Resolution Multi-modality Model	Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, Ziwei Liu	In this paper, we present OtterHD-8B, an innovative multimodal model evolved from Fuyu-8B, specifically engineered to interpret high-resolution visual inputs with granular precision. Unlike conventional models that are constrained by fixed-size vision encoders, OtterHD-8B boasts the ability to handle flexible input dimensions, ensuring its versatility across various inference requirements. Alongside this model, we introduce MagnifierBench, an evaluation framework designed to scrutinize models' ability to discern minute details and spatial relationships of small objects. Our comparative analysis reveals that while current leading models falter on this benchmark, OtterHD-8B, particularly when directly processing high-resolution inputs, outperforms its counterparts by a substantial margin. The findings illuminate the structural variances in visual information processing among different models and the influence that the vision encoders' pre-training resolution disparities have on model effectiveness within such benchmarks. Our study highlights the critical role of flexibility and high-resolution input capabilities in large multimodal models and also exemplifies the potential inherent in the Fuyu architecture's simplicity for handling complex visual data.	Presents OtterHD-8B, a novel multimodal model based on Fuyu-8B, designed to process high-resolution visual inputs with flexibility and introduces MagnifierBench, a benchmark to evaluate models' ability to discern minute details in large images.	Addresses the limitations of conventional LMMs that rely on fixed-size vision encoders and lack fine-grained perception abilities, crucial for tasks requiring detailed visual understanding.	Extends Fuyu-8B with instruction tuning to handle various resolutions up to 1024x1024 pixels and develops MagnifierBench using images from PVSG dataset with meticulously designed question-answer pairs focused on small objects.	OtterHD-8B outperforms existing LMMs on MagnifierBench, demonstrating its superior fine-grained perception abilities. Increasing input resolution leads to improved performance on MagnifierBench, highlighting the importance of resolution flexibility. Dynamic resolution training further enhances OtterHD-8B's ability to generalize to unseen resolutions.	Limited instruction tuning data compared to other state-of-the-art LMMs. Further exploration of image augmentation methods like random cropping is needed.	multimodal learning, large language models, computer vision, fine-grained perception, benchmarking
2311.04212 Report	Video Instance Matting	Jiachen Li, Roberto Henschel, Vidit Goel, Marianna Ohanyan, Shant Navasardyan, Humphrey Shi	Conventional video matting outputs one alpha matte for all instances appearing in a video frame so that individual instances are not distinguished. While video instance segmentation provides time-consistent instance masks, results are unsatisfactory for matting applications, especially due to applied binarization. To remedy this deficiency, we propose Video Instance Matting~(VIM), that is, estimating alpha mattes of each instance at each frame of a video sequence. To tackle this challenging problem, we present MSG-VIM, a Mask Sequence Guided Video Instance Matting neural network, as a novel baseline model for VIM. MSG-VIM leverages a mixture of mask augmentations to make predictions robust to inaccurate and inconsistent mask guidance. It incorporates temporal mask and temporal feature guidance to improve the temporal consistency of alpha matte predictions. Furthermore, we build a new benchmark for VIM, called VIM50, which comprises 50 video clips with multiple human instances as foreground objects. To evaluate performances on the VIM task, we introduce a suitable metric called Video Instance-aware Matting Quality~(VIMQ). Our proposed model MSG-VIM sets a strong baseline on the VIM50 benchmark and outperforms existing methods by a large margin. The project is open-sourced at https://github.com/SHI-Labs/VIM.	This paper proposes Video Instance Matting (VIM), a new task focused on estimating alpha mattes for each instance within a video sequence, addressing the limitations of conventional video matting and instance segmentation.	VIM enables instance-aware video editing, surpassing the limitations of traditional methods by providing separate alpha mattes for individual instances, crucial for applications like instance-selective removal.	The authors introduce MSG-VIM, a Mask Sequence Guided VIM network, utilizing mask sequences from video instance segmentation as guidance. MSG-VIM employs a mixture of mask augmentations for robustness, temporal mask and feature guidance for temporal consistency.	MSG-VIM significantly outperforms existing video matting, instance segmentation, and image matting methods on the newly created VIM50 benchmark. A proposed mixture of mask augmentations successfully enhances the robustness of MSG-VIM against inaccurate mask guidance. The incorporation of temporal guidance, both for masks and features, demonstrably improves the temporal consistency of alpha matte predictions.	The reliance on an external video instance segmentation model for mask guidance introduces a dependency on its accuracy. The computational demands of processing longer video chunks are constrained by memory limitations.	video matting, instance segmentation, alpha matte, video editing, deep learning
2311.04145 Report	I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models	Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, Jingren Zhou	Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models. However, it still encounters challenges in terms of semantic accuracy, clarity and spatio-temporal continuity. They primarily arise from the scarcity of well-aligned text-video data and the complex inherent structure of videos, making it difficult for the model to simultaneously ensure semantic and qualitative excellence. In this report, we propose a cascaded I2VGen-XL approach that enhances model performance by decoupling these two factors and ensures the alignment of the input data by utilizing static images as a form of crucial guidance. I2VGen-XL consists of two stages: i) the base stage guarantees coherent semantics and preserves content from input images by using two hierarchical encoders, and ii) the refinement stage enhances the video's details by incorporating an additional brief text and improves the resolution to 1280$\times$720. To improve the diversity, we collect around 35 million single-shot text-video pairs and 6 billion text-image pairs to optimize the model. By this means, I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos. Through extensive experiments, we have investigated the underlying principles of I2VGen-XL and compared it with current top methods, which can demonstrate its effectiveness on diverse data. The source code and models will be publicly available at \url{https://i2vgen-xl.github.io}.	This paper introduces I2VGen-XL, a cascaded diffusion model that synthesizes high-quality videos from single images.	Current video synthesis models struggle with semantic accuracy, clarity, and spatio-temporal continuity due to limited aligned text-video data and the complexity of videos.	I2VGen-XL uses a two-stage approach: 1) a base stage with hierarchical encoders ensures semantic coherence and content preservation at low resolution, and 2) a refinement stage enhances details and resolution using an additional text prompt.	I2VGen-XL generates videos with more realistic and diverse motions than state-of-the-art methods like Gen-2 and Pika Labs. The refinement stage significantly improves spatial details, reduces noise, and enhances spatio-temporal continuity. The model shows promising generalization ability across diverse categories like human faces, cartoons, and animals.	Generating natural and diverse human body motions remains challenging. The model is currently limited to generating short, single-shot videos.	video synthesis, diffusion models, image-to-video generation, cascaded diffusion, high-resolution video generation
2311.03943 Report	CLIP Guided Image-perceptive Prompt Learning for Image Enhancement	Weiwen Chen, Qiuhong Ke, Zinuo Li	Image enhancement is a significant research area in the fields of computer vision and image processing. In recent years, many learning-based methods for image enhancement have been developed, where the Look-up-table (LUT) has proven to be an effective tool. In this paper, we delve into the potential of Contrastive Language-Image Pre-Training (CLIP) Guided Prompt Learning, proposing a simple structure called CLIP-LUT for image enhancement. We found that the prior knowledge of CLIP can effectively discern the quality of degraded images, which can provide reliable guidance. To be specific, We initially learn image-perceptive prompts to distinguish between original and target images using CLIP model, in the meanwhile, we introduce a very simple network by incorporating a simple baseline to predict the weights of three different LUT as enhancement network. The obtained prompts are used to steer the enhancement network like a loss function and improve the performance of model. We demonstrate that by simply combining a straightforward method with CLIP, we can obtain satisfactory results.	This paper proposes CLIP-LUT, a novel image enhancement approach that leverages CLIP's image quality discerning ability through prompt learning to guide a simple LUT-based enhancement network.	This approach bridges the gap between powerful visual-language models like CLIP and low-level image enhancement tasks, offering a new avenue for leveraging CLIP's knowledge in this domain.	The method learns image-perceptive prompts using CLIP to distinguish between original and enhanced images. These prompts guide a lightweight UNet-like network predicting the weights for three different LUTs, which are then combined to enhance the input image.	CLIP-LUT achieves competitive results on benchmark datasets like MIT-Adobe FiveK, HDR+, and FilmSet, outperforming several existing methods in terms of PSNR, SSIM, and color accuracy. Ablation studies confirm the efficacy of learned image-perceptive prompts in guiding the enhancement process compared to using no prompts or random prompts. The simplicity of the proposed approach highlights the potential of integrating CLIP into low-level computer vision tasks for effective performance.	The paper acknowledges the preliminary stage of the research and suggests exploring more effective prompt learning techniques and loss functions. Further investigations into lightweight network architectures for enhancement and extending the approach to other low-level vision tasks are proposed.	image enhancement, clip, prompt learning, look-up table (lut), visual-language models
2311.03873 Report	Mini but Mighty: Finetuning ViTs with Mini Adapters	Imad Eddine Marouf, Enzo Tartaglione, Stéphane Lathuilière	Vision Transformers (ViTs) have become one of the dominant architectures in computer vision, and pre-trained ViT models are commonly adapted to new tasks via fine-tuning. Recent works proposed several parameter-efficient transfer learning methods, such as adapters, to avoid the prohibitive training and storage cost of finetuning. In this work, we observe that adapters perform poorly when the dimension of adapters is small, and we propose MiMi, a training framework that addresses this issue. We start with large adapters which can reach high performance, and iteratively reduce their size. To enable automatic estimation of the hidden dimension of every adapter, we also introduce a new scoring function, specifically designed for adapters, that compares the neuron importance across layers. Our method outperforms existing methods in finding the best trade-off between accuracy and trained parameters across the three dataset benchmarks DomainNet, VTAB, and Multi-task, for a total of 29 datasets.	The paper proposes MiMi, an iterative training framework for Vision Transformers (ViTs) that reduces the size of adapter modules for parameter-efficient transfer learning.	Fine-tuning pre-trained ViTs for new tasks is computationally and storage expensive. Adapters offer a parameter-efficient alternative, but their performance degrades with small sizes. MiMi addresses this by iteratively reducing adapter dimensions while maintaining high performance.	MiMi starts with large adapters and iteratively prunes neurons based on a novel importance score that considers both down-sampling and up-sampling layers. This score enables dynamic adjustment of adapter sizes and even removal if deemed unnecessary.	MiMi outperforms existing parameter-efficient transfer learning methods on 29 datasets across DomainNet, VTAB, and Multi-task benchmarks. The proposed importance score for neuron selection effectively guides adapter size reduction, leading to better performance than vanilla training. MiMi demonstrates its generalizability by achieving comparable performance to full fine-tuning with significantly fewer parameters across various ViT backbones.	The paper mainly focuses on image classification tasks; further investigation is needed for other vision tasks. The influence of hyperparameter ρ (amount of neuron removal) requires further exploration for optimal performance across diverse datasets.	vision transformer, parameter-efficient finetuning, adapters, pruning, transfer learning
2311.03830 Report	Reducing Spatial Fitting Error in Distillation of Denoising Diffusion Models	Shengzhe Zhou, Zejian Lee, Shengyuan Zhang, Lefan Hou, Changyuan Yang, Guang Yang, Zhiyuan Yang, Lingyun Sun	Denoising Diffusion models have exhibited remarkable capabilities in image generation. However, generating high-quality samples requires a large number of iterations. Knowledge distillation for diffusion models is an effective method to address this limitation with a shortened sampling process but causes degraded generative quality. Based on our analysis with bias-variance decomposition and experimental observations, we attribute the degradation to the spatial fitting error occurring in the training of both the teacher and student model. Accordingly, we propose $\textbf{S}$patial $\textbf{F}$itting-$\textbf{E}$rror $\textbf{R}$eduction $\textbf{D}$istillation model ($\textbf{SFERD}$). SFERD utilizes attention guidance from the teacher model and a designed semantic gradient predictor to reduce the student's fitting error. Empirically, our proposed model facilitates high-quality sample generation in a few function evaluations. We achieve an FID of 5.31 on CIFAR-10 and 9.39 on ImageNet 64$\times$64 with only one step, outperforming existing diffusion methods. Our study provides a new perspective on diffusion distillation by highlighting the intrinsic denoising ability of models. Project link: \url{https://github.com/Sainzerjj/SFERD}.	This paper proposes SFERD, a novel Spatial Fitting-Error Reduction Distillation model for Denoising Diffusion Models, to generate high-quality images in a few function evaluations.	Generating high-quality samples from Denoising Diffusion Models typically requires many iterations, leading to slow sampling speed. Distillation methods address this issue but often compromise image quality. This paper aims to improve the quality of distilled diffusion models.	The paper analyzes the distillation process and identifies the fitting errors in both the teacher and student models as the root cause of quality degradation. To address this, SFERD utilizes two novel components: (1) attention guidance for the teacher model to reduce error by highlighting semantically important regions, and (2) a semantic gradient predictor for the student model to enhance training by incorporating semantic information from a learned latent space.	SFERD significantly outperforms existing distillation methods (PD, CD) on CIFAR-10 and ImageNet 64x64 datasets, achieving impressive FID scores with only a few sampling steps. Notably, SFERD-CD achieves single-step FID scores of 5.31 and 9.39 on CIFAR-10 and ImageNet 64x64, respectively. Applying SFERD to fine-tune pre-trained diffusion models directly also leads to improved performance.	The attention guidance method, while effective, currently relies on unsupervised learning, which can lead to challenges in promptly detecting and correcting instances of incorrect self-attention direction. The student model with the semantic gradient predictor, though offering improved performance, experiences a slight increase in inference time compared to the original student model due to the extra predictor.	diffusion models, knowledge distillation, image generation, attention mechanisms, semantic gradient prediction
2311.03426 Report	GQKVA: Efficient Pre-training of Transformers by Grouping Queries, Keys, and Values	Farnoosh Javadi, Walid Ahmed, Habib Hajimolahoseini, Foozhan Ataiefard, Mohammad Hassanpour, Saina Asani, Austin Wen, Omar Mohamed Awad, Kangling Liu, Yang Liu	Massive transformer-based models face several challenges, including slow and computationally intensive pre-training and over-parametrization. This paper addresses these challenges by proposing a versatile method called GQKVA, which generalizes query, key, and value grouping techniques. GQKVA is designed to speed up transformer pre-training while reducing the model size. Our experiments with various GQKVA variants highlight a clear trade-off between performance and model size, allowing for customized choices based on resource and time limitations. Our findings also indicate that the conventional multi-head attention approach is not always the best choice, as there are lighter and faster alternatives available. We tested our method on ViT, which achieved an approximate 0.3% increase in accuracy while reducing the model size by about 4% in the task of image classification. Additionally, our most aggressive model reduction experiment resulted in a reduction of approximately 15% in model size, with only around a 1% drop in accuracy.	This paper introduces GQKVA, a novel method for reducing the size and speeding up the pre-training of transformer models by generalizing query, key, and value grouping techniques.	Massive transformer models suffer from slow and computationally expensive pre-training, along with over-parametrization. This method addresses these challenges by enabling faster training and smaller model sizes.	The study explores various GQKVA variants by grouping queries, keys, and values within the self-attention mechanism of a ViT-small model. The variants are evaluated based on accuracy, model size, and training time per sample (TPS).	GKVA, a variant of GQKVA, achieves the highest accuracy while reducing model size by 4-5% compared to standard multi-head attention (MHA). Certain GQKVA variants outperform MQA in accuracy despite having the same or fewer parameters. Results reveal a linear correlation between model size/TPS and performance, indicating a trade-off that allows for customization based on resource limits.	The study is limited to evaluating GQKVA on the ViT-small model due to resource constraints. Further research should explore applying GQKVA to larger transformer models to unlock greater potential speed-ups and memory savings.	transformer, pre-training, model compression, attention mechanism, gqkva
2311.03356 Report	GLaMM: Pixel Grounding Large Multimodal Model	Hanoona Rasheed, Muhammad Maaz, Sahal Shaji Mullappilly, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M. Anwer, Erix Xing, Ming-Hsuan Yang, Fahad S. Khan	Large Multimodal Models (LMMs) extend Large Language Models to the vision domain. Initial LMMs used holistic images and text prompts to generate ungrounded textual responses. Recently, region-level LMMs have been used to generate visually grounded responses. However, they are limited to only referring to a single object category at a time, require users to specify the regions, or cannot offer dense pixel-wise object grounding. In this work, we present Grounding LMM (GLaMM), the first model that can generate natural language responses seamlessly intertwined with corresponding object segmentation masks. GLaMM not only grounds objects appearing in the conversations but is flexible enough to accept both textual and optional visual prompts (region of interest) as input. This empowers users to interact with the model at various levels of granularity, both in textual and visual domains. Due to the lack of standard benchmarks for the novel setting of visually Grounded Conversation Generation (GCG), we introduce a comprehensive evaluation protocol with our curated grounded conversations. Our proposed GCG task requires densely grounded concepts in natural scenes at a large-scale. To this end, we propose a densely annotated Grounding-anything Dataset (GranD) using our proposed automated annotation pipeline that encompasses 7.5M unique concepts grounded in a total of 810M regions available with segmentation masks. Besides GCG, GLaMM also performs effectively on several downstream tasks, e.g., referring expression segmentation, image and region-level captioning and vision-language conversations.	Introduces GLaMM, the first large multimodal model capable of generating natural language responses with corresponding object segmentation masks, enabling visually grounded conversations.	Addresses limitations of existing LMMs that lack region-specific understanding or can't provide detailed pixel-level grounding for truly interactive visual-language tasks.	Develops GLaMM with a novel architecture that combines global and region encoders, an LLM, and a pixel decoder, trained end-to-end on a new densely annotated dataset (GranD).	GLaMM outperforms existing LMMs on the newly proposed Grounded Conversation Generation (GCG) task. Demonstrates strong performance on various downstream tasks such as referring expression segmentation, region-level captioning, and image captioning. Introduces GranD, a large-scale dataset with 7.5M unique concepts grounded in 810M regions, created through an automated annotation pipeline for scalable data generation.	Automated annotation pipeline in GranD may introduce noise in the labels, requiring further research on noise reduction techniques. Future work includes expanding GLaMM to incorporate other modalities like video and 3D data.	large multimodal models, grounded conversation generation, pixel-level grounding, dense image captioning, automated dataset annotation
2311.03355 Report	SegGen: Supercharging Segmentation Models with Text2Mask and Mask2Img Synthesis	Hanrong Ye, Jason Kuen, Qing Liu, Zhe Lin, Brian Price, Dan Xu	We propose SegGen, a highly-effective training data generation method for image segmentation, which pushes the performance limits of state-of-the-art segmentation models to a significant extent. SegGen designs and integrates two data generation strategies: MaskSyn and ImgSyn. (i) MaskSyn synthesizes new mask-image pairs via our proposed text-to-mask generation model and mask-to-image generation model, greatly improving the diversity in segmentation masks for model supervision; (ii) ImgSyn synthesizes new images based on existing masks using the mask-to-image generation model, strongly improving image diversity for model inputs. On the highly competitive ADE20K and COCO benchmarks, our data generation method markedly improves the performance of state-of-the-art segmentation models in semantic segmentation, panoptic segmentation, and instance segmentation. Notably, in terms of the ADE20K mIoU, Mask2Former R50 is largely boosted from 47.2 to 49.9 (+2.7); Mask2Former Swin-L is also significantly increased from 56.1 to 57.4 (+1.3). These promising results strongly suggest the effectiveness of our SegGen even when abundant human-annotated training data is utilized. Moreover, training with our synthetic data makes the segmentation models more robust towards unseen domains. Project website: https://seggenerator.github.io	This paper proposes SegGen, a novel method for generating high-quality, diverse training data for image segmentation using text-to-mask and mask-to-image generation models.	Existing segmentation datasets are limited in size, hindering model performance and generalization ability. SegGen addresses this by synthesizing large-scale, high-quality training data.	SegGen leverages two generative models: 1) Text2Mask generates new segmentation masks from text prompts. 2) Mask2Img synthesizes images conditioned on these masks or human-annotated ones. Two data generation approaches are proposed: MaskSyn focuses on new mask generation, while ImgSyn creates new images from existing masks.	SegGen significantly boosts the performance of state-of-the-art segmentation models (e.g., Mask2Former) on ADE20K and COCO benchmarks for semantic, panoptic, and instance segmentation. The method achieves state-of-the-art results on these benchmarks without relying on additional human-annotated data. Models trained with SegGen's synthetic data exhibit improved generalization ability, performing better on images from unseen domains (e.g., PASCAL VOC) and AI-generated images.	Generating instance segmentation data from text remains a challenge due to the difficulty of inferring instance information from color maps. Further exploration is needed to optimize the scale and diversity of synthetic data for even greater performance improvements.	image segmentation, data augmentation, synthetic data generation, text-to-image synthesis, mask-to-image synthesis
2311.03354 Report	CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding	Junyan Li, Delin Chen, Yining Hong, Zhenfang Chen, Peihao Chen, Yikang Shen, Chuang Gan	A remarkable ability of human beings resides in compositional reasoning, i.e., the capacity to make "infinite use of finite means". However, current large vision-language foundation models (VLMs) fall short of such compositional abilities due to their "bag-of-words" behaviors and inability to construct words that correctly represent visual entities and the relations among the entities. To this end, we propose CoVLM, which can guide the LLM to explicitly compose visual entities and relationships among the text and dynamically communicate with the vision encoder and detection network to achieve vision-language communicative decoding. Specifically, we first devise a set of novel communication tokens for the LLM, for dynamic communication between the visual detection system and the language system. A communication token is generated by the LLM following a visual entity or a relation, to inform the detection network to propose regions that are relevant to the sentence generated so far. The proposed regions-of-interests (ROIs) are then fed back into the LLM for better language generation contingent on the relevant regions. The LLM is thus able to compose the visual entities and relationships through the communication tokens. The vision-to-language and language-to-vision communication are iteratively performed until the entire sentence is generated. Our framework seamlessly bridges the gap between visual perception and LLMs and outperforms previous VLMs by a large margin on compositional reasoning benchmarks (e.g., ~20% in HICO-DET mAP, ~14% in Cola top-1 accuracy, and ~3% on ARO top-1 accuracy). We also achieve state-of-the-art performances on traditional vision-language tasks such as referring expression comprehension and visual question answering.	This paper proposes CoVLM, a novel approach that integrates detection networks into LLMs to enable dynamic interaction and compositionality over visual entities and relations for improved image understanding.	Current VLMs lack compositional reasoning abilities crucial for visual understanding tasks, often exhibiting "bag-of-words" behavior and failing to represent entities and relationships accurately.	CoVLM introduces communication tokens into LLMs, enabling step-by-step communication with visual components and relations. The vision module uses a detection network to propose relevant regions based on language inputs, and the LLM uses these regions for better language generation.	CoVLM outperforms previous VLMs on compositional reasoning benchmarks like ARO, Cola, and HICO-DET by significant margins. It demonstrates superior performance in tasks requiring fine-grained object recognition and reasoning about relations between entities. CoVLM achieves competitive results on traditional vision-language tasks like referring expression comprehension and visual question answering.	The paper doesn't delve deeply into object-attribute or spatial event compositionality. Further exploration of these aspects is crucial for future work.	vision-language models, compositional reasoning, object detection, large language models, vision-language communication
2311.03352 Report	Rethinking Evaluation Metrics of Open-Vocabulary Segmentaion	Hao Zhou, Tiancheng Shen, Xu Yang, Hai Huang, Xiangtai Li, Lu Qi, Ming-Hsuan Yang	In this paper, we highlight a problem of evaluation metrics adopted in the open-vocabulary segmentation. That is, the evaluation process still heavily relies on closed-set metrics on zero-shot or cross-dataset pipelines without considering the similarity between predicted and ground truth categories. To tackle this issue, we first survey eleven similarity measurements between two categorical words using WordNet linguistics statistics, text embedding, and language models by comprehensive quantitative analysis and user study. Built upon those explored measurements, we designed novel evaluation metrics, namely Open mIoU, Open AP, and Open PQ, tailored for three open-vocabulary segmentation tasks. We benchmarked the proposed evaluation metrics on 12 open-vocabulary methods of three segmentation tasks. Even though the relative subjectivity of similarity distance, we demonstrate that our metrics can still well evaluate the open ability of the existing open-vocabulary segmentation methods. We hope that our work can bring with the community new thinking about how to evaluate the open ability of models. The evaluation code is released in github.	This paper proposes novel evaluation metrics (Open mIoU, Open AP, Open PQ) for open-vocabulary segmentation, addressing the limitations of existing closed-set metrics by incorporating semantic similarity between predicted and ground-truth labels.	Existing open-vocabulary segmentation metrics rely on closed-set metrics and don't account for semantic similarity, failing to capture the true open-world performance of models.	The authors explore and compare eleven similarity measurements, preferring WordNet's path similarity. They introduce open metrics that employ class-agnostic matching and semantic similarity-based scoring for semantic, instance, and panoptic segmentation.	Open metrics consistently outperform vanilla metrics, highlighting their ability to account for semantic similarity. Open metrics demonstrate sensitivity to segmentation and recognition quality, providing a more accurate evaluation. User studies confirm that the proposed Open metrics and Path Similarity align well with human judgment.	The choice of similarity measurement, while preferred, remains subjective and may not be universally suitable. Future work could explore alternative similarity measures and their impact on open-vocabulary evaluation.	open-vocabulary segmentation, evaluation metrics, semantic similarity, wordnet, class-agnostic matching
2311.03335 Report	Cross-Image Attention for Zero-Shot Appearance Transfer	Yuval Alaluf, Daniel Garibi, Or Patashnik, Hadar Averbuch-Elor, Daniel Cohen-Or	Recent advancements in text-to-image generative models have demonstrated a remarkable ability to capture a deep semantic understanding of images. In this work, we leverage this semantic knowledge to transfer the visual appearance between objects that share similar semantics but may differ significantly in shape. To achieve this, we build upon the self-attention layers of these generative models and introduce a cross-image attention mechanism that implicitly establishes semantic correspondences across images. Specifically, given a pair of images -- one depicting the target structure and the other specifying the desired appearance -- our cross-image attention combines the queries corresponding to the structure image with the keys and values of the appearance image. This operation, when applied during the denoising process, leverages the established semantic correspondences to generate an image combining the desired structure and appearance. In addition, to improve the output image quality, we harness three mechanisms that either manipulate the noisy latent codes or the model's internal representations throughout the denoising process. Importantly, our approach is zero-shot, requiring no optimization or training. Experiments show that our method is effective across a wide range of object categories and is robust to variations in shape, size, and viewpoint between the two input images.	This paper introduces a zero-shot method for semantic-based appearance transfer between objects in natural images, leveraging the implicit semantic correspondences captured by pretrained text-to-image diffusion models.	The method addresses the limitations of existing appearance transfer techniques that require per-domain or per-image training, enabling flexible transfer across objects with variations in shape, size, and viewpoint.	The core of the method is Cross-Image Attention, which replaces the self-attention layers in the diffusion model's decoder. It mixes queries from the structure image with keys and values from the appearance image to establish semantic correspondences. Further enhancements include attention map contrasting, appearance guidance using classifier-free guidance, and AdaIN for color distribution alignment.	The method effectively transfers visual appearance between semantically similar objects, even with significant shape variations. It outperforms existing techniques in qualitative comparisons, capturing finer details and preserving source structure. Quantitative evaluations and user studies confirm its superiority in appearance fidelity and overall quality.	Transferring appearance between objects lacking shared semantics remains challenging. The quality of transfer relies on accurate and editable inversions of input images, which can be sensitive to inversion settings.	appearance transfer, diffusion models, cross-image attention, zero-shot learning, semantic correspondences
2311.03287 Report	Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges	Chenhang Cui, Yiyang Zhou, Xinyu Yang, Shirley Wu, Linjun Zhang, James Zou, Huaxiu Yao	While GPT-4V(ision) impressively models both visual and textual information simultaneously, it's hallucination behavior has not been systematically assessed. To bridge this gap, we introduce a new benchmark, namely, the Bias and Interference Challenges in Visual Language Models (Bingo). This benchmark is designed to evaluate and shed light on the two common types of hallucinations in visual language models: bias and interference. Here, bias refers to the model's tendency to hallucinate certain types of responses, possibly due to imbalance in its training data. Interference pertains to scenarios where the judgment of GPT-4V(ision) can be disrupted due to how the text prompt is phrased or how the input image is presented. We identify a notable regional bias, whereby GPT-4V(ision) is better at interpreting Western images or images with English writing compared to images from other countries or containing text in other languages. Moreover, GPT-4V(ision) is vulnerable to leading questions and is often confused when interpreting multiple images together. Popular mitigation approaches, such as self-correction and chain-of-thought reasoning, are not effective in resolving these challenges. We also identified similar biases and interference vulnerabilities with LLaVA and Bard. Our results characterize the hallucination challenges in GPT-4V(ision) and state-of-the-art visual-language models, and highlight the need for new solutions. The Bingo benchmark is available at https://github.com/gzcch/Bingo.	This paper introduces Bingo, a new benchmark to analyze hallucinations in vision-language models (VLMs), focusing on GPT-4V(ision)	Understanding the limitations and potential biases of VLMs like GPT-4V(ision) is crucial for improving their reliability and safety	The authors curated a benchmark of 190 failure instances across various categories of bias (regional, OCR, factual) and interference (image-to-image, text-to-image), comparing GPT-4V(ision)'s performance with LLaVA and Bard. Mitigation strategies like self-correction and chain-of-thought reasoning were also explored	GPT-4V(ision) exhibits significant regional bias, performing better on Western images and English text It is highly susceptible to interference, struggling to differentiate similar images and often adhering to inaccurate user claims Self-correction moderately reduces hallucinations, while chain-of-thought reasoning shows limited success	The benchmark's scope is limited to specific metrics and tasks Data curation relies on human judgment, potentially introducing bias	vision-language models, hallucination, bias, interference, gpt-4v
2311.03233 Report	Navigating Scaling Laws: Compute Optimality in Adaptive Model Training	Sotiris Anagnostidis, Gregor Bachmann, Imanol Schlag, Thomas Hofmann	In recent years, the state-of-the-art in deep learning has been dominated by very large models that have been pre-trained on vast amounts of data. The paradigm is very simple: investing more computational resources (optimally) leads to better performance, and even predictably so; neural scaling laws have been derived that accurately forecast the performance of a network for a desired level of compute. This leads to the notion of a `compute-optimal' model, i.e. a model that allocates a given level of compute during training optimally to maximize performance. In this work, we extend the concept of optimality by allowing for an `adaptive' model, i.e. a model that can change its shape during training. By doing so, we can design adaptive models that optimally traverse between the underlying scaling laws and outpace their `static' counterparts, leading to a significant reduction in the required compute to reach a given target performance. We show that our approach generalizes across modalities and different shape parameters.	This paper proposes an adaptive model training methodology that adjusts model shape (e.g., patch size for ViTs, context length for LLMs) during training to traverse between scaling laws, significantly reducing compute requirements for a target performance.	Alleviating the exponential compute increase needed for performance improvement in deep learning, especially for large pre-trained models.	By leveraging scaling laws for different model shapes, the method identifies the shape yielding the fastest performance gain at a given compute budget, enabling dynamic shape scheduling.	Adaptive patch size/context length scheduling for ViTs/LLMs reduces required compute by up to 50% for a given performance. The method generalizes to other shape parameters like model width, batch size, and training objectives. Dynamically scheduled models consistently outperform static (fixed shape) models in terms of compute efficiency.	The study primarily focuses on FLOPs, assuming a strong correlation with accelerator time, which might not always hold. Determining the optimal scheduler necessitates knowledge of scaling behavior for different shape parameters, potentially incurring high computational cost.	deep learning, scaling laws, adaptive training, vision transformers, language models
2311.03149 Report	Asymmetric Masked Distillation for Pre-Training Small Foundation Models	Zhiyu Zhao, Bingkun Huang, Sen Xing, Gangshan Wu, Yu Qiao, Limin Wang	Self-supervised foundation models have shown great potential in computer vision thanks to the pre-training paradigm of masked autoencoding. Scale is a primary factor influencing the performance of these foundation models. However, these large foundation models often result in high computational cost. This paper focuses on pre-training relatively small vision transformer models that could be efficiently adapted to downstream tasks. Specifically, taking inspiration from knowledge distillation in model compression, we propose a new asymmetric masked distillation (AMD) framework for pre-training relatively small models with autoencoding. The core of AMD is to devise an asymmetric masking strategy, where the teacher model is enabled to see more context information with a lower masking ratio, while the student model is still equipped with a high masking ratio. We design customized multi-layer feature alignment between the teacher encoder and student encoder to regularize the pre-training of student MAE. To demonstrate the effectiveness and versatility of AMD, we apply it to both ImageMAE and VideoMAE for pre-training relatively small ViT models. AMD achieved 84.6% classification accuracy on IN1K using the ViT-B model. And AMD achieves 73.3% classification accuracy using the ViT-B model on the Something-in-Something V2 dataset, a 3.7% improvement over the original ViT-B model from VideoMAE. We also transfer AMD pre-trained models to downstream tasks and obtain consistent performance improvement over the original masked autoencoding. The code and models are available at https://github.com/MCG-NJU/AMD.	This paper proposes Asymmetric Masked Distillation (AMD), a novel framework for pre-training smaller vision transformer models using an asymmetric masking strategy within a knowledge distillation framework.	Large foundation models are computationally expensive. AMD aims to pre-train smaller models efficiently, enabling their adaptation to downstream tasks with reduced computational costs.	AMD employs a teacher-student distillation framework. The teacher model (larger, pre-trained) uses a lower masking ratio, accessing more context. The student model (smaller) has a higher masking ratio, learning through pixel reconstruction and multi-layer feature alignment with the teacher.	AMD achieves 73.3% accuracy on SSV2 using ViT-B, closing the gap with the larger teacher model (74.3%). AMD demonstrates robust transfer learning performance, improving accuracy on action recognition tasks like SSV2, UCF101, and HMDB51. AMD outperforms symmetric distillation methods like DMAE, highlighting the benefits of asymmetric masking and feature alignment in capturing richer context information.	The optimal masking ratio for the teacher model needs careful consideration for balancing performance and computational cost. Exploring AMD with more complex architectures and larger datasets could further enhance performance.	knowledge distillation, vision transformers, self-supervised learning, masked autoencoding, action recognition
2311.03079 Report	CogVLM: Visual Expert for Pretrained Language Models	Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, Jie Tang	We introduce CogVLM, a powerful open-source visual language foundation model. Different from the popular shallow alignment method which maps image features into the input space of language model, CogVLM bridges the gap between the frozen pretrained language model and image encoder by a trainable visual expert module in the attention and FFN layers. As a result, CogVLM enables deep fusion of vision language features without sacrificing any performance on NLP tasks. CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC, and ranks the 2nd on VQAv2, OKVQA, TextVQA, COCO captioning, etc., surpassing or matching PaLI-X 55B. Codes and checkpoints are available at https://github.com/THUDM/CogVLM.	This paper introduces CogVLM, an open-source visual language foundation model that deeply integrates visual and linguistic features while preserving the capabilities of a pretrained large language model.	Existing methods for integrating vision into large language models either rely on shallow alignment, limiting performance, or risk catastrophic forgetting of language abilities when directly training on image-text data. CogVLM addresses these limitations.	CogVLM employs a trainable visual expert module within the language model's architecture, enabling deep fusion of visual and linguistic information without modifying the original language model parameters. This approach facilitates comprehensive multimodal pretraining and fine-tuning on diverse datasets.	CogVLM-17B achieves state-of-the-art results across 17 visual language benchmarks, including image captioning, visual question answering, LVLM benchmarks, and visual grounding. The visual expert module is shown to be crucial, outperforming shallow alignment methods and even surpassing models with larger language models or specialized training. Ablation studies validate the design choices of CogVLM, including the visual expert's architecture, initialization, and the use of causal attention masks for visual tokens.	The model's performance in handling complex compositional reasoning or tasks requiring extensive external knowledge is not extensively evaluated. Future work can explore advanced alignment techniques like RLHF and anti-hallucination strategies to further enhance CogVLM's capabilities and address potential biases.	multimodal learning, vision language models, deep fusion, visual expert, large language models
2311.02848 Report	Consistent4D: Consistent 360° Dynamic Object Generation from Monocular Video	Yanqin Jiang, Li Zhang, Jin Gao, Weimin Hu, Yao Yao	In this paper, we present Consistent4D, a novel approach for generating 4D dynamic objects from uncalibrated monocular videos. Uniquely, we cast the 360-degree dynamic object reconstruction as a 4D generation problem, eliminating the need for tedious multi-view data collection and camera calibration. This is achieved by leveraging the object-level 3D-aware image diffusion model as the primary supervision signal for training Dynamic Neural Radiance Fields (DyNeRF). Specifically, we propose a Cascade DyNeRF to facilitate stable convergence and temporal continuity under the supervision signal which is discrete along the time axis. To achieve spatial and temporal consistency, we further introduce an Interpolation-driven Consistency Loss. It is optimized by minimizing the discrepancy between rendered frames from DyNeRF and interpolated frames from a pre-trained video interpolation model. Extensive experiments show that our Consistent4D can perform competitively to prior art alternatives, opening up new possibilities for 4D dynamic object generation from monocular videos, whilst also demonstrating advantage for conventional text-to-3D generation tasks. Our project page is https://consistent4d.github.io/.	Presents Consistent4D, a novel framework for generating 360° 4D dynamic objects from uncalibrated, static monocular videos using a Cascade Dynamic Neural Radiance Field (DyNeRF) optimized by a 2D image diffusion model and a novel Interpolation-driven Consistency Loss (ICL).	Addresses limitations of existing 4D reconstruction methods reliant on multi-view data or restricted capture setups by enabling dynamic object generation from readily available monocular videos.	Leverages a cascade DyNeRF architecture trained with Score Distillation Sampling (SDS) from a pre-trained image diffusion model. Introduces ICL to enforce spatial and temporal consistency by minimizing discrepancies between rendered and interpolated frames from a video interpolation model. A lightweight video enhancer further refines the generated output.	Outperforms baseline dynamic NeRF methods in quantitative metrics (LPIPS, CLIP similarity) for novel view synthesis of dynamic objects. Demonstrates superior spatial and temporal consistency compared to methods without ICL, effectively mitigating issues like multi-face artifacts. The proposed ICL also shows promise in alleviating multi-face problems in conventional text-to-3D generation tasks.	Struggles to generate accurate representations when the object's motion is overly complex or abrupt. ICL, while effective in most cases, does not completely eliminate multi-face artifacts in all text-to-3D generation scenarios.	4d generation, dynamic nerf, monocular video, score distillation sampling, interpolation-driven consistency
2311.02826 Report	InstructPix2NeRF: Instructed 3D Portrait Editing from a Single Image	Jianhui Li, Shilong Liu, Zidong Liu, Yikai Wang, Kaiwen Zheng, Jinghui Xu, Jianmin Li, Jun Zhu	With the success of Neural Radiance Field (NeRF) in 3D-aware portrait editing, a variety of works have achieved promising results regarding both quality and 3D consistency. However, these methods heavily rely on per-prompt optimization when handling natural language as editing instructions. Due to the lack of labeled human face 3D datasets and effective architectures, the area of human-instructed 3D-aware editing for open-world portraits in an end-to-end manner remains under-explored. To solve this problem, we propose an end-to-end diffusion-based framework termed InstructPix2NeRF, which enables instructed 3D-aware portrait editing from a single open-world image with human instructions. At its core lies a conditional latent 3D diffusion process that lifts 2D editing to 3D space by learning the correlation between the paired images' difference and the instructions via triplet data. With the help of our proposed token position randomization strategy, we could even achieve multi-semantic editing through one single pass with the portrait identity well-preserved. Besides, we further propose an identity consistency module that directly modulates the extracted identity signals into our diffusion process, which increases the multi-view 3D identity consistency. Extensive experiments verify the effectiveness of our method and show its superiority against strong baselines quantitatively and qualitatively. Source code and pre-trained models can be found on our project page: \url{https://mybabyyh.github.io/InstructPix2NeRF}.	Presents InstructPix2NeRF, an end-to-end diffusion-based framework for 3D-aware portrait editing from a single image guided by human instructions.	Addresses the lack of end-to-end models for instructed 3D-aware portrait editing, aiming for more user-friendly and precise control over edits.	Combines a conditional latent 3D diffusion process with NeRF-based generators. Utilizes a triplet dataset (original face, edited face, instruction) and introduces token position randomization and an identity consistency module.	Achieves high fidelity to human instructions while maintaining identity consistency. Enables multi-semantic editing from a single instruction, surpassing baseline methods in qualitative and quantitative evaluations. Demonstrates superior performance in user studies for instruction correspondence and identity consistency.	Slight variations in color output can occur between semantically similar instructions. Fine details like eye shape and eyelashes can be further improved.	3d-aware editing, human instructions, diffusion models, nerf, portrait editing
2311.02709 Report	Benchmarking a Benchmark: How Reliable is MS-COCO?	Eric Zimmermann, Justin Szeto, Jerome Pasquero, Frederic Ratle	Benchmark datasets are used to profile and compare algorithms across a variety of tasks, ranging from image classification to segmentation, and also play a large role in image pretraining algorithms. Emphasis is placed on results with little regard to the actual content within the dataset. It is important to question what kind of information is being learned from these datasets and what are the nuances and biases within them. In the following work, Sama-COCO, a re-annotation of MS-COCO, is used to discover potential biases by leveraging a shape analysis pipeline. A model is trained and evaluated on both datasets to examine the impact of different annotation conditions. Results demonstrate that annotation styles are important and that annotation pipelines should closely consider the task of interest. The dataset is made publicly available at https://www.sama.com/sama-coco-dataset/ .	This paper presents Sama-COCO, a re-annotated version of the MS-COCO dataset focused on tighter polygons and decomposed crowd instances, to investigate potential biases in annotation styles and their impact on model performance.	Benchmark datasets like MS-COCO are crucial for evaluating computer vision algorithms. However, inconsistencies and biases within these datasets can impact the reliability of performance comparisons and the development of robust models.	The authors re-annotated the MS-COCO dataset with stricter guidelines, emphasizing precise polygon boundaries and individual instance labeling. They trained a Faster R-CNN model on both datasets and compared performance using standard metrics. A shape analysis pipeline, based on contour analysis and distance transforms, was employed to quantify differences between corresponding annotations in both datasets.	Annotation styles significantly impact model performance, with models performing better when trained and evaluated on datasets with consistent annotation guidelines. MS-COCO exhibits a bias towards avoiding annotation around occluding objects, leading to variations in model outputs compared to Sama-COCO, which emphasizes pixel-level accuracy. Even a theoretically perfect predictor trained on one dataset might exhibit degraded performance on another due to variations in annotation styles.	The analysis focuses on single polygon shapes and relies on bounding box shape consistency assumptions for matching annotations. The study primarily utilizes a Faster R-CNN model. Exploring the impact of annotation styles on other architectures could provide further insights.	dataset bias, annotation quality, instance segmentation, ms-coco, sama-coco
2311.02542 Report	VR-NeRF: High-Fidelity Virtualized Walkable Spaces	Linning Xu, Vasu Agrawal, William Laney, Tony Garcia, Aayush Bansal, Changil Kim, Samuel Rota Bulò, Lorenzo Porzi, Peter Kontschieder, Aljaž Božič, Dahua Lin, Michael Zollhöfer, Christian Richardt	We present an end-to-end system for the high-fidelity capture, model reconstruction, and real-time rendering of walkable spaces in virtual reality using neural radiance fields. To this end, we designed and built a custom multi-camera rig to densely capture walkable spaces in high fidelity and with multi-view high dynamic range images in unprecedented quality and density. We extend instant neural graphics primitives with a novel perceptual color space for learning accurate HDR appearance, and an efficient mip-mapping mechanism for level-of-detail rendering with anti-aliasing, while carefully optimizing the trade-off between quality and speed. Our multi-GPU renderer enables high-fidelity volume rendering of our neural radiance field model at the full VR resolution of dual 2K$\times$2K at 36 Hz on our custom demo machine. We demonstrate the quality of our results on our challenging high-fidelity datasets, and compare our method and datasets to existing baselines. We release our dataset on our project website.	VR-NeRF: An end-to-end system for high-fidelity capture, reconstruction, and real-time rendering of walkable spaces in VR using neural radiance fields.	Existing approaches for VR view synthesis are limited to either small headbox volumes or lower quality scene-scale rendering.	A custom multi-camera rig ("Eyeful Tower") captures dense, high-resolution HDR images. A novel NeRF model with perceptual color space and efficient mip-mapping enables high-fidelity reconstruction and rendering. A multi-GPU renderer enables real-time VR exploration.	Captures large-scale datasets with unprecedented quality and density (thousands of 50MP HDR images). Proposed VR-NeRF model outperforms baselines in visual fidelity for large-scale HDR scenes. Achieves real-time rendering (36 FPS) at VR resolution on a custom 20-GPU workstation.	Limited ability to handle dynamic scene elements like moving objects or lighting. View extrapolation capabilities are limited by the capture density.	neural radiance fields, virtual reality, novel view synthesis, high dynamic range imaging, multi-view capture
2311.02536 Report	Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation for Grounding-Based Vision and Language Models	Jingru Yi, Burak Uzkent, Oana Ignat, Zili Li, Amanmeet Garg, Xiang Yu, Linda Liu	Grounding-based vision and language models have been successfully applied to low-level vision tasks, aiming to precisely locate objects referred in captions. The effectiveness of grounding representation learning heavily relies on the scale of the training dataset. Despite being a useful data enrichment strategy, data augmentation has received minimal attention in existing vision and language tasks as augmentation for image-caption pairs is non-trivial. In this study, we propose a robust phrase grounding model trained with text-conditioned and text-unconditioned data augmentations. Specifically, we apply text-conditioned color jittering and horizontal flipping to ensure semantic consistency between images and captions. To guarantee image-caption correspondence in the training samples, we modify the captions according to pre-defined keywords when applying horizontal flipping. Additionally, inspired by recent masked signal reconstruction, we propose to use pixel-level masking as a novel form of data augmentation. While we demonstrate our data augmentation method with MDETR framework, the proposed approach is applicable to common grounding-based vision and language tasks with other frameworks. Finally, we show that image encoder pretrained on large-scale image and language datasets (such as CLIP) can further improve the results. Through extensive experiments on three commonly applied datasets: Flickr30k, referring expressions and GQA, our method demonstrates advanced performance over the state-of-the-arts with various metrics. Code can be found in https://github.com/amzn/augment-the-pairs-wacv2024.	The paper proposes a robust phrase grounding model trained with text-conditioned and text-unconditioned data augmentations to improve grounding-based vision and language models, particularly for object localization within captions.	Data augmentation, though crucial for enriching datasets and model generalization, has been under-explored in phrase grounding tasks due to the challenge of maintaining image-caption semantic consistency.	The methodology involves text-conditioned color jittering and horizontal flipping applied selectively based on caption content to ensure semantic consistency. It also uses pixel-level masking as a novel data augmentation approach. The approach is demonstrated with the MDETR framework and incorporates image encoders pre-trained on large-scale image and language datasets like CLIP.	The proposed method consistently outperforms MDETR, a state-of-the-art phrase grounding model, on various benchmarks like Flickr30k, Referring Expressions, and GQA. Text-conditioned horizontal flipping, which modifies captions based on a keyword list, significantly improves performance on the Referring Expressions dataset, highlighting its effectiveness in learning complex image-caption relationships. The model exhibits better semantic understanding and robustness in phrase grounding, demonstrated by its ability to suppress redundant detections and make accurate distinctions between objects based on subtle cues.	The model still faces challenges in detecting small objects that blend with the background and recognizing text within images. Future work will explore generalizing to a wider range of data augmentations to further enhance the diversity of the input space.	phrase grounding, data augmentation, vision and language, object localization, mdetr
2311.02343 Report	Stable Diffusion Reference Only: Image Prompt and Blueprint Jointly Guided Multi-Condition Diffusion Model for Secondary Painting	Hao Ai, Lu Sheng	Stable Diffusion and ControlNet have achieved excellent results in the field of image generation and synthesis. However, due to the granularity and method of its control, the efficiency improvement is limited for professional artistic creations such as comics and animation production whose main work is secondary painting. In the current workflow, fixing characters and image styles often need lengthy text prompts, and even requires further training through TextualInversion, DreamBooth or other methods, which is very complicated and expensive for painters. Therefore, we present a new method in this paper, Stable Diffusion Reference Only, a images-to-image self-supervised model that uses only two types of conditional images for precise control generation to accelerate secondary painting. The first type of conditional image serves as an image prompt, supplying the necessary conceptual and color information for generation. The second type is blueprint image, which controls the visual structure of the generated image. It is natively embedded into the original UNet, eliminating the need for ControlNet. We released all the code for the module and pipeline, and trained a controllable character line art coloring model at https://github.com/aihao2000/stable-diffusion-reference-only, that achieved state-of-the-art results in this field. This verifies the effectiveness of the structure and greatly improves the production efficiency of animations, comics, and fanworks.	This paper introduces Stable Diffusion Reference Only, a novel image-to-image self-supervised model for secondary painting that utilizes two conditional images: an image prompt and a blueprint image.	Existing text-to-image models are inefficient for professional artistic creations like comics and animation, as they rely heavily on text prompts and require extensive training for specific styles.	The model leverages a modified UNet architecture with cross-attention mechanisms to incorporate both image prompt and blueprint information. It is trained in a self-supervised manner using a dataset of anime images and CLIP similarity scores.	Stable Diffusion Reference Only demonstrates state-of-the-art results in line art coloring. The model exhibits generalization capabilities, enabling style transfer between different anime characters. It outperforms existing methods like ControlNet Reference Only and IP-Adapter in terms of accuracy and generalization.	The model's performance on complex backgrounds and diverse artistic styles requires further investigation. Future work could explore extending the blueprint image to other forms like sketches and poses.	image generation, secondary painting, stable diffusion, image-to-image translation, anime art
2311.01813 Report	FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation	Yuanxin Liu, Lei Li, Shuhuai Ren, Rundong Gao, Shicheng Li, Sishuo Chen, Xu Sun, Lu Hou	Recently, open-domain text-to-video (T2V) generation models have made remarkable progress. However, the promising results are mainly shown by the qualitative cases of generated videos, while the quantitative evaluation of T2V models still faces two critical problems. Firstly, existing studies lack fine-grained evaluation of T2V models on different categories of text prompts. Although some benchmarks have categorized the prompts, their categorization either only focuses on a single aspect or fails to consider the temporal information in video generation. Secondly, it is unclear whether the automatic evaluation metrics are consistent with human standards. To address these problems, we propose FETV, a benchmark for Fine-grained Evaluation of Text-to-Video generation. FETV is multi-aspect, categorizing the prompts based on three orthogonal aspects: the major content, the attributes to control and the prompt complexity. FETV is also temporal-aware, which introduces several temporal categories tailored for video generation. Based on FETV, we conduct comprehensive manual evaluations of four representative T2V models, revealing their pros and cons on different categories of prompts from different aspects. We also extend FETV as a testbed to evaluate the reliability of automatic T2V metrics. The multi-aspect categorization of FETV enables fine-grained analysis of the metrics' reliability in different scenarios. We find that existing automatic metrics (e.g., CLIPScore and FVD) correlate poorly with human evaluation. To address this problem, we explore several solutions to improve CLIPScore and FVD, and develop two automatic metrics that exhibit significant higher correlation with humans than existing metrics. Benchmark page: https://github.com/llyx97/FETV.	The paper introduces FETV, a benchmark designed for fine-grained evaluation of open-domain text-to-video generation models.	Existing quantitative evaluations of T2V models lack fine-grained analysis across categories and reliable automatic metrics.	FETV categorizes text prompts based on major content (spatial/temporal), attribute control (spatial/temporal), and complexity. The authors manually evaluate four representative T2V models on FETV and analyze their performance across different categories. FETV is also used as a testbed to assess the reliability of automatic metrics.	Existing T2V models struggle with generating high-quality videos, particularly for categories involving actions, kinetic motions, quantity control, motion direction, and event order. Widely used automatic metrics (e.g., CLIPScore, FID, FVD) show poor correlation with human evaluation. The authors develop two new automatic metrics, FVD-UMT (video quality) and UMTScore (video-text alignment), which exhibit higher correlation with human judgment than existing metrics.	The number of evaluated T2V models is limited due to the scarcity of open-sourced models. While improved, the proposed UMT-based metrics still have room for better alignment with human evaluation.	text-to-video generation, benchmark, evaluation metrics, fine-grained evaluation, vision-language models
2311.01804 Report	inkn'hue: Enhancing Manga Colorization from Multiple Priors with Alignment Multi-Encoder VAE	Tawin Jiramahapokee	Manga, a form of Japanese comics and distinct visual storytelling, has captivated readers worldwide. Traditionally presented in black and white, manga's appeal lies in its ability to convey complex narratives and emotions through intricate line art and shading. Yet, the desire to experience manga in vibrant colors has sparked the pursuit of manga colorization, a task of paramount significance for artists. However, existing methods, originally designed for line art and sketches, face challenges when applied to manga. These methods often fall short in achieving the desired results, leading to the need for specialized manga-specific solutions. Existing approaches frequently rely on a single training step or extensive manual artist intervention, which can yield less satisfactory outcomes. To address these challenges, we propose a specialized framework for manga colorization. Leveraging established models for shading and vibrant coloring, our approach aligns both using a multi-encoder VAE. This structured workflow ensures clear and colorful results, with the option to incorporate reference images and manual hints.	This paper introduces a novel framework for user-guided manga colorization that leverages a multi-encoder VAE to enhance the color consistency, shading, and line art quality of manga pages, building upon existing models for shading and colorization.	Existing methods for manga colorization often result in color bleeding, text clarity issues, and a lack of vibrancy. This framework aims to address these limitations and provide a streamlined approach for producing high-quality colorized manga.	The framework combines a shading model and a rough colorization model, aligning their outputs using a multi-encoder VAE. This VAE is trained to correct inconsistencies and enhance the overall visual quality, while CIELAB interpolation is employed as a post-processing step to fine-tune color saturation and truthfulness.	The framework effectively restores line art details lost during the initial colorization process, resulting in sharper features and improved text clarity. The multi-encoder VAE successfully corrects color outliers and inconsistencies from the rough colorization stage, producing more uniform and realistic results. User studies show a strong preference for the framework's post-processed outputs over the rough-colorized priors, highlighting its effectiveness in enhancing visual appeal.	The lack of publicly available datasets for manga colorization necessitated the compilation of a training dataset from various sources, potentially introducing bias. The study primarily focused on comparing the framework's performance against its internal stages rather than external benchmarks due to the lack of user-guidance support in some existing models.	manga colorization, multi-encoder vae, cielab interpolation, user-guided colorization, deep learning
2311.01797 Report	On the Generalization Properties of Diffusion Models	Puheng Li, Zhong Li, Huishuai Zhang, Jiang Bian	Diffusion models are a class of generative models that serve to establish a stochastic transport map between an empirically observed, yet unknown, target distribution and a known prior. Despite their remarkable success in real-world applications, a theoretical understanding of their generalization capabilities remains underdeveloped. This work embarks on a comprehensive theoretical exploration of the generalization attributes of diffusion models. We establish theoretical estimates of the generalization gap that evolves in tandem with the training dynamics of score-based diffusion models, suggesting a polynomially small generalization error ($O(n^{-2/5}+m^{-4/5})$) on both the sample size $n$ and the model capacity $m$, evading the curse of dimensionality (i.e., not exponentially large in the data dimension) when early-stopped. Furthermore, we extend our quantitative analysis to a data-dependent scenario, wherein target distributions are portrayed as a succession of densities with progressively increasing distances between modes. This precisely elucidates the adverse effect of "modes shift" in ground truths on the model generalization. Moreover, these estimates are not solely theoretical constructs but have also been confirmed through numerical simulations. Our findings contribute to the rigorous understanding of diffusion models' generalization properties and provide insights that may guide practical applications.	This paper provides a theoretical analysis of the generalization capability of diffusion models, showing that the generalization gap is polynomially small on sample size and model capacity with early-stopping.	Understanding the generalization properties of diffusion models is crucial to address memorization, privacy, and copyright concerns arising from their impressive empirical performance and practical applications.	The authors derive upper bounds of the generalization gap, measured by KL divergence, along the training dynamics, using techniques like Rademacher complexity and convex optimization analysis.	The generalization error scales polynomially with sample size ($O(n^{-2/5})$) and model capacity ($O(m^{-4/5})$) when early-stopped, avoiding the curse of dimensionality. For target distributions with distant multi-modes, the generalization capability is adversely affected by the distance between modes, as illustrated by the example of Gaussian mixtures. Numerical simulations on synthetic and real-world datasets verify the theoretical findings of early-stopping generalization and the modes shift effect.	The analysis focuses on a specific score network architecture (random feature model) and might not directly extend to other variants in the diffusion models family. Future work could explore extending the theoretical framework to more complex score networks, like neural tangent kernels or mean-field models.	diffusion models, score-based generative models, generalization, memorization, modes shift
2311.01773 Report	PDF: Point Diffusion Implicit Function for Large-scale Scene Neural Representation	Yuhan Ding, Fukun Yin, Jiayuan Fan, Hui Li, Xin Chen, Wen Liu, Chongshan Lu, Gang YU, Tao Chen	Recent advances in implicit neural representations have achieved impressive results by sampling and fusing individual points along sampling rays in the sampling space. However, due to the explosively growing sampling space, finely representing and synthesizing detailed textures remains a challenge for unbounded large-scale outdoor scenes. To alleviate the dilemma of using individual points to perceive the entire colossal space, we explore learning the surface distribution of the scene to provide structural priors and reduce the samplable space and propose a Point Diffusion implicit Function, PDF, for large-scale scene neural representation. The core of our method is a large-scale point cloud super-resolution diffusion module that enhances the sparse point cloud reconstructed from several training images into a dense point cloud as an explicit prior. Then in the rendering stage, only sampling points with prior points within the sampling radius are retained. That is, the sampling space is reduced from the unbounded space to the scene surface. Meanwhile, to fill in the background of the scene that cannot be provided by point clouds, the region sampling based on Mip-NeRF 360 is employed to model the background representation. Expensive experiments have demonstrated the effectiveness of our method for large-scale scene novel view synthesis, which outperforms relevant state-of-the-art baselines.	This paper proposes PDF, a Point Diffusion implicit Function, for large-scale scene neural representation to enable more efficient and detailed novel view synthesis.	Existing implicit neural representation methods struggle with large-scale outdoor scenes due to the explosively growing sampling space needed to represent details.	PDF uses a two-stage approach: (1) It leverages a point diffusion model to generate a dense point cloud from a sparse point cloud reconstructed from training images, providing a surface prior. (2) It employs Point-NeRF for foreground rendering and Mip-NeRF 360 for background rendering, fusing the features for novel view synthesis.	PDF outperforms state-of-the-art methods on large-scale scene datasets (OMMO, BlendedMVS) in terms of PSNR, SSIM, and LPIPS. The point diffusion module effectively captures scene surface distribution and generates dense point cloud representations. The fusion of foreground and background rendering modules leads to photorealistic results with fine details.	The current approach trains a separate diffusion model for each scene, limiting efficiency. Exploring cross-scene point cloud up-sampling generalization and generalized point diffusion NeRF are interesting future directions.	neural radiance fields, novel view synthesis, point cloud processing, diffusion models, large-scale scene representation
2311.01714 Report	EXIM: A Hybrid Explicit-Implicit Representation for Text-Guided 3D Shape Generation	Zhengzhe Liu, Jingyu Hu, Ka-Hei Hui, Xiaojuan Qi, Daniel Cohen-Or, Chi-Wing Fu	This paper presents a new text-guided technique for generating 3D shapes. The technique leverages a hybrid 3D shape representation, namely EXIM, combining the strengths of explicit and implicit representations. Specifically, the explicit stage controls the topology of the generated 3D shapes and enables local modifications, whereas the implicit stage refines the shape and paints it with plausible colors. Also, the hybrid approach separates the shape and color and generates color conditioned on shape to ensure shape-color consistency. Unlike the existing state-of-the-art methods, we achieve high-fidelity shape generation from natural-language descriptions without the need for time-consuming per-shape optimization or reliance on human-annotated texts during training or test-time optimization. Further, we demonstrate the applicability of our approach to generate indoor scenes with consistent styles using text-induced 3D shapes. Through extensive experiments, we demonstrate the compelling quality of our results and the high coherency of our generated shapes with the input texts, surpassing the performance of existing methods by a significant margin. Codes and models are released at https://github.com/liuzhengzhe/EXIM.	This paper introduces EXIM, a novel hybrid explicit-implicit 3D shape representation, for high-fidelity text-guided 3D shape generation and modification.	Existing methods for text-guided 3D shape generation suffer from limitations such as time-consuming optimization, unrealistic outputs, and difficulties in local shape modification. This work addresses these limitations by leveraging a hybrid shape representation.	EXIM employs a two-stage pipeline. The first stage uses a 3D diffusion model in a compact wavelet domain to generate a coarse shape based on text. The second stage utilizes an implicit network to enhance details and generate color conditioned on the shape, guided by the text.	EXIM generates high-fidelity 3D shapes from text descriptions, outperforming existing methods in terms of detail and realism. The hybrid representation allows for local shape modification and independent editing of color and shape based on text. The method can generate style-consistent indoor scenes by composing shapes generated from text prompts.	The performance on categories trained with pseudo-annotations is limited by the quality of the image captioning model. Automatic and accurate localization of regions of interest for modification based on text remains a challenge.	3d shape generation, text-guided synthesis, hybrid representation, shape modification, indoor scene generation
2311.01462 Report	Idempotent Generative Network	Assaf Shocher, Amil Dravid, Yossi Gandelsman, Inbar Mosseri, Michael Rubinstein, Alexei A. Efros	We propose a new approach for generative modeling based on training a neural network to be idempotent. An idempotent operator is one that can be applied sequentially without changing the result beyond the initial application, namely $f(f(z))=f(z)$. The proposed model $f$ is trained to map a source distribution (e.g, Gaussian noise) to a target distribution (e.g. realistic images) using the following objectives: (1) Instances from the target distribution should map to themselves, namely $f(x)=x$. We define the target manifold as the set of all instances that $f$ maps to themselves. (2) Instances that form the source distribution should map onto the defined target manifold. This is achieved by optimizing the idempotence term, $f(f(z))=f(z)$ which encourages the range of $f(z)$ to be on the target manifold. Under ideal assumptions such a process provably converges to the target distribution. This strategy results in a model capable of generating an output in one step, maintaining a consistent latent space, while also allowing sequential applications for refinement. Additionally, we find that by processing inputs from both target and source distributions, the model adeptly projects corrupted or modified data back to the target manifold. This work is a first step towards a ``global projector'' that enables projecting any input into a target data distribution.	This paper introduces Idempotent Generative Networks (IGN), a novel generative model trained to project inputs onto a target data manifold by optimizing for idempotence (i.e., f(f(z))=f(z)).	IGN aims to create a "global projector" capable of mapping various inputs, including noise, corrupted instances, and alternative data distributions, onto a desired target distribution (e.g., realistic images).	IGN uses a self-adversarial training process with three objectives: (1) Reconstruction: mapping target distribution samples to themselves, (2) Idempotence: ensuring the model output, when fed back as input, remains unchanged, and (3) Tightness: preventing the model from mapping everything to the target manifold.	Theoretically, under ideal conditions, IGN’s generated distribution converges to the target distribution. Experiments on MNIST and CelebA demonstrate IGN’s ability to generate images from noise, showing that sequential applications can refine generated samples. IGN exhibits out-of-distribution projection capabilities, successfully handling tasks like denoising, colorization, and sketch-to-image translation without explicit training on these tasks.	Similar to GANs, IGN can suffer from mode collapse, potentially limiting the diversity of generated samples. The generated samples can appear blurry, a common issue in autoencoder-based models.	generative models, idempotence, image generation, image-to-image translation, out-of-distribution projection
2311.01410 Report	The Blessing of Randomness: SDE Beats ODE in General Diffusion-based Image Editing	Shen Nie, Hanzhong Allan Guo, Cheng Lu, Yuhao Zhou, Chenyu Zheng, Chongxuan Li	We present a unified probabilistic formulation for diffusion-based image editing, where a latent variable is edited in a task-specific manner and generally deviates from the corresponding marginal distribution induced by the original stochastic or ordinary differential equation (SDE or ODE). Instead, it defines a corresponding SDE or ODE for editing. In the formulation, we prove that the Kullback-Leibler divergence between the marginal distributions of the two SDEs gradually decreases while that for the ODEs remains as the time approaches zero, which shows the promise of SDE in image editing. Inspired by it, we provide the SDE counterparts for widely used ODE baselines in various tasks including inpainting and image-to-image translation, where SDE shows a consistent and substantial improvement. Moreover, we propose SDE-Drag -- a simple yet effective method built upon the SDE formulation for point-based content dragging. We build a challenging benchmark (termed DragBench) with open-set natural, art, and AI-generated images for evaluation. A user study on DragBench indicates that SDE-Drag significantly outperforms our ODE baseline, existing diffusion-based methods, and the renowned DragGAN. Our results demonstrate the superiority and versatility of SDE in image editing and push the boundary of diffusion-based editing methods.	This paper introduces a unified probabilistic perspective for analyzing diffusion-based image editing and demonstrates the superiority of stochastic differential equation (SDE) formulations over the commonly used ordinary differential equation (ODE) methods.	Existing diffusion-based image editing methods lack a probabilistic understanding, and ODE formulations are predominantly used due to ease of implementation. This work provides theoretical justification for SDE's advantages in editing.	The paper presents a unified formulation encompassing existing methods, proves theoretically that SDEs reduce the divergence between edited and data distributions while ODEs do not, and proposes SDE-Drag, a novel SDE-based method for point-based content dragging.	SDE counterparts consistently outperform ODE baselines in inpainting and image-to-image translation tasks. SDE-Drag, evaluated on a new challenging benchmark (DragBench), significantly outperforms ODE-Drag, existing diffusion-based dragging methods, and DragGAN. SDE-based methods achieve superior results without increasing computational cost.	Theoretical analysis does not fully account for model approximation and discretization errors. Open-set image dragging remains a challenge with certain failure cases.	diffusion model, image editing, sde, ode, image dragging
2311.01373 Report	Recognize Any Regions	Haosen Yang, Chuofan Ma, Bin Wen, Yi Jiang, Zehuan Yuan, Xiatian Zhu	Understanding the semantics of individual regions or patches within unconstrained images, such as in open-world object detection, represents a critical yet challenging task in computer vision. Building on the success of powerful image-level vision-language (ViL) foundation models like CLIP, recent efforts have sought to harness their capabilities by either training a contrastive model from scratch with an extensive collection of region-label pairs or aligning the outputs of a detection model with image-level representations of region proposals. Despite notable progress, these approaches are plagued by computationally intensive training requirements, susceptibility to data noise, and deficiency in contextual information. To address these limitations, we explore the synergistic potential of off-the-shelf foundation models, leveraging their respective strengths in localization and semantics. We introduce a novel, generic, and efficient region recognition architecture, named RegionSpot, designed to integrate position-aware localization knowledge from a localization foundation model (e.g., SAM) with semantic information extracted from a ViL model (e.g., CLIP). To fully exploit pretrained knowledge while minimizing training overhead, we keep both foundation models frozen, focusing optimization efforts solely on a lightweight attention-based knowledge integration module. Through extensive experiments in the context of open-world object recognition, our RegionSpot demonstrates significant performance improvements over prior alternatives, while also providing substantial computational savings. For instance, training our model with 3 million data in a single day using 8 V100 GPUs. Our model outperforms GLIP by 6.5 % in mean average precision (mAP), with an even larger margin by 14.8 % for more challenging and rare categories.	This paper introduces RegionSpot, a novel region recognition framework that leverages frozen vision and vision-language foundation models (e.g., SAM and CLIP) for efficient region recognition.	Understanding the semantics of individual regions in images is crucial for tasks like open-world object detection. Existing methods suffer from high training costs, data noise susceptibility, and contextual information loss.	RegionSpot integrates position-aware tokens from a localization model (SAM) with semantic features from a ViL model (CLIP) using cross-attention. This approach allows for efficient training by keeping both foundation models frozen.	RegionSpot significantly outperforms previous methods in open-world object recognition, achieving a 6.5% higher mAP than GLIP. The model demonstrates robustness to noisy region proposals, achieving state-of-the-art performance when using proposals from a ViL detector. RegionSpot exhibits superior training efficiency, requiring substantially fewer GPU hours compared to GLIP and RegionCLIP.	The current implementation primarily focuses on object recognition and relies on external object proposals. Future work will explore end-to-end learning to incorporate both object localization and recognition.	region recognition, open-world object detection, vision-language models, foundation models, zero-shot learning
2311.01090 Report	Infusion: Internal Diffusion for Video Inpainting	Nicolas Cherel, Andrés Almansa, Yann Gousseau, Alasdair Newson	Video inpainting is the task of filling a desired region in a video in a visually convincing manner. It is a very challenging task due to the high dimensionality of the signal and the temporal consistency required for obtaining convincing results. Recently, diffusion models have shown impressive results in modeling complex data distributions, including images and videos. Diffusion models remain nonetheless very expensive to train and perform inference with, which strongly restrict their application to video. We show that in the case of video inpainting, thanks to the highly auto-similar nature of videos, the training of a diffusion model can be restricted to the video to inpaint and still produce very satisfying results. This leads us to adopt an internal learning approch, which also allows for a greatly reduced network size. We call our approach "Infusion": an internal learning algorithm for video inpainting through diffusion. Due to our frugal network, we are able to propose the first video inpainting approach based purely on diffusion. Other methods require supporting elements such as optical flow estimation, which limits their performance in the case of dynamic textures for example. We introduce a new method for efficient training and inference of diffusion models in the context of internal learning. We split the diffusion process into different learning intervals which greatly simplifies the learning steps. We show qualititative and quantitative results, demonstrating that our method reaches state-of-the-art performance, in particular in the case of dynamic backgrounds and textures.	Introduces "Infusion," a purely diffusion-based video inpainting approach using internal learning, enabling high-quality video inpainting by training a lightweight network on a single video.	Addresses the limitations of existing video inpainting methods, which often struggle with dynamic textures and rely on computationally expensive diffusion models or optical flow estimations.	Employs a 3D UNet architecture with a novel "interval training" scheme, where the network is trained on subsets of diffusion timesteps, leading to efficient training and inference.	Achieves state-of-the-art performance on video inpainting tasks, particularly excelling in handling dynamic textures. Significantly outperforms competing methods in reconstructing complex dynamic textures, as evidenced by LPIPS and VFID metrics. Demonstrates superior performance in object removal scenarios, as indicated by the VFID metric.	Limited temporal receptive field due to the convolutional architecture, potentially impacting long-range temporal consistency. Longer inference times compared to some deep learning-based methods.	video inpainting, diffusion models, internal learning, dynamic textures, interval training
2311.01015 Report	Act As You Wish: Fine-Grained Control of Motion Diffusion Model with Hierarchical Semantic Graphs	Peng Jin, Yang Wu, Yanbo Fan, Zhongqian Sun, Yang Wei, Li Yuan	Most text-driven human motion generation methods employ sequential modeling approaches, e.g., transformer, to extract sentence-level text representations automatically and implicitly for human motion synthesis. However, these compact text representations may overemphasize the action names at the expense of other important properties and lack fine-grained details to guide the synthesis of subtly distinct motion. In this paper, we propose hierarchical semantic graphs for fine-grained control over motion generation. Specifically, we disentangle motion descriptions into hierarchical semantic graphs including three levels of motions, actions, and specifics. Such global-to-local structures facilitate a comprehensive understanding of motion description and fine-grained control of motion generation. Correspondingly, to leverage the coarse-to-fine topology of hierarchical semantic graphs, we decompose the text-to-motion diffusion process into three semantic levels, which correspond to capturing the overall motion, local actions, and action specifics. Extensive experiments on two benchmark human motion datasets, including HumanML3D and KIT, with superior performances, justify the efficacy of our method. More encouragingly, by modifying the edge weights of hierarchical semantic graphs, our method can continuously refine the generated motion, which may have a far-reaching impact on the community. Code and pre-training weights are available at https://github.com/jpthu17/GraphMotion.	This paper proposes GraphMotion, which leverages hierarchical semantic graphs for fine-grained control over text-driven human motion generation.	Existing methods often overemphasize action names in text descriptions and lack fine-grained control over generated motions. This work aims to address this by using a more detailed and structured text representation.	The authors use semantic role parsing to represent text descriptions as hierarchical semantic graphs with three levels: motions, actions, and specifics. They then design a coarse-to-fine motion diffusion model that progressively generates motion details, guided by the graph structure.	GraphMotion outperforms state-of-the-art methods on HumanML3D and KIT benchmarks, demonstrating superior controllability and motion quality. Modifying edge weights in the semantic graph enables fine-tuning of action attributes and durations. The method successfully generates plausible motion even when verbs and action names are masked from the input text.	The randomness inherent to diffusion models may occasionally lead to undesirable outputs. The quality of generated motion is limited by the performance of the pre-trained motion variational autoencoder.	human motion generation, text-to-motion, diffusion models, semantic graphs, fine-grained control
2311.00990 Report	VideoDreamer: Customized Multi-Subject Text-to-Video Generation with Disen-Mix Finetuning	Hong Chen, Xin Wang, Guanning Zeng, Yipeng Zhang, Yuwei Zhou, Feilin Han, Wenwu Zhu	Customized text-to-video generation aims to generate text-guided videos with customized user-given subjects, which has gained increasing attention recently. However, existing works are primarily limited to generating videos for a single subject, leaving the more challenging problem of customized multi-subject text-to-video generation largely unexplored. In this paper, we fill this gap and propose a novel VideoDreamer framework. VideoDreamer can generate temporally consistent text-guided videos that faithfully preserve the visual features of the given multiple subjects. Specifically, VideoDreamer leverages the pretrained Stable Diffusion with latent-code motion dynamics and temporal cross-frame attention as the base video generator. The video generator is further customized for the given multiple subjects by the proposed Disen-Mix Finetuning and Human-in-the-Loop Re-finetuning strategy, which can tackle the attribute binding problem of multi-subject generation. We also introduce MultiStudioBench, a benchmark for evaluating customized multi-subject text-to-video generation models. Extensive experiments demonstrate the remarkable ability of VideoDreamer to generate videos with new content such as new events and backgrounds, tailored to the customized multiple subjects. Our project page is available at https://videodreamer23.github.io/.	This paper introduces VideoDreamer, a novel framework for customized multi-subject text-to-video generation, capable of generating videos featuring user-specified subjects while adhering to textual prompts.	This research addresses the unexplored area of customized multi-subject text-to-video generation, overcoming limitations of previous works that focused primarily on single-subject generation.	VideoDreamer leverages a pretrained Stable Diffusion model enhanced with latent-code motion dynamics and temporal cross-frame attention. It employs a Disen-Mix Finetuning strategy to customize the model for multiple subjects, mitigating the attribute binding problem. An optional Human-in-the-Loop Refinement (HLR) strategy can further enhance performance.	VideoDreamer demonstrates superior subject fidelity compared to baseline models, effectively preserving visual features of multiple subjects. The Disen-Mix finetuning strategy ensures high textual fidelity, preventing overfitting to subject-irrelevant information in mixed data. VideoDreamer exhibits strong temporal consistency and minimal artifacts in generated videos, surpassing limitations of existing customization methods.	The reliance on a single prompt to guide all frames limits the generation of videos with dynamic backgrounds or multiple events. Temporal consistency can be further improved due to the video generator not being pretrained on text-video pairs.	text-to-video generation, multi-subject customization, stable diffusion, disen-mix finetuning, human-in-the-loop refinement
2311.00941 Report	Gaussian Mixture Solvers for Diffusion Models	Hanzhong Guo, Cheng Lu, Fan Bao, Tianyu Pang, Shuicheng Yan, Chao Du, Chongxuan Li	Recently, diffusion models have achieved great success in generative tasks. Sampling from diffusion models is equivalent to solving the reverse diffusion stochastic differential equations (SDEs) or the corresponding probability flow ordinary differential equations (ODEs). In comparison, SDE-based solvers can generate samples of higher quality and are suited for image translation tasks like stroke-based synthesis. During inference, however, existing SDE-based solvers are severely constrained by the efficiency-effectiveness dilemma. Our investigation suggests that this is because the Gaussian assumption in the reverse transition kernel is frequently violated (even in the case of simple mixture data) given a limited number of discretization steps. To overcome this limitation, we introduce a novel class of SDE-based solvers called \emph{Gaussian Mixture Solvers (GMS)} for diffusion models. Our solver estimates the first three-order moments and optimizes the parameters of a Gaussian mixture transition kernel using generalized methods of moments in each step during sampling. Empirically, our solver outperforms numerous SDE-based solvers in terms of sample quality in image generation and stroke-based synthesis in various diffusion models, which validates the motivation and effectiveness of GMS. Our code is available at https://github.com/Guohanzhong/GMS.	The paper proposes a novel Gaussian Mixture Solver (GMS) for diffusion models, which employs a Gaussian mixture transition kernel in the reverse process to better approximate the true distribution and reduce discretization errors.	Existing SDE-based solvers for diffusion models suffer from an efficiency-effectiveness dilemma, especially in tasks like image translation. This is because the Gaussian assumption for the reverse transition kernel often fails under limited discretization steps.	GMS utilizes a noise prediction network with multiple heads to estimate high-order moments of the reverse transition kernel. It then fits a Gaussian mixture transition kernel in each sampling step using the generalized method of moments.	GMS outperforms numerous SDE-based solvers in terms of sample quality on CIFAR10 and ImageNet 64x64, achieving a 4.44 FID improvement over the state-of-the-art SDE-based solver with 10 steps on CIFAR10. In stroke-based image synthesis, GMS achieves higher realism than existing SDE-based and ODE-based solvers while maintaining comparable computation cost and faithfulness. Theoretical and empirical evidence demonstrate that the true reverse transition kernel deviates from a Gaussian distribution, particularly with fewer sampling steps.	GMS still requires more computation time compared to simpler SDE-based solvers, although it shows improvements with the same maximum computation cost. Like other generative models, diffusion models with GMS can potentially generate problematic fake content.	diffusion models, gaussian mixture, sde solvers, image generation, stroke-based synthesis
2311.00750 Report	Are These the Same Apple? Comparing Images Based on Object Intrinsics	Klemen Kotar, Stephen Tian, Hong-Xing Yu, Daniel L. K. Yamins, Jiajun Wu	The human visual system can effortlessly recognize an object under different extrinsic factors such as lighting, object poses, and background, yet current computer vision systems often struggle with these variations. An important step to understanding and improving artificial vision systems is to measure image similarity purely based on intrinsic object properties that define object identity. This problem has been studied in the computer vision literature as re-identification, though mostly restricted to specific object categories such as people and cars. We propose to extend it to general object categories, exploring an image similarity metric based on object intrinsics. To benchmark such measurements, we collect the Common paired objects Under differenT Extrinsics (CUTE) dataset of $18,000$ images of $180$ objects under different extrinsic factors such as lighting, poses, and imaging conditions. While existing methods such as LPIPS and CLIP scores do not measure object intrinsics well, we find that combining deep features learned from contrastive self-supervised learning with foreground filtering is a simple yet effective approach to approximating the similarity. We conduct an extensive survey of pre-trained features and foreground extraction methods to arrive at a strong baseline that best measures intrinsic object-centric image similarity among current methods. Finally, we demonstrate that our approach can aid in downstream applications such as acting as an analog for human subjects and improving generalizable re-identification. Please see our project website at https://s-tian.github.io/projects/cute/ for visualizations of the data and demos of our metric.	This paper introduces a new dataset and a simple but effective method for measuring the visual similarity of general objects based solely on their intrinsic properties, aiming to mimic human perception's robustness to extrinsic factors like lighting, pose, and background.	This contribution is important because current computer vision systems struggle to generalize across varying conditions, and measuring intrinsic object similarity is crucial for improving robustness and developing AI systems that understand the visual world like humans.	The authors collect a new dataset called CUTE, containing 18,000 images of 180 objects under controlled and in-the-wild conditions with varying extrinsics. They propose a method called Foreground Feature Averaging (FFA), which combines foreground filtering with deep features learned from contrastive self-supervised learning (DINOv2), and benchmark its performance against existing metrics.	FFA outperforms prior image similarity methods like LPIPS, SSIM, and CLIPScore, especially in challenging in-the-wild settings. FFA demonstrates better alignment with human perception in a qualitative study where participants preferred object orderings generated by FFA compared to LPIPS and CLIPScore. FFA, when combined with existing vehicle re-identification models, improves their generalization ability across different datasets.	The proposed method is currently focused and evaluated on images containing single objects, limiting its applicability to more complex scenes. The CUTE dataset, while diverse, has limitations in terms of object occlusion, pose variation relative to the camera, and geographical diversity in capture conditions.	object similarity, intrinsic image similarity, self-supervised learning, computer vision, object recognition
2311.00618 Report	De-Diffusion Makes Text a Strong Cross-Modal Interface	Chen Wei, Chenxi Liu, Siyuan Qiao, Zhishuai Zhang, Alan Yuille, Jiahui Yu	We demonstrate text as a strong cross-modal interface. Rather than relying on deep embeddings to connect image and language as the interface representation, our approach represents an image as text, from which we enjoy the interpretability and flexibility inherent to natural language. We employ an autoencoder that uses a pre-trained text-to-image diffusion model for decoding. The encoder is trained to transform an input image into text, which is then fed into the fixed text-to-image diffusion decoder to reconstruct the original input -- a process we term De-Diffusion. Experiments validate both the precision and comprehensiveness of De-Diffusion text representing images, such that it can be readily ingested by off-the-shelf text-to-image tools and LLMs for diverse multi-modal tasks. For example, a single De-Diffusion model can generalize to provide transferable prompts for different text-to-image tools, and also achieves a new state of the art on open-ended vision-language tasks by simply prompting large language models with few-shot examples.	This paper introduces De-Diffusion, a novel approach using text as a cross-modal interface for images by representing them as “scrambled captions” that are both precise and comprehensive.	This approach leverages the flexibility and interpretability of natural language and bypasses the need for deep embedding adaptation in multi-modal tasks.	The method uses an autoencoder architecture with a pre-trained text-to-image diffusion model as the decoder and trains the encoder to convert images into text descriptions.	De-Diffusion text enables transferable prompts for different text-to-image tools, outperforming human captions in reconstruction quality. It allows off-the-shelf LLMs to perform open-ended visual question answering with state-of-the-art results in few-shot settings. De-Diffusion text facilitates multi-modal dialogue with chatbots and enables novel applications like text-based image blending.	The quality of De-Diffusion text relies on the performance of the pre-trained text-to-image model used as the decoder. Further exploration of techniques to improve coherence and reduce redundancy in the generated text descriptions is needed.	cross-modal interface, text representation, de-diffusion, text-to-image generation, vision-language tasks
2311.00571 Report	LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing	Wei-Ge Chen, Irina Spiridonova, Jianwei Yang, Jianfeng Gao, Chunyuan Li	LLaVA-Interactive is a research prototype for multimodal human-AI interaction. The system can have multi-turn dialogues with human users by taking multimodal user inputs and generating multimodal responses. Importantly, LLaVA-Interactive goes beyond language prompt, where visual prompt is enabled to align human intents in the interaction. The development of LLaVA-Interactive is extremely cost-efficient as the system combines three multimodal skills of pre-built AI models without additional model training: visual chat of LLaVA, image segmentation from SEEM, as well as image generation and editing from GLIGEN. A diverse set of application scenarios is presented to demonstrate the promises of LLaVA-Interactive and to inspire future research in multimodal interactive systems.	LLaVA-Interactive is an open-source research prototype system for multimodal human-AI interaction, enabling multi-turn dialogues with multimodal inputs and responses, including visual prompts.	It addresses the limitations of existing LMMs like GPT-4V, which primarily focus on language-based interaction and lack visual prompting, hindering the development of open-source multimodal AI agents.	LLaVA-Interactive leverages pre-built AI models without additional training, combining the visual chat capabilities of LLaVA, image segmentation from SEEM, and image generation/editing from GLIGEN.	Supports flexible visual prompts like strokes, drag-and-drop, and bounding boxes for tasks involving segmentation, generation, and editing. Demonstrates enhanced user interaction and enables novel application scenarios, such as aiding photographic artists and co-creating visual scenes. Highlights the potential of composing pre-trained models for building general-purpose assistants without extensive training.	Capabilities limited by the performance of individual pre-trained models. Lack of emergent skills arising from latent task composition, as it relies on the existing abilities of individual models.	multimodal ai, visual prompting, human-ai interaction, image segmentation, image generation
2311.00457 Report	Single-view 3D Scene Reconstruction with High-fidelity Shape and Texture	Yixin Chen, Junfeng Ni, Nan Jiang, Yaowei Zhang, Yixin Zhu, Siyuan Huang	Reconstructing detailed 3D scenes from single-view images remains a challenging task due to limitations in existing approaches, which primarily focus on geometric shape recovery, overlooking object appearances and fine shape details. To address these challenges, we propose a novel framework for simultaneous high-fidelity recovery of object shapes and textures from single-view images. Our approach utilizes the proposed Single-view neural implicit Shape and Radiance field (SSR) representations to leverage both explicit 3D shape supervision and volume rendering of color, depth, and surface normal images. To overcome shape-appearance ambiguity under partial observations, we introduce a two-stage learning curriculum incorporating both 3D and 2D supervisions. A distinctive feature of our framework is its ability to generate fine-grained textured meshes while seamlessly integrating rendering capabilities into the single-view 3D reconstruction model. This integration enables not only improved textured 3D object reconstruction by 27.7% and 11.6% on the 3D-FRONT and Pix3D datasets, respectively, but also supports the rendering of images from novel viewpoints. Beyond individual objects, our approach facilitates composing object-level representations into flexible scene representations, thereby enabling applications such as holistic scene understanding and 3D scene editing. We conduct extensive experiments to demonstrate the effectiveness of our method.	A novel framework for reconstructing high-fidelity 3D shapes and textures from single-view images using neural implicit shape and radiance field representations.	Single-view 3D reconstruction is crucial for machines to understand and interact with the 3D world, with applications in VR/AR and robotics. Existing methods often neglect object textures and struggle to capture fine shape details.	The framework leverages both 3D shape supervision (SDF) and volume rendering of color, depth, and normal images. It utilizes a two-stage learning curriculum to overcome shape-appearance ambiguity under partial observations. Pixel-aligned and instance-aligned features are used for SDF and color prediction, respectively.	Achieves state-of-the-art performance on 3D object reconstruction benchmarks, with significant improvement in capturing fine-grained shape details and textures. Enables rendering of color, depth, and normal images from novel viewpoints, showcasing its capability in novel view synthesis and single-view depth/normal estimation. Demonstrates potential for holistic scene understanding and 3D scene editing applications by composing object-level representations into flexible scene representations.	Struggles with reconstructing objects with thin surfaces and severe occlusion. Generalization to unseen object categories remains a challenge. Future work includes integrating unsigned distance fields and incorporating large-scale 2D/3D priors for improved generalizability.	single-view reconstruction, 3d object reconstruction, neural implicit representation, volume rendering, scene editing
2311.00213 Report	Consistent Video-to-Video Transfer Using Synthetic Dataset	Jiaxin Cheng, Tianjun Xiao, Tong He	We introduce a novel and efficient approach for text-based video-to-video editing that eliminates the need for resource-intensive per-video-per-model finetuning. At the core of our approach is a synthetic paired video dataset tailored for video-to-video transfer tasks. Inspired by Instruct Pix2Pix's image transfer via editing instruction, we adapt this paradigm to the video domain. Extending the Prompt-to-Prompt to videos, we efficiently generate paired samples, each with an input video and its edited counterpart. Alongside this, we introduce the Long Video Sampling Correction during sampling, ensuring consistent long videos across batches. Our method surpasses current methods like Tune-A-Video, heralding substantial progress in text-based video-to-video editing and suggesting exciting avenues for further exploration and deployment.	This paper introduces Instruct Video-to-Video, a novel diffusion-based model for text-based video-to-video editing that eliminates the need for per-video-per-model finetuning.	Existing text-based video editing approaches suffer from limitations such as requiring resource-intensive per-video-per-model finetuning and demanding users to describe both the original and target video.	The paper proposes a synthetic paired video dataset tailored for video-to-video transfer tasks, generated using a large language model and a video diffusion model adapted from the Prompt-to-Prompt method. It also introduces Long Video Sampling Correction (LVSC) to ensure consistency across extended video sequences.	The approach eliminates the need for per-video-per-model finetuning, enabling a universal one-model-all-video transfer. It simplifies user interaction by requiring only an intuitive editing prompt. The proposed method outperforms existing techniques like Tune-A-Video in text-based video editing, as demonstrated through user studies and automated metrics.	The model may struggle with videos containing objects that are difficult to detect due to size, positioning, or occlusion. Future work includes exploring the generation of longer videos and improving the model's ability to handle complex editing scenarios.	video editing, diffusion models, synthetic data, prompt-to-prompt, long video generation
2311.00047 Report	Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans?	Yichi Zhang, Jiayi Pan, Yuchen Zhou, Rui Pan, Joyce Chai	Vision-Language Models (VLMs) are trained on vast amounts of data captured by humans emulating our understanding of the world. However, known as visual illusions, human's perception of reality isn't always faithful to the physical world. This raises a key question: do VLMs have the similar kind of illusions as humans do, or do they faithfully learn to represent reality? To investigate this question, we build a dataset containing five types of visual illusions and formulate four tasks to examine visual illusions in state-of-the-art VLMs. Our findings have shown that although the overall alignment is low, larger models are closer to human perception and more susceptible to visual illusions. Our dataset and initial findings will promote a better understanding of visual illusions in humans and machines and provide a stepping stone for future computational models that can better align humans and machines in perceiving and communicating about the shared visual world. The code and data are available at https://github.com/vl-illusion/dataset.	This paper introduces Grounding Visual Illusion in Language (GVIL), the first dataset for evaluating how well machines recognize and respond to visual illusions in language.	Understanding how well machines perceive visual illusions, a phenomenon inherent in human vision, is crucial for improving human-machine alignment in tasks involving vision and language.	The authors created GVIL, encompassing five illusion categories and four benchmark tasks: Same-Different Question Answering, Referential Question Answering, Attribute Question Answering, and Referential Localization. Four state-of-the-art vision-language models with varying sizes were evaluated.	Larger models demonstrate a stronger tendency towards humanlike illusion recognition and are more likely to align with human responses under illusion contexts. While models show promising alignment in object localization under illusions, they struggle with visual question-answering tasks. The degree of alignment between machine and human responses varies across different categories of visual illusions.	The current dataset size is modest, limiting the generalizability of the findings. Further research is needed to understand the discrepancy in model performance across different tasks.	visual illusion, vision-language models, human-machine alignment, dataset, benchmark
2310.20700 Report	SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction	Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, Ziwei Liu	Recently video generation has achieved substantial progress with realistic results. Nevertheless, existing AI-generated videos are usually very short clips ("shot-level") depicting a single scene. To deliver a coherent long video ("story-level"), it is desirable to have creative transition and prediction effects across different clips. This paper presents a short-to-long video diffusion model, SEINE, that focuses on generative transition and prediction. The goal is to generate high-quality long videos with smooth and creative transitions between scenes and varying lengths of shot-level videos. Specifically, we propose a random-mask video diffusion model to automatically generate transitions based on textual descriptions. By providing the images of different scenes as inputs, combined with text-based control, our model generates transition videos that ensure coherence and visual quality. Furthermore, the model can be readily extended to various tasks such as image-to-video animation and autoregressive video prediction. To conduct a comprehensive evaluation of this new generative task, we propose three assessing criteria for smooth and creative transition: temporal consistency, semantic similarity, and video-text semantic alignment. Extensive experiments validate the effectiveness of our approach over existing methods for generative transition and prediction, enabling the creation of story-level long videos. Project page: https://vchitect.github.io/SEINE-project/ .	This paper introduces SEINE, a short-to-long (S2L) video diffusion model that focuses on generating high-quality long videos with smooth and creative transitions between scenes and varying lengths of shot-level videos using a random-mask approach conditioned on text descriptions.	Existing AI-generated videos are typically short clips depicting a single scene, while longer, story-level videos require creative transitions and prediction effects across different clips for coherent storytelling.	SEINE employs a random-mask diffusion model that leverages text descriptions and visible conditional frames to generate unseen transition and prediction frames, enabling smooth and creative scene connections and video extension.	SEINE outperforms comparison methods in generating transitions based on metrics like temporal coherence, semantic similarity, and video-text alignment. SEINE demonstrates diverse and controllable transition generation, enabling variations in transition style and camera movement control. SEINE shows promising results in long video generation through auto-regressive prediction and image-to-video animation, expanding its applicability.	Limitations include the need for similarity between source and target scenes for smooth transitions and potential text-video unalignment. Future work focuses on improving text-image alignment, addressing limitations related to scene similarity, and mitigating watermark generation.	video generation, diffusion models, generative transition, video prediction, text-to-video
2310.20649 Report	Dynamic Batch Norm Statistics Update for Natural Robustness	Shahbaz Rezaei, Mohammad Sadegh Norouzzadeh	DNNs trained on natural clean samples have been shown to perform poorly on corrupted samples, such as noisy or blurry images. Various data augmentation methods have been recently proposed to improve DNN's robustness against common corruptions. Despite their success, they require computationally expensive training and cannot be applied to off-the-shelf trained models. Recently, it has been shown that updating BatchNorm (BN) statistics of an off-the-shelf model on a single corruption improves its accuracy on that corruption significantly. However, adopting the idea at inference time when the type of corruption is unknown and changing decreases the effectiveness of this method. In this paper, we harness the Fourier domain to detect the corruption type, a challenging task in the image domain. We propose a unified framework consisting of a corruption-detection model and BN statistics update that improves the corruption accuracy of any off-the-shelf trained model. We benchmark our framework on different models and datasets. Our results demonstrate about 8% and 4% accuracy improvement on CIFAR10-C and ImageNet-C, respectively. Furthermore, our framework can further improve the accuracy of state-of-the-art robust models, such as AugMix and DeepAug.	This paper presents a framework to improve the robustness of pre-trained vision models against corrupted images, by dynamically updating Batch Normalization (BN) statistics based on the detected corruption type.	DNNs are known to be vulnerable to image corruptions. Existing data augmentation methods to address this are computationally expensive and cannot be applied to already trained models. This work offers a computationally light-weight alternative to improve robustness of off-the-shelf models.	The framework utilizes a corruption type detection model, trained on the Fourier spectrum of images, to identify the corruption present in an input image. Based on the detected corruption, the BN statistics of the pre-trained model are updated with pre-computed values specific to that corruption type, fetched from a lookup table.	Achieves around 8% and 4% accuracy improvement on CIFAR10-C and ImageNet-C, respectively, compared to the base model. Outperforms the inference-time adaptation of previous BN update methods when the corruption type dynamically changes. Can be applied to existing state-of-the-art robust models, like AugMix and DeepAug, and further enhance their performance.	Requires data samples from all corruption types during training to construct the corruption detection model and the BN statistics lookup table. Performance improvement is limited by the effectiveness of the BN statistics update method for the specific corruption type.	robustness, image corruption, batch normalization, fourier domain, domain adaptation
2310.20550 Report	CapsFusion: Rethinking Image-Text Data at Scale	Qiying Yu, Quan Sun, Xiaosong Zhang, Yufeng Cui, Fan Zhang, Yue Cao, Xinlong Wang, Jingjing Liu	Large multimodal models demonstrate remarkable generalist ability to perform diverse multimodal tasks in a zero-shot manner. Large-scale web-based image-text pairs contribute fundamentally to this success, but suffer from excessive noise. Recent studies use alternative captions synthesized by captioning models and have achieved notable benchmark performance. However, our experiments reveal significant Scalability Deficiency and World Knowledge Loss issues in models trained with synthetic captions, which have been largely obscured by their initial benchmark success. Upon closer examination, we identify the root cause as the overly-simplified language structure and lack of knowledge details in existing synthetic captions. To provide higher-quality and more scalable multimodal pretraining data, we propose CapsFusion, an advanced framework that leverages large language models to consolidate and refine information from both web-based image-text pairs and synthetic captions. Extensive experiments show that CapsFusion captions exhibit remarkable all-round superiority over existing captions in terms of model performance (e.g., 18.8 and 18.3 improvements in CIDEr score on COCO and NoCaps), sample efficiency (requiring 11-16 times less computation than baselines), world knowledge depth, and scalability. These effectiveness, efficiency and scalability advantages position CapsFusion as a promising candidate for future scaling of LMM training.	This paper introduces Refcap, a novel framework that leverages large language models (LLMs) to refine large-scale image-text data for improved training of large multimodal models (LMMs).	Existing methods for generating image-text training data, such as using web-based pairs or synthetic captions, suffer from either excessive noise or a lack of real-world knowledge and scalability.	Refcap uses a captioning model to generate synthetic captions and then employs ChatGPT to fuse them with web-based captions, extracting real-world knowledge while maintaining structure. To ensure scalability, a fine-tuned LLaMA model is used for large-scale caption fusion.	Refcap captions significantly outperform raw, synthetic, and mixed captions in LMM training, achieving substantial improvements in CIDEr scores on multiple benchmarks. Refcap demonstrates superior sample efficiency, requiring 11-16 times less computation to reach similar performance levels as baseline captions. LMMs trained on Refcap captions exhibit richer world knowledge compared to those trained on synthetic captions, as evidenced by their ability to identify celebrities, artworks, and locations.	The caption fusion process relies on heuristics and could benefit from further exploration of automatic quality control mechanisms. Future work can explore the generalization of Refcap to other modalities beyond image-text pairs.	large multimodal models, image captioning, data augmentation, large language models, world knowledge
2310.19909 Report	Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks	Micah Goldblum, Hossein Souri, Renkun Ni, Manli Shu, Viraj Prabhu, Gowthami Somepalli, Prithvijit Chattopadhyay, Mark Ibrahim, Adrien Bardes, Judy Hoffman, Rama Chellappa, Andrew Gordon Wilson, Tom Goldstein	Neural network based computer vision systems are typically built on a backbone, a pretrained or randomly initialized feature extractor. Several years ago, the default option was an ImageNet-trained convolutional neural network. However, the recent past has seen the emergence of countless backbones pretrained using various algorithms and datasets. While this abundance of choice has led to performance increases for a range of systems, it is difficult for practitioners to make informed decisions about which backbone to choose. Battle of the Backbones (BoB) makes this choice easier by benchmarking a diverse suite of pretrained models, including vision-language models, those trained via self-supervised learning, and the Stable Diffusion backbone, across a diverse set of computer vision tasks ranging from classification to object detection to OOD generalization and more. Furthermore, BoB sheds light on promising directions for the research community to advance computer vision by illuminating strengths and weakness of existing approaches through a comprehensive analysis conducted on more than 1500 training runs. While vision transformers (ViTs) and self-supervised learning (SSL) are increasingly popular, we find that convolutional neural networks pretrained in a supervised fashion on large training sets still perform best on most tasks among the models we consider. Moreover, in apples-to-apples comparisons on the same architectures and similarly sized pretraining datasets, we find that SSL backbones are highly competitive, indicating that future works should perform SSL pretraining with advanced architectures and larger pretraining datasets. We release the raw results of our experiments along with code that allows researchers to put their own backbones through the gauntlet here: https://github.com/hsouri/Battle-of-the-Backbones	This paper presents "Battle of the Backbones" (BoB), a benchmark comparing diverse pretrained computer vision backbones across a wide range of tasks including classification, object detection, out-of-distribution generalization, and image retrieval.	The abundance of pretrained backbone models makes it difficult for practitioners to choose the best option. BoB aims to guide practitioners and researchers by providing a comprehensive evaluation of backbones and identifying strengths and weaknesses of existing approaches.	The authors benchmark publicly available pretrained models with different architectures (CNNs, ViTs, Swin Transformers, Stable Diffusion encoder), pretraining algorithms (supervised, self-supervised, vision-language), and pretraining datasets (ImageNet, LAION, CLIP dataset, depth datasets). They evaluate these backbones on a diverse set of tasks using various learning protocols (fine-tuning, linear probing, frozen backbone) and report performance using standard metrics for each task.	Supervised ConvNeXt-Base, SwinV2-Base (trained on ImageNet-21k), and CLIP ViT-Base consistently rank among the top performers across various tasks and settings. Supervised pretraining generally yields superior results, largely due to being trained on larger datasets. However, self-supervised or vision-language pretrained models perform better when comparing backbones trained on similar-sized datasets. Performance across tasks is highly correlated, suggesting the possibility of developing universal backbones suitable for various computer vision tasks.	The insights are limited by the specific tasks, backbones, and settings considered in the benchmark. Larger backbone models (beyond ConvNeXt-Base) were not included, potentially affecting the ranking, especially for transformers which benefit more from scale.	backbone, benchmark, computer vision, self-supervised learning, vision-language models
2310.19776 Report	Learn to Categorize or Categorize to Learn? Self-Coding for Generalized Category Discovery	Sarah Rastegar, Hazel Doughty, Cees G. M. Snoek	In the quest for unveiling novel categories at test time, we confront the inherent limitations of traditional supervised recognition models that are restricted by a predefined category set. While strides have been made in the realms of self-supervised and open-world learning towards test-time category discovery, a crucial yet often overlooked question persists: what exactly delineates a category? In this paper, we conceptualize a category through the lens of optimization, viewing it as an optimal solution to a well-defined problem. Harnessing this unique conceptualization, we propose a novel, efficient and self-supervised method capable of discovering previously unknown categories at test time. A salient feature of our approach is the assignment of minimum length category codes to individual data instances, which encapsulates the implicit category hierarchy prevalent in real-world datasets. This mechanism affords us enhanced control over category granularity, thereby equipping our model to handle fine-grained categories adeptly. Experimental evaluations, bolstered by state-of-the-art benchmark comparisons, testify to the efficacy of our solution in managing unknown categories at test time. Furthermore, we fortify our proposition with a theoretical foundation, providing proof of its optimality. Our code is available at https://github.com/SarahRastegar/InfoSieve.	This paper proposes InfoSieve, a novel self-supervised method for discovering unknown categories at test time by conceptualizing 'category' as an optimization problem solution.	Traditional supervised models struggle with open-world recognition due to the lack of a clear 'category' definition. This work addresses label inconsistencies, incorporates category hierarchies, and tackles open-world recognition by learning category codes instead of relying on predefined labels.	The proposed method uses algorithmic and Shannon information theory to define an optimization problem for finding optimal category codes. It leverages contrastive learning to maximize mutual information between input data and binary category codes, while minimizing code length to reduce search space. A masking mechanism allows for flexible handling of category granularity.	InfoSieve outperforms state-of-the-art methods in generalized category discovery on fine-grained datasets. The method demonstrates robustness to different category granularities and long-tailed data distributions. Qualitative analysis reveals the model's ability to learn an implicit category hierarchy from the data.	The approach assumes an implicit hierarchical tree underlying categorization, which may not always hold true. The current implementation requires unlabeled data from unknown categories during training.	generalized category discovery, novel class discovery, self-supervised learning, information theory, contrastive learning
2310.19731 Report	ViR: Towards Efficient Vision Retention Backbones	Ali Hatamizadeh, Michael Ranzinger, Shiyi Lan, Jose M. Alvarez, Sanja Fidler, Jan Kautz	Vision Transformers (ViTs) have attracted a lot of popularity in recent years, due to their exceptional capabilities in modeling long-range spatial dependencies and scalability for large scale training. Although the training parallelism of self-attention mechanism plays an important role in retaining great performance, its quadratic complexity baffles the application of ViTs in many scenarios which demand fast inference. This effect is even more pronounced in applications in which autoregressive modeling of input features is required. In Natural Language Processing (NLP), a new stream of efforts has proposed parallelizable models with recurrent formulation that allows for efficient inference in generative applications. Inspired by this trend, we propose a new class of computer vision models, dubbed Vision Retention Networks (ViR), with dual parallel and recurrent formulations, which strike an optimal balance between fast inference and parallel training with competitive performance. In particular, ViR scales favorably for image throughput and memory consumption in tasks that require higher-resolution images due to its flexible formulation in processing large sequence lengths. The ViR is the first attempt to realize dual parallel and recurrent equivalency in a general vision backbone for recognition tasks. We have validated the effectiveness of ViR through extensive experiments with different dataset sizes and various image resolutions and achieved competitive performance. Code: https://github.com/NVlabs/ViR	The paper introduces Vision Retention Networks (ViR), a novel computer vision model architecture that leverages both parallel and recurrent formulations, enabling efficient inference for tasks requiring high-resolution images.	ViTs excel in capturing long-range dependencies but suffer from quadratic complexity, making them slow for high-resolution image processing. ViR addresses this limitation by introducing a recurrent formulation that enables fast inference without compromising accuracy.	ViR utilizes a retention mechanism with dual parallel and recurrent representations. The recurrent mode processes tokens sequentially, reducing complexity for long sequences. A hybrid chunkwise mode combines parallel and recurrent processing for optimal performance.	ViR achieves competitive performance on ImageNet-1K classification benchmarks, outperforming other ViT-based models in terms of accuracy and throughput. ViR with 2D retention demonstrates superior performance for downstream tasks like object detection and semantic segmentation on MS COCO and ADE20K datasets. ViR exhibits favorable scaling characteristics for high-resolution images, achieving higher throughput and utilizing memory more efficiently than ViTs, especially for larger batch sizes.	Exploration of relative position embeddings in two dimensions for potential performance improvement. Extension of ViR to other vision tasks beyond recognition, leveraging its efficiency for high-resolution image processing.	vision transformer, recurrent neural network, efficient inference, high-resolution images, computer vision
2310.19540 Report	IterInv: Iterative Inversion for Pixel-Level T2I Models	Chuanming Tang, Kai Wang, Joost van de Weijer	Large-scale text-to-image diffusion models have been a ground-breaking development in generating convincing images following an input text prompt. The goal of image editing research is to give users control over the generated images by modifying the text prompt. Current image editing techniques predominantly hinge on DDIM inversion as a prevalent practice rooted in Latent Diffusion Models (LDM). However, the large pretrained T2I models working on the latent space suffer from losing details due to the first compression stage with an autoencoder mechanism. Instead, other mainstream T2I pipeline working on the pixel level, such as Imagen and DeepFloyd-IF, circumvents the above problem. They are commonly composed of multiple stages, typically starting with a text-to-image stage and followed by several super-resolution stages. In this pipeline, the DDIM inversion fails to find the initial noise and generate the original image given that the super-resolution diffusion models are not compatible with the DDIM technique. According to our experimental findings, iteratively concatenating the noisy image as the condition is the root of this problem. Based on this observation, we develop an iterative inversion (IterInv) technique for this category of T2I models and verify IterInv with the open-source DeepFloyd-IF model.Specifically, IterInv employ NTI as the inversion and reconstruction of low-resolution image generation. In stages 2 and 3, we update the latent variance at each timestep to find the deterministic inversion trace and promote the reconstruction process. By combining our method with a popular image editing method, we prove the application prospects of IterInv. The code will be released upon acceptance. The code is available at \url{https://github.com/Tchuanm/IterInv.git}.	This paper introduces IterInv, a novel iterative inversion technique for pixel-level Text-to-Image (T2I) diffusion models like DeepFloyd-IF, addressing the limitations of DDIM inversion in such models.	Existing text-guided image editing methods rely on latent diffusion models (LDMs) that often lead to detail loss. Pixel-level T2I models offer a solution but lack effective inversion techniques for accurate real image reconstruction, hindering editing capabilities.	IterInv leverages Null-Text Inversion (NTI) with classifier-free guidance and iteratively optimizes latent variance at each timestep to find a deterministic inversion trace, enabling accurate image reconstruction in the pixel space.	IterInv demonstrates superior reconstruction quality compared to DDIM inversion across various stages of DeepFloyd-IF, achieving results comparable to SDXL's autoencoder. The method exhibits robustness to classifier-guidance scale variations, ensuring consistent performance. Combining IterInv with DiffEdit enables effective text-guided image editing on DeepFloyd-IF, opening possibilities for advanced editing techniques in pixel-level diffusion models.	The current study focuses solely on the DeepFloyd model, limiting the generalizability of IterInv. The compatibility of IterInv with other image editing methods beyond DiffEdit remains unexplored.	image inversion, image reconstruction, image editing, text-to-image, pixel diffusion
2310.19512 Report	VideoCrafter1: Open Diffusion Models for High-Quality Video Generation	Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, Ying Shan	Video generation has increasingly gained interest in both academia and industry. Although commercial tools can generate plausible videos, there is a limited number of open-source models available for researchers and engineers. In this work, we introduce two diffusion models for high-quality video generation, namely text-to-video (T2V) and image-to-video (I2V) models. T2V models synthesize a video based on a given text input, while I2V models incorporate an additional image input. Our proposed T2V model can generate realistic and cinematic-quality videos with a resolution of $1024 \times 576$, outperforming other open-source T2V models in terms of quality. The I2V model is designed to produce videos that strictly adhere to the content of the provided reference image, preserving its content, structure, and style. This model is the first open-source I2V foundation model capable of transforming a given image into a video clip while maintaining content preservation constraints. We believe that these open-source video generation models will contribute significantly to the technological advancements within the community.	This paper introduces two open-source diffusion models for video generation: a text-to-video (T2V) model and an image-to-video (I2V) model.	Existing open-source video generation models have limitations in quality, resolution, and content preservation, while commercial models are not accessible for research.	The T2V model extends Stable Diffusion with temporal attention and joint image-video training. The I2V model incorporates a CLIP-based image embedding branch into the T2V architecture.	The T2V model generates high-quality videos (1024x576 resolution) with cinematic quality, outperforming other open-source models. The I2V model is the first open-source model capable of strictly preserving content and structure of the input image while animating it. Both models demonstrate superior performance compared to existing open-source alternatives and achieve comparable results to some commercial models.	The current models are limited to 2-second video generation. Further improvements in motion quality, resolution, and success rate are needed.	video generation, diffusion models, text-to-video, image-to-video, open-source
2310.19464 Report	Generative Neural Fields by Mixtures of Neural Implicit Functions	Tackgeun You, Mijeong Kim, Jungtaek Kim, Bohyung Han	We propose a novel approach to learning the generative neural fields represented by linear combinations of implicit basis networks. Our algorithm learns basis networks in the form of implicit neural representations and their coefficients in a latent space by either conducting meta-learning or adopting auto-decoding paradigms. The proposed method easily enlarges the capacity of generative neural fields by increasing the number of basis networks while maintaining the size of a network for inference to be small through their weighted model averaging. Consequently, sampling instances using the model is efficient in terms of latency and memory footprint. Moreover, we customize denoising diffusion probabilistic model for a target task to sample latent mixture coefficients, which allows our final model to generate unseen data effectively. Experiments show that our approach achieves competitive generation performance on diverse benchmarks for images, voxel data, and NeRF scenes without sophisticated designs for specific modalities and domains.	This paper proposes mNIF, a novel method for learning generative neural fields using linear combinations of implicit basis networks (INRs).	mNIF offers a more efficient and scalable approach to represent complex data distributions in various domains (images, voxels, NeRF scenes) compared to existing generative neural field methods.	The method learns a set of basis INRs and a latent space of mixture coefficients. Two training stages are employed: (1) context adaptation via meta-learning or auto-decoding to optimize basis networks and mixture coefficients for reconstruction, and (2) task-specific generalization using a denoising diffusion probabilistic model for sampling unseen data.	mNIF achieves competitive or state-of-the-art generation quality on image, voxel, and NeRF scene benchmarks. The method exhibits significantly better inference efficiency (smaller model size and faster speed) than existing methods. Analysis reveals the learned latent space captures smooth data manifold and benefits from increasing mixture components and latent dimensionality.	Limited scalability beyond fine-grained datasets is observed, potentially due to the limitations of the SIREN architecture used. Future work will focus on incorporating local information and exploring alternative architectures to enhance performance on diverse datasets.	generative neural fields, implicit neural representations, mixture of experts, denoising diffusion probabilistic models, meta-learning
2310.19415 Report	Text-to-3D with Classifier Score Distillation	Xin Yu, Yuan-Chen Guo, Yangguang Li, Ding Liang, Song-Hai Zhang, Xiaojuan Qi	Text-to-3D generation has made remarkable progress recently, particularly with methods based on Score Distillation Sampling (SDS) that leverages pre-trained 2D diffusion models. While the usage of classifier-free guidance is well acknowledged to be crucial for successful optimization, it is considered an auxiliary trick rather than the most essential component. In this paper, we re-evaluate the role of classifier-free guidance in score distillation and discover a surprising finding: the guidance alone is enough for effective text-to-3D generation tasks. We name this method Classifier Score Distillation (CSD), which can be interpreted as using an implicit classification model for generation. This new perspective reveals new insights for understanding existing techniques. We validate the effectiveness of CSD across a variety of text-to-3D tasks including shape generation, texture synthesis, and shape editing, achieving results superior to those of state-of-the-art methods. Our project page is https://xinyu-andy.github.io/Classifier-Score-Distillation	This paper introduces Classifier Score Distillation (CSD), a novel method for text-to-3D generation that utilizes the classifier component of pre-trained 2D diffusion models, challenging the prevailing assumption that generative priors are essential for this task.	This research reshapes the understanding of text-to-3D generation by demonstrating the critical role of implicit classifiers within diffusion models, potentially leading to more efficient and effective generation techniques.	The authors analyze the role of classifier-free guidance in Score Distillation Sampling (SDS), revealing that the classifier score, rather than the generative prior, is the driving force behind successful optimization. They propose CSD, which leverages only the classifier score for 3D scene refinement.	CSD achieves state-of-the-art results in text-to-3D generation, surpassing existing SDS-based methods in visual quality and text alignment. The study reveals that negative prompts act as dual-objective classifier scores, and introduces an annealed negative classifier score optimization strategy for improved quality and fidelity. CSD proves effective for text-guided 3D editing, allowing modifications to existing scenes while preserving desired attributes.	While empirical results highlight CSD's superiority, a formal distribution-based objective function for this optimization process is yet to be defined. Applying CSD to 2D image optimization results in artifacts, suggesting potential limitations or the need for further research to bridge the gap between 2D and 3D applications.	text-to-3d generation, score distillation, classifier-free guidance, diffusion models, 3d scene editing
2310.19248 Report	IMPRESS: Evaluating the Resilience of Imperceptible Perturbations Against Unauthorized Data Usage in Diffusion-Based Generative AI	Bochuan Cao, Changjiang Li, Ting Wang, Jinyuan Jia, Bo Li, Jinghui Chen	Diffusion-based image generation models, such as Stable Diffusion or DALL-E 2, are able to learn from given images and generate high-quality samples following the guidance from prompts. For instance, they can be used to create artistic images that mimic the style of an artist based on his/her original artworks or to maliciously edit the original images for fake content. However, such ability also brings serious ethical issues without proper authorization from the owner of the original images. In response, several attempts have been made to protect the original images from such unauthorized data usage by adding imperceptible perturbations, which are designed to mislead the diffusion model and make it unable to properly generate new samples. In this work, we introduce a perturbation purification platform, named IMPRESS, to evaluate the effectiveness of imperceptible perturbations as a protective measure. IMPRESS is based on the key observation that imperceptible perturbations could lead to a perceptible inconsistency between the original image and the diffusion-reconstructed image, which can be used to devise a new optimization strategy for purifying the image, which may weaken the protection of the original image from unauthorized data usage (e.g., style mimicking, malicious editing). The proposed IMPRESS platform offers a comprehensive evaluation of several contemporary protection methods, and can be used as an evaluation platform for future protection methods.	This paper introduces IMPRESS, a platform for evaluating the effectiveness of imperceptible perturbations in protecting images from unauthorized use in diffusion models by purifying perturbed images with consistency-based losses.	This evaluation is crucial to understand the robustness of existing protection methods like GLAZE and PhotoGuard against adaptive attacks and guide the development of future protection mechanisms.	IMPRESS leverages the inconsistency between a perturbed image and its diffusion-reconstructed version. It formulates an optimization problem with two losses: a similarity loss ensuring the purified image is close to the perturbed one and a consistency loss ensuring the purified image can be reconstructed by the diffusion model.	IMPRESS successfully weakens the protection of GLAZE on style mimicking, increasing the accuracy of generated images mimicking protected styles to near clean-image levels (87% vs. 90.8% for CLIP classifier). IMPRESS also diminishes the effectiveness of PhotoGuard on malicious editing, leading to edited images closer to edited clean images according to PSNR and VIF-p metrics. Adaptive protection methods incorporating consistency-based losses are explored but show limited improvement, suggesting the complexity of balancing multiple objectives and potential inherent conflicts.	The reliance on specific similarity metrics (e.g., LPIPS) and the effectiveness of simple post-processing techniques on malicious editing highlight potential vulnerabilities. Designing robust adaptive protection methods for IMPRESS remains challenging due to complex loss optimization and potential conflicts between protection and purification goals.	image protection, diffusion models, adversarial attacks, image editing, style mimicking
2310.18949 Report	Customize StyleGAN with One Hand Sketch	Shaocong Zhang	Generating images from human sketches typically requires dedicated networks trained from scratch. In contrast, the emergence of the pre-trained Vision-Language models (e.g., CLIP) has propelled generative applications based on controlling the output imagery of existing StyleGAN models with text inputs or reference images. Parallelly, our work proposes a framework to control StyleGAN imagery with a single user sketch. In particular, we learn a conditional distribution in the latent space of a pre-trained StyleGAN model via energy-based learning and propose two novel energy functions leveraging CLIP for cross-domain semantic supervision. Once trained, our model can generate multi-modal images semantically aligned with the input sketch. Quantitative evaluations on synthesized datasets have shown that our approach improves significantly from previous methods in the one-shot regime. The superiority of our method is further underscored when experimenting with a wide range of human sketches of diverse styles and poses. Surprisingly, our models outperform the previous baseline regarding both the range of sketch inputs and image qualities despite operating with a stricter setting: with no extra training data and single sketch input.	This paper proposes a novel framework to control the imagery generated by a pre-trained StyleGAN model using a single user sketch, eliminating the need for dedicated networks or training datasets.	This approach aligns with the recent trend of utilizing pre-trained generative models and enables a more intuitive and flexible way for users to control image generation through sketches.	The framework leverages energy-based learning to learn a conditional distribution in the latent space of the StyleGAN model. It introduces two novel energy functions based on CLIP to provide cross-domain semantic supervision, guiding the generated images to align with the input sketch.	Quantitative evaluations on synthesized datasets demonstrate significant improvement over previous methods in one-shot image generation. Experiments with real human sketches show the method's robustness to diverse sketch styles and poses, outperforming the baseline in terms of image quality and adaptability. The proposed framework integrates seamlessly with other StyleGAN-based manipulations like latent space editing and natural image inversion, broadening its application in image editing.	The method may struggle with sketches representing rare modes not well-represented in the source StyleGAN model's training data. Future work could explore explicit control over the degree of output realism and extend the framework to other generative models beyond StyleGAN.	image generation, sketch-to-image synthesis, stylegan, clip, energy-based models
2310.18936 Report	Adversarial Examples Are Not Real Features	Ang Li, Yifei Wang, Yiwen Guo, Yisen Wang	The existence of adversarial examples has been a mystery for years and attracted much interest. A well-known theory by \citet{ilyas2019adversarial} explains adversarial vulnerability from a data perspective by showing that one can extract non-robust features from adversarial examples and these features alone are useful for classification. However, the explanation remains quite counter-intuitive since non-robust features are mostly noise features to humans. In this paper, we re-examine the theory from a larger context by incorporating multiple learning paradigms. Notably, we find that contrary to their good usefulness under supervised learning, non-robust features attain poor usefulness when transferred to other self-supervised learning paradigms, such as contrastive learning, masked image modeling, and diffusion models. It reveals that non-robust features are not really as useful as robust or natural features that enjoy good transferability between these paradigms. Meanwhile, for robustness, we also show that naturally trained encoders from robust features are largely non-robust under AutoAttack. Our cross-paradigm examination suggests that the non-robust features are not really useful but more like paradigm-wise shortcuts, and robust features alone might be insufficient to attain reliable model robustness. Code is available at \url{https://github.com/PKU-ML/AdvNotRealFeatures}.	This paper challenges the prevailing view of adversarial examples as explained by the existence of non-robust features. It argues that these features are not truly useful but act as paradigm-specific shortcuts.	Understanding the true nature of adversarial examples and their relation to data features is crucial for developing robust machine learning models.	The authors propose a cross-paradigm evaluation framework, testing the usefulness and robustness of robust and non-robust features across various learning paradigms (Supervised, Contrastive, Masked Image Modeling, Diffusion).	Non-robust features, while useful in supervised learning, show poor transferability and are largely useless in other self-supervised paradigms. Robust features, claimed to be sufficient for robustness, fail to provide robustness when learned with different paradigms, especially under more reliable attacks. Adversarial examples themselves show poor transferability across paradigms, suggesting a strong dependence on the learning objective.	The study primarily focuses on image classification tasks, leaving its generalizability to other domains unexplored. Further investigation is needed to understand the influence of data augmentation on the robustness of models trained on robust datasets.	adversarial examples, robustness, non-robust features, cross-paradigm learning, transferability
2310.18274 Report	LipSim: A Provably Robust Perceptual Similarity Metric	Sara Ghazanfari, Alexandre Araujo, Prashanth Krishnamurthy, Farshad Khorrami, Siddharth Garg	Recent years have seen growing interest in developing and applying perceptual similarity metrics. Research has shown the superiority of perceptual metrics over pixel-wise metrics in aligning with human perception and serving as a proxy for the human visual system. On the other hand, as perceptual metrics rely on neural networks, there is a growing concern regarding their resilience, given the established vulnerability of neural networks to adversarial attacks. It is indeed logical to infer that perceptual metrics may inherit both the strengths and shortcomings of neural networks. In this work, we demonstrate the vulnerability of state-of-the-art perceptual similarity metrics based on an ensemble of ViT-based feature extractors to adversarial attacks. We then propose a framework to train a robust perceptual similarity metric called LipSim (Lipschitz Similarity Metric) with provable guarantees. By leveraging 1-Lipschitz neural networks as the backbone, LipSim provides guarded areas around each data point and certificates for all perturbations within an $\ell_2$ ball. Finally, a comprehensive set of experiments shows the performance of LipSim in terms of natural and certified scores and on the image retrieval application. The code is available at https://github.com/SaraGhazanfari/LipSim.	The paper proposes LipSim, the first certifiably robust perceptual similarity metric, by leveraging 1-Lipschitz neural networks and a student-teacher training approach with DreamSim as the teacher model.	Existing perceptual similarity metrics, while effective, are vulnerable to adversarial attacks, potentially compromising applications like image retrieval and copy detection. LipSim aims to address this vulnerability with provable robustness guarantees.	LipSim utilizes a 1-Lipschitz feature extractor trained via knowledge distillation from DreamSim on ImageNet. It then fine-tunes the feature extractor with a hinge loss on the NIGHT dataset and projects the embeddings onto a unit hypersphere, enabling certified robustness.	LipSim demonstrates higher empirical robustness compared to state-of-the-art perceptual metrics under various adversarial attacks. The paper proves theoretical guarantees for LipSim's robustness, providing certified accuracy within a specified perturbation budget. LipSim achieves good performance on image retrieval, showcasing its practical applicability for finding semantically similar images even with adversarial queries.	The current implementation of LipSim is limited to 2AFC datasets and could be expanded for broader applicability. Future work could explore LipSim's performance on a wider range of applications, such as copy detection and feature inversion.	perceptual similarity, certified robustness, lipschitz networks, adversarial attacks, image retrieval
2310.17880 Report	Reconstructive Latent-Space Neural Radiance Fields for Efficient 3D Scene Representations	Tristan Aumentado-Armstrong, Ashkan Mirzaei, Marcus A. Brubaker, Jonathan Kelly, Alex Levinshtein, Konstantinos G. Derpanis, Igor Gilitschenski	Neural Radiance Fields (NeRFs) have proven to be powerful 3D representations, capable of high quality novel view synthesis of complex scenes. While NeRFs have been applied to graphics, vision, and robotics, problems with slow rendering speed and characteristic visual artifacts prevent adoption in many use cases. In this work, we investigate combining an autoencoder (AE) with a NeRF, in which latent features (instead of colours) are rendered and then convolutionally decoded. The resulting latent-space NeRF can produce novel views with higher quality than standard colour-space NeRFs, as the AE can correct certain visual artifacts, while rendering over three times faster. Our work is orthogonal to other techniques for improving NeRF efficiency. Further, we can control the tradeoff between efficiency and image quality by shrinking the AE architecture, achieving over 13 times faster rendering with only a small drop in performance. We hope that our approach can form the basis of an efficient, yet high-fidelity, 3D scene representation for downstream tasks, especially when retaining differentiability is useful, as in many robotics scenarios requiring continual learning.	This paper introduces Reconstructive Latent-Space NeRF (ReLS-NeRF), a novel 3D scene representation that combines an autoencoder (AE) with a NeRF for faster rendering and higher visual fidelity.	NeRFs, while powerful, suffer from slow rendering speeds and visual artifacts, hindering their application in robotics and other fields. This work addresses these limitations to broaden NeRF's applicability.	ReLS-NeRF renders low-resolution latent features instead of colors, using an AE to decode them into high-resolution images. The model is trained in three stages: AE training, joint NeRF fitting, and decoder fine-tuning.	ReLS-NeRF achieves faster rendering (over 3 times) than standard NeRFs. It improves visual quality on several metrics, including PSNR, LPIPS, and video quality metrics like DOVER. The trade-off between speed and quality can be controlled by adjusting the AE architecture.	The AE introduces temporal artifacts (view inconsistencies) not captured by standard metrics. Future work includes exploring task-specific AEs and geometry-aware decoders.	neural radiance fields, nerf, autoencoder, 3d scene representation, novel view synthesis
2310.17527 Report	Masked Space-Time Hash Encoding for Efficient Dynamic Scene Reconstruction	Feng Wang, Zilong Chen, Guokang Wang, Yafei Song, Huaping Liu	In this paper, we propose the Masked Space-Time Hash encoding (MSTH), a novel method for efficiently reconstructing dynamic 3D scenes from multi-view or monocular videos. Based on the observation that dynamic scenes often contain substantial static areas that result in redundancy in storage and computations, MSTH represents a dynamic scene as a weighted combination of a 3D hash encoding and a 4D hash encoding. The weights for the two components are represented by a learnable mask which is guided by an uncertainty-based objective to reflect the spatial and temporal importance of each 3D position. With this design, our method can reduce the hash collision rate by avoiding redundant queries and modifications on static areas, making it feasible to represent a large number of space-time voxels by hash tables with small size.Besides, without the requirements to fit the large numbers of temporally redundant features independently, our method is easier to optimize and converge rapidly with only twenty minutes of training for a 300-frame dynamic scene.As a result, MSTH obtains consistently better results than previous methods with only 20 minutes of training time and 130 MB of memory storage. Code is available at https://github.com/masked-spacetime-hashing/msth	This paper proposes Masked Space-Time Hash encoding (MSTH), a novel, efficient method for reconstructing dynamic 3D scenes from multi-view or monocular videos.	Reconstructing dynamic scenes is crucial for various applications, but existing methods struggle with efficiency, memory usage, and rendering quality.	MSTH uses a weighted combination of 3D and 4D hash encodings, guided by a learnable mask reflecting spatial and temporal importance, to reduce hash collisions and improve efficiency.	MSTH achieves consistently better reconstruction metrics (PSNR, DSSIM, LPIPS) than state-of-the-art methods on multiple datasets. The method requires only 20 minutes of training time, significantly faster than previous approaches. MSTH maintains a compact memory footprint of 130MB, thanks to its efficient encoding scheme.	MSTH may struggle with scenes lacking detailed dynamic information, leading to artifacts. Future work includes addressing complex scenes, motion dynamics, and integrating multiple information sources for enhanced reconstruction.	dynamic 3d scene reconstruction, neural radiance fields, hash encoding, uncertainty estimation, multi-view video
2310.17347 Report	CADS: Unleashing the Diversity of Diffusion Models through Condition-Annealed Sampling	Seyedmorteza Sadat, Jakob Buhmann, Derek Bradley, Otmar Hilliges, Romann M. Weber	While conditional diffusion models are known to have good coverage of the data distribution, they still face limitations in output diversity, particularly when sampled with a high classifier-free guidance scale for optimal image quality or when trained on small datasets. We attribute this problem to the role of the conditioning signal in inference and offer an improved sampling strategy for diffusion models that can increase generation diversity, especially at high guidance scales, with minimal loss of sample quality. Our sampling strategy anneals the conditioning signal by adding scheduled, monotonically decreasing Gaussian noise to the conditioning vector during inference to balance diversity and condition alignment. Our Condition-Annealed Diffusion Sampler (CADS) can be used with any pretrained model and sampling algorithm, and we show that it boosts the diversity of diffusion models in various conditional generation tasks. Further, using an existing pretrained diffusion model, CADS achieves a new state-of-the-art FID of 1.70 and 2.31 for class-conditional ImageNet generation at 256$\times$256 and 512$\times$512 respectively.	This paper introduces Condition-Annealed Diffusion Sampler (CADS), a novel sampling strategy for diffusion models to enhance generation diversity without compromising quality.	Conditional diffusion models, while powerful, often lack diversity in their outputs, especially at high classifier-free guidance scales or when trained on smaller datasets. This limits their ability to fully capture the breadth of the data distribution.	CADS anneals the conditioning signal by adding scheduled, monotonically decreasing Gaussian noise during inference. This disrupts the strong dependence on the conditioning signal initially and gradually restores it, promoting exploration of the data distribution while maintaining alignment with the input condition.	CADS significantly boosts diversity across various tasks (class-conditional ImageNet generation, pose-to-image, text-to-image, and identity-conditioned face generation) as measured by FID, Recall, and similarity scores. CADS achieves state-of-the-art FID scores on class-conditional ImageNet generation at 256x256 and 512x512 resolutions by leveraging higher guidance scales without sacrificing diversity. The method is compatible with various diffusion samplers (DDPM, DDIM, PNDM, DPM++) and consistently improves their performance.	Applying CADS to complex conditioning contexts like dense segmentation maps requires further investigation. While CADS mitigates the diversity-quality trade-off, finding the optimal annealing schedule might require task-specific tuning.	diffusion models, generative modeling, diversity, classifier-free guidance, sampling strategies
2310.17050 Report	Exploring Question Decomposition for Zero-Shot VQA	Zaid Khan, Vijay Kumar BG, Samuel Schulter, Manmohan Chandraker, Yun Fu	Visual question answering (VQA) has traditionally been treated as a single-step task where each question receives the same amount of effort, unlike natural human question-answering strategies. We explore a question decomposition strategy for VQA to overcome this limitation. We probe the ability of recently developed large vision-language models to use human-written decompositions and produce their own decompositions of visual questions, finding they are capable of learning both tasks from demonstrations alone. However, we show that naive application of model-written decompositions can hurt performance. We introduce a model-driven selective decomposition approach for second-guessing predictions and correcting errors, and validate its effectiveness on eight VQA tasks across three domains, showing consistent improvements in accuracy, including improvements of >20% on medical VQA datasets and boosting the zero-shot performance of BLIP-2 above chance on a VQA reformulation of the challenging Winoground task. Project Site: https://zaidkhan.me/decomposition-0shot-vqa/	This paper explores question decomposition as a strategy for zero-shot visual question answering (VQA) with large vision-language models (VLMs), enabling them to approach reasoning-heavy VQA as a two-step process.	Traditional VQA treats all questions as single-step tasks, unlike natural human question-answering strategies where more complex questions receive more effort. This work aims to address this limitation by introducing a question decomposition strategy for VQA.	The authors first probe the ability of large VLMs to use human-written and then model-generated question decompositions in a zero-shot setting. They then introduce a model-driven 'selective decomposition' approach to address limitations of naive decomposition, evaluating its effectiveness on eight VQA tasks across three domains.	Large VLMs can effectively use human-written decompositions to improve VQA accuracy without explicit training and do not merely exploit surface-level statistics. Generative, instruction-tuned language models can produce effective decompositions zero-shot without task-specific training. Selective decomposition, which applies decomposition only when the model is uncertain about its initial answer, consistently improves VQA accuracy across datasets and domains, with significant gains on medical VQA datasets and the challenging Winoground task.	The study primarily considers two-step decomposition approaches. Exploring multi-step approaches for more complex reasoning remains a future direction. While the paper focuses on in-context learning, investigating the benefits of explicitly training models to produce and consume decompositions is left for future work.	visual question answering, question decomposition, zero-shot learning, vision-language models, selective prediction
2310.16951 Report	The Teenager's Problem: Efficient Garment Decluttering With Grasp Optimization	Aviv Adler, Ayah Ahmad, Shengyin Wang, Wisdom C. Agboh, Edith Llontop, Tianshuang Qiu, Jeffrey Ichnowski, Mehmet Dogar, Thomas Kollar, Richard Cheng, Ken Goldberg	This paper addresses the ''Teenager's Problem'': efficiently removing scattered garments from a planar surface. As grasping and transporting individual garments is highly inefficient, we propose analytical policies to select grasp locations for multiple garments using an overhead camera. Two classes of methods are considered: depth-based, which use overhead depth data to find efficient grasps, and segment-based, which use segmentation on the RGB overhead image (without requiring any depth data); grasp efficiency is measured by Objects per Transport, which denotes the average number of objects removed per trip to the laundry basket. Experiments suggest that both depth- and segment-based methods easily reduce Objects per Transport (OpT) by $20\%$; furthermore, these approaches complement each other, with combined hybrid methods yielding improvements of $34\%$. Finally, a method employing consolidation (with segmentation) is considered, which manipulates the garments on the work surface to increase OpT; this yields an improvement of $67\%$ over the baseline, though at a cost of additional physical actions.	This paper introduces the "Teenager's Problem" - efficient decluttering of garments from a surface, proposing depth-based, segment-based, and hybrid methods to optimize grasp locations for removing multiple garments simultaneously.	Efficient garment manipulation is important in various domains like hotels, retail, and manufacturing, where current individual garment grasping methods are inefficient.	The paper evaluates different grasp planning methods, including depth-based (highest point, max volume), segment-based (using segmentation to grasp multiple garments), hybrid (combining depth and segmentation), and a baseline random grasping method. These methods are tested with a real robot to compare their efficiency in clearing a workspace of scattered garments.	Both depth-based and segment-based methods individually increase grasping efficiency (Objects per Transport) by 20%. Hybrid methods, combining depth and segmentation, yield even larger improvements, reaching up to 34%. Incorporating consolidation actions (rearranging garments within the workspace) with segmentation achieves a 67% improvement but requires additional physical actions.	The methods rely on accurate separation of garments from the background using color, which may not generalize well to different setups. The grasps use a fixed height and vertical orientation, potentially limiting efficiency. Future work could explore optimizing grasp height and angle.	robotics, garment manipulation, decluttering, grasp planning, image segmentation
2310.16858 Report	4D-Editor: Interactive Object-level Editing in Dynamic Neural Radiance Fields via Semantic Distillation	Dadong Jiang, Zhihui Ke, Xiaobo Zhou, Xidong Shi	This paper targets interactive object-level editing (e.g., deletion, recoloring, transformation, composition) in dynamic scenes. Recently, some methods aiming for flexible editing static scenes represented by neural radiance field (NeRF) have shown impressive synthesis quality, while similar capabilities in time-variant dynamic scenes remain limited. To solve this problem, we propose 4D-Editor, an interactive semantic-driven editing framework, allowing editing multiple objects in a dynamic NeRF with user strokes on a single frame. We propose an extension to the original dynamic NeRF by incorporating a hybrid semantic feature distillation to maintain spatial-temporal consistency after editing. In addition, we design Recursive Selection Refinement that significantly boosts object segmentation accuracy within a dynamic NeRF to aid the editing process. Moreover, we develop Multi-view Reprojection Inpainting to fill holes caused by incomplete scene capture after editing. Extensive experiments and editing examples on real-world demonstrate that 4D-Editor achieves photo-realistic editing on dynamic NeRFs. Project page: https://patrickddj.github.io/4D-Editor	4D-Editor, an interactive object-level editing framework for dynamic neural radiance fields (NeRFs), allows users to edit multiple objects with strokes on a single reference frame, propagating modifications throughout the entire dynamic NeRF.	Existing NeRF editing methods are limited to static scenes or lack object-level control in dynamic scenes. 4D-Editor addresses this gap by enabling interactive and precise object editing in dynamic NeRFs, crucial for applications like VR/AR and animation.	4D-Editor utilizes hybrid semantic feature distillation from a pre-trained DINO model to guide object segmentation. It introduces Recursive Selection Refinement for accurate object selection and Multi-view Reprojection Inpainting to fill holes caused by object removal.	4D-Editor achieves precise object-level editing in dynamic NeRFs with user-friendly strokes, demonstrated on challenging datasets. Recursive Selection Refinement significantly improves object segmentation accuracy compared to traditional methods. Multi-view Reprojection Inpainting effectively fills holes after object removal, preserving spatial-temporal consistency.	Removing shadows of moving objects remains challenging. Scene inpainting might exhibit spatial-temporal inconsistencies in some cases, requiring further investigation.	neural radiance fields, dynamic scene editing, interactive editing, semantic distillation, 4d object segmentation
2310.16825 Report	CommonCanvas: An Open Diffusion Model Trained with Creative-Commons Images	Aaron Gokaslan, A. Feder Cooper, Jasmine Collins, Landan Seguin, Austin Jacobson, Mihir Patel, Jonathan Frankle, Cory Stephenson, Volodymyr Kuleshov	We assemble a dataset of Creative-Commons-licensed (CC) images, which we use to train a set of open diffusion models that are qualitatively competitive with Stable Diffusion 2 (SD2). This task presents two challenges: (1) high-resolution CC images lack the captions necessary to train text-to-image generative models; (2) CC images are relatively scarce. In turn, to address these challenges, we use an intuitive transfer learning technique to produce a set of high-quality synthetic captions paired with curated CC images. We then develop a data- and compute-efficient training recipe that requires as little as 3% of the LAION-2B data needed to train existing SD2 models, but obtains comparable quality. These results indicate that we have a sufficient number of CC images (~70 million) for training high-quality models. Our training recipe also implements a variety of optimizations that achieve ~3X training speed-ups, enabling rapid model iteration. We leverage this recipe to train several high-quality text-to-image models, which we dub the CommonCanvas family. Our largest model achieves comparable performance to SD2 on a human evaluation, despite being trained on our CC dataset that is significantly smaller than LAION and using synthetic captions for training. We release our models, data, and code at https://github.com/mosaicml/diffusion/blob/main/assets/common-canvas.md	This paper introduces CommonCanvas, a suite of text-to-image latent diffusion models trained solely on Creative Commons images and synthetically generated captions.	This work addresses copyright and reproducibility concerns associated with training diffusion models on web-scraped data (like LAION).	The authors curate a dataset of CC images and use a pre-trained BLIP-2 model to generate captions for these images. They also develop efficient training techniques that allow them to train high-quality models with significantly less data.	Training diffusion models on less than 3% of the data used to train Stable Diffusion 2 (SD2) yields comparable performance on standard metrics. Synthetic captions can be as effective as human-generated captions for training diffusion models. CommonCanvas models, despite being trained on a smaller dataset with synthetic captions, achieve comparable performance to SD2 on human evaluations.	The CC image dataset used is smaller and potentially less diverse than web-scraped datasets. The reliance on a pre-trained BLIP-2 model for captions introduces potential biases.	diffusion models, copyright, synthetic data, image captioning, data efficiency
2310.16818 Report	DreamCraft3D: Hierarchical 3D Generation with Bootstrapped Diffusion Prior	Jingxiang Sun, Bo Zhang, Ruizhi Shao, Lizhen Wang, Wen Liu, Zhenda Xie, Yebin Liu	We present DreamCraft3D, a hierarchical 3D content generation method that produces high-fidelity and coherent 3D objects. We tackle the problem by leveraging a 2D reference image to guide the stages of geometry sculpting and texture boosting. A central focus of this work is to address the consistency issue that existing works encounter. To sculpt geometries that render coherently, we perform score distillation sampling via a view-dependent diffusion model. This 3D prior, alongside several training strategies, prioritizes the geometry consistency but compromises the texture fidelity. We further propose Bootstrapped Score Distillation to specifically boost the texture. We train a personalized diffusion model, Dreambooth, on the augmented renderings of the scene, imbuing it with 3D knowledge of the scene being optimized. The score distillation from this 3D-aware diffusion prior provides view-consistent guidance for the scene. Notably, through an alternating optimization of the diffusion prior and 3D scene representation, we achieve mutually reinforcing improvements: the optimized 3D scene aids in training the scene-specific diffusion model, which offers increasingly view-consistent guidance for 3D optimization. The optimization is thus bootstrapped and leads to substantial texture boosting. With tailored 3D priors throughout the hierarchical generation, DreamCraft3D generates coherent 3D objects with photorealistic renderings, advancing the state-of-the-art in 3D content generation. Code available at https://github.com/deepseek-ai/DreamCraft3D.	DreamCraft3D, a hierarchical 3D content generation method that produces high-fidelity and coherent 3D objects by leveraging a 2D reference image to guide geometry sculpting and texture boosting.	Addresses the limitations of existing 3D content generation methods that struggle to create complex objects with consistent geometry and textures.	A hierarchical pipeline with two main stages: (1) Geometry sculpting: Uses a view-conditioned diffusion model and progressive view training to create detailed, consistent geometry from a 2D reference image. (2) Texture boosting: Employs a bootstrapped score distillation (BSD) approach that iteratively refines the 3D texture by jointly optimizing the 3D representation and a personalized DreamBooth diffusion model.	Generates creative 3D assets with intricate geometric structures and realistic textures rendered coherently in 360 degrees. Outperforms existing text-to-3D and image-to-3D methods in terms of texture quality, geometric consistency, and overall visual fidelity. Demonstrates superior performance in user studies, with a strong preference for DreamCraft3D-generated models.	Occasionally incorporates frontal-view details into textures due to depth ambiguity. Does not explicitly separate material and lighting information from the 2D reference image.	3d content generation, diffusion models, dreambooth, texture synthesis, view consistency
2310.16656 Report	A Picture is Worth a Thousand Words: Principled Recaptioning Improves Image Generation	Eyal Segalis, Dani Valevski, Danny Lumen, Yossi Matias, Yaniv Leviathan	Text-to-image diffusion models achieved a remarkable leap in capabilities over the last few years, enabling high-quality and diverse synthesis of images from a textual prompt. However, even the most advanced models often struggle to precisely follow all of the directions in their prompts. The vast majority of these models are trained on datasets consisting of (image, caption) pairs where the images often come from the web, and the captions are their HTML alternate text. A notable example is the LAION dataset, used by Stable Diffusion and other models. In this work we observe that these captions are often of low quality, and argue that this significantly affects the model's capability to understand nuanced semantics in the textual prompts. We show that by relabeling the corpus with a specialized automatic captioning model and training a text-to-image model on the recaptioned dataset, the model benefits substantially across the board. First, in overall image quality: e.g. FID 14.84 vs. the baseline of 17.87, and 64.3% improvement in faithful image generation according to human evaluation. Second, in semantic alignment, e.g. semantic object accuracy 84.34 vs. 78.90, counting alignment errors 1.32 vs. 1.44 and positional alignment 62.42 vs. 57.60. We analyze various ways to relabel the corpus and provide evidence that this technique, which we call RECAP, both reduces the train-inference discrepancy and provides the model with more information per example, increasing sample efficiency and allowing the model to better understand the relations between captions and images.	This paper introduces RECAP, a method that improves text-to-image models by training them on synthetically generated captions. It involves fine-tuning an automatic captioning system and using it to generate more detailed and contextually relevant captions for the training images.	Existing text-to-image models often struggle to accurately follow nuanced prompts because they're trained on datasets with low-quality captions (e.g., HTML Alttext). This method aims to address this limitation and improve the models' fidelity and semantic understanding.	The method has 3 steps: 1) Fine-tune an image-to-text model (PaLI) on human-annotated captions to generate detailed descriptions. 2) Use the fine-tuned model to re-caption the image training dataset. 3) Fine-tune a text-to-image model (Stable Diffusion) on the dataset with the new captions.	Significantly improved image quality metrics, with FID improving from 17.87 to 14.84. Improved semantic alignment between generated images and prompts, demonstrated by increased object accuracy, counting alignment, and positional alignment scores. Human evaluation showed 64.3% relative improvement in generating images successfully following the prompts.	The study primarily focuses on fine-tuning a pre-trained model; exploring the impact of training from scratch with RECAP captions is left for future work. The impact of RECAP on larger models and datasets is yet to be explored.	text-to-image generation, image captioning, synthetic data, semantic alignment, diffusion models
2310.16400 Report	Fuse Your Latents: Video Editing with Multi-source Latent Diffusion Models	Tianyi Lu, Xing Zhang, Jiaxi Gu, Hang Xu, Renjing Pei, Songcen Xu, Zuxuan Wu	Latent Diffusion Models (LDMs) are renowned for their powerful capabilities in image and video synthesis. Yet, video editing methods suffer from insufficient pre-training data or video-by-video re-training cost. In addressing this gap, we propose FLDM (Fused Latent Diffusion Model), a training-free framework to achieve text-guided video editing by applying off-the-shelf image editing methods in video LDMs. Specifically, FLDM fuses latents from an image LDM and an video LDM during the denoising process. In this way, temporal consistency can be kept with video LDM while high-fidelity from the image LDM can also be exploited. Meanwhile, FLDM possesses high flexibility since both image LDM and video LDM can be replaced so advanced image editing methods such as InstructPix2Pix and ControlNet can be exploited. To the best of our knowledge, FLDM is the first method to adapt off-the-shelf image editing methods into video LDMs for video editing. Extensive quantitative and qualitative experiments demonstrate that FLDM can improve the textual alignment and temporal consistency of edited videos.	This paper proposes FLDM (Fused Latent Diffusion Model), a training-free framework for text-guided video editing using off-the-shelf image editing methods within video LDMs.	Existing video editing methods are limited by insufficient pre-training data or require costly video-by-video retraining. This work addresses these limitations by leveraging the strengths of both image and video LDMs.	FLDM fuses latent representations from an image LDM and a video LDM during the denoising process. This allows for control over the balance between temporal consistency (from the video LDM) and editing fidelity (from the image LDM).	FLDM improves the textual alignment and temporal consistency of edited videos compared to using image or video LDMs alone. The method is flexible and can be used with different off-the-shelf image editing techniques, such as InstructPix2Pix and ControlNet. FLDM demonstrates the complementary nature of image and video LDMs in achieving high-quality video editing.	The paper uses a re-implemented video diffusion model due to the lack of publicly available high-quality pre-trained models. Further exploration is needed to optimize the fusion strategy and apply it to other video editing tasks.	video editing, latent diffusion models, text-guided editing, temporal consistency, multi-source fusion
2310.16383 Report	Open-NeRF: Towards Open Vocabulary NeRF Decomposition	Hao Zhang, Fang Li, Narendra Ahuja	In this paper, we address the challenge of decomposing Neural Radiance Fields (NeRF) into objects from an open vocabulary, a critical task for object manipulation in 3D reconstruction and view synthesis. Current techniques for NeRF decomposition involve a trade-off between the flexibility of processing open-vocabulary queries and the accuracy of 3D segmentation. We present, Open-vocabulary Embedded Neural Radiance Fields (Open-NeRF), that leverage large-scale, off-the-shelf, segmentation models like the Segment Anything Model (SAM) and introduce an integrate-and-distill paradigm with hierarchical embeddings to achieve both the flexibility of open-vocabulary querying and 3D segmentation accuracy. Open-NeRF first utilizes large-scale foundation models to generate hierarchical 2D mask proposals from varying viewpoints. These proposals are then aligned via tracking approaches and integrated within the 3D space and subsequently distilled into the 3D field. This process ensures consistent recognition and granularity of objects from different viewpoints, even in challenging scenarios involving occlusion and indistinct features. Our experimental results show that the proposed Open-NeRF outperforms state-of-the-art methods such as LERF \cite{lerf} and FFD \cite{ffd} in open-vocabulary scenarios. Open-NeRF offers a promising solution to NeRF decomposition, guided by open-vocabulary queries, enabling novel applications in robotics and vision-language interaction in open-world 3D scenes.	Open-NeRF decomposes Neural Radiance Fields (NeRF) into objects from an open vocabulary using an integrate-and-distill paradigm with hierarchical embeddings.	NeRF decomposition is crucial for object manipulation in 3D reconstruction and view synthesis, but existing methods struggle with the trade-off between handling open-vocabulary queries and accurate 3D segmentation.	Open-NeRF utilizes large-scale foundation models (SAM, openclip) to generate and align 2D mask proposals from multiple viewpoints, integrating them in 3D space and distilling them into the 3D field. It also employs hierarchical embeddings for handling queries at different scales (object, part, background).	Open-NeRF outperforms state-of-the-art methods (LERF, FFD) in open-vocabulary scenarios. It accurately segments both common and novel objects regardless of viewpoint. It enables flexible object manipulation based on various attributes like product name, brand, color, and text.	The performance of Open-NeRF is limited by the capabilities of the foundational models (SAM, openclip). Future work could explore more robust methods for handling background regions.	nerf, 3d scene understanding, open vocabulary, segmentation, vision-language models
2310.16167 Report	iNVS: Repurposing Diffusion Inpainters for Novel View Synthesis	Yash Kant, Aliaksandr Siarohin, Michael Vasilkovsky, Riza Alp Guler, Jian Ren, Sergey Tulyakov, Igor Gilitschenski	We present a method for generating consistent novel views from a single source image. Our approach focuses on maximizing the reuse of visible pixels from the source image. To achieve this, we use a monocular depth estimator that transfers visible pixels from the source view to the target view. Starting from a pre-trained 2D inpainting diffusion model, we train our method on the large-scale Objaverse dataset to learn 3D object priors. While training we use a novel masking mechanism based on epipolar lines to further improve the quality of our approach. This allows our framework to perform zero-shot novel view synthesis on a variety of objects. We evaluate the zero-shot abilities of our framework on three challenging datasets: Google Scanned Objects, Ray Traced Multiview, and Common Objects in 3D. See our webpage for more details: https://yashkant.github.io/invs/	The paper introduces iNVS, a novel method for synthesizing new views of an object from a single source image by leveraging a pretrained 2D inpainting diffusion model and maximizing the reuse of visible pixels through depth-based warping.	Generating high-fidelity novel views from a single image is crucial for various applications but remains challenging due to the need to infer 3D geometry from limited information. Existing methods often struggle with consistency, quality, or generalization.	iNVS uses a monocular depth estimator to warp visible pixels from the source to the target view. It then employs an inpainting diffusion model, finetuned on the Objaverse dataset, to recover missing regions, guided by an epipolar mask that identifies newly visible areas.	iNVS outperforms baseline methods on PSNR and achieves comparable LPIPS scores, indicating good noise reduction and perceptual similarity. The method excels at preserving text and fine details from the source image due to its pixel reuse strategy. While iNVS demonstrates strong performance, it can exhibit limitations in accurately reconstructing object shapes due to reliance on monocular depth estimation, leading to lower SSIM scores.	The method's reliance on monocular depth estimation can lead to structural inconsistencies, particularly in regions with significant viewpoint changes. Future work could explore auto-regressive schemes for novel view generation to address limitations in generating consistent textures in unseen regions.	novel view synthesis, diffusion models, inpainting, epipolar geometry, single image
2310.16044 Report	Stanford-ORB: A Real-World 3D Object Inverse Rendering Benchmark	Zhengfei Kuang, Yunzhi Zhang, Hong-Xing Yu, Samir Agarwala, Shangzhe Wu, Jiajun Wu	We introduce Stanford-ORB, a new real-world 3D Object inverse Rendering Benchmark. Recent advances in inverse rendering have enabled a wide range of real-world applications in 3D content generation, moving rapidly from research and commercial use cases to consumer devices. While the results continue to improve, there is no real-world benchmark that can quantitatively assess and compare the performance of various inverse rendering methods. Existing real-world datasets typically only consist of the shape and multi-view images of objects, which are not sufficient for evaluating the quality of material recovery and object relighting. Methods capable of recovering material and lighting often resort to synthetic data for quantitative evaluation, which on the other hand does not guarantee generalization to complex real-world environments. We introduce a new dataset of real-world objects captured under a variety of natural scenes with ground-truth 3D scans, multi-view images, and environment lighting. Using this dataset, we establish the first comprehensive real-world evaluation benchmark for object inverse rendering tasks from in-the-wild scenes, and compare the performance of various existing methods.	This paper introduces \emph{\name}, a novel real-world 3D object inverse rendering benchmark designed to address the lack of standardized evaluation for inverse rendering methods in complex, real-world settings.	Accurately evaluating inverse rendering methods is crucial as their applications in 3D content creation and robotics expand. However, current benchmarks often rely on synthetic data, limiting generalization to real-world scenarios.	The authors created a dataset of 14 objects captured in 7 diverse real-world scenes, including ground-truth 3D scans, multi-view images, and environment lighting. They established three evaluation benchmarks: geometry estimation, novel scene relighting, and novel view synthesis.	IDR excels in geometry reconstruction and novel view synthesis, outperforming NeRF and its variants. NVDiffRecMC demonstrates superior performance in novel scene relighting compared to other inverse rendering methods. Methods using ground-truth shape and material information significantly outperform those relying solely on learned priors, highlighting areas for future research.	The dataset is currently limited to non-translucent objects and faces difficulties capturing thin, deformable objects. Future work includes expanding the dataset with more diverse objects, incorporating multi-object scenes, and capturing complete environment maps.	inverse rendering, benchmarking, 3d reconstruction, relighting, novel view synthesis
2310.16002 Report	Integrating View Conditions for Image Synthesis	Jinbin Bai, Zhen Dong, Aosong Feng, Xiao Zhang, Tian Ye, Kaicheng Zhou	In the field of image processing, applying intricate semantic modifications within existing images remains an enduring challenge. This paper introduces a pioneering framework that integrates viewpoint information to enhance the control of image editing tasks, especially for interior design scenes. By surveying existing object editing methodologies, we distill three essential criteria -- consistency, controllability, and harmony -- that should be met for an image editing method. In contrast to previous approaches, our framework takes the lead in satisfying all three requirements for addressing the challenge of image synthesis. Through comprehensive experiments, encompassing both quantitative assessments and qualitative comparisons with contemporary state-of-the-art methods, we present compelling evidence of our framework's superior performance across multiple dimensions. This work establishes a promising avenue for advancing image synthesis techniques and empowering precise object modifications while preserving the visual coherence of the entire composition.	This paper introduces a novel image editing framework that leverages viewpoint information to enhance control over object manipulation in images, particularly for interior design scenes.	Existing image editing methods struggle to simultaneously achieve consistency in object appearance, controllability over object pose and position, and harmonious integration with the scene. This framework addresses these limitations.	The framework combines several components: 1) an LLM planner to extract object and pose information from user prompts, 2) pose estimation and synthesis modules for generating target objects with desired viewpoints, and 3) a personalized diffusion model with ControlNets for harmoniously integrating the synthesized object into the scene.	The framework outperforms state-of-the-art reference-based image synthesis methods in terms of consistency, harmony, and controllability, as demonstrated through qualitative comparisons and human evaluations. Ablation studies confirm the necessity of each component, highlighting the importance of view conditions for accurate object synthesis. The framework demonstrates robustness to slight errors in view condition specifications.	Future work aims to develop an end-to-end solution by integrating view control directly within the latent space of the diffusion model, improving efficiency. The current implementation relies on explicit pose estimation and synthesis, which can be further streamlined.	image editing, view control, pose synthesis, diffusion models, interior design
2310.15747 Report	Large Language Models are Temporal and Causal Reasoners for Video Question Answering	Dohwan Ko, Ji Soo Lee, Wooyoung Kang, Byungseok Roh, Hyunwoo J. Kim	Large Language Models (LLMs) have shown remarkable performances on a wide range of natural language understanding and generation tasks. We observe that the LLMs provide effective priors in exploiting $\textit{linguistic shortcuts}$ for temporal and causal reasoning in Video Question Answering (VideoQA). However, such priors often cause suboptimal results on VideoQA by leading the model to over-rely on questions, $\textit{i.e.}$, $\textit{linguistic bias}$, while ignoring visual content. This is also known as `ungrounded guesses' or `hallucinations'. To address this problem while leveraging LLMs' prior on VideoQA, we propose a novel framework, Flipped-VQA, encouraging the model to predict all the combinations of $\langle$V, Q, A$\rangle$ triplet by flipping the source pair and the target label to understand their complex relationships, $\textit{i.e.}$, predict A, Q, and V given a VQ, VA, and QA pairs, respectively. In this paper, we develop LLaMA-VQA by applying Flipped-VQA to LLaMA, and it outperforms both LLMs-based and non-LLMs-based models on five challenging VideoQA benchmarks. Furthermore, our Flipped-VQA is a general framework that is applicable to various LLMs (OPT and GPT-J) and consistently improves their performances. We empirically demonstrate that Flipped-VQA not only enhances the exploitation of linguistic shortcuts but also mitigates the linguistic bias, which causes incorrect answers over-relying on the question. Code is available at https://github.com/mlvlab/Flipped-VQA.	This paper investigates the temporal and causal reasoning abilities of Large Language Models (LLMs) on Video Question Answering (VideoQA) and proposes Flipped-VQA, a novel framework that leverages LLMs' knowledge for this task.	Challenging VideoQA benchmarks require understanding of temporal and causal relationships, and LLMs, pretrained on massive text data, inherently possess such reasoning abilities. However, they can be prone to linguistic bias, relying heavily on questions while ignoring visual content.	Flipped-VQA consists of three objectives: 1) VQ -> A (main task: predicting answer from video and question), 2) VA -> Q (predicting question from video and answer), and 3) QA -> V (predicting video from question and answer). This encourages understanding the complex relationships within the VQA triplet.	Larger LLMs exhibit significantly better performance on temporal and causal VideoQA questions, highlighting the importance of their pretrained knowledge. Flipped-VQA significantly improves the performance of various LLMs (LLaMA, OPT, GPT-J) on five challenging VideoQA datasets, surpassing previous state-of-the-art models. Extensive analyses demonstrate that Flipped-VQA effectively mitigates linguistic bias by encouraging the model to utilize visual content more effectively, while still leveraging linguistic shortcuts when beneficial.	The framework's applicability to encoder-decoder LLMs with objectives beyond next-token prediction requires further exploration. Despite using a small number of trainable parameters, the reliance on large backbone LLMs results in significant memory usage.	video question answering, large language models, temporal and causal reasoning, linguistic bias mitigation, multi-modal understanding
2310.15308 Report	SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding	Haoxiang Wang, Pavan Kumar Anasosalu Vasu, Fartash Faghri, Raviteja Vemulapalli, Mehrdad Farajtabar, Sachin Mehta, Mohammad Rastegari, Oncel Tuzel, Hadi Pouransari	The landscape of publicly available vision foundation models (VFMs), such as CLIP and Segment Anything Model (SAM), is expanding rapidly. VFMs are endowed with distinct capabilities stemming from their pre-training objectives. For instance, CLIP excels in semantic understanding, while SAM specializes in spatial understanding for segmentation. In this work, we introduce a simple recipe to efficiently merge VFMs into a unified model that absorbs their expertise. Our method integrates techniques of multi-task learning, continual learning, and distillation. Further, it demands significantly less computational cost compared to traditional multi-task training from scratch, and it only needs a small fraction of the pre-training datasets that were initially used to train individual models. By applying our method to SAM and CLIP, we obtain SAM-CLIP: a unified model that combines the capabilities of SAM and CLIP into a single vision transformer. Compared with deploying SAM and CLIP independently, our merged model, SAM-CLIP, reduces storage and compute costs for inference, making it well-suited for edge device applications. We show that SAM-CLIP not only retains the foundational strengths of SAM and CLIP, but also introduces synergistic functionalities, notably in zero-shot semantic segmentation, where SAM-CLIP establishes new state-of-the-art results on 5 benchmarks. It outperforms previous models that are specifically designed for this task by a large margin, including +6.8% and +5.9% mean IoU improvement on Pascal-VOC and COCO-Stuff datasets, respectively.	This paper introduces a method for efficiently merging pre-trained Vision Foundation Models (VFMs) into a unified model, combining their expertise without requiring extensive training from scratch.	Maintaining separate VFMs for different tasks is inefficient, while traditional multi-task learning is computationally expensive. This work offers a middle ground by efficiently merging VFMs with minimal training.	The method treats merging as a continual learning problem, using multi-task distillation and a small replay dataset to transfer knowledge from an auxiliary VFM to a base VFM while mitigating forgetting.	The merged model, combining SAM and CLIP (called SAM-CLIP), retains the zero-shot capabilities of both original models (instance segmentation and image classification). SAM-CLIP exhibits stronger representation learning abilities compared to individual SAM and CLIP models. SAM-CLIP demonstrates emergent capability in zero-shot semantic segmentation, achieving state-of-the-art results on 5 benchmarks.	The merged model might inherit limitations (e.g., biases in data distribution) from the original VFMs. The merged model requires an additional head for the auxiliary model, increasing the overall size.	vision foundation models, model merging, knowledge distillation, continual learning, zero-shot learning
2310.15169 Report	FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling	Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, Ziwei Liu	With the availability of large-scale video datasets and the advances of diffusion models, text-driven video generation has achieved substantial progress. However, existing video generation models are typically trained on a limited number of frames, resulting in the inability to generate high-fidelity long videos during inference. Furthermore, these models only support single-text conditions, whereas real-life scenarios often require multi-text conditions as the video content changes over time. To tackle these challenges, this study explores the potential of extending the text-driven capability to generate longer videos conditioned on multiple texts. 1) We first analyze the impact of initial noise in video diffusion models. Then building upon the observation of noise, we propose FreeNoise, a tuning-free and time-efficient paradigm to enhance the generative capabilities of pretrained video diffusion models while preserving content consistency. Specifically, instead of initializing noises for all frames, we reschedule a sequence of noises for long-range correlation and perform temporal attention over them by window-based function. 2) Additionally, we design a novel motion injection method to support the generation of videos conditioned on multiple text prompts. Extensive experiments validate the superiority of our paradigm in extending the generative capabilities of video diffusion models. It is noteworthy that compared with the previous best-performing method which brought about 255% extra time cost, our method incurs only negligible time cost of approximately 17%. Generated video samples are available at our website: http://haonanqiu.com/projects/FreeNoise.html.	This paper proposes FreeNoise, a tuning-free and time-efficient paradigm to enhance the generative capabilities of pre-trained video diffusion models for longer and multi-prompt video generation.	Existing video generation models are limited in their ability to generate high-fidelity long videos and often only support single-text conditions, hindering their applicability to real-life scenarios.	The proposed FreeNoise method leverages noise rescheduling with local shuffling and window-based attention fusion to enable longer video generation while maintaining content consistency. It also introduces a motion injection strategy for multi-prompt video generation by modulating the influence of text prompts during the denoising process.	FreeNoise outperforms previous methods in generating longer videos with better content consistency and visual quality, as evidenced by quantitative metrics (FVD, KVD, CLIP-SIM) and user studies. The proposed motion injection method effectively enables multi-prompt video generation with smooth transitions and coherent motion continuity. Compared to previous best methods, FreeNoise incurs significantly less computational overhead during inference (17% vs. 255%).	The weakening effect of repeated locally shuffled noises might limit the introduction of new content as video length increases. The performance of FreeNoise is constrained by the base model's ability to handle videos with significant subject movement.	video generation, diffusion models, long video generation, multi-prompt video generation, content consistency
2310.15160 Report	FreeMask: Synthetic Images with Dense Annotations Make Stronger Segmentation Models	Lihe Yang, Xiaogang Xu, Bingyi Kang, Yinghuan Shi, Hengshuang Zhao	Semantic segmentation has witnessed tremendous progress due to the proposal of various advanced network architectures. However, they are extremely hungry for delicate annotations to train, and the acquisition is laborious and unaffordable. Therefore, we present FreeMask in this work, which resorts to synthetic images from generative models to ease the burden of both data collection and annotation procedures. Concretely, we first synthesize abundant training images conditioned on the semantic masks provided by realistic datasets. This yields extra well-aligned image-mask training pairs for semantic segmentation models. We surprisingly observe that, solely trained with synthetic images, we already achieve comparable performance with real ones (e.g., 48.3 vs. 48.5 mIoU on ADE20K, and 49.3 vs. 50.5 on COCO-Stuff). Then, we investigate the role of synthetic images by joint training with real images, or pre-training for real images. Meantime, we design a robust filtering principle to suppress incorrectly synthesized regions. In addition, we propose to inequally treat different semantic masks to prioritize those harder ones and sample more corresponding synthetic images for them. As a result, either jointly trained or pre-trained with our filtered and re-sampled synthesized images, segmentation models can be greatly enhanced, e.g., from 48.7 to 52.0 on ADE20K. Code is available at https://github.com/LiheYoung/FreeMask.	This paper presents FreeMask, a novel method to enhance fully-supervised semantic segmentation by leveraging synthetic images generated from semantic masks.	Collecting and annotating real images for semantic segmentation is laborious and expensive. This work explores using synthetic data from generative models to address this challenge.	The authors use FreestyleNet, a mask-to-image synthesis model, to generate synthetic images from real semantic masks. They propose two strategies: 1) Filtering noisy synthetic regions based on class-level loss analysis, and 2) Re-sampling synthetic images based on mask-level hardness to prioritize challenging layouts.	Training solely on synthetic images achieves comparable performance to training on real images (e.g., 48.3 vs 48.5 mIoU on ADE20K). Jointly training on real and synthetic images significantly improves performance over using real images alone (e.g., 48.7 to 52.0 mIoU on ADE20K). Pre-training on synthetic images and fine-tuning on real images also leads to substantial improvements.	Generating synthetic images can be time-consuming. The proposed method's effectiveness in more complex real-world scenarios requires further investigation.	semantic segmentation, synthetic data, generative models, image synthesis, data augmentation
2310.15111 Report	Matryoshka Diffusion Models	Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Josh Susskind, Navdeep Jaitly	Diffusion models are the de facto approach for generating high-quality images and videos, but learning high-dimensional models remains a formidable task due to computational and optimization challenges. Existing methods often resort to training cascaded models in pixel space or using a downsampled latent space of a separately trained auto-encoder. In this paper, we introduce Matryoshka Diffusion Models(MDM), an end-to-end framework for high-resolution image and video synthesis. We propose a diffusion process that denoises inputs at multiple resolutions jointly and uses a NestedUNet architecture where features and parameters for small-scale inputs are nested within those of large scales. In addition, MDM enables a progressive training schedule from lower to higher resolutions, which leads to significant improvements in optimization for high-resolution generation. We demonstrate the effectiveness of our approach on various benchmarks, including class-conditioned image generation, high-resolution text-to-image, and text-to-video applications. Remarkably, we can train a single pixel-space model at resolutions of up to 1024x1024 pixels, demonstrating strong zero-shot generalization using the CC12M dataset, which contains only 12 million images.	This paper introduces \Model (\model), an end-to-end diffusion model framework for high-resolution image and video synthesis that addresses the computational and optimization challenges of traditional high-dimensional models.	Scaling diffusion models to high resolutions for complex generation tasks like text-to-image synthesis is challenging. Existing methods rely on cascaded or latent approaches, which complicate training, inference, and can limit generation quality.	\model uses a multi-resolution diffusion process in an extended space, jointly denoising inputs at multiple resolutions using a NestedUNet architecture. It employs a progressive training schedule, starting from lower resolutions and gradually adding higher resolutions.	Joint multi-resolution denoising and a nested architecture lead to faster convergence and better quality compared to single-resolution diffusion. Progressive training significantly speeds up the training process for high-resolution models, outperforming cascaded diffusion baselines. \model achieves high performance in text-to-image generation up to 1024x1024 resolution on a relatively small dataset (CC12M), demonstrating strong zero-shot generalization.	The paper primarily explores a limited set of architectures, leaving room for further improvements in weight sharing and parameter distribution across resolutions. While compared to Latent Diffusion Models (LDM), a more thorough investigation of combining \model with autoencoder-based approaches is left as future work.	diffusion models, high-resolution synthesis, text-to-image generation, text-to-video generation, multi-resolution modeling
2310.15110 Report	Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model	Ruoxi Shi, Hansheng Chen, Zhuoyang Zhang, Minghua Liu, Chao Xu, Xinyue Wei, Linghao Chen, Chong Zeng, Hao Su	We report Zero123++, an image-conditioned diffusion model for generating 3D-consistent multi-view images from a single input view. To take full advantage of pretrained 2D generative priors, we develop various conditioning and training schemes to minimize the effort of finetuning from off-the-shelf image diffusion models such as Stable Diffusion. Zero123++ excels in producing high-quality, consistent multi-view images from a single image, overcoming common issues like texture degradation and geometric misalignment. Furthermore, we showcase the feasibility of training a ControlNet on Zero123++ for enhanced control over the generation process. The code is available at https://github.com/SUDO-AI-3D/zero123plus.	Introduces Zero123++, an image-conditioned diffusion model that generates consistent multi-view images from a single input view, by finetuning Stable Diffusion with novel conditioning and training schemes.	Addresses limitations of previous methods like Zero-1-to-3 in achieving 3D consistency in generated multi-view images, aiming to bridge the gap with true 3D scene representation.	Utilizes a multi-view tiling strategy, leverages Stable Diffusion's local and global conditioning mechanisms, adopts a linear noise schedule, and implements a phased training approach for optimal prior utilization.	Generates high-quality, consistent multi-view images from single inputs, outperforming previous methods in visual fidelity and consistency. Demonstrates strong generalization ability, effectively handling real photos, AI-generated images, and 2D illustrations. Presents a depth-controlled version using ControlNet, enabling geometry-guided generation with superior consistency (LPIPS of 0.086).	Current model trained on a medium-scale dataset (Objaverse), potentially limiting its representational capacity. Exploration of two-stage refiner models and further dataset scaling are planned to enhance detail and generalization.	multi-view generation, diffusion models, 3d consistency, image conditioning, controlnet
2310.15008 Report	Wonder3D: Single Image to 3D using Cross-Domain Diffusion	Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, Wenping Wang	In this work, we introduce Wonder3D, a novel method for efficiently generating high-fidelity textured meshes from single-view images.Recent methods based on Score Distillation Sampling (SDS) have shown the potential to recover 3D geometry from 2D diffusion priors, but they typically suffer from time-consuming per-shape optimization and inconsistent geometry. In contrast, certain works directly produce 3D information via fast network inferences, but their results are often of low quality and lack geometric details. To holistically improve the quality, consistency, and efficiency of image-to-3D tasks, we propose a cross-domain diffusion model that generates multi-view normal maps and the corresponding color images. To ensure consistency, we employ a multi-view cross-domain attention mechanism that facilitates information exchange across views and modalities. Lastly, we introduce a geometry-aware normal fusion algorithm that extracts high-quality surfaces from the multi-view 2D representations. Our extensive evaluations demonstrate that our method achieves high-quality reconstruction results, robust generalization, and reasonably good efficiency compared to prior works.	Wonder3D, a novel method for efficiently generating high-fidelity textured meshes from single-view images.	Existing methods for single-view 3D reconstruction either suffer from time-consuming optimization, inconsistent geometry, or limited generalizability. This paper aims to address these limitations.	Wonder3D leverages a cross-domain diffusion model to generate consistent multi-view normal maps and color images. It then employs a geometry-aware normal fusion algorithm to extract high-quality surfaces from the 2D representations.	Wonder3D achieves high-quality reconstruction results with fine-grained details. The method demonstrates robust generalization across diverse image styles. Wonder3D offers good efficiency, reconstructing textured meshes in just 2 minutes.	The limited number of views (six) poses challenges for reconstructing objects with thin structures or severe occlusions. Scaling up to more views requires addressing increased computational demands during training.	3d reconstruction, single-view reconstruction, diffusion models, normal fusion, cross-domain learning
2310.14942 Report	Domain Watermark: Effective and Harmless Dataset Copyright Protection is Closed at Hand	Junfeng Guo, Yiming Li, Lixu Wang, Shu-Tao Xia, Heng Huang, Cong Liu, Bo Li	The prosperity of deep neural networks (DNNs) is largely benefited from open-source datasets, based on which users can evaluate and improve their methods. In this paper, we revisit backdoor-based dataset ownership verification (DOV), which is currently the only feasible approach to protect the copyright of open-source datasets. We reveal that these methods are fundamentally harmful given that they could introduce malicious misclassification behaviors to watermarked DNNs by the adversaries. In this paper, we design DOV from another perspective by making watermarked models (trained on the protected dataset) correctly classify some `hard' samples that will be misclassified by the benign model. Our method is inspired by the generalization property of DNNs, where we find a \emph{hardly-generalized domain} for the original dataset (as its \emph{domain watermark}). It can be easily learned with the protected dataset containing modified samples. Specifically, we formulate the domain generation as a bi-level optimization and propose to optimize a set of visually-indistinguishable clean-label modified data with similar effects to domain-watermarked samples from the hardly-generalized domain to ensure watermark stealthiness. We also design a hypothesis-test-guided ownership verification via our domain watermark and provide the theoretical analyses of our method. Extensive experiments on three benchmark datasets are conducted, which verify the effectiveness of our method and its resistance to potential adaptive methods. The code for reproducing main experiments is available at \url{https://github.com/JunfengGo/Domain-Watermark}.	This paper revisits dataset ownership verification (DOV), reveals the harm of backdoor-based methods, and proposes a harmless DOV approach using a 'domain watermark.'	Protecting the copyright of open-source datasets is crucial, but existing backdoor-based DOV methods introduce security risks.	The authors find a 'hardly-generalized domain' for the original dataset, train a model on modified samples from this domain, and use prediction differences for harmless verification.	The domain watermark achieves high benign accuracy and verification success rates. It is resistant to adaptive methods like fine-tuning and model pruning. The method successfully distinguishes between models trained on the protected dataset and those trained independently.	The verification success rate is restricted by the benign accuracy. Future work will explore lower watermarking rates and resistance to more adaptive methods.	dataset ownership verification, domain watermark, harmless verification, copyright protection, deep neural networks
2310.14532 Report	Practical Deep Dispersed Watermarking with Synchronization and Fusion	Hengchang Guo, Qilong Zhang, Junwei Luo, Feng Guo, Wenbin Zhang, Xiaodong Su, Minglei Li	Deep learning based blind watermarking works have gradually emerged and achieved impressive performance. However, previous deep watermarking studies mainly focus on fixed low-resolution images while paying less attention to arbitrary resolution images, especially widespread high-resolution images nowadays. Moreover, most works usually demonstrate robustness against typical non-geometric attacks (\textit{e.g.}, JPEG compression) but ignore common geometric attacks (\textit{e.g.}, Rotate) and more challenging combined attacks. To overcome the above limitations, we propose a practical deep \textbf{D}ispersed \textbf{W}atermarking with \textbf{S}ynchronization and \textbf{F}usion, called \textbf{\proposed}. Specifically, given an arbitrary-resolution cover image, we adopt a dispersed embedding scheme which sparsely and randomly selects several fixed small-size cover blocks to embed a consistent watermark message by a well-trained encoder. In the extraction stage, we first design a watermark synchronization module to locate and rectify the encoded blocks in the noised watermarked image. We then utilize a decoder to obtain messages embedded in these blocks, and propose a message fusion strategy based on similarity to make full use of the consistency among messages, thus determining a reliable message. Extensive experiments conducted on different datasets convincingly demonstrate the effectiveness of our proposed {\proposed}. Compared with state-of-the-art approaches, our blind watermarking can achieve better performance: averagely improve the bit accuracy by 5.28\% and 5.93\% against single and combined attacks, respectively, and show less file size increment and better visual quality. Our code is available at https://github.com/bytedance/DWSF.	This paper proposes DWSF, a practical deep blind watermarking framework for arbitrary-resolution images, addressing the limitations of existing methods in handling high-resolution images and complex attacks.	Existing deep watermarking methods struggle with high-resolution images common in real-world scenarios and lack robustness against complex, combined attacks.	DWSF uses a dispersed embedding scheme to embed a consistent watermark message into randomly selected small image blocks. It then employs a watermark synchronization module to locate and rectify encoded blocks, even under geometric distortions. Finally, a message fusion strategy leverages message consistency for a reliable final watermark.	DWSF achieves significantly higher visual quality (PSNR) and lower file size increment compared to state-of-the-art methods. DWSF demonstrates superior robustness against a wide range of single and combined attacks, consistently achieving over 98% bit accuracy. DWSF shows practical value with high bit check accuracy, indicating its ability to correctly decode the entire watermark message in realistic scenarios.	The current implementation of DWSF primarily focuses on image watermarking, with potential extensions to other media like videos left for future work. Exploring more sophisticated message fusion techniques and further improving the efficiency of the watermark synchronization module are promising directions for future research.	robust blind watermarking, deep learning, dispersed embedding, watermark synchronization, message fusion
2310.14487 Report	VQ-NeRF: Vector Quantization Enhances Implicit Neural Representations	Yiying Yang, Wen Liu, Fukun Yin, Xin Chen, Gang Yu, Jiayuan Fan, Tao Chen	Recent advancements in implicit neural representations have contributed to high-fidelity surface reconstruction and photorealistic novel view synthesis. However, the computational complexity inherent in these methodologies presents a substantial impediment, constraining the attainable frame rates and resolutions in practical applications. In response to this predicament, we propose VQ-NeRF, an effective and efficient pipeline for enhancing implicit neural representations via vector quantization. The essence of our method involves reducing the sampling space of NeRF to a lower resolution and subsequently reinstating it to the original size utilizing a pre-trained VAE decoder, thereby effectively mitigating the sampling time bottleneck encountered during rendering. Although the codebook furnishes representative features, reconstructing fine texture details of the scene remains challenging due to high compression rates. To overcome this constraint, we design an innovative multi-scale NeRF sampling scheme that concurrently optimizes the NeRF model at both compressed and original scales to enhance the network's ability to preserve fine details. Furthermore, we incorporate a semantic loss function to improve the geometric fidelity and semantic coherence of our 3D reconstructions. Extensive experiments demonstrate the effectiveness of our model in achieving the optimal trade-off between rendering quality and efficiency. Evaluation on the DTU, BlendMVS, and H3DS datasets confirms the superior performance of our approach.	Presents VQ-NeRF, a novel framework that leverages Vector Quantization (VQ) to enhance Neural Radiance Fields (NeRF) for efficient and high-quality 3D surface representation.	Addresses the computational bottleneck in traditional NeRF methods, which limits their practical applications in terms of achievable frame rates and resolutions.	Reduces the sampling space of NeRF using a pre-trained VQ-VAE decoder and introduces a multi-scale semantic consistency module to recover texture details and ensure realism in rendered images.	Achieves optimal trade-off between rendering quality and efficiency, outperforming baselines like NeRF, VolSDF, and Coco-INR. Significantly reduces rendering time (more than 10 times faster) compared to state-of-the-art methods while maintaining high visual fidelity. Demonstrates superior performance in quantitative metrics (PSNR, SSIM, LPIPS) and qualitative comparisons on DTU, BlendedMVS, and H3DS datasets.	VQ-NeRF requires significant training time due to scene-specific optimization. Future work will explore general representations for different scenes to improve generalization and reduce training time.	neural radiance fields, vector quantization, 3d surface reconstruction, novel view synthesis, vq-vae
2310.14189 Report	Improved Techniques for Training Consistency Models	Yang Song, Prafulla Dhariwal	Consistency models are a nascent family of generative models that can sample high quality data in one step without the need for adversarial training. Current consistency models achieve optimal sample quality by distilling from pre-trained diffusion models and employing learned metrics such as LPIPS. However, distillation limits the quality of consistency models to that of the pre-trained diffusion model, and LPIPS causes undesirable bias in evaluation. To tackle these challenges, we present improved techniques for consistency training, where consistency models learn directly from data without distillation. We delve into the theory behind consistency training and identify a previously overlooked flaw, which we address by eliminating Exponential Moving Average from the teacher consistency model. To replace learned metrics like LPIPS, we adopt Pseudo-Huber losses from robust statistics. Additionally, we introduce a lognormal noise schedule for the consistency training objective, and propose to double total discretization steps every set number of training iterations. Combined with better hyperparameter tuning, these modifications enable consistency models to achieve FID scores of 2.51 and 3.25 on CIFAR-10 and ImageNet $64\times 64$ respectively in a single sampling step. These scores mark a 3.5$\times$ and 4$\times$ improvement compared to prior consistency training approaches. Through two-step sampling, we further reduce FID scores to 2.24 and 2.77 on these two datasets, surpassing those obtained via distillation in both one-step and two-step settings, while narrowing the gap between consistency models and other state-of-the-art generative models.	This paper introduces improved consistency training (iCT) techniques for consistency models, a new class of generative models that produce high-quality samples in one step without adversarial training, achieving state-of-the-art results without relying on pre-trained diffusion models or learned metrics like LPIPS.	Consistency training (CT) allows consistency models to learn directly from data, making them a distinct family of generative models. Previous CT methods were outperformed by distillation-based methods and relied on learned metrics, limiting their potential and introducing bias.	The paper analyzes and improves CT by: 1) optimizing weighting functions, noise embeddings, and dropout, 2) removing Exponential Moving Average from the teacher network, 3) adopting Pseudo-Huber losses instead of LPIPS, 4) introducing an improved curriculum for total discretization steps, and 5) proposing a new noise schedule based on lognormal distributions.	iCT achieves FID scores of 2.51 and 3.25 on CIFAR-10 and ImageNet 64x64 in one step, surpassing distillation-based methods and representing 3.5x and 4x improvements over prior CT methods. Two-step iCT achieves FIDs of 2.24 and 2.77 on CIFAR-10 and ImageNet 64x64, exceeding distillation-based methods in both one-step and two-step settings. iCT demonstrates comparable or superior performance to top-tier diffusion models and GANs, showcasing its potential as a new independent family of generative models.	The study primarily focuses on CIFAR-10 and ImageNet 64x64, further investigation is needed to validate effectiveness on higher resolution datasets. While iCT significantly reduces the computational overhead of distillation, it still requires careful hyperparameter tuning, especially for the Pseudo-Huber loss.	generative models, consistency models, consistency training, image generation, deep learning
2310.14108 Report	CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement	Mohammadreza Salehi, Mehrdad Farajtabar, Maxwell Horton, Fartash Faghri, Hadi Pouransari, Raviteja Vemulapalli, Oncel Tuzel, Ali Farhadi, Mohammad Rastegari, Sachin Mehta	Contrastive language image pretraining (CLIP) is a standard method for training vision-language models. While CLIP is scalable, promptable, and robust to distribution shifts on image classification tasks, it lacks object localization capabilities. This paper studies the following question: Can we augment CLIP training with task-specific vision models from model zoos to improve its visual representations? Towards this end, we leverage open-source task-specific vision models to generate pseudo-labels for an uncurated and noisy image-text dataset. Subsequently, we train CLIP models on these pseudo-labels in addition to the contrastive training on image and text pairs. This simple setup shows substantial improvements of up to 16.3% across different vision tasks, including segmentation, detection, depth estimation, and surface normal estimation. Importantly, these enhancements are achieved without compromising CLIP's existing capabilities, including its proficiency in promptable zero-shot classification.	This paper proposes CLIP trained with Experts (CLIPTe), a method to improve CLIP's visual representations by leveraging task-specific vision models from model zoos to generate pseudo-labels for training on an uncurated and noisy image-text dataset.	While CLIP excels in image classification, it lacks object localization capabilities. CLIPTe aims to bridge this gap by enhancing CLIP's visual representations without compromising its existing strengths.	CLIPTe generates pseudo-labels for an uncurated image-text dataset using open-source task-specific vision models (experts) for segmentation, depth estimation, and surface normal estimation. It then trains CLIP models on these pseudo-labels along with the standard contrastive training on image-text pairs.	CLIPTe significantly improves CLIP's performance on various vision tasks, including segmentation, detection, depth estimation, and surface normal estimation, with up to 16.3% improvement in probing accuracy. The method exhibits positive transfer of representations to downstream tasks, indicating its ability to generalize learned knowledge. Importantly, CLIPTe preserves CLIP's inherent strengths, including zero-shot classification capabilities, ensuring its versatility across different vision domains.	The paper mainly focuses on finetuning pre-trained CLIP models on CC3M, leaving exploration with larger datasets and diverse experts for future work. Further investigation into the impact of pseudo-label quality and noise on CLIPTe's performance is crucial.	clip, vision-language models, pseudo-supervision, multi-task learning, object localization
2310.13772 Report	TexFusion: Synthesizing 3D Textures with Text-Guided Image Diffusion Models	Tianshi Cao, Karsten Kreis, Sanja Fidler, Nicholas Sharp, Kangxue Yin	We present TexFusion (Texture Diffusion), a new method to synthesize textures for given 3D geometries, using large-scale text-guided image diffusion models. In contrast to recent works that leverage 2D text-to-image diffusion models to distill 3D objects using a slow and fragile optimization process, TexFusion introduces a new 3D-consistent generation technique specifically designed for texture synthesis that employs regular diffusion model sampling on different 2D rendered views. Specifically, we leverage latent diffusion models, apply the diffusion model's denoiser on a set of 2D renders of the 3D object, and aggregate the different denoising predictions on a shared latent texture map. Final output RGB textures are produced by optimizing an intermediate neural color field on the decodings of 2D renders of the latent texture. We thoroughly validate TexFusion and show that we can efficiently generate diverse, high quality and globally coherent textures. We achieve state-of-the-art text-guided texture synthesis performance using only image diffusion models, while avoiding the pitfalls of previous distillation-based methods. The text-conditioning offers detailed control and we also do not rely on any ground truth 3D textures for training. This makes our method versatile and applicable to a broad range of geometry and texture types. We hope that TexFusion will advance AI-based texturing of 3D assets for applications in virtual reality, game design, simulation, and more.	TexFusion is a novel method for synthesizing high-quality, globally coherent 3D textures on given meshes, guided by text prompts.	Existing methods for text-driven 3D texture synthesis either lack global coherence or rely on slow and unstable optimization processes.	TexFusion leverages latent diffusion models and introduces the Sequential Interlaced Multiview Sampler (SIMS), which interlaces denoising iterations with texture map aggregation across multiple camera views.	TexFusion generates textures with natural color tones and fewer artifacts compared to the state-of-the-art TEXTure method. User studies show preference for TexFusion results in terms of natural color, detail, cleanliness, and alignment with prompts. The method is significantly faster than previous optimization-based techniques, achieving comparable speed to TEXTure.	The sharpness of the generated textures is not yet ideal and could be further improved. Texture generation is not real-time, limiting its applicability in interactive settings.	3d texture synthesis, text-guided generation, diffusion models, multi-view consistency, latent space
2310.13730 Report	Localizing and Editing Knowledge in Text-to-Image Generative Models	Samyadeep Basu, Nanxuan Zhao, Vlad Morariu, Soheil Feizi, Varun Manjunatha	Text-to-Image Diffusion Models such as Stable-Diffusion and Imagen have achieved unprecedented quality of photorealism with state-of-the-art FID scores on MS-COCO and other generation benchmarks. Given a caption, image generation requires fine-grained knowledge about attributes such as object structure, style, and viewpoint amongst others. Where does this information reside in text-to-image generative models? In our paper, we tackle this question and understand how knowledge corresponding to distinct visual attributes is stored in large-scale text-to-image diffusion models. We adapt Causal Mediation Analysis for text-to-image models and trace knowledge about distinct visual attributes to various (causal) components in the (i) UNet and (ii) text-encoder of the diffusion model. In particular, we show that unlike generative large-language models, knowledge about different attributes is not localized in isolated components, but is instead distributed amongst a set of components in the conditional UNet. These sets of components are often distinct for different visual attributes. Remarkably, we find that the CLIP text-encoder in public text-to-image models such as Stable-Diffusion contains only one causal state across different visual attributes, and this is the first self-attention layer corresponding to the last subject token of the attribute in the caption. This is in stark contrast to the causal states in other language models which are often the mid-MLP layers. Based on this observation of only one causal state in the text-encoder, we introduce a fast, data-free model editing method Diff-QuickFix which can effectively edit concepts in text-to-image models. DiffQuickFix can edit (ablate) concepts in under a second with a closed-form update, providing a significant 1000x speedup and comparable editing performance to existing fine-tuning based editing methods.	This paper investigates how knowledge about different visual attributes is stored in large-scale text-to-image diffusion models (e.g., Stable Diffusion).	Understanding where this knowledge resides is crucial for interpreting these models and enabling controlled edits.	The authors adapt Causal Mediation Analysis to trace knowledge about visual attributes to specific components in the UNet and the text-encoder.	Knowledge is distributed across the UNet with different distributions for distinct attributes, unlike in large language models where it is localized. The CLIP text-encoder exhibits a single causal state for all visual attributes: the first self-attention layer corresponding to the last subject token. This localized causal state in the text-encoder allows for efficient model editing.	The study primarily focuses on Stable Diffusion, limiting the generalizability of the findings to other model architectures. Further investigation into individual components within each layer (e.g., neurons) is left for future work.	text-to-image synthesis, diffusion models, interpretability, causal mediation analysis, model editing
2310.13545 Report	ScaleLong: Towards More Stable Training of Diffusion Model via Scaling Network Long Skip Connection	Zhongzhan Huang, Pan Zhou, Shuicheng Yan, Liang Lin	In diffusion models, UNet is the most popular network backbone, since its long skip connects (LSCs) to connect distant network blocks can aggregate long-distant information and alleviate vanishing gradient. Unfortunately, UNet often suffers from unstable training in diffusion models which can be alleviated by scaling its LSC coefficients smaller. However, theoretical understandings of the instability of UNet in diffusion models and also the performance improvement of LSC scaling remain absent yet. To solve this issue, we theoretically show that the coefficients of LSCs in UNet have big effects on the stableness of the forward and backward propagation and robustness of UNet. Specifically, the hidden feature and gradient of UNet at any layer can oscillate and their oscillation ranges are actually large which explains the instability of UNet training. Moreover, UNet is also provably sensitive to perturbed input, and predicts an output distant from the desired output, yielding oscillatory loss and thus oscillatory gradient. Besides, we also observe the theoretical benefits of the LSC coefficient scaling of UNet in the stableness of hidden features and gradient and also robustness. Finally, inspired by our theory, we propose an effective coefficient scaling framework ScaleLong that scales the coefficients of LSC in UNet and better improves the training stability of UNet. Experimental results on four famous datasets show that our methods are superior to stabilize training and yield about 1.5x training acceleration on different diffusion models with UNet or UViT backbones. Code: https://github.com/sail-sg/ScaleLong	This paper theoretically analyzes the training instability of UNet in diffusion models and proposes a framework ScaleLong with two scaling methods (CS and LS) for long skip connections to improve stability.	UNet, a popular backbone in diffusion models, often suffers from unstable training. Understanding this instability and finding ways to stabilize training are crucial for improving diffusion model performance and efficiency.	The authors theoretically analyze the stability of forward and backward propagation in UNet, as well as its robustness to noisy input. They derive bounds for hidden feature oscillation, gradient magnitude, and robustness error, showing the influence of long skip connection coefficients. Inspired by the theoretical analysis, they propose ScaleLong, which includes two methods: CS (constant scaling) and LS (learnable scaling).	Scaling the coefficients of long skip connections can effectively stabilize UNet training in diffusion models. CS with exponentially decaying coefficients is more effective than universal scaling methods like 1/√2-scaling. LS, which learns scaling coefficients adaptively, further improves training stability and convergence speed.	CS requires manual selection of the scaling coefficient within an estimated range. LS introduces a small but non-negligible number of additional parameters and computational cost.	diffusion models, unet, training stability, long skip connections, coefficient scaling
2310.13165 Report	CycleNet: Rethinking Cycle Consistency in Text-Guided Diffusion for Image Manipulation	Sihan Xu, Ziqiao Ma, Yidong Huang, Honglak Lee, Joyce Chai	Diffusion models (DMs) have enabled breakthroughs in image synthesis tasks but lack an intuitive interface for consistent image-to-image (I2I) translation. Various methods have been explored to address this issue, including mask-based methods, attention-based methods, and image-conditioning. However, it remains a critical challenge to enable unpaired I2I translation with pre-trained DMs while maintaining satisfying consistency. This paper introduces Cyclenet, a novel but simple method that incorporates cycle consistency into DMs to regularize image manipulation. We validate Cyclenet on unpaired I2I tasks of different granularities. Besides the scene and object level translation, we additionally contribute a multi-domain I2I translation dataset to study the physical state changes of objects. Our empirical studies show that Cyclenet is superior in translation consistency and quality, and can generate high-quality images for out-of-domain distributions with a simple change of the textual prompt. Cyclenet is a practical framework, which is robust even with very limited training data (around 2k) and requires minimal computational resources (1 GPU) to train. Project homepage: https://cyclenetweb.github.io/	The paper introduces CycleNet, a novel method that incorporates cycle consistency into diffusion models (DMs) for image-to-image (I2I) translation to improve consistency in image manipulation.	Consistency in image manipulation is crucial for various DM applications, especially in unpaired I2I scenarios where correspondence between source and target domain images is not guaranteed.	CycleNet leverages cycle consistency regularization over the image translation cycle by introducing reconstruction loss, cycle consistency loss, and invariance loss. It utilizes a ControlNet with pre-trained Stable Diffusion as the backbone and incorporates text prompts and image conditioning to guide the translation.	CycleNet demonstrates superior translation consistency and quality compared to previous approaches on scene, object, and state-level I2I tasks. It is computationally efficient, requiring only limited training data and a single GPU. CycleNet exhibits robust zero-shot I2I translation capability, generating faithful and high-quality images for out-of-domain distributions with a simple change of the textual prompt.	Cycle consistency constraints can be too restrictive, leading to trade-offs between consistency and translation quality. Maintaining global consistency while making faithful local edits remains challenging for LDM-based approaches.	image-to-image translation, diffusion models, cycle consistency, image manipulation, zero-shot learning
2310.13119 Report	DreamSpace: Dreaming Your Room Space with Text-Driven Panoramic Texture Propagation	Bangbang Yang, Wenqi Dong, Lin Ma, Wenbo Hu, Xiao Liu, Zhaopeng Cui, Yuewen Ma	Diffusion-based methods have achieved prominent success in generating 2D media. However, accomplishing similar proficiencies for scene-level mesh texturing in 3D spatial applications, e.g., XR/VR, remains constrained, primarily due to the intricate nature of 3D geometry and the necessity for immersive free-viewpoint rendering. In this paper, we propose a novel indoor scene texturing framework, which delivers text-driven texture generation with enchanting details and authentic spatial coherence. The key insight is to first imagine a stylized 360{\deg} panoramic texture from the central viewpoint of the scene, and then propagate it to the rest areas with inpainting and imitating techniques. To ensure meaningful and aligned textures to the scene, we develop a novel coarse-to-fine panoramic texture generation approach with dual texture alignment, which both considers the geometry and texture cues of the captured scenes. To survive from cluttered geometries during texture propagation, we design a separated strategy, which conducts texture inpainting in confidential regions and then learns an implicit imitating network to synthesize textures in occluded and tiny structural areas. Extensive experiments and the immersive VR application on real-world indoor scenes demonstrate the high quality of the generated textures and the engaging experience on VR headsets. Project webpage: https://ybbbbt.com/publication/dreamspace	DreamSpace: a novel text-driven framework for generating semantically meaningful and spatially coherent scene textures for real-world indoor scenes represented as meshes, suitable for immersive VR applications.	Existing methods for scene stylization either lack semantic meaning, are computationally expensive for VR, or struggle with real-world scene complexities. DreamSpace addresses these limitations by enabling text-driven, high-quality texture generation for real-world indoor scenes with immersive VR experiences.	DreamSpace uses a top-down approach: 1) Generates a stylized panoramic texture from the central viewpoint using a coarse-to-fine panoramic diffusion process with dual texture alignment. 2) Propagates the texture to the rest of the scene using confidential texture inpainting for visible areas and an implicit texture imitating network for occluded/tiny areas.	Generates high-resolution, semantically meaningful textures for real-world indoor scenes based on text prompts. Outperforms existing methods in terms of visual quality, image-text matching, and user preference. Enables immersive VR experiences by generating textured meshes compatible with standard rendering pipelines and HMD devices.	Baked lighting in the generated textures limits custom lighting and dynamic shadows in rendering. Reliance on real-world textures and high-quality scene reconstruction as input may limit applicability.	text-driven texture generation, scene stylization, panoramic diffusion, immersive vr, 3d scene understanding
2310.13102 Report	Particle Guidance: non-I.I.D. Diverse Sampling with Diffusion Models	Gabriele Corso, Yilun Xu, Valentin de Bortoli, Regina Barzilay, Tommi Jaakkola	In light of the widespread success of generative models, a significant amount of research has gone into speeding up their sampling time. However, generative models are often sampled multiple times to obtain a diverse set incurring a cost that is orthogonal to sampling time. We tackle the question of how to improve diversity and sample efficiency by moving beyond the common assumption of independent samples. We propose particle guidance, an extension of diffusion-based generative sampling where a joint-particle time-evolving potential enforces diversity. We analyze theoretically the joint distribution that particle guidance generates, how to learn a potential that achieves optimal diversity, and the connections with methods in other disciplines. Empirically, we test the framework both in the setting of conditional image generation, where we are able to increase diversity without affecting quality, and molecular conformer generation, where we reduce the state-of-the-art median error by 13% on average.	This paper introduces "particle guidance", a novel framework that enhances the sample efficiency of diffusion models by guiding them to generate diverse sets of samples instead of independent samples.	While deep generative models have achieved remarkable success, their reliance on generating numerous independent samples for diversity is computationally expensive. This work tackles the challenge of improving both diversity and sample efficiency in generative models.	Particle guidance modifies the reverse diffusion process by introducing a time-evolving potential that encourages diversity among a set of particles being sampled simultaneously. Two instantiations are presented: fixed potential PG, which uses hand-crafted potentials for efficient diverse sampling without additional training, and learned potential PG, which learns potentials to achieve specific joint distributions and preserve marginal distributions.	Theoretical analysis of particle guidance leads to an expression for the joint marginal distribution of the sampled process under any arbitrary guidance potential. A training objective is derived to learn a time-evolving potential that enables sampling from a desired joint distribution, ensuring optimality under given diversity constraints. Empirical evaluations on text-to-image generation and molecular conformer generation demonstrate particle guidance's effectiveness in improving diversity and sample efficiency. In text-to-image generation, it increases diversity without sacrificing quality, and in molecular conformer generation, it achieves a 13% reduction in median error compared to state-of-the-art methods.	Computational overhead of particle guidance can increase with the number of particles and the complexity of the potential function. Carefully choosing the potential or guidance weight is crucial to prevent degradation of sample quality due to excessive deviation from the marginal likelihood.	diffusion models, generative models, sample efficiency, diversity, particle guidance
2310.12973 Report	Frozen Transformers in Language Models Are Effective Visual Encoder Layers	Ziqi Pang, Ziyang Xie, Yunze Man, Yu-Xiong Wang	This paper reveals that large language models (LLMs), despite being trained solely on textual data, are surprisingly strong encoders for purely visual tasks in the absence of language. Even more intriguingly, this can be achieved by a simple yet previously overlooked strategy -- employing a frozen transformer block from pre-trained LLMs as a constituent encoder layer to directly process visual tokens. Our work pushes the boundaries of leveraging LLMs for computer vision tasks, significantly departing from conventional practices that typically necessitate a multi-modal vision-language setup with associated language prompts, inputs, or outputs. We demonstrate that our approach consistently enhances performance across a diverse range of tasks, encompassing pure 2D and 3D visual recognition tasks (e.g., image and point cloud classification), temporal modeling tasks (e.g., action recognition), non-semantic tasks (e.g., motion forecasting), and multi-modal tasks (e.g., 2D/3D visual question answering and image-text retrieval). Such improvements are a general phenomenon, applicable to various types of LLMs (e.g., LLaMA and OPT) and different LLM transformer blocks. We additionally propose the information filtering hypothesis to explain the effectiveness of pre-trained LLMs in visual encoding -- the pre-trained LLM transformer blocks discern informative visual tokens and further amplify their effect. This hypothesis is empirically supported by the observation that the feature activation, after training with LLM transformer blocks, exhibits a stronger focus on relevant regions. We hope that our work inspires new perspectives on utilizing LLMs and deepening our understanding of their underlying mechanisms. Code is available at https://github.com/ziqipang/LM4VisualEncoding.	This paper discovers that a frozen transformer block from a pre-trained large language model (LLM) can surprisingly serve as an effective visual encoder layer, enhancing performance across various computer vision tasks even without language input.	This finding challenges the conventional view of LLMs as solely language-specific models, suggesting their potential for more general representation learning across modalities.	The authors insert a frozen LLM transformer block, pre-trained on text data, into existing visual encoders, keeping the LLM block frozen during training. They evaluate this approach on diverse tasks like image classification, point cloud recognition, action recognition, and visual question answering.	Incorporating a frozen LLM transformer block consistently improves performance across a wide range of visual tasks, including 2D and 3D recognition, temporal modeling, and multi-modal tasks. This improvement is observed across different types of LLMs (e.g., LLaMA, OPT) and various LLM transformer blocks. The authors propose the 'information filtering' hypothesis, suggesting that pre-trained LLM transformers can identify and amplify informative visual tokens, contributing to their effectiveness in visual encoding.	The paper primarily focuses on exploring the potential of frozen LLM transformers for visual encoding rather than achieving state-of-the-art results on all tasks. The information filtering hypothesis, while insightful, requires further investigation to fully understand the mechanisms by which LLMs benefit visual encoding, such as quantifying layer-wise utility and analyzing training dynamics.	large language models, computer vision, visual encoding, representation learning, information filtering
2310.12474 Report	Enhancing High-Resolution 3D Generation through Pixel-wise Gradient Clipping	Zijie Pan, Jiachen Lu, Xiatian Zhu, Li Zhang	High-resolution 3D object generation remains a challenging task primarily due to the limited availability of comprehensive annotated training data. Recent advancements have aimed to overcome this constraint by harnessing image generative models, pretrained on extensive curated web datasets, using knowledge transfer techniques like Score Distillation Sampling (SDS). Efficiently addressing the requirements of high-resolution rendering often necessitates the adoption of latent representation-based models, such as the Latent Diffusion Model (LDM). In this framework, a significant challenge arises: To compute gradients for individual image pixels, it is necessary to backpropagate gradients from the designated latent space through the frozen components of the image model, such as the VAE encoder used within LDM. However, this gradient propagation pathway has never been optimized, remaining uncontrolled during training. We find that the unregulated gradients adversely affect the 3D model's capacity in acquiring texture-related information from the image generative model, leading to poor quality appearance synthesis. To address this overarching challenge, we propose an innovative operation termed Pixel-wise Gradient Clipping (PGC) designed for seamless integration into existing 3D generative models, thereby enhancing their synthesis quality. Specifically, we control the magnitude of stochastic gradients by clipping the pixel-wise gradients efficiently, while preserving crucial texture-related gradient directions. Despite this simplicity and minimal extra cost, extensive experiments demonstrate the efficacy of our PGC in enhancing the performance of existing 3D generative models for high-resolution object rendering.	This paper introduces Pixel-wise Gradient Clipping (PGC), a simple yet effective technique to enhance texture quality in high-resolution 3D object generation using Latent Diffusion Models (LDMs).	Existing 3D generation methods using LDMs suffer from uncontrolled pixel-wise gradients during backpropagation through the VAE encoder, leading to poor texture synthesis, especially at high resolutions.	PGC regulates the magnitudes of pixel-wise gradients by clipping them against predefined thresholds while preserving crucial texture information encoded in the gradient direction.	PGC consistently enhances texture details compared to baselines, particularly when using SDXL for guidance. Integrating PGC enables the successful utilization of SDXL, which otherwise fails in high-resolution 3D generation. PGC is shown to be beneficial across various LDM-based 3D generation pipelines, including text-to-3D and image-to-3D tasks.	The risk of potential biases inherited from the pre-trained text-to-image models. The effectiveness of PGC with larger multi-view diffusion models needs further investigation.	3d object generation, texture synthesis, latent diffusion models, score distillation sampling, gradient clipping
2310.12395 Report	Closed-Form Diffusion Models	Christopher Scarvelis, Haitz Sáez de Ocáriz Borde, Justin Solomon	Score-based generative models (SGMs) sample from a target distribution by iteratively transforming noise using the score function of the perturbed target. For any finite training set, this score function can be evaluated in closed form, but the resulting SGM memorizes its training data and does not generate novel samples. In practice, one approximates the score by training a neural network via score-matching. The error in this approximation promotes generalization, but neural SGMs are costly to train and sample, and the effective regularization this error provides is not well-understood theoretically. In this work, we instead explicitly smooth the closed-form score to obtain an SGM that generates novel samples without training. We analyze our model and propose an efficient nearest-neighbor-based estimator of its score function. Using this estimator, our method achieves sampling times competitive with neural SGMs while running on consumer-grade CPUs.	Introduced Smoothed Closed-Form Diffusion Models (smoothed CFDMs), training-free diffusion models that generate novel samples from finite training sets by smoothing the score function of the perturbed data distribution.	Addresses the limitations of neural SGMs, such as high training costs and unclear generalization mechanisms, by providing a training-free, efficient, and theoretically grounded approach to generative modeling.	Explicitly smooths the closed-form score function, derived from a finite training set, to promote generalization. Utilizes a nearest-neighbor-based estimator of the smoothed score and a reduced number of sampling steps for efficiency.	Smoothing the closed-form score function promotes generalization by enabling the generation of novel samples that are convex combinations of training points. The support of the model's samples converges towards barycenters of tuples of training points as the number of sampling steps increases. Achieves competitive sample quality and significantly faster sampling times compared to neural SGMs, even on consumer-grade CPUs, by employing a nearest-neighbor-based score estimator and reduced sampling steps.	Generating high-quality images requires sampling in the latent space of a pretrained autoencoder, limiting direct application to pixel-level image generation. The theoretical analysis assumes specific noise distributions (Gumbel) for characterizing the distribution of one-step samples, leaving room for exploring alternative noise distributions.	generative models, diffusion models, score-based models, training-free methods, nearest neighbor search
2310.12190 Report	DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors	Jinbo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Wangbo Yu, Hanyuan Liu, Xintao Wang, Tien-Tsin Wong, Ying Shan	Animating a still image offers an engaging visual experience. Traditional image animation techniques mainly focus on animating natural scenes with stochastic dynamics (e.g. clouds and fluid) or domain-specific motions (e.g. human hair or body motions), and thus limits their applicability to more general visual content. To overcome this limitation, we explore the synthesis of dynamic content for open-domain images, converting them into animated videos. The key idea is to utilize the motion prior of text-to-video diffusion models by incorporating the image into the generative process as guidance. Given an image, we first project it into a text-aligned rich context representation space using a query transformer, which facilitates the video model to digest the image content in a compatible fashion. However, some visual details still struggle to be preserved in the resultant videos. To supplement with more precise image information, we further feed the full image to the diffusion model by concatenating it with the initial noises. Experimental results show that our proposed method can produce visually convincing and more logical & natural motions, as well as higher conformity to the input image. Comparative evaluation demonstrates the notable superiority of our approach over existing competitors.	This paper presents DynamiCrafter, a novel method for animating open-domain images by leveraging the generative capabilities of pretrained text-to-video diffusion models.	Existing image animation techniques often struggle to animate open-domain images due to their reliance on specific object categories or stochastic motion patterns. This new method overcomes this limitation by utilizing the rich dynamic priors present in text-to-video diffusion models.	DynamiCrafter employs a dual-stream image injection paradigm. It projects the input image into a text-aligned context representation space using a query transformer to facilitate semantic understanding. Additionally, it directly feeds the image to the diffusion model alongside the initial noise to preserve visual details.	DynamiCrafter generates temporally coherent and visually convincing animations that closely adhere to the input image content. It outperforms existing open-source methods in quantitative evaluations using FVD, KVD, and a newly introduced Perceptual Input Conformity (PIC) metric. Qualitative comparisons and user studies confirm its superiority over previous approaches and demonstrate comparable performance to state-of-the-art commercial products like PikaLabs and Gen-2.	DynamiCrafter's performance is limited by the capabilities of the underlying text-to-video diffusion model, particularly in terms of resolution, duration, and potential flickering artifacts. Achieving fine-grained control over specific motions remains challenging, although the paper explores text-based motion control as a promising direction.	image animation, video diffusion models, text-to-video generation, generative models, open-domain animation
2310.12149 Report	Object-aware Inversion and Reassembly for Image Editing	Zhen Yang, Ganggui Ding, Wen Wang, Hao Chen, Bohan Zhuang, Chunhua Shen	By comparing the original and target prompts, we can obtain numerous editing pairs, each comprising an object and its corresponding editing target. To allow editability while maintaining fidelity to the input image, existing editing methods typically involve a fixed number of inversion steps that project the whole input image to its noisier latent representation, followed by a denoising process guided by the target prompt. However, we find that the optimal number of inversion steps for achieving ideal editing results varies significantly among different editing pairs, owing to varying editing difficulties. Therefore, the current literature, which relies on a fixed number of inversion steps, produces sub-optimal generation quality, especially when handling multiple editing pairs in a natural image. To this end, we propose a new image editing paradigm, dubbed Object-aware Inversion and Reassembly (OIR), to enable object-level fine-grained editing. Specifically, we design a new search metric, which determines the optimal inversion steps for each editing pair, by jointly considering the editability of the target and the fidelity of the non-editing region. We use our search metric to find the optimal inversion step for each editing pair when editing an image. We then edit these editing pairs separately to avoid concept mismatch. Subsequently, we propose an additional reassembly step to seamlessly integrate the respective editing results and the non-editing region to obtain the final edited image. To systematically evaluate the effectiveness of our method, we collect two datasets called OIRBench for benchmarking single- and multi-object editing, respectively. Experiments demonstrate that our method achieves superior performance in editing object shapes, colors, materials, categories, etc., especially in multi-object editing scenarios.	This paper proposes Object-aware Inversion and Reassembly (OIR), a novel text-driven image editing method using diffusion models that addresses the limitation of existing methods which use a fixed inversion step for all editing pairs in an image.	Existing diffusion-based image editing methods employ a fixed inversion step for all editing pairs, neglecting the varying editing difficulties of different objects, leading to sub-optimal generation quality and concept mismatch.	OIR utilizes a search metric to determine the optimal inversion step for each editing pair based on editability and fidelity. It then disassembles the image, edits each pair separately using the optimal step, and reassembles them with the non-editing region, ensuring global consistency.	OIR achieves state-of-the-art performance in multi-object editing scenarios, surpassing existing methods on CLIP score and demonstrating significant qualitative improvements. The proposed search metric effectively identifies the optimal inversion step for various editing pairs, confirmed through visualizations. Ablation studies highlight the importance of disassembly and reassembly steps in OIR for achieving high-quality editing and avoiding concept mismatch.	OIR incurs additional inference time for the optimal inversion step search. The effectiveness of OIR needs further validation on other editing tasks like video editing.	image editing, diffusion models, text-driven editing, object-aware editing, inversion
2310.12103 Report	Quality Diversity through Human Feedback	Li Ding, Jenny Zhang, Jeff Clune, Lee Spector, Joel Lehman	Reinforcement Learning from Human Feedback (RLHF) has shown potential in qualitative tasks where clear objectives are lacking. However, its effectiveness is not fully realized when it is conceptualized merely as a tool to optimize average human preferences, especially in generative tasks that demand diverse model responses. Meanwhile, Quality Diversity (QD) algorithms excel at identifying diverse and high-quality solutions but often rely on manually crafted diversity metrics. This paper introduces Quality Diversity through Human Feedback (QDHF), a novel approach integrating human feedback into the QD framework. QDHF infers diversity metrics from human judgments of similarity among solutions, thereby enhancing the applicability and effectiveness of QD algorithms. Our empirical studies show that QDHF significantly outperforms state-of-the-art methods in automatic diversity discovery and matches the efficacy of using manually crafted metrics for QD on standard benchmarks in robotics and reinforcement learning. Notably, in a latent space illumination task, QDHF substantially enhances the diversity in images generated by a diffusion model and was more favorably received in user studies. We conclude by analyzing QDHF's scalability and the quality of its derived diversity metrics, emphasizing its potential to improve exploration and diversity in complex, open-ended optimization tasks. Source code is available on GitHub: https://github.com/ld-ing/qdhf.	This paper introduces Quality Diversity through Human Feedback (QDHF), a novel approach that integrates human feedback into the Quality Diversity (QD) framework to learn diversity metrics for enhanced optimization in complex tasks.	Many generative tasks require diverse model responses, and existing QD algorithms often rely on manually crafted diversity metrics which limit their applicability. QDHF addresses this by learning diversity metrics directly from human feedback, improving exploration and diversity in complex optimization.	QDHF uses latent space projection to characterize diversity and contrastive learning to align the learned diversity representation with human judgment on the similarity of solutions. A progressive training strategy is proposed to refine the diversity metrics throughout the optimization process.	QDHF significantly outperforms unsupervised diversity discovery methods and matches the performance of QD with ground truth metrics in robotics and reinforcement learning benchmarks. QDHF, applied to a latent space illumination task for image generation, produces more diverse and high-quality images compared to baseline methods, as evidenced by quantitative metrics and user studies. Analysis shows a strong correlation between QDHF's performance, the sample size of human feedback, and the accuracy of the learned diversity metrics in reflecting human judgment.	The preference model used in QDHF might not generalize well to unseen domains, requiring more diverse and strategically collected human feedback. Future work will focus on applying QDHF to more complex and open-ended tasks in robotics, reinforcement learning, and generative modeling.	quality diversity, human feedback, contrastive learning, diversity metrics, generative ai
2310.11868 Report	To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now	Yimeng Zhang, Jinghan Jia, Xin Chen, Aochuan Chen, Yihua Zhang, Jiancheng Liu, Ke Ding, Sijia Liu	The recent advances in diffusion models (DMs) have revolutionized the generation of realistic and complex images. However, these models also introduce potential safety hazards, such as producing harmful content and infringing data copyrights. Despite the development of safety-driven unlearning techniques to counteract these challenges, doubts about their efficacy persist. To tackle this issue, we introduce an evaluation framework that leverages adversarial prompts to discern the trustworthiness of these safety-driven DMs after they have undergone the process of unlearning harmful concepts. Specifically, we investigated the adversarial robustness of DMs, assessed by adversarial prompts, when eliminating unwanted concepts, styles, and objects. We develop an effective and efficient adversarial prompt generation approach for DMs, termed UnlearnDiffAtk. This method capitalizes on the intrinsic classification abilities of DMs to simplify the creation of adversarial prompts, thereby eliminating the need for auxiliary classification or diffusion models.Through extensive benchmarking, we evaluate the robustness of five widely-used safety-driven unlearned DMs (i.e., DMs after unlearning undesirable concepts, styles, or objects) across a variety of tasks. Our results demonstrate the effectiveness and efficiency merits of UnlearnDiffAtk over the state-of-the-art adversarial prompt generation method and reveal the lack of robustness of current safety-driven unlearning techniques when applied to DMs. Codes are available at https://github.com/OPTML-Group/Diffusion-MU-Attack. WARNING: This paper contains model outputs that may be offensive in nature.	This paper introduces a novel adversarial attack method, Diffusion-MU-Attack (DMA), to evaluate the robustness of safety-driven diffusion models (DMs) after unlearning harmful concepts, styles, and objects.	Evaluating the robustness of safety-driven unlearned DMs is crucial to ensure their trustworthiness and prevent the generation of harmful content, despite efforts to remove such influence.	DMA leverages the intrinsic classification abilities of DMs to efficiently generate adversarial prompts without the need for auxiliary diffusion or classification models. The method optimizes the adversarial prompt using a simplified diffusion classifier-guided approach.	DMA effectively bypasses various safety-driven unlearned DMs, leading to the generation of undesirable content across concept, style, and object unlearning tasks. DMA outperforms the concurrent attack method P4D in terms of effectiveness and efficiency, especially in style and object unlearning. Current safety-driven unlearning techniques exhibit varying degrees of vulnerability to adversarial prompts, highlighting the need for more robust unlearning methods.	The evaluation primarily focuses on a limited selection of unlearned DMs. Future work could explore the development of more robust unlearning techniques that can withstand adversarial attacks like DMA. Investigating the attack transferability across different diffusion model architectures and training datasets is another potential direction.	diffusion models, adversarial attacks, machine unlearning, image generation, ai safety
2310.11513 Report	GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment	Dhruba Ghosh, Hanna Hajishirzi, Ludwig Schmidt	Recent breakthroughs in diffusion models, multimodal pretraining, and efficient finetuning have led to an explosion of text-to-image generative models. Given human evaluation is expensive and difficult to scale, automated methods are critical for evaluating the increasingly large number of new models. However, most current automated evaluation metrics like FID or CLIPScore only offer a holistic measure of image quality or image-text alignment, and are unsuited for fine-grained or instance-level analysis. In this paper, we introduce GenEval, an object-focused framework to evaluate compositional image properties such as object co-occurrence, position, count, and color. We show that current object detection models can be leveraged to evaluate text-to-image models on a variety of generation tasks with strong human agreement, and that other discriminative vision models can be linked to this pipeline to further verify properties like object color. We then evaluate several open-source text-to-image models and analyze their relative generative capabilities on our benchmark. We find that recent models demonstrate significant improvement on these tasks, though they are still lacking in complex capabilities such as spatial relations and attribute binding. Finally, we demonstrate how GenEval might be used to help discover existing failure modes, in order to inform development of the next generation of text-to-image models. Our code to run the GenEval framework is publicly available at https://github.com/djghosh13/geneval.	Introduces Geneval, an automated object-focused framework for evaluating compositional capabilities of text-to-image models.	Automated methods are needed for evaluating the increasing number of text-to-image models, but existing metrics lack fine-grained analysis.	Leverages object detection models to verify object presence, count, and position, and uses additional vision models (e.g., color classifiers) for attribute verification.	Geneval achieves 83% agreement with human judgment on image correctness. Recent models like IF-XL show improvement but still struggle with spatial relations and attribute binding. Geneval's fine-grained output helps identify specific failure modes in text-to-image generation.	Performance limited by the capabilities of current object detection models. Evaluation scope depends on the availability of relevant discriminative vision models.	text-to-image generation, evaluation metrics, object detection, compositional reasoning, attribute binding
2310.11454 Report	VeRA: Vector-based Random Matrix Adaptation	Dawid J. Kopiczko, Tijmen Blankevoort, Yuki M. Asano	Low-rank adapation (LoRA) is a popular method that reduces the number of trainable parameters when finetuning large language models, but still faces acute storage challenges when scaling to even larger models or deploying numerous per-user or per-task adapted models. In this work, we present Vector-based Random Matrix Adaptation (VeRA), which significantly reduces the number of trainable parameters compared to LoRA, yet maintains the same performance. It achieves this by using a single pair of low-rank matrices shared across all layers and learning small scaling vectors instead. We demonstrate its effectiveness on the GLUE and E2E benchmarks, image classification tasks, and show its application in instruction-tuning of 7B and 13B language models.	This paper introduces VeRA (Vector-based Random Matrix Adaptation), a new parameter-efficient finetuning method that uses significantly fewer parameters than LoRA while maintaining comparable performance.	Efficient adaptation methods are crucial for scaling large language models to personalized applications and edge devices due to limited memory constraints.	VeRA adapts a single pair of frozen, randomly initialized matrices shared across layers using trainable scaling vectors, unlike LoRA which trains separate low-rank matrices per layer.	VeRA achieves comparable performance to LoRA on GLUE and E2E benchmarks with an order of magnitude fewer parameters. It successfully performs instruction-tuning of 7B and 13B language models with a 100x reduction in trainable parameters compared to LoRA. VeRA demonstrates comparable or better performance to LoRA on image classification tasks with Vision Transformers, using over 10x fewer parameters.	The current study primarily focuses on Transformer architectures, leaving its applicability to other architectures and domains for future research. Further performance improvements could be explored by incorporating dynamic parameter allocation, initialization techniques, or regularization.	parameter-efficient finetuning, large language models, lora, random projections, instruction tuning
2310.11448 Report	4K4D: Real-Time 4D View Synthesis at 4K Resolution	Zhen Xu, Sida Peng, Haotong Lin, Guangzhao He, Jiaming Sun, Yujun Shen, Hujun Bao, Xiaowei Zhou	This paper targets high-fidelity and real-time view synthesis of dynamic 3D scenes at 4K resolution. Recently, some methods on dynamic view synthesis have shown impressive rendering quality. However, their speed is still limited when rendering high-resolution images. To overcome this problem, we propose 4K4D, a 4D point cloud representation that supports hardware rasterization and enables unprecedented rendering speed. Our representation is built on a 4D feature grid so that the points are naturally regularized and can be robustly optimized. In addition, we design a novel hybrid appearance model that significantly boosts the rendering quality while preserving efficiency. Moreover, we develop a differentiable depth peeling algorithm to effectively learn the proposed model from RGB videos. Experiments show that our representation can be rendered at over 400 FPS on the DNA-Rendering dataset at 1080p resolution and 80 FPS on the ENeRF-Outdoor dataset at 4K resolution using an RTX 4090 GPU, which is 30x faster than previous methods and achieves the state-of-the-art rendering quality. Our project page is available at https://zju3dv.github.io/4k4d/.	This paper presents 4K4D, a novel 4D point cloud representation designed for real-time, high-fidelity view synthesis of dynamic 3D scenes at 4K resolution.	Existing dynamic view synthesis methods, though achieving impressive rendering quality, are limited in rendering speed, particularly at high resolutions, hindering their application in VR/AR and other fields.	4K4D leverages a 4D feature grid for point regularization and robust optimization. It introduces a hybrid appearance model combining a pre-computable image blending model for efficiency and a continuous spherical harmonics model for view-dependent effects. A differentiable depth peeling algorithm renders the representation, enabling hardware rasterization for speed enhancement.	4K4D achieves state-of-the-art rendering quality, outperforming competitors on benchmark datasets like DNA-Rendering and ENeRF-Outdoor. The method achieves unprecedented rendering speed, reaching over 200 FPS at 1080p and 80 FPS at 4K on an RTX 4090 GPU. 4K4D effectively compresses scene information, achieving a low storage cost of approximately 2 MB per frame, including source videos.	4K4D currently lacks the ability to establish point correspondences across frames, potentially limiting its applicability in tasks requiring temporal coherence. The storage cost scales linearly with the number of frames, presenting challenges for modeling extensive volumetric video sequences.	dynamic view synthesis, neural rendering, point cloud representation, real-time rendering, 4k resolution
2310.11440 Report	EvalCrafter: Benchmarking and Evaluating Large Video Generation Models	Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, Ying Shan	The vision and language generative models have been overgrown in recent years. For video generation, various open-sourced models and public-available services have been developed to generate high-quality videos. However, these methods often use a few metrics, e.g., FVD or IS, to evaluate the performance. We argue that it is hard to judge the large conditional generative models from the simple metrics since these models are often trained on very large datasets with multi-aspect abilities. Thus, we propose a novel framework and pipeline for exhaustively evaluating the performance of the generated videos. Our approach involves generating a diverse and comprehensive list of 700 prompts for text-to-video generation, which is based on an analysis of real-world user data and generated with the assistance of a large language model. Then, we evaluate the state-of-the-art video generative models on our carefully designed benchmark, in terms of visual qualities, content qualities, motion qualities, and text-video alignment with 17 well-selected objective metrics. To obtain the final leaderboard of the models, we further fit a series of coefficients to align the objective metrics to the users' opinions. Based on the proposed human alignment method, our final score shows a higher correlation than simply averaging the metrics, showing the effectiveness of the proposed evaluation method.	This paper proposes EvalCrafter, a comprehensive framework and benchmark for evaluating text-to-video (T2V) generation models.	Existing metrics for evaluating T2V models are limited and often fail to capture important aspects such as motion quality, temporal consistency, and text-video alignment.	The authors first construct a benchmark of 700 diverse prompts with detailed annotations. They then evaluate various T2V models on this benchmark using 17 objective metrics across four aspects: visual quality, text-video alignment, motion quality, and temporal consistency. Finally, they align objective metrics with user opinions obtained through user studies.	Significant variations in model rankings across different evaluation aspects highlight the need for a multi-aspect evaluation approach. Users prioritize visual appeal and temporal consistency over strict text-video alignment. Current T2V models struggle with camera motion control, complex scenes, instruction following, and entity details.	The current benchmark only contains 700 prompts, which might not be enough to represent the complexity of real-world scenarios. Evaluating motion quality in a general sense remains challenging.	text-to-video generation, benchmarking, evaluation metrics, human alignment, large generative models
2310.10769 Report	LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation	Ruiqi Wu, Liangyu Chen, Tong Yang, Chunle Guo, Chongyi Li, Xiangyu Zhang	With the impressive progress in diffusion-based text-to-image generation, extending such powerful generative ability to text-to-video raises enormous attention. Existing methods either require large-scale text-video pairs and a large number of training resources or learn motions that are precisely aligned with template videos. It is non-trivial to balance a trade-off between the degree of generation freedom and the resource costs for video generation. In our study, we present a few-shot-based tuning framework, LAMP, which enables text-to-image diffusion model Learn A specific Motion Pattern with 8~16 videos on a single GPU. Specifically, we design a first-frame-conditioned pipeline that uses an off-the-shelf text-to-image model for content generation so that our tuned video diffusion model mainly focuses on motion learning. The well-developed text-to-image techniques can provide visually pleasing and diverse content as generation conditions, which highly improves video quality and generation freedom. To capture the features of temporal dimension, we expand the pretrained 2D convolution layers of the T2I model to our novel temporal-spatial motion learning layers and modify the attention blocks to the temporal level. Additionally, we develop an effective inference trick, shared-noise sampling, which can improve the stability of videos with computational costs. Our method can also be flexibly applied to other tasks, e.g. real-world image animation and video editing. Extensive experiments demonstrate that LAMP can effectively learn the motion pattern on limited data and generate high-quality videos. The code and models are available at https://rq-wu.github.io/projects/LAMP.	Presents LAMP, a few-shot-based tuning framework enabling text-to-image diffusion models to learn motion patterns from a small set of videos (8-16) on a single GPU.	Addresses the limitations of existing text-to-video generation methods that either require large datasets and resources or lack generation freedom by relying on template videos.	Introduces a first-frame-conditioned pipeline decoupling content and motion, temporal-spatial motion learning layers to capture temporal information, and a shared-noise sampling strategy for improved consistency.	Achieves state-of-the-art performance on textural alignment, frame consistency, and generation diversity compared to existing methods. Demonstrates good generalization ability, generating high-quality videos with learned motion patterns applied to various scenes and styles. Successfully applies the framework to real image animation and video editing tasks.	Learning complex motions can lead to an increased occurrence of failure cases. Motion of foreground objects can sometimes affect background stability.	text-to-video generation, diffusion models, few-shot learning, motion pattern learning, video editing
2310.10651 Report	HairCLIPv2: Unifying Hair Editing via Proxy Feature Blending	Tianyi Wei, Dongdong Chen, Wenbo Zhou, Jing Liao, Weiming Zhang, Gang Hua, Nenghai Yu	Hair editing has made tremendous progress in recent years. Early hair editing methods use well-drawn sketches or masks to specify the editing conditions. Even though they can enable very fine-grained local control, such interaction modes are inefficient for the editing conditions that can be easily specified by language descriptions or reference images. Thanks to the recent breakthrough of cross-modal models (e.g., CLIP), HairCLIP is the first work that enables hair editing based on text descriptions or reference images. However, such text-driven and reference-driven interaction modes make HairCLIP unable to support fine-grained controls specified by sketch or mask. In this paper, we propose HairCLIPv2, aiming to support all the aforementioned interactions with one unified framework. Simultaneously, it improves upon HairCLIP with better irrelevant attributes (e.g., identity, background) preservation and unseen text descriptions support. The key idea is to convert all the hair editing tasks into hair transfer tasks, with editing conditions converted into different proxies accordingly. The editing effects are added upon the input image by blending the corresponding proxy features within the hairstyle or hair color feature spaces. Besides the unprecedented user interaction mode support, quantitative and qualitative experiments demonstrate the superiority of HairCLIPv2 in terms of editing effects, irrelevant attribute preservation and visual naturalness. Our code is available at \url{https://github.com/wty-ustc/HairCLIPv2}.	HairCLIPv2 is proposed as a unified system for hair editing that supports diverse user interactions, including text descriptions, reference images, sketches, masks, and their combinations.	Previous hair editing methods were limited in the types of user input they supported, lacking the ability to handle both simple language/image-based instructions and fine-grained local controls within a single framework.	The key idea is to convert all hair editing tasks into hair transfer by generating different proxy images based on user inputs. This is achieved by blending proxy features within the hairstyle or hair color feature space of StyleGAN.	HairCLIPv2 demonstrates superior performance compared to existing text-driven methods, especially for out-of-domain text descriptions. It achieves comparable hair transfer results to state-of-the-art methods while offering broader interaction support. The proposed method excels in sketch-based local hair editing, outperforming existing approaches in terms of editing quality and preservation of non-edited regions.	The current system focuses solely on image editing and does not extend to facial hair or video editing. Generating proxies through optimization limits real-time editing capabilities, presenting an area for future research.	hair editing, hair transfer, stylegan, clip, multimodal editing
2310.10649 Report	A Computational Framework for Solving Wasserstein Lagrangian Flows	Kirill Neklyudov, Rob Brekelmans, Alexander Tong, Lazar Atanackovic, Qiang Liu, Alireza Makhzani	The dynamical formulation of the optimal transport can be extended through various choices of the underlying geometry ($\textit{kinetic energy}$), and the regularization of density paths ($\textit{potential energy}$). These combinations yield different variational problems ($\textit{Lagrangians}$), encompassing many variations of the optimal transport problem such as the Schr\"odinger bridge, unbalanced optimal transport, and optimal transport with physical constraints, among others. In general, the optimal density path is unknown, and solving these variational problems can be computationally challenging. Leveraging the dual formulation of the Lagrangians, we propose a novel deep learning based framework approaching all of these problems from a unified perspective. Our method does not require simulating or backpropagating through the trajectories of the learned dynamics, and does not need access to optimal couplings. We showcase the versatility of the proposed framework by outperforming previous approaches for the single-cell trajectory inference, where incorporating prior knowledge into the dynamics is crucial for correct predictions.	The paper introduces Wasserstein Lagrangian Flows, a deep learning framework for inferring dynamics and solving marginal interpolation problems using Lagrangian action functionals on manifolds of probability measures.	This framework unifies various optimal transport problems, including Schrödinger Bridge, unbalanced optimal transport, and optimal transport with physical constraints, allowing for flexible incorporation of prior information in trajectory inference.	The methodology leverages the dual formulation of Lagrangians, parameterizes cotangent vectors and distributional paths with neural networks, and optimizes a dual objective that is linear in the density.	The framework outperforms previous approaches for single-cell trajectory inference. Incorporating mass teleportation into the dynamical formulation improves performance. Including a physical potential significantly enhances performance, especially when combined with the Wasserstein Fisher-Rao metric.	The paper focuses on Lagrangians with linearizable dual objectives. Future work includes exploring various Lagrangian costs and extending the framework to other domains.	optimal transport, lagrangian mechanics, deep learning, trajectory inference, single-cell rna sequencing
2310.10644 Report	TOSS:High-quality Text-guided Novel View Synthesis from a Single Image	Yukai Shi, Jianan Wang, He Cao, Boshi Tang, Xianbiao Qi, Tianyu Yang, Yukun Huang, Shilong Liu, Lei Zhang, Heung-Yeung Shum	In this paper, we present TOSS, which introduces text to the task of novel view synthesis (NVS) from just a single RGB image. While Zero-1-to-3 has demonstrated impressive zero-shot open-set NVS capability, it treats NVS as a pure image-to-image translation problem. This approach suffers from the challengingly under-constrained nature of single-view NVS: the process lacks means of explicit user control and often results in implausible NVS generations. To address this limitation, TOSS uses text as high-level semantic information to constrain the NVS solution space. TOSS fine-tunes text-to-image Stable Diffusion pre-trained on large-scale text-image pairs and introduces modules specifically tailored to image and camera pose conditioning, as well as dedicated training for pose correctness and preservation of fine details. Comprehensive experiments are conducted with results showing that our proposed TOSS outperforms Zero-1-to-3 with more plausible, controllable and multiview-consistent NVS results. We further support these results with comprehensive ablations that underscore the effectiveness and potential of the introduced semantic guidance and architecture design.	TOSS, a zero-shot open-set novel view synthesis (NVS) model that leverages textual descriptions to generate more plausible and controllable novel views from a single RGB image.	Existing single-view NVS methods often produce implausible results due to the highly unconstrained nature of the problem. TOSS addresses this by incorporating textual guidance to constrain the solution space and provide explicit user control.	TOSS adapts a text-to-image Stable Diffusion model by introducing: 1) a dense cross-attention module to condition on input image features, 2) a mechanism for incorporating camera pose information, 3) a training strategy with expert denoisers for pose accuracy and detail preservation.	TOSS quantitatively outperforms baseline methods on NVS, showing higher PSNR, SSIM, and lower LPIPS and KID values on GSO and RTMV datasets. TOSS demonstrates superior qualitative results with more plausible, controllable, and multiview-consistent novel view generations. TOSS improves 3D reconstruction quality with finer details and better mesh quality compared to baselines.	Current captioning models may not provide sufficiently detailed descriptions for optimal NVS. Training on synthetic datasets can lead to distribution shift; utilizing real images and videos could alleviate this.	novel view synthesis, text-guided generation, diffusion models, single image 3d reconstruction, semantic guidance
2310.10642 Report	Real-time Photorealistic Dynamic Scene Representation and Rendering with 4D Gaussian Splatting	Zeyu Yang, Hongye Yang, Zijie Pan, Li Zhang	Reconstructing dynamic 3D scenes from 2D images and generating diverse views over time is challenging due to scene complexity and temporal dynamics. Despite advancements in neural implicit models, limitations persist: (i) Inadequate Scene Structure: Existing methods struggle to reveal the spatial and temporal structure of dynamic scenes from directly learning the complex 6D plenoptic function. (ii) Scaling Deformation Modeling: Explicitly modeling scene element deformation becomes impractical for complex dynamics. To address these issues, we consider the spacetime as an entirety and propose to approximate the underlying spatio-temporal 4D volume of a dynamic scene by optimizing a collection of 4D primitives, with explicit geometry and appearance modeling. Learning to optimize the 4D primitives enables us to synthesize novel views at any desired time with our tailored rendering routine. Our model is conceptually simple, consisting of a 4D Gaussian parameterized by anisotropic ellipses that can rotate arbitrarily in space and time, as well as view-dependent and time-evolved appearance represented by the coefficient of 4D spherindrical harmonics. This approach offers simplicity, flexibility for variable-length video and end-to-end training, and efficient real-time rendering, making it suitable for capturing complex dynamic scene motions. Experiments across various benchmarks, including monocular and multi-view scenarios, demonstrate our 4DGS model's superior visual quality and efficiency.	This paper presents 4D Gaussian Splatting (4DGS), a novel approach to represent and render dynamic scenes using 4D Gaussian primitives that coherently integrate space and time dimensions.	Reconstructing dynamic 3D scenes from 2D images and generating diverse views over time is challenging, and existing methods struggle with inadequate scene structure representation and scaling deformation modeling.	The method leverages 4D Gaussian distributions parameterized by anisotropic ellipses capable of rotation in space-time to model scene elements. Additionally, it introduces 4D Spherindrical Harmonics to represent the time evolution of view-dependent color.	4DGS outperforms state-of-the-art methods in terms of visual quality and efficiency on benchmarks like the Plenoptic Video and D-NeRF datasets. The method effectively captures underlying 3D motion without explicit supervision or regularization. 4DGS demonstrates capability in handling complex real-world dynamic scenes, including those with volumetric effects, non-Lambertian surfaces, and varying lighting.	The method might struggle to capture distant background areas when initialized with point clouds from a limited time range. Reliance on initial point clouds might introduce limitations in scenarios where such information is unavailable.	novel view synthesis, dynamic scene reconstruction, 4d gaussian splatting, 4d spherindrical harmonics, real-time rendering
2310.10640 Report	LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts	Hanan Gani, Shariq Farooq Bhat, Muzammal Naseer, Salman Khan, Peter Wonka	Diffusion-based generative models have significantly advanced text-to-image generation but encounter challenges when processing lengthy and intricate text prompts describing complex scenes with multiple objects. While excelling in generating images from short, single-object descriptions, these models often struggle to faithfully capture all the nuanced details within longer and more elaborate textual inputs. In response, we present a novel approach leveraging Large Language Models (LLMs) to extract critical components from text prompts, including bounding box coordinates for foreground objects, detailed textual descriptions for individual objects, and a succinct background context. These components form the foundation of our layout-to-image generation model, which operates in two phases. The initial Global Scene Generation utilizes object layouts and background context to create an initial scene but often falls short in faithfully representing object characteristics as specified in the prompts. To address this limitation, we introduce an Iterative Refinement Scheme that iteratively evaluates and refines box-level content to align them with their textual descriptions, recomposing objects as needed to ensure consistency. Our evaluation on complex prompts featuring multiple objects demonstrates a substantial improvement in recall compared to baseline diffusion models. This is further validated by a user study, underscoring the efficacy of our approach in generating coherent and detailed scenes from intricate textual inputs.	This paper proposes a novel framework, called LLM Blueprint, to address the challenge of generating images from lengthy and detailed text prompts, which often pose difficulties for existing text-to-image models.	Current text-to-image models, while proficient in handling short prompts, often struggle to faithfully capture all details in longer descriptions, leading to omissions and misrepresentations.	The framework uses LLMs to extract a 'Scene Blueprint' from the prompt, including object layouts, descriptions, and background context. It then employs a two-phase image generation process: 1) Global Scene Generation: Creating an initial image based on the layout and background. 2) Iterative Refinement Scheme: Refining the content of each bounding box to align with its textual description, using a multi-modal guidance procedure.	The proposed method achieves a significantly higher Prompt Adherence Recall (PAR) score (85%) compared to baselines like Stable Diffusion (49%), GLIGEN (57%), and LayoutGPT (69%). Qualitative analysis demonstrates the superiority of the approach in capturing all objects and their details, including accurate spatial positioning, compared to existing methods. A user study confirms that the proposed method consistently produces more coherent images that better adhere to lengthy textual descriptions than baseline approaches.	The current method uses a fixed box layout during refinement, exploring dynamic box adjustments could be beneficial. The approach handles overlapping boxes by size-based sorting; investigating more advanced techniques is a potential future direction.	text-to-image synthesis, large language models, diffusion models, layout-to-image generation, iterative refinement
2310.10563 Report	RefConv: Re-parameterized Refocusing Convolution for Powerful ConvNets	Zhicheng Cai, Xiaohan Ding, Qiu Shen, Xun Cao	We propose Re-parameterized Refocusing Convolution (RefConv) as a replacement for regular convolutional layers, which is a plug-and-play module to improve the performance without any inference costs. Specifically, given a pre-trained model, RefConv applies a trainable Refocusing Transformation to the basis kernels inherited from the pre-trained model to establish connections among the parameters. For example, a depth-wise RefConv can relate the parameters of a specific channel of convolution kernel to the parameters of the other kernel, i.e., make them refocus on the other parts of the model they have never attended to, rather than focus on the input features only. From another perspective, RefConv augments the priors of existing model structures by utilizing the representations encoded in the pre-trained parameters as the priors and refocusing on them to learn novel representations, thus further enhancing the representational capacity of the pre-trained model. Experimental results validated that RefConv can improve multiple CNN-based models by a clear margin on image classification (up to 1.47% higher top-1 accuracy on ImageNet), object detection and semantic segmentation without introducing any extra inference costs or altering the original model structure. Further studies demonstrated that RefConv can reduce the redundancy of channels and smooth the loss landscape, which explains its effectiveness.	This paper proposes Re-parameterized Refocusing Convolution (RefConv), a plug-and-play module that enhances the performance of convolutional layers in convolutional neural networks (CNNs) without increasing inference costs.	This approach aims to improve CNN performance by augmenting the priors of existing model structures, allowing kernels to learn more diverse representations.	RefConv replaces standard convolutional layers with a two-step process: 1) it uses a pre-trained convolutional kernel as basis weights, and 2) it applies a trainable Refocusing Transformation to these basis weights, creating transformed weights that are used for inference.	RefConv consistently improves the performance of various CNN architectures on ImageNet, object detection, and semantic segmentation tasks. The method is shown to reduce channel redundancy, leading to more diverse and richer representations. RefConv results in a smoother loss landscape, implying better generalization abilities.	The current design of Refocusing Transformation is relatively simple, relying solely on convolution. Future work could explore more advanced operations and non-linearity in the Refocusing Transformation.	convolutional neural networks, re-parameterization, refocusing convolution, channel redundancy, loss landscape
2310.10533 Report	Label-efficient Segmentation via Affinity Propagation	Wentong Li, Yuqian Yuan, Song Wang, Wenyu Liu, Dongqi Tang, Jian Liu, Jianke Zhu, Lei Zhang	Weakly-supervised segmentation with label-efficient sparse annotations has attracted increasing research attention to reduce the cost of laborious pixel-wise labeling process, while the pairwise affinity modeling techniques play an essential role in this task. Most of the existing approaches focus on using the local appearance kernel to model the neighboring pairwise potentials. However, such a local operation fails to capture the long-range dependencies and ignores the topology of objects. In this work, we formulate the affinity modeling as an affinity propagation process, and propose a local and a global pairwise affinity terms to generate accurate soft pseudo labels. An efficient algorithm is also developed to reduce significantly the computational cost. The proposed approach can be conveniently plugged into existing segmentation networks. Experiments on three typical label-efficient segmentation tasks, i.e. box-supervised instance segmentation, point/scribble-supervised semantic segmentation and CLIP-guided semantic segmentation, demonstrate the superior performance of the proposed approach.	This paper proposes Affinity Propagation (APro), a novel component for weakly-supervised segmentation that formulates the task as an affinity propagation process.	Label-efficient segmentation with sparse annotations is crucial for reducing annotation costs. Existing methods using local appearance kernels for affinity modeling have limitations in capturing long-range dependencies and object topology.	APro models pairwise affinity both globally and locally. It utilizes a topology-aware tree-based graph for global affinity propagation and a Gaussian kernel-based approach for local affinity propagation. An efficient algorithm reduces the computational complexity from O(N^2) to O(NlogN).	APro outperforms counterparts in box-supervised instance segmentation on Pascal VOC and COCO datasets, achieving significant AP gains. It achieves state-of-the-art results in point/scribble-supervised semantic segmentation on Pascal VOC2012, surpassing previous best methods in mIoU. In CLIP-guided annotation-free semantic segmentation, APro consistently improves performance on Pascal VOC2012, Pascal Context, and COCO-Stuff datasets across various CLIP models.	The method relies on image intensity and color similarities, potentially facing challenges in scenarios like motion blur and occlusions. Future work will explore integrating APro with large-scale foundation models like SAM for enhanced feature representation and performance.	weakly-supervised segmentation, affinity propagation, label-efficient learning, pairwise affinity modeling, instance segmentation, semantic segmentation
2310.10513 Report	Unifying Image Processing as Visual Prompting Question Answering	Yihao Liu, Xiangyu Chen, Xianzheng Ma, Xintao Wang, Jiantao Zhou, Yu Qiao, Chao Dong	Image processing is a fundamental task in computer vision, which aims at enhancing image quality and extracting essential features for subsequent vision applications. Traditionally, task-specific models are developed for individual tasks and designing such models requires distinct expertise. Building upon the success of large language models (LLMs) in natural language processing (NLP), there is a similar trend in computer vision, which focuses on developing large-scale models through pretraining and in-context learning. This paradigm shift reduces the reliance on task-specific models, yielding a powerful unified model to deal with various tasks. However, these advances have predominantly concentrated on high-level vision tasks, with less attention paid to low-level vision tasks. To address this issue, we propose a universal model for general image processing that covers image restoration, image enhancement, image feature extraction tasks, etc. Our proposed framework, named PromptGIP, unifies these diverse image processing tasks within a universal framework. Inspired by NLP question answering (QA) techniques, we employ a visual prompting question answering paradigm. Specifically, we treat the input-output image pair as a structured question-answer sentence, thereby reprogramming the image processing task as a prompting QA problem. PromptGIP can undertake diverse cross-domain tasks using provided visual prompts, eliminating the need for task-specific finetuning. Our methodology offers a universal and adaptive solution to general image processing. While PromptGIP has demonstrated a certain degree of out-of-domain task generalization capability, further research is expected to fully explore its more powerful emergent generalization.	This paper introduces PromptGIP, a universal model for general image processing. PromptGIP can handle image restoration, enhancement, and feature extraction tasks within a unified framework.	Existing image processing methods often require task-specific models and struggle to generalize across different output domains. This work aims to unify these diverse tasks under one model.	PromptGIP leverages a visual prompting question answering paradigm. It treats input-output image pairs as structured question-answer sentences, effectively reprogramming image processing tasks as prompting QA problems.	PromptGIP successfully handles up to 15 diverse image processing tasks with promising visual results. It outperforms baseline methods, including the original ViT and Painter, on various tasks, demonstrating the effectiveness of the proposed QA paradigm and masked training strategy. PromptGIP exhibits a certain degree of generalization on out-of-distribution tasks, showcasing its potential for broader application.	The current model does not excel at generating unexpected or emergent outcomes, indicating a need for further exploration in enabling true out-of-distribution generalization. The current ViT backbone limits performance on certain tasks, suggesting that incorporating stronger backbones could be beneficial.	image processing, visual prompting, question answering, in-context learning, vision transformer
2310.10343 Report	ConsistNet: Enforcing 3D Consistency for Multi-view Images Diffusion	Jiayu Yang, Ziang Cheng, Yunfei Duan, Pan Ji, Hongdong Li	Given a single image of a 3D object, this paper proposes a novel method (named ConsistNet) that is able to generate multiple images of the same object, as if seen they are captured from different viewpoints, while the 3D (multi-view) consistencies among those multiple generated images are effectively exploited. Central to our method is a multi-view consistency block which enables information exchange across multiple single-view diffusion processes based on the underlying multi-view geometry principles. ConsistNet is an extension to the standard latent diffusion model, and consists of two sub-modules: (a) a view aggregation module that unprojects multi-view features into global 3D volumes and infer consistency, and (b) a ray aggregation module that samples and aggregate 3D consistent features back to each view to enforce consistency. Our approach departs from previous methods in multi-view image generation, in that it can be easily dropped-in pre-trained LDMs without requiring explicit pixel correspondences or depth prediction. Experiments show that our method effectively learns 3D consistency over a frozen Zero123 backbone and can generate 16 surrounding views of the object within 40 seconds on a single A100 GPU. Our code will be made available on https://github.com/JiayuYANG/ConsistNet	This paper proposes ConsistNet, a plug-in module for image diffusion models to generate multi-view consistent images by enforcing 3D consistency.	3D-consistent multi-view image generation is crucial for applications like 3D asset creation in VR/AR and video games, overcoming limitations of existing methods that struggle to maintain strict multi-view geometry consistency.	The method uses multiple parallel Latent Diffusion Models (LDMs), one per viewpoint, connected by ConsistNet. This module enforces consistency through view aggregation (unprojecting features to 3D and using attention) and ray aggregation (sampling 3D features and projecting back to 2D).	ConsistNet effectively learns 3D consistency when applied to a frozen Zero123 model. It generates 16 surrounding views of an object in 40 seconds on a single A100 GPU. The model demonstrates good generalization ability when evaluated on unseen data (Google Scanned Objects dataset).	Quantitative evaluation metrics may not fully capture the inherent ambiguity of generating unseen views from a single image. Future work includes improving computational efficiency and developing a 3D reconstruction module for mesh generation alongside image denoising.	multi-view image generation, 3d consistency, diffusion models, latent diffusion models, generative models
2310.10123 Report	AutoDIR: Automatic All-in-One Image Restoration with Latent Diffusion	Yitong Jiang, Zhaoyang Zhang, Tianfan Xue, Jinwei Gu	In this paper, we aim to solve complex real-world image restoration situations, in which, one image may have a variety of unknown degradations. To this end, we propose an all-in-one image restoration framework with latent diffusion (AutoDIR), which can automatically detect and address multiple unknown degradations. Our framework first utilizes a Blind Image Quality Assessment Module (BIQA) to automatically detect and identify the unknown dominant image degradation type of the image. Then, an All-in-One Image Refinement (AIR) Module handles multiple kinds of degradation image restoration with the guidance of BIQA. Finally, a Structure Correction Module (SCM) is proposed to recover the image details distorted by AIR. Our comprehensive evaluation demonstrates that AutoDIR outperforms state-of-the-art approaches by achieving superior restoration results while supporting a wider range of tasks. Notably, AutoDIR is also the first method to automatically handle real-scenario images with multiple unknown degradations.	Proposes AutoDIR, an all-in-one image restoration system using latent diffusion for automatic detection and restoration of images with unknown degradations.	Addresses limitations of single-task image restoration methods and aims to learn a unified model capable of handling real-world images with multiple unknown degradations.	Combines a Blind Image Quality Assessment (BIQA) module for degradation detection, an All-in-One Image Restoration (AIR) module based on latent diffusion for restoration, and a Structural Correction Module (SCM) for refining image details.	AutoDIR outperforms state-of-the-art methods in seven image restoration tasks, including denoising, deblurring, super-resolution, low-light enhancement, dehazing, deraining, and deraindrop removal. Effectively handles images with unknown degradations, as demonstrated on Under-Display Camera and Underwater datasets. Shows promise as a foundation model for image restoration, exhibiting effective few-shot learning capabilities for new tasks like desnowing.	Computational cost remains high compared to non-generative networks. Currently focuses on global image restoration, with limited capabilities for local region-based editing.	image restoration, latent diffusion model, blind image quality assessment, foundation model, multi-task learning
2310.09965 Report	ProteusNeRF: Fast Lightweight NeRF Editing using 3D-Aware Image Context	Binglun Wang, Niladri Shekhar Dutt, Niloy J. Mitra	Neural Radiance Fields (NeRFs) have recently emerged as a popular option for photo-realistic object capture due to their ability to faithfully capture high-fidelity volumetric content even from handheld video input. Although much research has been devoted to efficient optimization leading to real-time training and rendering, options for interactive editing NeRFs remain limited. We present a very simple but effective neural network architecture that is fast and efficient while maintaining a low memory footprint. This architecture can be incrementally guided through user-friendly image-based edits. Our representation allows straightforward object selection via semantic feature distillation at the training stage. More importantly, we propose a local 3D-aware image context to facilitate view-consistent image editing that can then be distilled into fine-tuned NeRFs, via geometric and appearance adjustments. We evaluate our setup on a variety of examples to demonstrate appearance and geometric edits and report 10-30x speedup over concurrent work focusing on text-guided NeRF editing. Video results can be seen on our project webpage at https://proteusnerf.github.io.	Presents ame, a fast and lightweight framework for editing NeRF assets using traditional or generative image manipulation tools via a novel 3D-aware image context.	NeRF editing remains limited despite advancements in real-time training and rendering, creating a need for intuitive, expressive, and efficient editing tools.	Employs a residual tri-plane feature field for NeRF representation, enables object selection through semantic feature distillation, and utilizes a 3D-aware image context for synchronized multi-view editing. Edits are distilled back into the NeRF via geometric and appearance adjustments.	Achieves 10-30x speedup over concurrent text-guided NeRF editing methods. Allows for both small appearance edits (e.g., recoloring) and larger edits involving geometry changes. Supports layered editing with a minimal memory footprint (4-36KB/edit) for appearance modifications.	Limited support for large geometric changes. Inability to handle view-dependent specular effects.	nerf, neural radiance fields, 3d editing, generative editing, image-based editing
2310.09912 Report	Unsupervised Discovery of Interpretable Directions in h-space of Pre-trained Diffusion Models	Zijian Zhang, Luping Liu, Zhijie Lin, Yichen Zhu, Zhou Zhao	We propose the first unsupervised and learning-based method to identify interpretable directions in h-space of pre-trained diffusion models. Our method is derived from an existing technique that operates on the GAN latent space. Specifically, we employ a shift control module that works on h-space of pre-trained diffusion models to manipulate a sample into a shifted version of itself, followed by a reconstructor to reproduce both the type and the strength of the manipulation. By jointly optimizing them, the model will spontaneously discover disentangled and interpretable directions. To prevent the discovery of meaningless and destructive directions, we employ a discriminator to maintain the fidelity of shifted sample. Due to the iterative generative process of diffusion models, our training requires a substantial amount of GPU VRAM to store numerous intermediate tensors for back-propagating gradient. To address this issue, we propose a general VRAM-efficient training algorithm based on gradient checkpointing technique to back-propagate any gradient through the whole generative process, with acceptable occupancy of VRAM and sacrifice of training efficiency. Compared with existing related works on diffusion models, our method inherently identifies global and scalable directions, without necessitating any other complicated procedures. Extensive experiments on various datasets demonstrate the effectiveness of our method.	The paper proposes the first unsupervised, learning-based method to identify interpretable directions in the h-space of pre-trained diffusion models for semantic image manipulation.	Existing methods for discovering meaningful directions in the latent space of diffusion models rely on external supervision (e.g., human annotation, CLIP). This work aims to achieve similar results in a fully unsupervised manner.	The method employs a shift control module and a reconstructor. The shift control module manipulates a sample into a shifted version by modifying its representation in the h-space. The reconstructor aims to reproduce the applied shift. A discriminator is introduced to maintain the fidelity of shifted samples during training. Additionally, a VRAM-efficient training algorithm based on gradient checkpointing is proposed to handle the memory-intensive training process.	The method successfully discovers disentangled and interpretable directions in the h-space of pre-trained diffusion models, enabling semantic image manipulation. The proposed VRAM-efficient training algorithm significantly reduces memory consumption during training while maintaining comparable efficiency to the standard approach. Quantitative evaluations using reconstructor classification accuracy (RCA) and mean opinion score (MOS) demonstrate the effectiveness of the proposed method.	The training and inference speed is limited by the multi-step iterative generation process inherent to diffusion models. The reliance on adversarial training to maintain sample fidelity may introduce complexity, and simpler alternative methods could be explored.	diffusion models, unsupervised learning, semantic image manipulation, latent space, interpretable directions
2310.09711 Report	LOVECon: Text-driven Training-Free Long Video Editing with ControlNet	Zhenyi Liao, Zhijie Deng	Leveraging pre-trained conditional diffusion models for video editing without further tuning has gained increasing attention due to its promise in film production, advertising, etc. Yet, seminal works in this line fall short in generation length, temporal coherence, or fidelity to the source video. This paper aims to bridge the gap, establishing a simple and effective baseline for training-free diffusion model-based long video editing. As suggested by prior arts, we build the pipeline upon ControlNet, which excels at various image editing tasks based on text prompts. To break down the length constraints caused by limited computational memory, we split the long video into consecutive windows and develop a novel cross-window attention mechanism to ensure the consistency of global style and maximize the smoothness among windows. To achieve more accurate control, we extract the information from the source video via DDIM inversion and integrate the outcomes into the latent states of the generations. We also incorporate a video frame interpolation model to mitigate the frame-level flickering issue. Extensive empirical studies verify the superior efficacy of our method over competing baselines across scenarios, including the replacement of the attributes of foreground objects, style transfer, and background replacement. In particular, our method manages to edit videos with up to 128 frames according to user requirements. Code is available at https://github.com/zhijie-group/LOVECon.	This paper introduces LOVECon, a simple yet effective training-free diffusion model-based method for long video editing.	Existing training-free methods for video editing struggle with long videos, exhibiting inconsistencies in global style and local details, especially in maintaining temporal coherence and fidelity to the source video. LOVECon aims to address these limitations.	LOVECon leverages pre-trained Stable Diffusion and ControlNet, incorporating DDIM inversion for source frame information. It introduces three key components: (1) Cross-window attention for inter-window consistency, (2) Latent fusion of source and edited frames for structural fidelity, and (3) Frame interpolation to mitigate flickering.	LOVECon outperforms baselines in maintaining fidelity to the source video and temporal consistency, as evidenced by quantitative metrics and user studies. LOVECon demonstrates precise editing capabilities while preserving fine details, unlike some baselines that suffer from semantic leakage or blurring. LOVECon can effectively edit videos up to 128 frames, showcasing its capability for long video editing.	LOVECon, relying on ControlNet, is limited in handling significant shape changes in editing. Suboptimal editing results are observed when the source video contains substantial content changes, such as large movements.	video editing, diffusion models, controlnet, temporal consistency, long video generation
2310.09469 Report	Towards More Accurate Diffusion Model Acceleration with A Timestep Aligner	Mengfei Xia, Yujun Shen, Changsong Lei, Yu Zhou, Ran Yi, Deli Zhao, Wenping Wang, Yong-jin Liu	A diffusion model, which is formulated to produce an image using thousands of denoising steps, usually suffers from a slow inference speed. Existing acceleration algorithms simplify the sampling by skipping most steps yet exhibit considerable performance degradation. By viewing the generation of diffusion models as a discretized integrating process, we argue that the quality drop is partly caused by applying an inaccurate integral direction to a timestep interval. To rectify this issue, we propose a timestep aligner that helps find a more accurate integral direction for a particular interval at the minimum cost. Specifically, at each denoising step, we replace the original parameterization by conditioning the network on a new timestep, which is obtained by aligning the sampling distribution to the real distribution. Extensive experiments show that our plug-in design can be trained efficiently and boost the inference performance of various state-of-the-art acceleration methods, especially when there are few denoising steps. For example, when using 10 denoising steps on the popular LSUN Bedroom dataset, we improve the FID of DDIM from 9.65 to 6.07, simply by adopting our method for a more appropriate set of timesteps. Code will be made publicly available.	This paper proposes Time step Aligner (TSA), a plug-in method to enhance the accuracy of accelerated diffusion model sampling by re-aligning time steps.	Accelerated diffusion model sampling, while reducing computation, often suffers from quality degradation due to discrepancies between real and sampling distributions. This method aims to bridge this gap for improved sampling fidelity.	TSA searches for a more suitable time step (τ) for each denoising step, replacing the original time step (t) as input to the pre-trained noise prediction model. It minimizes the distance between predicted noise at the re-aligned time step and the actual noise, effectively aligning the distributions.	TSA significantly boosts the performance of various acceleration methods, particularly for low function evaluation counts (NFE). The improvement is consistent across diverse datasets (CIFAR10, CelebA, LSUN Bedroom, FFHQ, ImageNet, MS-COCO) and tasks (unconditional & conditional generation). Experiments validate the theoretical claims, showing monotonic FID reduction with progressive time step re-alignment and a decrease in the distribution gap.	The parallel training strategy, while significantly faster, shows slightly lower performance compared to the sequential approach. Exploration of methods to enhance the parallel training strategy's performance is a potential future direction.	diffusion models, image generation, sampling acceleration, time step alignment, truncation error
2310.09458 Report	PaintHuman: Towards High-fidelity Text-to-3D Human Texturing via Denoised Score Distillation	Jianhui Yu, Hao Zhu, Liming Jiang, Chen Change Loy, Weidong Cai, Wayne Wu	Recent advances in zero-shot text-to-3D human generation, which employ the human model prior (eg, SMPL) or Score Distillation Sampling (SDS) with pre-trained text-to-image diffusion models, have been groundbreaking. However, SDS may provide inaccurate gradient directions under the weak diffusion guidance, as it tends to produce over-smoothed results and generate body textures that are inconsistent with the detailed mesh geometry. Therefore, directly leverage existing strategies for high-fidelity text-to-3D human texturing is challenging. In this work, we propose a model called PaintHuman to addresses the challenges from two aspects. We first propose a novel score function, Denoised Score Distillation (DSD), which directly modifies the SDS by introducing negative gradient components to iteratively correct the gradient direction and generate high-quality textures. In addition, we use the depth map as a geometric guidance to ensure the texture is semantically aligned to human mesh surfaces. To guarantee the quality of rendered results, we employ geometry-aware networks to predict surface materials and render realistic human textures. Extensive experiments, benchmarked against state-of-the-art methods, validate the efficacy of our approach.	Presents PaintHuman, a zero-shot text-to-human texture generation model that leverages a novel Denoised Score Distillation (DSD) method for high-quality, detailed textures aligned to input text.	Addresses limitations of existing text-to-3D human texturing methods that produce over-smoothed results, inconsistent textures, and semantic misalignment with input texts.	Introduces DSD, which refines gradient direction using negative image-text pairs during the diffusion process, utilizes depth signals for accurate geometry-aware texturing, and employs a differentiable SV-BRDF network for realistic material prediction and rendering.	Generates high-quality human avatars with detailed textures, surpassing existing methods in visual fidelity and semantic alignment. Demonstrates the efficacy of DSD in mitigating over-smoothing issues and achieving higher CLIP scores compared to baseline models. Shows significant improvements in user study evaluations, highlighting the superior quality and text faithfulness of the generated textures.	Current semantic zoom implementation relies on manual adjustment for face region; future work may explore automatic detection. Exploring the potential of DSD in broader 3D texturing tasks beyond human avatars could be a promising research direction.	text-to-3d, texture generation, diffusion models, score distillation sampling, denoising score distillation
2310.09382 Report	LL-VQ-VAE: Learnable Lattice Vector-Quantization For Efficient Representations	Ahmed Khalil, Robert Piechocki, Raul Santos-Rodriguez	In this paper we introduce learnable lattice vector quantization and demonstrate its effectiveness for learning discrete representations. Our method, termed LL-VQ-VAE, replaces the vector quantization layer in VQ-VAE with lattice-based discretization. The learnable lattice imposes a structure over all discrete embeddings, acting as a deterrent against codebook collapse, leading to high codebook utilization. Compared to VQ-VAE, our method obtains lower reconstruction errors under the same training conditions, trains in a fraction of the time, and with a constant number of parameters (equal to the embedding dimension $D$), making it a very scalable approach. We demonstrate these results on the FFHQ-1024 dataset and include FashionMNIST and Celeb-A.	This paper introduces Learnable Lattice VQ-VAE (LL-VQ-VAE), replacing vector quantization in VQ-VAE with a learnable lattice layer for efficient latent discretization.	The proposed method addresses limitations of VQ-VAE, such as codebook collapse and a trade-off between quantization quality and speed, by imposing a structured lattice representation on discrete embeddings.	The LL-VQ-VAE utilizes a learnable lattice basis matrix to define the embedding space. The Babai Rounding Estimate is used for quantization, and a size loss term encourages lattice sparsity, controlling the number of effective embeddings.	LL-VQ-VAE achieves lower reconstruction errors than VQ-VAE and its EMA variant on datasets like FFHQ-1024, Celeb-A, and FashionMNIST. It significantly reduces the number of trainable parameters in the quantization layer, making it more scalable. The method demonstrates faster training times compared to VQ-VAE while maintaining high quantization quality and resisting codebook collapse.	The paper acknowledges the difficulty of imposing a hard upper limit on the number of embeddings due to the infinite nature of the lattice. Future work aims to explore the relationship between quantization strategies and the preservation of image properties, as well as the resilience of the learned representations to distortions.	learnable lattice, vector quantization, vq-vae, discrete representation learning, codebook collapse
2310.09199 Report	PaLI-3 Vision Language Models: Smaller, Faster, Stronger	Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul Voigtlaender, Basil Mustafa, Sebastian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, Daniel Salz, Xi Xiong, Daniel Vlasic, Filip Pavetic, Keran Rong, Tianli Yu, Daniel Keysers, Xiaohua Zhai, Radu Soricut	This paper presents PaLI-3, a smaller, faster, and stronger vision language model (VLM) that compares favorably to similar models that are 10x larger. As part of arriving at this strong performance, we compare Vision Transformer (ViT) models pretrained using classification objectives to contrastively (SigLIP) pretrained ones. We find that, while slightly underperforming on standard image classification benchmarks, SigLIP-based PaLI shows superior performance across various multimodal benchmarks, especially on localization and visually-situated text understanding. We scale the SigLIP image encoder up to 2 billion parameters, and achieves a new state-of-the-art on multilingual cross-modal retrieval. We hope that PaLI-3, at only 5B parameters, rekindles research on fundamental pieces of complex VLMs, and could fuel a new generation of scaled-up models.	This paper presents \NEWNAME, a 5B parameter vision language model (VLM) that achieves state-of-the-art performance on various benchmarks, particularly excelling in visually-situated text understanding and object localization, despite being 10x smaller than previous state-of-the-art models.	Smaller VLMs are important for practical applications due to easier training, deployment, environmental friendliness, and faster research cycles.	The model utilizes a contrastively pretrained ViT-G image encoder (SigLIP) instead of classification pretraining, and is trained in three stages: unimodal pretraining, multimodal training with a refined dataset mixture, and resolution increase.	Achieves new state-of-the-art results on visually-situated text understanding benchmarks, such as TextCaps and TextVQA. Outperforms previous models on several video QA benchmarks, despite not being pretrained on video data. Introduces a 2B parameter multilingual SigLIP model that achieves state-of-the-art on multilingual cross-modal retrieval.	Similar limitations to existing VLMs in terms of potential biases and fairness issues. Further improvements in reasoning capabilities are needed, particularly for tasks like AI2D and ChartQA.	vision language model, contrastive pretraining, visually-situated text understanding, object localization, multimodal learning
2310.08949 Report	EasyGen: Easing Multimodal Generation with BiDiffuser and LLMs	Xiangyu Zhao, Bo Liu, Qijiong Liu, Guangyuan Shi, Xiao-Ming Wu	We present EasyGen, an efficient model designed to enhance multimodal understanding and generation by harnessing the capabilities of diffusion models and large language models (LLMs), Unlike existing multimodal models that predominately depend on encoders like CLIP or ImageBind and need ample amounts of training data to bridge modalities,EasyGen leverages BiDiffuser,a bidirectional conditional diffusion model, to foster more efficient modality interactions. Easygen achieves text generation by training a projection layer linking BiDiffuser and an LLM, and facilities image generation by training an adapter to align the LLM's text space with the BiDiffuser's image space, Comprehensive quantitative and qualitative experiments show that EasyGen excels in data-efficient training, high-quality image generation, and extendibility, effectively addressing the challenges in multimodal generation. The source code is available at https://github.com/zxy556677/EasyGen.	This paper introduces EasyGen, an end-to-end multimodal model that leverages a bidirectional conditional diffusion model (BiDiffuser) and LLMs for efficient multimodal understanding and generation.	Existing multimodal models often struggle with inefficient modality interactions and limited generation capabilities beyond text. EasyGen aims to address these limitations.	EasyGen utilizes BiDiffuser, a fine-tuned version of UniDiffuser, for bidirectional image-text generation. It employs a projection layer to align BiDiffuser with LLMs for text generation tasks like image captioning and VQA. For image generation, an adapter integrates semantic information from LLM into BiDiffuser.	EasyGen achieves competitive performance on image captioning and VQA tasks with high data efficiency. The model demonstrates superior image generation quality compared to other end-to-end MLLMs. EasyGen is easily extendable, accommodating advanced visual encoders or enhancing existing multimodal LLMs like LLaVA.	The diffusion-based approach may lead to longer processing times for image-to-text and text-to-image generation. Future work could focus on exploring efficient sampling methods to enhance EasyGen's overall efficiency.	multimodal generation, diffusion models, large language models, image captioning, visual question answering
2310.08921 Report	Feature Proliferation -- the "Cancer" in StyleGAN and its Treatments	Shuang Song, Yuanbang Liang, Jing Wu, Yu-Kun Lai, Yipeng Qin	Despite the success of StyleGAN in image synthesis, the images it synthesizes are not always perfect and the well-known truncation trick has become a standard post-processing technique for StyleGAN to synthesize high-quality images. Although effective, it has long been noted that the truncation trick tends to reduce the diversity of synthesized images and unnecessarily sacrifices many distinct image features. To address this issue, in this paper, we first delve into the StyleGAN image synthesis mechanism and discover an important phenomenon, namely Feature Proliferation, which demonstrates how specific features reproduce with forward propagation. Then, we show how the occurrence of Feature Proliferation results in StyleGAN image artifacts. As an analogy, we refer to it as the" cancer" in StyleGAN from its proliferating and malignant nature. Finally, we propose a novel feature rescaling method that identifies and modulates risky features to mitigate feature proliferation. Thanks to our discovery of Feature Proliferation, the proposed feature rescaling method is less destructive and retains more useful image features than the truncation trick, as it is more fine-grained and works in a lower-level feature space rather than a high-level latent space. Experimental results justify the validity of our claims and the effectiveness of the proposed feature rescaling method. Our code is available at https://github. com/songc42/Feature-proliferation.	This paper introduces "Feature Proliferation", a phenomenon in StyleGAN where certain features with abnormal values reproduce during forward propagation, leading to image artifacts.	This phenomenon explains a cause of artifacts in StyleGAN-generated images and allows for targeted correction without sacrificing image diversity as much as the truncation trick.	The authors analyze the StyleGAN architecture, particularly the weight modulation and demodulation steps, to identify how Feature Proliferation occurs. They propose a feature rescaling method that identifies and adjusts risky features early in the network.	Feature Proliferation is directly linked to image artifacts in StyleGAN. Proposed feature rescaling mitigates artifacts while better preserving image features compared to the truncation trick. The method is compatible with StyleGAN latent space operations like interpolation and image editing.	Current feature identification and rescaling may still remove some useful features. Future work will focus on more precise feature processing and investigating Feature Proliferation in other network architectures.	stylegan, generative adversarial networks, image synthesis, feature proliferation, artifact removal
2310.08587 Report	Pseudo-Generalized Dynamic View Synthesis from a Video	Xiaoming Zhao, Alex Colburn, Fangchang Ma, Miguel Angel Bautista, Joshua M. Susskind, Alexander G. Schwing	Rendering scenes observed in a monocular video from novel viewpoints is a challenging problem. For static scenes the community has studied both scene-specific optimization techniques, which optimize on every test scene, and generalized techniques, which only run a deep net forward pass on a test scene. In contrast, for dynamic scenes, scene-specific optimization techniques exist, but, to our best knowledge, there is currently no generalized method for dynamic novel view synthesis from a given monocular video. To answer whether generalized dynamic novel view synthesis from monocular videos is possible today, we establish an analysis framework based on existing techniques and work toward the generalized approach. We find a pseudo-generalized process without scene-specific appearance optimization is possible, but geometrically and temporally consistent depth estimates are needed. Despite no scene-specific appearance optimization, the pseudo-generalized approach improves upon some scene-specific methods.	This paper investigates the feasibility of generalizing dynamic novel view synthesis from monocular videos, aiming to reduce reliance on computationally expensive scene-specific optimization.	Developing generalized methods for dynamic view synthesis from monocular videos is crucial for applications in AR/VR and robotics but remains challenging due to the task's ill-posed nature. Existing methods heavily rely on scene-specific optimization, hindering scalability.	The authors establish an analysis framework inspired by scene-specific methods, separating the rendering of static and dynamic content. They adapt a pre-trained generalizable NeRF transformer (GNT) for static content and investigate the use of depth and temporal priors (optical flow and tracking) for dynamic content rendering.	Complete generalization with current depth estimation or tracking methods is not yet achievable. A pseudo-generalized approach, using consistent depth estimates but avoiding scene-specific appearance optimization, outperforms several scene-specific methods on perceptual quality metrics. Consistent depth is identified as a sufficient condition for generalized dynamic novel view synthesis from monocular videos.	Simple temporal aggregation using tracking methods does not yet yield satisfactory results, indicating a need for more sophisticated designs. Future work includes exploring context-aware inpainting to address artifacts such as missing foreground parts and blurry backgrounds.	novel view synthesis, dynamic scenes, monocular video, generalizable methods, consistent depth
2310.08579 Report	HyperHuman: Hyper-Realistic Human Generation with Latent Structural Diffusion	Xian Liu, Jian Ren, Aliaksandr Siarohin, Ivan Skorokhodov, Yanyu Li, Dahua Lin, Xihui Liu, Ziwei Liu, Sergey Tulyakov	Despite significant advances in large-scale text-to-image models, achieving hyper-realistic human image generation remains a desirable yet unsolved task. Existing models like Stable Diffusion and DALL-E 2 tend to generate human images with incoherent parts or unnatural poses. To tackle these challenges, our key insight is that human image is inherently structural over multiple granularities, from the coarse-level body skeleton to fine-grained spatial geometry. Therefore, capturing such correlations between the explicit appearance and latent structure in one model is essential to generate coherent and natural human images. To this end, we propose a unified framework, HyperHuman, that generates in-the-wild human images of high realism and diverse layouts. Specifically, 1) we first build a large-scale human-centric dataset, named HumanVerse, which consists of 340M images with comprehensive annotations like human pose, depth, and surface normal. 2) Next, we propose a Latent Structural Diffusion Model that simultaneously denoises the depth and surface normal along with the synthesized RGB image. Our model enforces the joint learning of image appearance, spatial relationship, and geometry in a unified network, where each branch in the model complements to each other with both structural awareness and textural richness. 3) Finally, to further boost the visual quality, we propose a Structure-Guided Refiner to compose the predicted conditions for more detailed generation of higher resolution. Extensive experiments demonstrate that our framework yields the state-of-the-art performance, generating hyper-realistic human images under diverse scenarios. Project Page: https://snap-research.github.io/HyperHuman/	This paper introduces HyperHuman, a novel framework for generating highly realistic and diverse human images with controllable layouts, addressing the limitations of existing text-to-image models in accurately depicting human anatomy and poses.	Generating hyper-realistic human images is crucial for various applications like image animation and virtual try-on, but existing models often produce incoherent or unnatural results. HyperHuman aims to overcome these limitations by explicitly modeling the inherent multi-level structure of human images.	The approach consists of two main stages: 1) Latent Structural Diffusion Model: This model jointly denoises RGB, depth, and surface-normal maps, capturing the correlations between appearance and structure. 2) Structure-Guided Refiner: It leverages the predicted structural maps to generate high-resolution images with improved detail and fidelity. The authors also create a large-scale human-centric dataset, HumanVerse, containing 340M images with comprehensive annotations for training and evaluation.	HyperHuman significantly outperforms previous state-of-the-art models in terms of image quality, pose accuracy, and text-image alignment on the MS-COCO 2014 validation human subset. The model demonstrates strong robustness to the impact of random seeds and unseen poses, as shown by the generated results. Qualitative analysis and user studies confirm that HyperHuman generates more realistic, aesthetically pleasing, and text-aligned human images compared to baseline methods.	The generation of subtle details like fingers and eyes is limited by the performance of existing pose, depth, and normal estimators. Future work includes exploring deep priors for text-to-pose generation, eliminating the current reliance on body skeleton input.	text-to-image generation, human image synthesis, diffusion models, controllable image generation, structure-aware generation
2310.08577 Report	Visual Data-Type Understanding does not emerge from Scaling Vision-Language Models	Vishaal Udandarao, Max F. Burg, Samuel Albanie, Matthias Bethge	Recent advances in the development of vision-language models (VLMs) are yielding remarkable success in recognizing visual semantic content, including impressive instances of compositional image understanding. Here, we introduce the novel task of Visual Data-Type Identification, a basic perceptual skill with implications for data curation (e.g., noisy data-removal from large datasets, domain-specific retrieval) and autonomous vision (e.g., distinguishing changing weather conditions from camera lens staining). We develop two datasets consisting of animal images altered across a diverse set of 27 visual data-types, spanning four broad categories. An extensive zero-shot evaluation of 39 VLMs, ranging from 100M to 80B parameters, shows a nuanced performance landscape. While VLMs are reasonably good at identifying certain stylistic \textit{data-types}, such as cartoons and sketches, they struggle with simpler data-types arising from basic manipulations like image rotations or additive noise. Our findings reveal that (i) model scaling alone yields marginal gains for contrastively-trained models like CLIP, and (ii) there is a pronounced drop in performance for the largest auto-regressively trained VLMs like OpenFlamingo. This finding points to a blind spot in current frontier VLMs: they excel in recognizing semantic content but fail to acquire an understanding of visual data-types through scaling. By analyzing the pre-training distributions of these models and incorporating data-type information into the captions during fine-tuning, we achieve a significant enhancement in performance. By exploring this previously uncharted task, we aim to set the stage for further advancing VLMs to equip them with visual data-type understanding. Code and datasets are released at https://github.com/bethgelab/DataTypeIdentification.	This paper introduces the novel task of Visual Data-Type Identification, where a model identifies how an image was generated (e.g., blurred, rotated) in addition to recognizing semantic content.	This task is important for various applications such as data curation, filtering, and autonomous vision, where understanding image generation processes can be as crucial as recognizing semantic content.	The authors create two datasets, SyntheticTypeIdent and NaturalTypeIdent, consisting of animal images altered with 27 different data-types. They benchmark 39 state-of-the-art VLMs, ranging from 100M to 80B parameters, on their ability to identify these data-types.	VLMs struggle with identifying many data-types, particularly those involving simple transformations like noise addition or rotation. Scaling VLM size yields only marginal performance improvements, suggesting current models are not inherently learning to recognize data-types through increased scale alone. Performance can be significantly improved by fine-tuning VLMs with data explicitly containing data-type information.	The study is limited to animal images and a specific set of data-types. Future work can explore alternative training objectives or architectures that explicitly encourage data-type representation learning in VLMs.	vision-language models, data-type identification, dataset bias, model scaling, fine-tuning
2310.08541 Report	Idea2Img: Iterative Self-Refinement with GPT-4V(ision) for Automatic Image Design and Generation	Zhengyuan Yang, Jianfeng Wang, Linjie Li, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Lijuan Wang	We introduce ``Idea to Image,'' a system that enables multimodal iterative self-refinement with GPT-4V(ision) for automatic image design and generation. Humans can quickly identify the characteristics of different text-to-image (T2I) models via iterative explorations. This enables them to efficiently convert their high-level generation ideas into effective T2I prompts that can produce good images. We investigate if systems based on large multimodal models (LMMs) can develop analogous multimodal self-refinement abilities that enable exploring unknown models or environments via self-refining tries. Idea2Img cyclically generates revised T2I prompts to synthesize draft images, and provides directional feedback for prompt revision, both conditioned on its memory of the probed T2I model's characteristics. The iterative self-refinement brings Idea2Img various advantages over vanilla T2I models. Notably, Idea2Img can process input ideas with interleaved image-text sequences, follow ideas with design instructions, and generate images of better semantic and visual qualities. The user preference study validates the efficacy of multimodal iterative self-refinement on automatic image design and generation.	This paper introduces "Idea to Image" (Idea2Img), a system that utilizes multimodal iterative self-refinement with large multimodal models (LMMs) like GPT-4V for automatic image design and generation.	The goal is to mimic the human ability to iteratively explore and understand the characteristics of text-to-image (T2I) models, thereby generating more effective prompts and higher-quality images.	Idea2Img employs a cyclical process where an LMM generates text prompts, selects the best draft image from the T2I model, provides feedback on discrepancies, and refines the prompt iteratively. This process is guided by a memory module storing the history of prompts, images, and feedback.	Idea2Img can handle complex user ideas containing interleaved image-text sequences and design instructions. Idea2Img consistently outperforms baseline methods and human-written prompts in user preference studies across various T2I models, including SDXL and DeepFloyd IF. Stronger T2I models benefit more from Idea2Img's iterative refinement, indicating a synergistic effect between LMM guidance and T2I capabilities.	Current work focuses on image generation, future work can explore applying the framework to other multimodal tasks like GUI navigation and embodied agents. While the current system explores using a single generation model, extending it to manage and optimize the collaboration of multiple tools is a promising direction.	image generation, text-to-image, multimodal learning, iterative refinement, large language models
2310.08534 Report	Animating Street View	Mengyi Shan, Brian Curless, Ira Kemelmacher-Shlizerman, Steve Seitz	We present a system that automatically brings street view imagery to life by populating it with naturally behaving, animated pedestrians and vehicles. Our approach is to remove existing people and vehicles from the input image, insert moving objects with proper scale, angle, motion, and appearance, plan paths and traffic behavior, as well as render the scene with plausible occlusion and shadowing effects. The system achieves these by reconstructing the still image street scene, simulating crowd behavior, and rendering with consistent lighting, visibility, occlusions, and shadows. We demonstrate results on a diverse range of street scenes including regular still images and panoramas.	This paper presents a system that automatically animates still street view images by populating them with naturally behaving pedestrians and vehicles, respecting the scene's geometry and illumination.	This approach enhances the vividness of street view imagery without additional capture and privacy concerns, offering a more immersive experience.	The system uses a three-stage pipeline: (1) Reconstruction of scene geometry, semantics, and lighting. (2) Simulation of pedestrian and vehicle behaviors. (3) Rendering of 3D assets into the scene with realistic shadows and occlusions.	Realistic animation of street scenes with moving pedestrians and vehicles. Accurate shadow rendering and occlusion handling based on 3D scene understanding. Effective traffic simulation at crosswalks with dynamic pedestrian and vehicle interactions.	System struggles with curved lanes, hills, and complex shadow scenarios. Limited diversity in the appearance of generated pedestrians and vehicles.	image animation, scene reconstruction, crowd simulation, rendering, street view
2310.08530 Report	UniPose: Detecting Any Keypoints	Jie Yang, Ailing Zeng, Ruimao Zhang, Lei Zhang	This work proposes a unified framework called UniPose to detect keypoints of any articulated (e.g., human and animal), rigid, and soft objects via visual or textual prompts for fine-grained vision understanding and manipulation. Keypoint is a structure-aware, pixel-level, and compact representation of any object, especially articulated objects. Existing fine-grained promptable tasks mainly focus on object instance detection and segmentation but often fail to identify fine-grained granularity and structured information of image and instance, such as eyes, leg, paw, etc. Meanwhile, prompt-based keypoint detection is still under-explored. To bridge the gap, we make the first attempt to develop an end-to-end prompt-based keypoint detection framework called UniPose to detect keypoints of any objects. As keypoint detection tasks are unified in this framework, we can leverage 13 keypoint detection datasets with 338 keypoints across 1,237 categories over 400K instances to train a generic keypoint detection model. UniPose can effectively align text-to-keypoint and image-to-keypoint due to the mutual enhancement of textual and visual prompts based on the cross-modality contrastive learning optimization objectives. Our experimental results show that UniPose has strong fine-grained localization and generalization abilities across image styles, categories, and poses. Based on UniPose as a generalist keypoint detector, we hope it could serve fine-grained visual perception, understanding, and generation.	This paper introduces UniPose, a unified framework for detecting keypoints of any object (articulated, rigid, or soft) using visual or textual prompts.	Existing methods are limited to specific object categories or struggle with unseen objects and keypoints. UniPose aims to overcome these limitations and enable fine-grained vision understanding and manipulation.	UniPose utilizes a coarse-to-fine strategy: 1) it encodes visual and textual prompts, 2) decodes instance information (bounding boxes), and 3) decodes keypoint locations. It's trained on a unified dataset (UniKPT) combining 13 keypoint datasets across various object categories.	UniPose achieves state-of-the-art results on unseen object and keypoint detection, surpassing previous methods by a large margin. It outperforms expert keypoint detection models across 12 diverse datasets, demonstrating strong generalization ability. UniPose exhibits impressive text-to-image similarity, exceeding CLIP's performance in distinguishing object categories and image styles.	The performance on objects with novel topologies not included in the training data needs improvement. The model can struggle with heavily occluded keypoints or objects with indistinct visual features.	keypoint detection, prompt-based learning, multi-modality learning, category-agnostic pose estimation, open-vocabulary vision
2310.08529 Report	GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models	Taoran Yi, Jiemin Fang, Junjie Wang, Guanjun Wu, Lingxi Xie, Xiaopeng Zhang, Wenyu Liu, Qi Tian, Xinggang Wang	In recent times, the generation of 3D assets from text prompts has shown impressive results. Both 2D and 3D diffusion models can help generate decent 3D objects based on prompts. 3D diffusion models have good 3D consistency, but their quality and generalization are limited as trainable 3D data is expensive and hard to obtain. 2D diffusion models enjoy strong abilities of generalization and fine generation, but 3D consistency is hard to guarantee. This paper attempts to bridge the power from the two types of diffusion models via the recent explicit and efficient 3D Gaussian splatting representation. A fast 3D object generation framework, named as GaussianDreamer, is proposed, where the 3D diffusion model provides priors for initialization and the 2D diffusion model enriches the geometry and appearance. Operations of noisy point growing and color perturbation are introduced to enhance the initialized Gaussians. Our GaussianDreamer can generate a high-quality 3D instance or 3D avatar within 15 minutes on one GPU, much faster than previous methods, while the generated instances can be directly rendered in real time. Demos and code are available at https://taoranyi.com/gaussiandreamer/.	GaussianDreamer, a fast text-to-3D generation method that bridges 3D and 2D diffusion models via Gaussian Splatting, achieving both 3D consistency and rich detail.	Existing methods either lack 3D consistency (2D diffusion models) or struggle with complex prompts and fine details due to limited 3D data (3D diffusion models).	1. Initialize 3D Gaussians from coarse 3D models generated by text-to-3D or text-to-motion diffusion models. 2. Enhance initialization with noisy point growing and color perturbation. 3. Optimize Gaussians using the Score Distillation Sampling loss with a 2D diffusion model. 4. Render in real time using Gaussian Splatting.	Generates high-quality 3D objects and avatars with 3D consistency and fine details. Significantly faster than previous methods (15 minutes on a single GPU). Achieves real-time rendering without mesh conversion.	Generated objects may have unsharp edges and unnecessary Gaussians. Limited effectiveness in generating large-scale scenes.	text-to-3d, 3d generation, diffusion models, gaussian splatting, real-time rendering
2310.08528 Report	4D Gaussian Splatting for Real-Time Dynamic Scene Rendering	Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, Xinggang Wang	Representing and rendering dynamic scenes has been an important but challenging task. Especially, to accurately model complex motions, high efficiency is usually hard to guarantee. To achieve real-time dynamic scene rendering while also enjoying high training and storage efficiency, we propose 4D Gaussian Splatting (4D-GS) as a holistic representation for dynamic scenes rather than applying 3D-GS for each individual frame. In 4D-GS, a novel explicit representation containing both 3D Gaussians and 4D neural voxels is proposed. A decomposed neural voxel encoding algorithm inspired by HexPlane is proposed to efficiently build Gaussian features from 4D neural voxels and then a lightweight MLP is applied to predict Gaussian deformations at novel timestamps. Our 4D-GS method achieves real-time rendering under high resolutions, 82 FPS at an 800$\times$800 resolution on an RTX 3090 GPU while maintaining comparable or better quality than previous state-of-the-art methods. More demos and code are available at https://guanjunwu.github.io/4dgs/.	Proposes 4D Gaussian Splatting (4D-GS), a novel approach for real-time dynamic scene rendering using an efficient Gaussian deformation field to model Gaussian motions and shape changes over time.	Real-time rendering of dynamic scenes is crucial for applications like VR and AR, but accurately modeling complex motions with high efficiency is challenging. Existing methods struggle with either rendering speed or storage efficiency, especially for long input sequences.	Represents scenes as 3D Gaussians and models their motion and deformation over time using a Gaussian deformation field network. This network consists of a spatial-temporal structure encoder (multi-resolution HexPlane inspired by K-Planes) to encode features of adjacent Gaussians and a lightweight multi-head decoder to predict Gaussian deformations at novel timestamps. Rendering is achieved through efficient differentiable splatting.	Achieves real-time rendering on dynamic scenes, up to 82 FPS at 800x800 resolution on synthetic datasets and 30 FPS at 1352x1014 resolution on real datasets. Maintains comparable or superior rendering quality compared to state-of-the-art methods while ensuring low storage consumption and fast convergence. Demonstrates potential for 4D object tracking and editing due to its explicit representation of dynamic scenes.	Modeling large motions and dramatic scene changes, especially in monocular settings, remains a challenge. Handling urban-scale reconstructions with a massive number of 3D Gaussians requires a more compact algorithm.	dynamic scene rendering, 4d gaussian splatting, real-time rendering, deformation fields, neural rendering
2310.08465 Report	MotionDirector: Motion Customization of Text-to-Video Diffusion Models	Rui Zhao, Yuchao Gu, Jay Zhangjie Wu, David Junhao Zhang, Jiawei Liu, Weijia Wu, Jussi Keppo, Mike Zheng Shou	Large-scale pre-trained diffusion models have exhibited remarkable capabilities in diverse video generations. Given a set of video clips of the same motion concept, the task of Motion Customization is to adapt existing text-to-video diffusion models to generate videos with this motion. For example, generating a video with a car moving in a prescribed manner under specific camera movements to make a movie, or a video illustrating how a bear would lift weights to inspire creators. Adaptation methods have been developed for customizing appearance like subject or style, yet unexplored for motion. It is straightforward to extend mainstream adaption methods for motion customization, including full model tuning, parameter-efficient tuning of additional layers, and Low-Rank Adaptions (LoRAs). However, the motion concept learned by these methods is often coupled with the limited appearances in the training videos, making it difficult to generalize the customized motion to other appearances. To overcome this challenge, we propose MotionDirector, with a dual-path LoRAs architecture to decouple the learning of appearance and motion. Further, we design a novel appearance-debiased temporal loss to mitigate the influence of appearance on the temporal training objective. Experimental results show the proposed method can generate videos of diverse appearances for the customized motions. Our method also supports various downstream applications, such as the mixing of different videos with their appearance and motion respectively, and animating a single image with customized motions. Our code and model weights will be released.	This paper introduces Motion Customization, a method for adapting text-to-video diffusion models to generate videos with user-specified motion concepts, learned from one or multiple reference videos, while preserving appearance diversity.	While existing text-to-video models allow for appearance customization, controlling the motion generated in videos remains an open challenge. This ability is crucial for users who desire specific motion styles in their generated videos.	The proposed MotionDirector uses a dual-path Low-Rank Adaptation (LoRA) architecture. The spatial path learns appearance from single frames, while the temporal path learns motion from multiple frames, decoupling the two. An appearance-debiased temporal loss further improves motion learning by mitigating the influence of appearance.	MotionDirector successfully generates diverse videos with customized motions, outperforming baselines and existing controllable generation methods on two benchmarks. Human evaluations demonstrate a strong preference for MotionDirector in terms of motion fidelity and appearance diversity. The method is efficient, requiring minimal additional parameters and training time compared to full model fine-tuning.	Learning complex motions involving multiple subjects remains challenging. Future work could explore decoupling motions of multiple subjects for more intricate scenarios.	text-to-video generation, motion customization, diffusion models, low-rank adaptation (lora), controllable video generation
2310.08442 Report	Debias the Training of Diffusion Models	Hu Yu, Li Shen, Jie Huang, Man Zhou, Hongsheng Li, Feng Zhao	Diffusion models have demonstrated compelling generation quality by optimizing the variational lower bound through a simple denoising score matching loss. In this paper, we provide theoretical evidence that the prevailing practice of using a constant loss weight strategy in diffusion models leads to biased estimation during the training phase. Simply optimizing the denoising network to predict Gaussian noise with constant weighting may hinder precise estimations of original images. To address the issue, we propose an elegant and effective weighting strategy grounded in the theoretically unbiased principle. Moreover, we conduct a comprehensive and systematic exploration to dissect the inherent bias problem deriving from constant weighting loss from the perspectives of its existence, impact and reasons. These analyses are expected to advance our understanding and demystify the inner workings of diffusion models. Through empirical evaluation, we demonstrate that our proposed debiased estimation method significantly enhances sample quality without the reliance on complex techniques, and exhibits improved efficiency compared to the baseline method both in training and sampling processes.	This paper provides theoretical and empirical evidence that the common practice of using constant loss weights in diffusion models leads to biased estimation during training, hindering image quality. It proposes a debiased loss weight strategy to address this issue.	Understanding and addressing the bias in diffusion model training is crucial as it directly impacts the quality of generated samples and the model's performance.	The paper theoretically analyzes the impact of constant weighting on the estimation of the original image from noisy samples. It then proposes a debiased weighting strategy that assigns higher weights to later timesteps, improving the estimation of the original image. The effectiveness of this approach is validated through experiments on multiple datasets.	The proposed debiased estimation method significantly improves sample quality compared to constant weighting. The method exhibits improved efficiency in both training and sampling, achieving superior performance with fewer training iterations and sampling steps. The analysis provides insights into the existence, impact, and underlying causes of biased estimation in diffusion models.	The paper focuses on the standard Gaussian noise prediction objective and does not extensively explore other training targets. Future work could investigate the impact of noise schedules on the bias problem.	diffusion models, generative models, debiasing, image generation, loss function
2310.08094 Report	SingleInsert: Inserting New Concepts from a Single Image into Text-to-Image Models for Flexible Editing	Zijie Wu, Chaohui Yu, Zhen Zhu, Fan Wang, Xiang Bai	Recent progress in text-to-image (T2I) models enables high-quality image generation with flexible textual control. To utilize the abundant visual priors in the off-the-shelf T2I models, a series of methods try to invert an image to proper embedding that aligns with the semantic space of the T2I model. However, these image-to-text (I2T) inversion methods typically need multiple source images containing the same concept or struggle with the imbalance between editing flexibility and visual fidelity. In this work, we point out that the critical problem lies in the foreground-background entanglement when learning an intended concept, and propose a simple and effective baseline for single-image I2T inversion, named SingleInsert. SingleInsert adopts a two-stage scheme. In the first stage, we regulate the learned embedding to concentrate on the foreground area without being associated with the irrelevant background. In the second stage, we finetune the T2I model for better visual resemblance and devise a semantic loss to prevent the language drift problem. With the proposed techniques, SingleInsert excels in single concept generation with high visual fidelity while allowing flexible editing. Additionally, SingleInsert can perform single-image novel view synthesis and multiple concepts composition without requiring joint training. To facilitate evaluation, we design an editing prompt list and introduce a metric named Editing Success Rate (ESR) for quantitative assessment of editing flexibility. Our project page is: https://jarrentwu1031.github.io/SingleInsert-web/	This paper introduces SingleInsert, a single-image image-to-text inversion method for inserting novel concepts into pre-trained text-to-image models, enabling flexible editing.	Existing methods struggle to balance editing flexibility and visual fidelity when learning concepts from single images due to foreground-background entanglement.	SingleInsert employs a two-stage scheme: 1) Inversion stage: An image encoder learns to map a source image to a textual embedding, optimized using foreground and background losses to disentangle the concept from its background. 2) Finetuning stage: The text-to-image model is fine-tuned alongside the encoder, using the same losses and a semantic loss to prevent language drift and preserve class-specific priors.	SingleInsert achieves high visual fidelity while enabling flexible editing of the learned concept, outperforming existing single-image and multi-image inversion methods. The proposed method effectively disentangles the learned concept from the background, allowing for novel view synthesis from a single image. SingleInsert enables composition of multiple independently learned concepts without joint training.	SingleInsert may struggle with rare concepts due to limited prior knowledge in the base text-to-image model. Synthesized novel viewpoints can be less accurate when the input image presents an extreme perspective of the concept.	image-to-text inversion, text-to-image generation, concept learning, novel view synthesis, concept composition
2310.08092 Report	Consistent123: Improve Consistency for One Image to 3D Object Synthesis	Haohan Weng, Tianyu Yang, Jianan Wang, Yu Li, Tong Zhang, C. L. Philip Chen, Lei Zhang	Large image diffusion models enable novel view synthesis with high quality and excellent zero-shot capability. However, such models based on image-to-image translation have no guarantee of view consistency, limiting the performance for downstream tasks like 3D reconstruction and image-to-3D generation. To empower consistency, we propose Consistent123 to synthesize novel views simultaneously by incorporating additional cross-view attention layers and the shared self-attention mechanism. The proposed attention mechanism improves the interaction across all synthesized views, as well as the alignment between the condition view and novel views. In the sampling stage, such architecture supports simultaneously generating an arbitrary number of views while training at a fixed length. We also introduce a progressive classifier-free guidance strategy to achieve the trade-off between texture and geometry for synthesized object views. Qualitative and quantitative experiments show that Consistent123 outperforms baselines in view consistency by a large margin. Furthermore, we demonstrate a significant improvement of Consistent123 on varying downstream tasks, showing its great potential in the 3D generation field. The project page is available at consistent-123.github.io.	Consistent123, a novel image-to-3D model that synthesizes consistent multiple views simultaneously, is proposed.	Existing image-to-image translation based diffusion models for novel view synthesis lack view consistency, hindering their use in downstream tasks like 3D reconstruction.	Consistent123 incorporates cross-view attention layers and a shared self-attention mechanism into a denoising U-Net to align synthesized views. It employs progressive classifier-free guidance for a trade-off between texture and geometry.	Significantly improved view consistency over baselines on Objaverse, GSO, and RTMV datasets. Supports synthesizing arbitrary numbers of views while being trained at a fixed length. Demonstrates substantial improvement in downstream tasks like 3D reconstruction and image-to-3D generation.	The model requires a relatively large number of views to achieve high consistency. Further exploration is needed to optimize the trade-off between view consistency and image quality. Future work includes exploring the impact of different attention mechanisms and sampling techniques.	novel view synthesis, view consistency, 3d object synthesis, diffusion models, cross-view attention
2310.07771 Report	DrivingDiffusion: Layout-Guided multi-view driving scene video generation with latent diffusion model	Xiaofan Li, Yifu Zhang, Xiaoqing Ye	With the increasing popularity of autonomous driving based on the powerful and unified bird's-eye-view (BEV) representation, a demand for high-quality and large-scale multi-view video data with accurate annotation is urgently required. However, such large-scale multi-view data is hard to obtain due to expensive collection and annotation costs. To alleviate the problem, we propose a spatial-temporal consistent diffusion framework DrivingDiffusion, to generate realistic multi-view videos controlled by 3D layout. There are three challenges when synthesizing multi-view videos given a 3D layout: How to keep 1) cross-view consistency and 2) cross-frame consistency? 3) How to guarantee the quality of the generated instances? Our DrivingDiffusion solves the problem by cascading the multi-view single-frame image generation step, the single-view video generation step shared by multiple cameras, and post-processing that can handle long video generation. In the multi-view model, the consistency of multi-view images is ensured by information exchange between adjacent cameras. In the temporal model, we mainly query the information that needs attention in subsequent frame generation from the multi-view images of the first frame. We also introduce the local prompt to effectively improve the quality of generated instances. In post-processing, we further enhance the cross-view consistency of subsequent frames and extend the video length by employing temporal sliding window algorithm. Without any extra cost, our model can generate large-scale realistic multi-camera driving videos in complex urban scenes, fueling the downstream driving tasks. The code will be made publicly available.	DrivingDiffusion, a novel spatial-temporal consistent diffusion framework, generates realistic multi-view driving videos controlled by 3D layout.	High-quality, annotated multi-view video data is crucial for autonomous driving but expensive to collect. DrivingDiffusion offers a solution by generating such data, supporting BEV perception model development.	The method uses a cascaded approach with a multi-view image generation model, a single-view temporal model (shared across cameras), and post-processing for consistency and length. Key components include a 3D layout controller, cross-view/frame attention, consistency loss, and local prompt for instance quality.	DrivingDiffusion achieves state-of-the-art video synthesis on nuScenes, outperforming existing methods in FID and FVD metrics. The generated data, when used for augmenting training data, demonstrably improves BEV perception tasks, evidenced by increased NDS and decreased mAOE. Ablation studies confirm the importance of the consistency module and local prompt for overall quality and instance-level performance, respectively.	Future work involves exploring memory-efficient end-to-end multi-view video generation. Incorporating NeRF-based approaches is planned to further enhance spatial and temporal consistency.	multi-view video generation, autonomous driving, layout-guided synthesis, latent diffusion model, data augmentation
2310.07726 Report	Warfare:Breaking the Watermark Protection of AI-Generated Content	Guanlin Li, Yifei Chen, Jie Zhang, Jiwei Li, Shangwei Guo, Tianwei Zhang	AI-Generated Content (AIGC) is gaining great popularity, with many emerging commercial services and applications. These services leverage advanced generative models, such as latent diffusion models and large language models, to generate creative content (e.g., realistic images and fluent sentences) for users. The usage of such generated content needs to be highly regulated, as the service providers need to ensure the users do not violate the usage policies (e.g., abuse for commercialization, generating and distributing unsafe content). A promising solution to achieve this goal is watermarking, which adds unique and imperceptible watermarks on the content for service verification and attribution. Numerous watermarking approaches have been proposed recently. However, in this paper, we show that an adversary can easily break these watermarking mechanisms. Specifically, we consider two possible attacks. (1) Watermark removal: the adversary can easily erase the embedded watermark from the generated content and then use it freely bypassing the regulation of the service provider. (2) Watermark forging: the adversary can create illegal content with forged watermarks from another user, causing the service provider to make wrong attributions. We propose Warfare, a unified methodology to achieve both attacks in a holistic way. The key idea is to leverage a pre-trained diffusion model for content processing and a generative adversarial network for watermark removal or forging. We evaluate Warfare on different datasets and embedding setups. The results prove that it can achieve high success rates while maintaining the quality of the generated content. Compared to existing diffusion model-based attacks, Warfare is 5,050~11,000x faster.	Introduces \SysName, a novel method to break the watermark protection of AI-generated content, enabling both watermark removal and forging.	Highlights a critical vulnerability in current AI-generated content protection mechanisms, emphasizing the need for more robust watermarking techniques.	Employs a two-stage approach, first training a generator on watermarked images and then using it to either remove or forge watermarks based on specific bit manipulation.	\SysName achieves high bit accuracy in both watermark removal (up to 99.98%) and forging (up to 99.11%). Demonstrates effectiveness across different watermarking schemes, including those embedded in latent spaces of diffusion models. Shows potential for few-shot learning, achieving significant results with limited new data.	Zero-shot performance, while promising, requires further improvement. Limited evaluation on real-world datasets beyond LSUN.	watermark removal, watermark forging, ai-generated content, diffusion models, adversarial attacks
2310.07704 Report	Ferret: Refer and Ground Anything Anywhere at Any Granularity	Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, Yinfei Yang	We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of understanding spatial referring of any shape or granularity within an image and accurately grounding open-vocabulary descriptions. To unify referring and grounding in the LLM paradigm, Ferret employs a novel and powerful hybrid region representation that integrates discrete coordinates and continuous features jointly to represent a region in the image. To extract the continuous features of versatile regions, we propose a spatial-aware visual sampler, adept at handling varying sparsity across different shapes. Consequently, Ferret can accept diverse region inputs, such as points, bounding boxes, and free-form shapes. To bolster the desired capability of Ferret, we curate GRIT, a comprehensive refer-and-ground instruction tuning dataset including 1.1M samples that contain rich hierarchical spatial knowledge, with 95K hard negative data to promote model robustness. The resulting model not only achieves superior performance in classical referring and grounding tasks, but also greatly outperforms existing MLLMs in region-based and localization-demanded multimodal chatting. Our evaluations also reveal a significantly improved capability of describing image details and a remarkable alleviation in object hallucination. Code and data will be available at https://github.com/apple/ml-ferret	Ferret is a multimodal large language model (MLLM) that combines discrete coordinates and continuous visual features for fine-grained spatial understanding within images, enabling both referring to and grounding of open-vocabulary descriptions.	Existing methods struggle to unify referring and grounding in one framework, represent versatile region types beyond bounding boxes, and ensure open-vocabulary and robust performance. Ferret addresses these limitations by integrating referring/grounding in an MLLM and supporting diverse region inputs.	Ferret utilizes a hybrid region representation that combines discrete coordinates with continuous visual features extracted by a novel spatial-aware visual sampler. It is trained on GRIT, a new dataset curated for refer-and-ground instruction tuning and enhanced robustness.	Ferret outperforms previous MLLMs in conventional referring and grounding tasks, including referring object classification, phrase grounding, and grounded image captioning. Ferret demonstrates superior performance in region-based multimodal chatting, excelling in tasks like referring description, referring reasoning, and grounding in conversation. Ferret exhibits strong robustness against object hallucination, significantly outperforming other MLLMs on the POPE benchmark.	While Ferret supports various region inputs, its evaluation primarily focuses on bounding boxes for benchmarking purposes. Future work includes enabling Ferret to output segmentation masks for even finer-grained region localization.	multimodal large language models, referring expression comprehension, visual grounding, spatial understanding, object hallucination
2310.07702 Report	ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models	Yingqing He, Shaoshu Yang, Haoxin Chen, Xiaodong Cun, Menghan Xia, Yong Zhang, Xintao Wang, Ran He, Qifeng Chen, Ying Shan	In this work, we investigate the capability of generating images from pre-trained diffusion models at much higher resolutions than the training image sizes. In addition, the generated images should have arbitrary image aspect ratios. When generating images directly at a higher resolution, 1024 x 1024, with the pre-trained Stable Diffusion using training images of resolution 512 x 512, we observe persistent problems of object repetition and unreasonable object structures. Existing works for higher-resolution generation, such as attention-based and joint-diffusion approaches, cannot well address these issues. As a new perspective, we examine the structural components of the U-Net in diffusion models and identify the crucial cause as the limited perception field of convolutional kernels. Based on this key observation, we propose a simple yet effective re-dilation that can dynamically adjust the convolutional perception field during inference. We further propose the dispersed convolution and noise-damped classifier-free guidance, which can enable ultra-high-resolution image generation (e.g., 4096 x 4096). Notably, our approach does not require any training or optimization. Extensive experiments demonstrate that our approach can address the repetition issue well and achieve state-of-the-art performance on higher-resolution image synthesis, especially in texture details. Our work also suggests that a pre-trained diffusion model trained on low-resolution images can be directly used for high-resolution visual generation without further tuning, which may provide insights for future research on ultra-high-resolution image and video synthesis.	This paper proposes a novel method, named ScaleCrafter, for generating high-resolution images from pre-trained diffusion models without requiring any further training or optimization.	Existing text-to-image diffusion models are limited in resolution, and generating images at higher resolutions often leads to object repetition and unreasonable object structures. This method addresses these limitations, paving the way for high-resolution image synthesis using pre-trained models.	The authors analyze the structural components of diffusion models and identify the limited perception field of convolutional kernels as the primary cause for object repetition. They propose a dynamic "re-dilation" technique to adjust the convolutional perception field during inference, along with dispersed convolution and noise-damped classifier-free guidance for ultra-high-resolution generation.	ScaleCrafter effectively addresses the object repetition issue in higher-resolution image synthesis. The method demonstrates state-of-the-art performance on various diffusion models, including Stable Diffusion versions and a text-to-video model, achieving superior results compared to direct inference and attention-based adaptation. ScaleCrafter generates images with superior texture details compared to a pre-trained super-resolution model, highlighting the potential of leveraging pre-trained models for high-resolution synthesis.	The method focuses on adapting pre-trained models and does not explore the impact of training diffusion models directly on high-resolution images. Further investigation is needed to explore the trade-off between computational cost and generation quality when applying the method to even higher resolutions.	diffusion models, high-resolution image synthesis, text-to-image generation, re-dilation, perception field
2310.07697 Report	ConditionVideo: Training-Free Condition-Guided Text-to-Video Generation	Bo Peng, Xinyuan Chen, Yaohui Wang, Chaochao Lu, Yu Qiao	Recent works have successfully extended large-scale text-to-image models to the video domain, producing promising results but at a high computational cost and requiring a large amount of video data. In this work, we introduce ConditionVideo, a training-free approach to text-to-video generation based on the provided condition, video, and input text, by leveraging the power of off-the-shelf text-to-image generation methods (e.g., Stable Diffusion). ConditionVideo generates realistic dynamic videos from random noise or given scene videos. Our method explicitly disentangles the motion representation into condition-guided and scenery motion components. To this end, the ConditionVideo model is designed with a UNet branch and a control branch. To improve temporal coherence, we introduce sparse bi-directional spatial-temporal attention (sBiST-Attn). The 3D control network extends the conventional 2D controlnet model, aiming to strengthen conditional generation accuracy by additionally leveraging the bi-directional frames in the temporal domain. Our method exhibits superior performance in terms of frame consistency, clip score, and conditional accuracy, outperforming other compared methods.	Presents ConditionVideo, a training-free method for generating realistic and temporally consistent videos from text descriptions, guided by optional reference videos and various input conditions (e.g., pose, depth, segmentation).	Addresses limitations of existing text-to-video generation methods that are computationally expensive, require large training datasets, and struggle to generate dynamic backgrounds.	Leverages pre-trained text-to-image diffusion models (Stable Diffusion, ControlNet) with a novel pipeline that disentangles motion representation into condition-guided and scenery components. Introduces sparse bi-directional spatial-temporal attention (sBiST-Attn) and a 3D control branch for improved temporal consistency and conditional accuracy.	Generates videos with realistic dynamic backgrounds, unlike previous training-free methods. Achieves superior temporal consistency and condition alignment compared to existing methods, as demonstrated by quantitative metrics (frame consistency, CLIP score, pose accuracy). Demonstrates the effectiveness of the proposed sBiST-Attn and 3D control branch through ablation studies.	Flickering observed in videos generated with sparse conditions (e.g., pose), potentially addressed by denser control inputs and additional temporal structures in future work. Exploration of hierarchical sampling for long video generation as future work.	video generation, text-to-video, diffusion models, conditional generation, training-free
2310.07653 Report	Mini-DALLE3: Interactive Text to Image by Prompting Large Language Models	Zeqiang Lai, Xizhou Zhu, Jifeng Dai, Yu Qiao, Wenhai Wang	The revolution of artificial intelligence content generation has been rapidly accelerated with the booming text-to-image (T2I) diffusion models. Within just two years of development, it was unprecedentedly of high-quality, diversity, and creativity that the state-of-the-art models could generate. However, a prevalent limitation persists in the effective communication with these popular T2I models, such as Stable Diffusion, using natural language descriptions. This typically makes an engaging image hard to obtain without expertise in prompt engineering with complex word compositions, magic tags, and annotations. Inspired by the recently released DALLE3 - a T2I model directly built-in ChatGPT that talks human language, we revisit the existing T2I systems endeavoring to align human intent and introduce a new task - interactive text to image (iT2I), where people can interact with LLM for interleaved high-quality image generation/edit/refinement and question answering with stronger images and text correspondences using natural language. In addressing the iT2I problem, we present a simple approach that augments LLMs for iT2I with prompting techniques and off-the-shelf T2I models. We evaluate our approach for iT2I in a variety of common-used scenarios under different LLMs, e.g., ChatGPT, LLAMA, Baichuan, and InternLM. We demonstrate that our approach could be a convenient and low-cost way to introduce the iT2I ability for any existing LLMs and any text-to-image models without any training while bringing little degradation on LLMs' inherent capabilities in, e.g., question answering and code generation. We hope this work could draw broader attention and provide inspiration for boosting user experience in human-machine interactions alongside the image quality of the next-generation T2I systems.	This paper introduces the concept of interactive text-to-image (iT2I), enabling multi-turn image generation/editing through natural language conversations with large language models (LLMs).	Existing text-to-image models often require expertise in prompt engineering, making it difficult for general users to obtain desired images. iT2I aims to bridge this gap by providing a more user-friendly and interactive interface using natural language.	The proposed approach, Mini-DALLE3, prompts LLMs to generate intermediate textual descriptions within special tags. These descriptions are then refined and used to generate images using pre-trained text-to-image models. The system also incorporates hierarchical content consistency control and leverages off-the-shelf T2I models for multi-turn generation.	Prompting LLMs for iT2I does not significantly impact their inherent abilities like question answering and code generation. Commercial LLMs (ChatGPT, GPT-4, Claude) effectively generate images with corresponding textual responses, demonstrating successful augmentation for iT2I. Mini-DALLE3 shows promise in various iT2I use cases, including single/multi-turn image generation and interactive storytelling.	Evaluation on open-source LLMs shows less satisfactory results, with some struggling to generate images. Future work could focus on improving performance with open-source LLMs and exploring more sophisticated prompt refinement techniques.	text-to-image generation, interactive image generation, large language models, prompt engineering, human-computer interaction
2310.07419 Report	Multi-Concept T2I-Zero: Tweaking Only The Text Embeddings and Nothing Else	Hazarapet Tunanyan, Dejia Xu, Shant Navasardyan, Zhangyang Wang, Humphrey Shi	Recent advances in text-to-image diffusion models have enabled the photorealistic generation of images from text prompts. Despite the great progress, existing models still struggle to generate compositional multi-concept images naturally, limiting their ability to visualize human imagination. While several recent works have attempted to address this issue, they either introduce additional training or adopt guidance at inference time. In this work, we consider a more ambitious goal: natural multi-concept generation using a pre-trained diffusion model, and with almost no extra cost. To achieve this goal, we identify the limitations in the text embeddings used for the pre-trained text-to-image diffusion models. Specifically, we observe concept dominance and non-localized contribution that severely degrade multi-concept generation performance. We further design a minimal low-cost solution that overcomes the above issues by tweaking (not re-training) the text embeddings for more realistic multi-concept text-to-image generation. Our Correction by Similarities method tweaks the embedding of concepts by collecting semantic features from most similar tokens to localize the contribution. To avoid mixing features of concepts, we also apply Cross-Token Non-Maximum Suppression, which excludes the overlap of contributions from different concepts. Experiments show that our approach outperforms previous methods in text-to-image, image manipulation, and personalization tasks, despite not introducing additional training or inference costs to the diffusion steps.	This paper proposes a novel zero-shot method for multi-concept text-to-image generation using pre-trained diffusion models without additional training or inference-time optimization. The method tweaks the text embeddings to address concept dominance and non-localized contribution issues.	Existing text-to-image diffusion models struggle to generate compositional multi-concept images due to limitations in text embeddings. Existing solutions require additional training or inference-time guidance, leading to high computational cost. This method offers a low-cost alternative by focusing on text embedding manipulation.	The method consists of two techniques: 1) Corrections-by-Similarities, which aggregates semantic features from similar tokens to localize contributions, and 2) Cross-Token Non-Maximum Suppression, which minimizes overlap in contributions from different concepts to avoid feature mixing.	Outperforms existing methods in multi-concept text-to-image generation despite not introducing additional training or inference cost. Enables realistic multi-concept image manipulation by improving contribution localization. Successfully extends single-concept personalization methods to multi-concept scenarios.	The effectiveness of the method heavily relies on the quality of the pre-trained text encoder. Further improvement in concept disentanglement is needed for more complex multi-concept compositions.	text-to-image generation, diffusion models, multi-concept generation, text embeddings, zero-shot learning
2310.07222 Report	Uni-paint: A Unified Framework for Multimodal Image Inpainting with Pretrained Diffusion Model	Shiyuan Yang, Xiaodong Chen, Jing Liao	Recently, text-to-image denoising diffusion probabilistic models (DDPMs) have demonstrated impressive image generation capabilities and have also been successfully applied to image inpainting. However, in practice, users often require more control over the inpainting process beyond textual guidance, especially when they want to composite objects with customized appearance, color, shape, and layout. Unfortunately, existing diffusion-based inpainting methods are limited to single-modal guidance and require task-specific training, hindering their cross-modal scalability. To address these limitations, we propose Uni-paint, a unified framework for multimodal inpainting that offers various modes of guidance, including unconditional, text-driven, stroke-driven, exemplar-driven inpainting, as well as a combination of these modes. Furthermore, our Uni-paint is based on pretrained Stable Diffusion and does not require task-specific training on specific datasets, enabling few-shot generalizability to customized images. We have conducted extensive qualitative and quantitative evaluations that show our approach achieves comparable results to existing single-modal methods while offering multimodal inpainting capabilities not available in other methods. Code will be available at https://github.com/ysy31415/unipaint.	This paper presents Uni-paint, a unified framework for multimodal image inpainting based on a pretrained diffusion model, supporting unconditional, text-driven, stroke-driven, and exemplar-driven inpainting within a single framework.	Existing diffusion-based inpainting methods are limited to single-modal guidance and often require task-specific training, hindering their cross-modal scalability and generalization. Uni-paint addresses these limitations by offering flexible and versatile inpainting capabilities.	The authors finetune a pretrained Stable Diffusion model unconditionally on masked images, enabling context-aware inpainting. They leverage the textual interface (cross-attention) for semantic guidance (text and exemplar) and the spatial interface (image blending) for stroke guidance. A masked attention control mechanism is introduced to restrict inpainted content within the unknown region.	Uni-paint achieves comparable results to existing single-modal methods in unconditional and text-driven inpainting while not requiring large-scale training. For exemplar-driven inpainting, Uni-paint shows superior performance in capturing customized object details compared to baselines. Uni-paint effectively performs stroke-driven inpainting and allows for flexible combinations of text, stroke, and exemplar guidance.	Uni-paint may struggle to harmonize large gaps between exemplar and input images, leading to unnatural stitching. Conflicting multi-modal guidance (e.g., stroke and exemplar) can pose challenges in finding a balance between different modalities.	image inpainting, diffusion models, multimodal guidance, stable diffusion, few-shot learning
2310.06968 Report	ObjectComposer: Consistent Generation of Multiple Objects Without Fine-tuning	Alec Helbling, Evan Montoya, Duen Horng Chau	Recent text-to-image generative models can generate high-fidelity images from text prompts. However, these models struggle to consistently generate the same objects in different contexts with the same appearance. Consistent object generation is important to many downstream tasks like generating comic book illustrations with consistent characters and setting. Numerous approaches attempt to solve this problem by extending the vocabulary of diffusion models through fine-tuning. However, even lightweight fine-tuning approaches can be prohibitively expensive to run at scale and in real-time. We introduce a method called ObjectComposer for generating compositions of multiple objects that resemble user-specified images. Our approach is training-free, leveraging the abilities of preexisting models. We build upon the recent BLIP-Diffusion model, which can generate images of single objects specified by reference images. ObjectComposer enables the consistent generation of compositions containing multiple specific objects simultaneously, all without modifying the weights of the underlying models.	Introduces \ObjectComposer{}, a training-free method for generating image compositions with multiple user-specified objects using pre-existing diffusion models.	Existing text-to-image models struggle with consistent object generation across different contexts, limiting their use in applications like comic book illustration. Fine-tuning approaches, while effective, are computationally expensive.	Leverages BLIP-Diffusion for object generation and a vanilla diffusion model for background composition. Employs cross-attention maps to guide object placement and blends diffusion processes of individual objects and the background.	Generates images containing user-specified objects while adhering to text prompts. Maintains consistent object appearance across different backgrounds and compositions. Outperforms vanilla Stable Diffusion in preserving object fidelity to reference images.	Object appearance can sometimes deviate from the reference image. Relies on accurate object localization through cross-attention maps, which might not always be perfect.	image generation, object composition, diffusion models, blip-diffusion, cross-attention
2310.06904 Report	Mitigating stereotypical biases in text to image generative systems	Piero Esposito, Parmida Atighehchian, Anastasis Germanidis, Deepti Ghadiyaram	State-of-the-art generative text-to-image models are known to exhibit social biases and over-represent certain groups like people of perceived lighter skin tones and men in their outcomes. In this work, we propose a method to mitigate such biases and ensure that the outcomes are fair across different groups of people. We do this by finetuning text-to-image models on synthetic data that varies in perceived skin tones and genders constructed from diverse text prompts. These text prompts are constructed from multiplicative combinations of ethnicities, genders, professions, age groups, and so on, resulting in diverse synthetic data. Our diversity finetuned (DFT) model improves the group fairness metric by 150% for perceived skin tone and 97.7% for perceived gender. Compared to baselines, DFT models generate more people with perceived darker skin tone and more women. To foster open research, we will release all text prompts and code to generate training images.	This paper proposes a method for mitigating social biases in text-to-image models by fine-tuning them on synthetically generated data diverse in perceived skin tones, genders, age groups, and professions.	State-of-the-art text-to-image models often exhibit social biases, over-representing certain demographics. This work aims to address these biases and promote fairness in generated content.	The authors generate diverse text prompts, synthesize images from these prompts using an off-the-shelf model (SDXL), and fine-tune existing models (Stable Diffusion and Stable Diffusion XL) on this data.	The diversity fine-tuned (DFT) models significantly improve group fairness metrics for perceived skin tone (up to 150% improvement) and gender (up to 97.7% improvement). The study finds that training on a balanced distribution of perceived skin tones leads to the most diverse outputs. Subjective evaluations indicate that fine-tuning on synthetic data does not negatively impact image quality and may even improve it in some cases.	The study acknowledges limitations such as inheriting issues from the model used for synthetic data generation (SDXL) and occasional reduction in photorealism. Future work includes addressing other forms of bias and exploring the application of this technique to video generation models.	social bias, text-to-image generation, fairness, synthetic data, diversity
2310.06836 Report	What Does Stable Diffusion Know about the 3D Scene?	Guanqi Zhan, Chuanxia Zheng, Weidi Xie, Andrew Zisserman	Recent advances in generative models like Stable Diffusion enable the generation of highly photo-realistic images. Our objective in this paper is to probe the diffusion network to determine to what extent it 'understands' different properties of the 3D scene depicted in an image. To this end, we make the following contributions: (i) We introduce a protocol to evaluate whether features of an off-the-shelf diffusion model encode a number of physical 'properties' of the 3D scene, by training discriminative classifiers on the features for these properties. The probes are applied on datasets of real images with annotations for the property. (ii) We apply this protocol to properties covering scene geometry, scene material, support relations, lighting, and view dependent measures. (iii) We find that features from Stable Diffusion are good for discriminative learning of a number of properties, including scene geometry, support relations, shadows and depth, but less performant for occlusion and material. (iv) We also apply the probes to other networks trained at large-scale, including DINO, CLIP and VQGAN, and find that DINOv2 has a similar performance to Stable Diffusion, while outperforming DINOv1, CLIP and VQGAN.	This paper presents a protocol to evaluate the extent to which diffusion models and other large-scale image networks understand 3D scene properties.	Understanding what these networks learn about 3D scenes can provide insights into their workings, enable new applications using their features, help detect synthetic images, and guide further training for improved 3D modeling.	The protocol involves extracting features from different layers and timesteps of the networks, training linear classifiers to predict specific 3D scene properties from these features, and evaluating their performance on real image datasets.	Stable Diffusion and DINOv2 demonstrate a good understanding of scene geometry, support relations, lighting, and depth. They are less performant in predicting material and occlusion, indicating areas for potential improvement. Stable Diffusion and DINOv2 generally outperform other large-scale networks tested, including OpenCLIP, DINOv1, and VQGAN.	The study primarily focuses on linear probing, which might not fully capture the networks' capabilities. Future work could explore more complex properties, non-symmetric question formulations, and combinations of features from different layers and timesteps.	3d physical scene understanding, stable diffusion, representation learning, generative models, linear probing
2310.06389 Report	Learning Stackable and Skippable LEGO Bricks for Efficient, Reconfigurable, and Variable-Resolution Diffusion Modeling	Huangjie Zheng, Zhendong Wang, Jianbo Yuan, Guanghan Ning, Pengcheng He, Quanzeng You, Hongxia Yang, Mingyuan Zhou	Diffusion models excel at generating photo-realistic images but come with significant computational costs in both training and sampling. While various techniques address these computational challenges, a less-explored issue is designing an efficient and adaptable network backbone for iterative refinement. Current options like U-Net and Vision Transformer often rely on resource-intensive deep networks and lack the flexibility needed for generating images at variable resolutions or with a smaller network than used in training. This study introduces LEGO bricks, which seamlessly integrate Local-feature Enrichment and Global-content Orchestration. These bricks can be stacked to create a test-time reconfigurable diffusion backbone, allowing selective skipping of bricks to reduce sampling costs and generate higher-resolution images than the training data. LEGO bricks enrich local regions with an MLP and transform them using a Transformer block while maintaining a consistent full-resolution image across all bricks. Experimental results demonstrate that LEGO bricks enhance training efficiency, expedite convergence, and facilitate variable-resolution image generation while maintaining strong generative performance. Moreover, LEGO significantly reduces sampling time compared to other methods, establishing it as a valuable enhancement for diffusion models.	This paper introduces "LEGO bricks", a novel network unit for diffusion models that integrates local feature enrichment and global content orchestration.	Diffusion models excel at generating realistic images but suffer from high computational costs during training and sampling. This work addresses these limitations by designing a more efficient and flexible network backbone.	LEGO bricks, built upon Transformer blocks and trained on image patches, are stacked to form a reconfigurable backbone. This allows selective skipping of bricks during sampling, reducing computational cost while enabling variable-resolution image generation.	LEGO significantly reduces training time and FLOPs compared to U-Net and ViT-based diffusion models while maintaining competitive FID scores. The LEGO framework enables a 60% reduction in sampling time compared to DiT without sacrificing generation quality. LEGO can generate coherent images at resolutions much higher than the training data, demonstrated by generating panoramas from models trained on ImageNet.	The paper primarily focuses on progressive growth and refinement for stacking LEGO bricks, leaving other strategies unexplored. The current work doesn't explore text-guided image generation with LEGO bricks.	diffusion models, image generation, generative models, transformer, efficient training
2310.06347 Report	JointNet: Extending Text-to-Image Diffusion for Dense Distribution Modeling	Jingyang Zhang, Shiwei Li, Yuanxun Lu, Tian Fang, David McKinnon, Yanghai Tsin, Long Quan, Yao Yao	We introduce JointNet, a novel neural network architecture for modeling the joint distribution of images and an additional dense modality (e.g., depth maps). JointNet is extended from a pre-trained text-to-image diffusion model, where a copy of the original network is created for the new dense modality branch and is densely connected with the RGB branch. The RGB branch is locked during network fine-tuning, which enables efficient learning of the new modality distribution while maintaining the strong generalization ability of the large-scale pre-trained diffusion model. We demonstrate the effectiveness of JointNet by using RGBD diffusion as an example and through extensive experiments, showcasing its applicability in a variety of applications, including joint RGBD generation, dense depth prediction, depth-conditioned image generation, and coherent tile-based 3D panorama generation.	This paper presents JointNet, a novel neural network architecture for modeling the joint distribution of images and an additional dense modality (e.g., depth maps) by extending a pre-trained text-to-image diffusion model.	Existing methods for joint distribution modeling either rely on limited labeled dense image pairs or struggle to retain the generalization ability of pre-trained models. JointNet addresses these limitations, offering high-quality joint generation without sacrificing performance in the original RGB domain.	JointNet creates a copy of the pre-trained diffusion network for the dense label branch and connects it densely with the RGB branch. It leverages the 'output preserving principle' to ensure smooth adaptation to the new objective by fixing the original RGB branch during training and fine-tuning only the dense label branch and connections.	JointNet preserves the RGB generation quality of the base model, achieving comparable FID, IS, and CLIP similarity scores. It demonstrates comparable performance in mono-view depth estimation tasks, with results comparable to MiDaS in terms of RMSE. JointNet enables coherent tile-based joint data generation, as evidenced by its low intra-LPIPS loss and efficient generation of high-quality RGBD panoramas.	The inference time of JointNet is doubled due to maintaining two branches. Directly extending JointNet to support more modalities could lead to further increases in time consumption.	diffusion models, joint distribution modeling, dense prediction, rgbd generation, panorama generation
2310.06313 Report	Advancing Pose-Guided Image Synthesis with Progressive Conditional Diffusion Models	Fei Shen, Hu Ye, Jun Zhang, Cong Wang, Xiao Han, Wei Yang	Recent work has showcased the significant potential of diffusion models in pose-guided person image synthesis. However, owing to the inconsistency in pose between the source and target images, synthesizing an image with a distinct pose, relying exclusively on the source image and target pose information, remains a formidable challenge. This paper presents Progressive Conditional Diffusion Models (PCDMs) that incrementally bridge the gap between person images under the target and source poses through three stages. Specifically, in the first stage, we design a simple prior conditional diffusion model that predicts the global features of the target image by mining the global alignment relationship between pose coordinates and image appearance. Then, the second stage establishes a dense correspondence between the source and target images using the global features from the previous stage, and an inpainting conditional diffusion model is proposed to further align and enhance the contextual features, generating a coarse-grained person image. In the third stage, we propose a refining conditional diffusion model to utilize the coarsely generated image from the previous stage as a condition, achieving texture restoration and enhancing fine-detail consistency. The three-stage PCDMs work progressively to generate the final high-quality and high-fidelity synthesized image. Both qualitative and quantitative results demonstrate the consistency and photorealism of our proposed PCDMs under challenging scenarios.The code and model will be available at https://github.com/tencent-ailab/PCDMs.	This paper proposes Progressive Conditional Diffusion Models (PCDMs), a novel three-stage pipeline for pose-guided person image synthesis that incrementally bridges the gap between source and target poses.	Synthesizing realistic images with distinct poses from a source image remains a significant challenge due to pose inconsistencies. PCDMs address this by progressively predicting global features, establishing dense correspondences, and refining textures.	PCDMs consist of: 1) a prior conditional diffusion model predicting global target features from pose and source image, 2) an inpainting diffusion model establishing dense correspondences for a coarse image, and 3) a refining diffusion model enhancing texture and detail consistency.	PCDMs outperform state-of-the-art methods on DeepFashion and Market-1501 datasets in SSIM and LPIPS, demonstrating improved image quality and realism. User studies confirm that PCDMs generate more realistic and visually appealing person images compared to existing methods. PCDMs demonstrate strong applicability in downstream tasks, significantly improving person re-identification performance on Market-1501.	The use of three diffusion models increases computational costs and inference time. Future work should explore more efficient methods to reduce computational overhead without sacrificing quality.	image synthesis, diffusion models, pose-guided generation, person image generation, deep learning
2310.06311 Report	Improving Compositional Text-to-image Generation with Large Vision-Language Models	Song Wen, Guian Fang, Renrui Zhang, Peng Gao, Hao Dong, Dimitris Metaxas	Recent advancements in text-to-image models, particularly diffusion models, have shown significant promise. However, compositional text-to-image models frequently encounter difficulties in generating high-quality images that accurately align with input texts describing multiple objects, variable attributes, and intricate spatial relationships. To address this limitation, we employ large vision-language models (LVLMs) for multi-dimensional assessment of the alignment between generated images and their corresponding input texts. Utilizing this assessment, we fine-tune the diffusion model to enhance its alignment capabilities. During the inference phase, an initial image is produced using the fine-tuned diffusion model. The LVLM is then employed to pinpoint areas of misalignment in the initial image, which are subsequently corrected using the image editing algorithm until no further misalignments are detected by the LVLM. The resultant image is consequently more closely aligned with the input text. Our experimental results validate that the proposed methodology significantly improves text-image alignment in compositional image generation, particularly with respect to object number, attribute binding, spatial relationships, and aesthetic quality.	This paper introduces a novel framework leveraging Large Vision-Language Models (LVLMs) to enhance the quality of compositional image generation, particularly addressing the limitations of existing models in accurately aligning images with complex textual descriptions.	Compositional text-to-image models often struggle to generate images that accurately reflect the input text, particularly concerning object number, attribute binding, spatial relationships, and aesthetic quality. This work aims to improve the alignment between generated images and complex textual descriptions.	The proposed method comprises three core components: (1) LVLM-based Evaluation: LVLMs assess the alignment between generated images and input texts by analyzing answers to questions formulated from the text. (2) Model Fine-tuning: Diffusion models are fine-tuned using Reward Feedback Learning (ReFL) based on the LVLM-derived evaluation metrics. (3) LVLM-guided Editing: During inference, LVLMs identify misalignments, guiding image-editing algorithms to iteratively refine the generated image until it aligns with the input text.	LVLMs effectively evaluate image-text alignment by analyzing the accuracy of answers to questions derived from the input text. Fine-tuning diffusion models with LVLM-based evaluation metrics significantly improves the alignment between generated images and input texts. The LVLM-guided editing process effectively corrects misalignments in generated images, resulting in images that are more faithful to the input text.	The effectiveness of the method is limited by the performance of current LVLMs. Future work will explore the use of more advanced LVLMs and image editing algorithms.	text-to-image generation, compositional image generation, large vision-language models (lvlms), reward feedback learning (refl), image editing
2310.06214 Report	CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding	Eslam Mohamed Bakr, Mohamed Ayman, Mahmoud Ahmed, Habib Slim, Mohamed Elhoseiny	3D visual grounding is the ability to localize objects in 3D scenes conditioned by utterances. Most existing methods devote the referring head to localize the referred object directly, causing failure in complex scenarios. In addition, it does not illustrate how and why the network reaches the final decision. In this paper, we address this question Can we design an interpretable 3D visual grounding framework that has the potential to mimic the human perception system?. To this end, we formulate the 3D visual grounding problem as a sequence-to-sequence Seq2Seq task by first predicting a chain of anchors and then the final target. Interpretability not only improves the overall performance but also helps us identify failure cases. Following the chain of thoughts approach enables us to decompose the referring task into interpretable intermediate steps, boosting the performance and making our framework extremely data-efficient. Moreover, our proposed framework can be easily integrated into any existing architecture. We validate our approach through comprehensive experiments on the Nr3D, Sr3D, and Scanrefer benchmarks and show consistent performance gains compared to existing methods without requiring manually annotated data. Furthermore, our proposed framework, dubbed CoT3DRef, is significantly data-efficient, whereas on the Sr3D dataset, when trained only on 10% of the data, we match the SOTA performance that trained on the entire data. The code is available at https:eslambakr.github.io/cot3dref.github.io/.	This paper presents CoT3DRef, a novel 3D visual grounding framework that formulates the task as a sequence-to-sequence problem. By predicting a chain of anchor objects before localizing the final target, CoT3DRef aims to mimic human perception and improve interpretability.	Existing 3D visual grounding methods fail to provide insights into their decision-making process and struggle in complex scenarios. CoT3DRef addresses these limitations by introducing interpretability and mimicking human-like reasoning.	CoT3DRef employs a Chain-of-Thoughts decoder that leverages a Pathway module to predict the logical order of anchors extracted from the input utterance. A parallel referring head first localizes anchors and the target, which are then refined by the CoT decoder in a sequential manner.	CoT3DRef achieves state-of-the-art results on Nr3D, Sr3D, and ScanRefer benchmarks without requiring manually annotated data. The framework demonstrates significant data efficiency, surpassing existing methods even when trained on only 10% of the data. Visualizing attention maps provides insights into the model's reasoning process, aiding in the identification of failure cases.	The pseudo-label module, while effective, limits performance gains on the Nr3D dataset due to the inherent ambiguity in free-form language. The Pathway module does not currently handle scenarios with multiple valid logical paths.	3d visual grounding, chain-of-thoughts, interpretability, data efficiency, sequence-to-sequence
2310.05986 Report	The Unreasonable Effectiveness of Linear Prediction as a Perceptual Metric	Daniel Severo, Lucas Theis, Johannes Ballé	We show how perceptual embeddings of the visual system can be constructed at inference-time with no training data or deep neural network features. Our perceptual embeddings are solutions to a weighted least squares (WLS) problem, defined at the pixel-level, and solved at inference-time, that can capture global and local image characteristics. The distance in embedding space is used to define a perceptual similarity metric which we call LASI: Linear Autoregressive Similarity Index. Experiments on full-reference image quality assessment datasets show LASI performs competitively with learned deep feature based methods like LPIPS (Zhang et al., 2018) and PIM (Bhardwaj et al., 2020), at a similar computational cost to hand-crafted methods such as MS-SSIM (Wang et al., 2003). We found that increasing the dimensionality of the embedding space consistently reduces the WLS loss while increasing performance on perceptual tasks, at the cost of increasing the computational complexity. LASI is fully differentiable, scales cubically with the number of embedding dimensions, and can be parallelized at the pixel-level. A Maximum Differentiation (MAD) competition (Wang & Simoncelli, 2008) between LASI and LPIPS shows that both methods are capable of finding failure points for the other, suggesting these metrics can be combined.	This paper introduces LASI (Linear Autoregressive Similarity Index), a data-free perceptual similarity metric that constructs image embeddings at inference time without needing training data or deep neural networks.	This is important because current perceptual similarity metrics often rely on expensive training data or complex deep learning models, hindering their applicability. LASI offers a lightweight and efficient alternative.	LASI leverages a weighted least squares (WLS) approach inspired by lossless compression algorithms. It learns pixel-level representations by predicting neighboring pixel values, capturing global image semantics through this self-supervised process.	LASI achieves competitive performance with learned methods (LPIPS, PIM) on the BAPPS dataset for both 2-AFC and JND tasks. Increasing LASI's embedding dimensionality improves both its predictive performance and its scores on perceptual tasks. MAD competition analysis reveals that LASI and LPIPS exhibit distinct failure modes, suggesting potential for combining these metrics.	The generalization ability of LASI beyond BAPPS and its applicability to larger images requires further investigation. Future work could explore the usefulness of LASI embeddings in other computer vision tasks beyond perceptual similarity.	perceptual similarity, image quality assessment, data-free methods, weighted least squares, self-supervision
2310.05922 Report	FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing	Yuren Cong, Mengmeng Xu, Christian Simon, Shoufa Chen, Jiawei Ren, Yanping Xie, Juan-Manuel Perez-Rua, Bodo Rosenhahn, Tao Xiang, Sen He	Text-to-video editing aims to edit the visual appearance of a source video conditional on textual prompts. A major challenge in this task is to ensure that all frames in the edited video are visually consistent. Most recent works apply advanced text-to-image diffusion models to this task by inflating 2D spatial attention in the U-Net into spatio-temporal attention. Although temporal context can be added through spatio-temporal attention, it may introduce some irrelevant information for each patch and therefore cause inconsistency in the edited video. In this paper, for the first time, we introduce optical flow into the attention module in the diffusion model's U-Net to address the inconsistency issue for text-to-video editing. Our method, FLATTEN, enforces the patches on the same flow path across different frames to attend to each other in the attention module, thus improving the visual consistency in the edited videos. Additionally, our method is training-free and can be seamlessly integrated into any diffusion-based text-to-video editing methods and improve their visual consistency. Experiment results on existing text-to-video editing benchmarks show that our proposed method achieves the new state-of-the-art performance. In particular, our method excels in maintaining the visual consistency in the edited videos.	Introduces FLATTEN, a flow-guided attention mechanism for text-to-video editing, that improves visual consistency by leveraging optical flow to guide attention in diffusion models.	Addresses the challenge of maintaining visual consistency in edited videos, a key limitation of existing text-to-video editing methods.	Inflates a pre-trained text-to-image diffusion model, integrates FLATTEN into the U-Net, uses DDIM inversion for latent noise estimation, and employs DDIM sampling with feature injection for video generation.	Achieves state-of-the-art performance on text-to-video editing benchmarks, demonstrating superior visual consistency. Improves visual consistency when integrated into other diffusion-based video editing methods. Outperforms competing methods in user studies evaluating semantic alignment, visual consistency, and motion preservation.	Limited ability for dramatic structure editing due to reliance on optical flow from the source video. Runtime, while comparable to other methods, has room for optimization.	text-to-video editing, visual consistency, diffusion models, optical flow, attention mechanism
2310.05917 Report	Drivable Avatar Clothing: Faithful Full-Body Telepresence with Dynamic Clothing Driven by Sparse RGB-D Input	Donglai Xiang, Fabian Prada, Zhe Cao, Kaiwen Guo, Chenglei Wu, Jessica Hodgins, Timur Bagautdinov	Clothing is an important part of human appearance but challenging to model in photorealistic avatars. In this work we present avatars with dynamically moving loose clothing that can be faithfully driven by sparse RGB-D inputs as well as body and face motion. We propose a Neural Iterative Closest Point (N-ICP) algorithm that can efficiently track the coarse garment shape given sparse depth input. Given the coarse tracking results, the input RGB-D images are then remapped to texel-aligned features, which are fed into the drivable avatar models to faithfully reconstruct appearance details. We evaluate our method against recent image-driven synthesis baselines, and conduct a comprehensive analysis of the N-ICP algorithm. We demonstrate that our method can generalize to a novel testing environment, while preserving the ability to produce high-fidelity and faithful clothing dynamics and appearance.	This paper introduces a novel framework for creating photorealistic full-body avatars with dynamic clothing, driven by sparse RGB-D input, enabling faithful telepresence.	Faithfully capturing clothing dynamics is crucial for realistic avatars and telepresence, addressing the limitations of pose-driven methods that struggle to represent the nuances of loose clothing.	The framework employs a two-step process: (1) Neural Iterative Closest Point (N-ICP) algorithm for coarse clothing surface tracking, and (2) Texel-conditioned clothed avatars for high-fidelity geometry and appearance reconstruction from sparse RGB-D input and N-ICP tracking results.	The N-ICP algorithm demonstrates faster convergence than classical optimization solvers. The full framework outperforms various baselines, including DVA, NeRF-based methods, and sensing-based techniques, in reconstructing dynamic clothing. The method generalizes to a novel environment with different backgrounds and illumination, preserving appearance style and capturing unseen motion.	The current model is person- and garment-specific. The approach cannot handle drastic clothing deformations like topology changes.	telepresence, photorealistic avatars, clothing capture, neural iterative closest point, texel-conditioned avatars
2310.05916 Report	Interpreting CLIP's Image Representation via Text-Based Decomposition	Yossi Gandelsman, Alexei A. Efros, Jacob Steinhardt	We investigate the CLIP image encoder by analyzing how individual model components affect the final representation. We decompose the image representation as a sum across individual image patches, model layers, and attention heads, and use CLIP's text representation to interpret the summands. Interpreting the attention heads, we characterize each head's role by automatically finding text representations that span its output space, which reveals property-specific roles for many heads (e.g. location or shape). Next, interpreting the image patches, we uncover an emergent spatial localization within CLIP. Finally, we use this understanding to remove spurious features from CLIP and to create a strong zero-shot image segmenter. Our results indicate that a scalable understanding of transformer models is attainable and can be used to repair and improve models.	This paper investigates and interprets the internal representations of the CLIP image encoder, particularly the ViT-based variant (CLIP-ViT), by decomposing the representation into contributions from individual model components (layers, attention heads, and image patches).	Understanding the inner workings of CLIP is crucial because of its wide adoption and strong performance in various downstream tasks like image classification, segmentation, and generation. However, the complex representations learned by CLIP remain largely opaque.	The authors leverage the residual and attention mechanisms of the ViT architecture to decompose the image representation. They first identify that the last few attention layers contribute most significantly to the representation. Then, they propose an algorithm called TextSpan, which uses a greedy approach to find text descriptions that explain the output space of individual attention heads, revealing specialized roles like shape, color, and location for different heads. Lastly, they decompose the representation by image tokens (patches) to visualize the contribution of image regions to specific text concepts.	The last few attention layers in CLIP-ViT contribute most significantly to the final image representation. TextSpan reveals that individual attention heads specialize in capturing specific image properties like shape, color, counting, and location. Decomposing the representation by image tokens yields a state-of-the-art zero-shot semantic image segmenter.	The study primarily focuses on direct effects of model components, leaving the analysis of indirect effects and information flow between layers for future work. Not all attention heads have a clear semantic role assigned by TextSpan, potentially due to limitations in the initial pool of text descriptions or the collaborative nature of certain heads.	clip, vision transformer (vit), explainable ai (xai), image segmentation, zero-shot learning
2310.05873 Report	Implicit Concept Removal of Diffusion Models	Zhili Liu, Kai Chen, Yifan Zhang, Jianhua Han, Lanqing Hong, Hang Xu, Zhenguo Li, Dit-Yan Yeung, James Kwok	Text-to-image (T2I) diffusion models often inadvertently generate unwanted concepts such as watermarks and unsafe images. These concepts, termed as the "implicit concepts", could be unintentionally learned during training and then be generated uncontrollably during inference. Existing removal methods still struggle to eliminate implicit concepts primarily due to their dependency on the model's ability to recognize concepts it actually can not discern. To address this, we utilize the intrinsic geometric characteristics of implicit concepts and present the Geom-Erasing, a novel concept removal method based on geometric-driven control. Specifically, once an unwanted implicit concept is identified, we integrate the existence and geometric information of the concept into text prompts with the help of an accessible classifier or detector model. Subsequently, the model is optimized to identify and disentangle this information, which is adopted as negative prompts for generation. Moreover, we introduce Implicit Concept Dataset (ICD), a novel image-text dataset imbued with three typical implicit concepts (i.e., QR codes, watermarks, and text), reflecting real-life situations where implicit concepts are easily injected. Geom-Erasing effectively mitigates the generation of implicit concepts, achieving state-of-the-art results on the Inappropriate Image Prompts (I2P) and our challenging Implicit Concept Dataset (ICD) benchmarks.	This paper introduces Implicit Concept Dataset (ICD) and proposes Geo Erasure, a novel method for removing implicit concepts (e.g., watermarks, unsafe content) from text-to-image diffusion models.	Implicit concepts are difficult to remove with existing methods because they are unintentionally learned during training and cannot be reliably controlled with text prompts. This hinders personalized model fine-tuning and can result in the generation of undesired or even harmful content.	Geo Erasure leverages the geometric properties of implicit concepts by using a classifier or detector to identify concept existence and location within an image. This information is then integrated into the text prompts, guiding the diffusion model to learn and disentangle the implicit concept.	Geo Erasure effectively removes implicit concepts like watermarks and unsafe content from pre-trained Stable Diffusion models. It outperforms existing erasure methods in personalized fine-tuning scenarios, achieving lower FID and Implicit Concept Ratio (ICR) on ICD datasets. The method demonstrates that incorporating geometric information significantly improves the model's ability to recognize and eliminate implicit concepts.	Geo Erasure currently relies on external detectors for geometric information, which could be replaced with more general localizers in future work. Further exploration is needed to understand the impact of adding geometric information as negative prompts, which currently improves concept removal but slightly degrades image quality (FID).	implicit concept removal, text-to-image diffusion models, geometric guidance, personalized fine-tuning, stable diffusion
2310.05737 Report	Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation	Lijun Yu, José Lezama, Nitesh B. Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, Alexander G. Hauptmann, Boqing Gong, Ming-Hsuan Yang, Irfan Essa, David A. Ross, Lu Jiang	While Large Language Models (LLMs) are the dominant models for generative tasks in language, they do not perform as well as diffusion models on image and video generation. To effectively use LLMs for visual generation, one crucial component is the visual tokenizer that maps pixel-space inputs to discrete tokens appropriate for LLM learning. In this paper, we introduce MAGVIT-v2, a video tokenizer designed to generate concise and expressive tokens for both videos and images using a common token vocabulary. Equipped with this new tokenizer, we show that LLMs outperform diffusion models on standard image and video generation benchmarks including ImageNet and Kinetics. In addition, we demonstrate that our tokenizer surpasses the previously top-performing video tokenizer on two more tasks: (1) video compression comparable to the next-generation video codec (VCC) according to human evaluations, and (2) learning effective representations for action recognition tasks.	\modelname{} is a novel video tokenizer that leverages lookup-free quantization and architectural advancements to tokenize images and videos using a shared vocabulary.	A good visual tokenizer is crucial for language models (LMs) to excel in image and video generation, bridging the gap between pixel-based representations and discrete token-based processing inherent to LLMs.	The paper introduces (1) Lookup-free quantization (LFQ) that eliminates the need for embedding lookup in VQ-VAEs, enabling learning of larger vocabularies beneficial for LMs. (2) Architectural improvements to the tokenizer, including causal 3D CNNs for joint image-video tokenization and modifications for better temporal modeling.	\modelname{} significantly outperforms previous state-of-the-art video tokenizers in visual generation tasks on ImageNet and Kinetics datasets. In human rater studies, \modelname{} achieves better video compression quality compared to MAGVIT, HEVC, and on par with VVC. The tokens generated by \modelname{} prove to be effective representations for video understanding, leading to improved performance on action recognition benchmarks.	While \modelname{} shows promising results, further research is needed to adapt it for efficient CPU execution, aligning it with standard video codecs. Exploring the full potential of text-to-image and text-to-video generation with \modelname{} is left as future work.	video tokenization, language models, visual generation, video compression, action recognition
2310.05718 Report	EdVAE: Mitigating Codebook Collapse with Evidential Discrete Variational Autoencoders	Gulcin Baykal, Melih Kandemir, Gozde Unal	Codebook collapse is a common problem in training deep generative models with discrete representation spaces like Vector Quantized Variational Autoencoders (VQ-VAEs). We observe that the same problem arises for the alternatively designed discrete variational autoencoders (dVAEs) whose encoder directly learns a distribution over the codebook embeddings to represent the data. We hypothesize that using the softmax function to obtain a probability distribution causes the codebook collapse by assigning overconfident probabilities to the best matching codebook elements. In this paper, we propose a novel way to incorporate evidential deep learning (EDL) instead of softmax to combat the codebook collapse problem of dVAE. We evidentially monitor the significance of attaining the probability distribution over the codebook embeddings, in contrast to softmax usage. Our experiments using various datasets show that our model, called EdVAE, mitigates codebook collapse while improving the reconstruction performance, and enhances the codebook usage compared to dVAE and VQ-VAE based models. Our code can be found at https://github.com/ituvisionlab/EdVAE .	This paper proposes EdVAE, an extension of dVAE using evidential deep learning (EDL) to address codebook collapse by incorporating uncertainty awareness in codebook embedding selection.	Codebook collapse, the under-utilization of codebook embeddings, is a significant problem in discrete representation learning with dVAEs, limiting their expressiveness and performance.	EdVAE replaces the softmax layer in dVAE’s encoder with an evidential mechanism. It models a distribution over Categorical distributions (representing codebook selections) using a Dirichlet distribution, learning to select embeddings based on data-driven evidence.	EdVAE significantly improves codebook usage (measured by perplexity) compared to dVAE and achieves comparable or better performance than state-of-the-art VQ-VAE models. The paper provides evidence for a correlation between uncertainty values and perplexity, supporting the claim that uncertainty awareness improves codebook usage. EdVAE demonstrates robust performance across various codebook designs, showing less sensitivity to codebook size and dimensionality compared to other methods.	The method is primarily evaluated on small to medium-sized datasets and may require further exploration for larger, more diverse datasets like ImageNet. The $eta$ coefficient, balancing reconstruction and KL divergence terms, requires fine-tuning due to the complexity introduced by the evidential formulation.	evidential deep learning, discrete variational autoencoders, codebook collapse, generative models, uncertainty quantification
2310.05654 Report	No Token Left Behind: Efficient Vision Transformer via Dynamic Token Idling	Xuwei Xu, Changlin Li, Yudong Chen, Xiaojun Chang, Jiajun Liu, Sen Wang	Vision Transformers (ViTs) have demonstrated outstanding performance in computer vision tasks, yet their high computational complexity prevents their deployment in computing resource-constrained environments. Various token pruning techniques have been introduced to alleviate the high computational burden of ViTs by dynamically dropping image tokens. However, some undesirable pruning at early stages may result in permanent loss of image information in subsequent layers, consequently hindering model performance. To address this problem, we propose IdleViT, a dynamic token-idle-based method that achieves an excellent trade-off between performance and efficiency. Specifically, in each layer, IdleViT selects a subset of the image tokens to participate in computations while keeping the rest of the tokens idle and directly passing them to this layer's output. By allowing the idle tokens to be re-selected in the following layers, IdleViT mitigates the negative impact of improper pruning in the early stages. Furthermore, inspired by the normalized graph cut, we devise a token cut loss on the attention map as regularization to improve IdleViT's token selection ability. Our method is simple yet effective and can be extended to pyramid ViTs since no token is completely dropped. Extensive experimental results on various ViT architectures have shown that IdleViT can diminish the complexity of pretrained ViTs by up to 33\% with no more than 0.2\% accuracy decrease on ImageNet, after finetuning for only 30 epochs. Notably, when the keep ratio is 0.5, IdleViT outperforms the state-of-the-art EViT on DeiT-S by 0.5\% higher accuracy and even faster inference speed. The source code is available in the supplementary material.	Proposes IdleViT, a token-idle-based efficient ViT framework that dynamically selects tokens for computation while keeping the rest idle, allowing re-selection in later layers and mitigating information loss from early pruning.	Addresses the high computational complexity of ViTs, hindering their deployment in resource-constrained environments, by achieving a better balance between performance and efficiency.	Preserves unselected tokens (idle) throughout the layer, allowing re-selection. Introduces a token cut loss based on normalized graph cut theory to enhance semantic consistency in token selection. Fine-tunes pretrained ViTs with knowledge distillation and token cut loss.	Reduces DeiT-S complexity by 33% with only a 0.2% accuracy drop on ImageNet. Outperforms state-of-the-art EViT on DeiT-S with 0.5% higher accuracy and faster inference speed at a 0.5 keep ratio. Improves accuracy on a pyramid ViT (Swin-Ti) compared to vanilla DynamicViT at various keep ratios.	Only evaluates token selection based on class attention; other methods could be explored. Limited evaluation on pyramid ViTs; more extensive experiments are needed.	vision transformer, efficient deep learning, token pruning, token idling, normalized graph cut
2310.05590 Report	Perceptual Artifacts Localization for Image Synthesis Tasks	Lingzhi Zhang, Zhengjie Xu, Connelly Barnes, Yuqian Zhou, Qing Liu, He Zhang, Sohrab Amirghodsi, Zhe Lin, Eli Shechtman, Jianbo Shi	Recent advancements in deep generative models have facilitated the creation of photo-realistic images across various tasks. However, these generated images often exhibit perceptual artifacts in specific regions, necessitating manual correction. In this study, we present a comprehensive empirical examination of Perceptual Artifacts Localization (PAL) spanning diverse image synthesis endeavors. We introduce a novel dataset comprising 10,168 generated images, each annotated with per-pixel perceptual artifact labels across ten synthesis tasks. A segmentation model, trained on our proposed dataset, effectively localizes artifacts across a range of tasks. Additionally, we illustrate its proficiency in adapting to previously unseen models using minimal training samples. We further propose an innovative zoom-in inpainting pipeline that seamlessly rectifies perceptual artifacts in the generated images. Through our experimental analyses, we elucidate several practical downstream applications, such as automated artifact rectification, non-referential image quality evaluation, and abnormal region detection in images. The dataset and code are released.	This paper presents a novel dataset and a deep learning model for localizing perceptual artifacts in images generated by various AI image synthesis models.	Current generative models often produce images with noticeable artifacts, requiring manual correction. This work aims to automate this process and improve the quality of generated images.	The authors created a dataset of 10,168 generated images annotated with per-pixel artifact labels across ten synthesis tasks. They trained a segmentation model (using Swin-T backbone and UperNet head) on this dataset to localize artifacts.	The trained model effectively locates artifacts across a range of synthesis tasks and generalizes well to unseen models with minimal fine-tuning. The authors propose a 'zoom-in' inpainting pipeline, which significantly improves the refinement of artifact regions, especially for detailed objects like faces and hands. The study demonstrates the effectiveness of using the artifact localization model for downstream tasks such as automatic artifact correction, no-reference image quality assessment, and anomaly detection in real images.	The study primarily focuses on inpainting as a method for artifact correction, leaving room for exploring other task-specific refinement modules. The dataset is labeled based on a specific criterion and may not encompass the full spectrum of individual preferences concerning perceptual artifacts.	image synthesis, perceptual artifacts, artifact localization, deep learning, image quality assessment
2310.05375 Report	IPDreamer: Appearance-Controllable 3D Object Generation with Complex Image Prompts	Bohan Zeng, Shanglin Li, Yutang Feng, Ling Yang, Hong Li, Sicheng Gao, Jiaming Liu, Conghui He, Wentao Zhang, Jianzhuang Liu, Baochang Zhang, Shuicheng Yan	Recent advances in 3D generation have been remarkable, with methods such as DreamFusion leveraging large-scale text-to-image diffusion-based models to supervise 3D object generation. These methods enable the synthesis of detailed and photorealistic textured objects. However, the appearance of 3D objects produced by these text-to-3D models is unpredictable, and it is hard for the single-image-to-3D methods to deal with complex images, thus posing a challenge in generating appearance-controllable 3D objects. To achieve controllable complex 3D object synthesis, we propose IPDreamer, a novel approach that incorporates image prompt adaption to extract detailed and comprehensive appearance features from complex images, which are then utilized for 3D object generation. Our results demonstrate that IPDreamer effectively generates high-quality 3D objects that are consistent with both the provided text and the appearance of complex image prompts, demonstrating its promising capability in appearance-controllable 3D object generation. Our code is available at https://github.com/zengbohan0217/IPDreamer.	IPDreamer: a novel 3D object generation framework enabling controllable, high-quality 3D object creation from complex image prompts.	Existing text-to-3D models lack appearance control, and single-image-to-3D methods struggle with complex images.	Two-stage approach: 1) Train a coarse NeRF model from text/image. 2) Extract 3D mesh and optimize geometry and texture using Image Prompt Score Distillation (IPSD), leveraging image prompt features and a geometry prompt difference. Local Editing with Partial Images (LEPI) handles large appearance discrepancies.	Effectively transfers complex image styles to 3D objects, enabling high-quality texture editing. Generates more controllable and realistic 3D objects from text prompts compared to SOTA methods. Outperforms existing methods in quantitative metrics (FID, CLIP score) and user study.	Color inconsistency between generated 3D object and image prompt can occur. Further improvements in processing speed are desired.	3d object generation, image prompt adaption, score distillation sampling, texture editing, neural radiance fields
2310.05056 Report	Open-Vocabulary Animal Keypoint Detection with Semantic-feature Matching	Hao Zhang, Lumin Xu, Shenqi Lai, Wenqi Shao, Nanning Zheng, Ping Luo, Yu Qiao, Kaipeng Zhang	Current image-based keypoint detection methods for animal (including human) bodies and faces are generally divided into full-supervised and few-shot class-agnostic approaches. The former typically relies on laborious and time-consuming manual annotations, posing considerable challenges in expanding keypoint detection to a broader range of keypoint categories and animal species. The latter, though less dependent on extensive manual input, still requires necessary support images with annotation for reference during testing. To realize zero-shot keypoint detection without any prior annotation, we introduce the Open-Vocabulary Keypoint Detection (OVKD) task, which is innovatively designed to use text prompts for identifying arbitrary keypoints across any species. In pursuit of this goal, we have developed a novel framework named Open-Vocabulary Keypoint Detection with Semantic-feature Matching (KDSM). This framework synergistically combines vision and language models, creating an interplay between language features and local keypoint visual features. KDSM enhances its capabilities by integrating Domain Distribution Matrix Matching (DDMM) and other special modules, such as the Vision-Keypoint Relational Awareness (VKRA) module, improving the framework's generalizability and overall performance.Our comprehensive experiments demonstrate that KDSM significantly outperforms the baseline in terms of performance and achieves remarkable success in the OVKD task.Impressively, our method, operating in a zero-shot fashion, still yields results comparable to state-of-the-art few-shot species class-agnostic keypoint detection methods.We will make the source code publicly accessible.	This paper introduces Open-Vocabulary Keypoint Detection (OVKD), a novel task aiming to detect arbitrary keypoints in images using text prompts, even for unseen animal species and keypoint categories.	Traditional keypoint detection methods struggle with generalizing to new categories. This work addresses this limitation by leveraging language models to achieve zero-shot detection of diverse keypoints across species.	The paper proposes KDSM, a framework that combines a Vision-Keypoint Relational Awareness (VKRA) module with Domain Distribution Matrix Matching (DDMM). VKRA enhances interactions between text embeddings and visual features, while DDMM clusters keypoint categories to enable efficient learning and generalization.	KDSM significantly outperforms the baseline framework on the MP-78 dataset for both diverse keypoint categories and varied animal species settings. In the zero-shot setting, KDSM achieves comparable performance to state-of-the-art few-shot species class-agnostic keypoint detection methods. Ablation studies confirm the contribution of DDMM, VKRA, and the choice of pre-trained encoders to KDSM's performance.	KDSM's performance in challenging scenarios with occlusion, lighting variations, and resolution changes requires further investigation. Future work could explore integrating stronger text encoders to further enhance the method's capabilities.	open vocabulary, keypoint detection, zero-shot learning, vision-language models, domain distribution matrix matching
2310.04995 Report	SemST: Semantically Consistent Multi-Scale Image Translation via Structure-Texture Alignment	Ganning Zhao, Wenhui Cui, Suya You, C. -C. Jay Kuo	Unsupervised image-to-image (I2I) translation learns cross-domain image mapping that transfers input from the source domain to output in the target domain while preserving its semantics. One challenge is that different semantic statistics in source and target domains result in content discrepancy known as semantic distortion. To address this problem, a novel I2I method that maintains semantic consistency in translation is proposed and named SemST in this work. SemST reduces semantic distortion by employing contrastive learning and aligning the structural and textural properties of input and output by maximizing their mutual information. Furthermore, a multi-scale approach is introduced to enhance translation performance, thereby enabling the applicability of SemST to domain adaptation in high-resolution images. Experiments show that SemST effectively mitigates semantic distortion and achieves state-of-the-art performance. Also, the application of SemST to domain adaptation (DA) is explored. It is demonstrated by preliminary experiments that SemST can be utilized as a beneficial pre-training for the semantic segmentation task.	This paper presents SemST, a novel multi-scale image-to-image translation method that reduces semantic distortion by aligning structural and textural properties of input and output images.	Semantic distortion, a common problem in cross-domain image translation, can negatively impact the performance of downstream tasks like semantic segmentation and domain adaptation.	SemST leverages a multi-scale framework to capture both global context and local details, uses a texture-structure consistency loss based on mutual information to align semantic features, and employs semantics-aided hard negative sampling to enhance contrastive learning.	SemST achieves state-of-the-art performance in image translation across multiple datasets, including GTA5 to Cityscapes, Cityscapes Parsing to Image, and Photo to Maps. Qualitative results demonstrate SemST's ability to effectively preserve semantic consistency and generate high-quality translated images with fewer artifacts. Experiments on domain adaptation for semantic segmentation show that using SemST-refined synthetic images during training improves performance, highlighting its potential for UDA pre-training.	The selection of the TS loss weight requires careful tuning to balance input-output consistency and target domain style learning. Future work could explore the integration of semantic segmentation loss for explicit semantic guidance during image translation.	image-to-image translation, semantic consistency, multi-scale learning, contrastive learning, domain adaptation
2310.04719 Report	A Comprehensive Survey on Deep Neural Image Deblurring	Sajjad Amrollahi Biyouki, Hoon Hwangbo	Image deblurring tries to eliminate degradation elements of an image causing blurriness and improve the quality of an image for better texture and object visualization. Traditionally, prior-based optimization approaches predominated in image deblurring, but deep neural networks recently brought a major breakthrough in the field. In this paper, we comprehensively review the recent progress of the deep neural architectures in both blind and non-blind image deblurring. We outline the most popular deep neural network structures used in deblurring applications, describe their strengths and novelties, summarize performance metrics, and introduce broadly used datasets. In addition, we discuss the current challenges and research gaps in this domain and suggest potential research directions for future works.	This paper presents a comprehensive review of deep neural network architectures for both blind and non-blind image deblurring, summarizing their contributions, structures, and mechanisms.	Image deblurring is crucial for improving image quality in various applications, and deep learning has shown breakthroughs in this field.	The paper examines various deep neural networks, including CNNs, ResNets, encoder-decoder networks, GANs, and their variations. It analyzes their architectural configurations, training loss functions, and performance on benchmark datasets.	Multi-scale architectures and GANs have significantly improved blind image deblurring performance. Attention mechanisms effectively capture blur characteristics and locations. Deep learning-based image priors, especially deep image prior, have shown promise in enhancing deblurring results.	Current architectures face challenges in scalability, generalizability, and handling real-world blur. Future research should focus on developing more efficient feature extraction modules, reducing architecture complexity, and creating realistic datasets.	image deblurring, deep learning, convolutional neural networks, generative adversarial networks, image restoration
2310.04672 Report	EasyPhoto: Your Smart AI Photo Generator	Ziheng Wu, Jiaqi Xu, Xinyi Zou, Kunzhe Huang, Xing Shi, Jun Huang	Stable Diffusion web UI (SD-WebUI) is a comprehensive project that provides a browser interface based on Gradio library for Stable Diffusion models. In this paper, We propose a novel WebUI plugin called EasyPhoto, which enables the generation of AI portraits. By training a digital doppelganger of a specific user ID using 5 to 20 relevant images, the finetuned model (according to the trained LoRA model) allows for the generation of AI photos using arbitrary templates. Our current implementation supports the modification of multiple persons and different photo styles. Furthermore, we allow users to generate fantastic template image with the strong SDXL model, enhancing EasyPhoto's capabilities to deliver more diverse and satisfactory results. The source code for EasyPhoto is available at: https://github.com/aigc-apps/sd-webui-EasyPhoto. We also support a webui-free version by using diffusers: https://github.com/aigc-apps/EasyPhoto. We are continuously enhancing our efforts to expand the EasyPhoto pipeline, making it suitable for any identification (not limited to just the face), and we enthusiastically welcome any intriguing ideas or suggestions.	EasyPhoto, a Stable Diffusion web UI plugin for generating high-quality AI portraits by training a digital doppelganger of a user using a few input images.	Existing methods for AI portrait generation often result in unrealistic lighting, identity loss, or boundary artifacts. EasyPhoto overcomes these limitations by leveraging the image-to-image capabilities of Stable Diffusion and a novel two-stage diffusion process.	EasyPhoto uses a multi-stage process involving: (1) Training a LoRA model on user images, incorporating reinforcement learning for identity preservation. (2) A two-stage diffusion process with ControlNet guidance for generating realistic and identity-consistent portraits in various styles and with multiple users.	EasyPhoto generates high-quality AI portraits that maintain user identity and resemble the input template style. The two-stage diffusion process effectively addresses issues of boundary artifacts and identity loss. The system supports multi-user generation and leverages SDXL for diverse and realistic template creation.	Current implementation primarily focuses on face IDs; expanding to other objects is under development. Reliance on multiple ControlNet units and diffusion stages can increase computational cost.	ai portrait generation, stable diffusion, lora, controlnet, digital doppelganger
2310.04414 Report	CIFAR-10-Warehouse: Broad and More Realistic Testbeds in Model Generalization Analysis	Xiaoxiao Sun, Xingjian Leng, Zijian Wang, Yang Yang, Zi Huang, Liang Zheng	Analyzing model performance in various unseen environments is a critical research problem in the machine learning community. To study this problem, it is important to construct a testbed with out-of-distribution test sets that have broad coverage of environmental discrepancies. However, existing testbeds typically either have a small number of domains or are synthesized by image corruptions, hindering algorithm design that demonstrates real-world effectiveness. In this paper, we introduce CIFAR-10-Warehouse, consisting of 180 datasets collected by prompting image search engines and diffusion models in various ways. Generally sized between 300 and 8,000 images, the datasets contain natural images, cartoons, certain colors, or objects that do not naturally appear. With CIFAR-10-W, we aim to enhance the evaluation and deepen the understanding of two generalization tasks: domain generalization and model accuracy prediction in various out-of-distribution environments. We conduct extensive benchmarking and comparison experiments and show that CIFAR-10-W offers new and interesting insights inherent to these tasks. We also discuss other fields that would benefit from CIFAR-10-W.	This paper introduces CIFAR-10-Warehouse (CIFAR-10-W), a dataset of 180 diverse domains for evaluating model generalization in out-of-distribution (OOD) settings.	Existing OOD datasets are limited by a small number of domains or reliance on synthetic corruptions, hindering the development of algorithms with real-world effectiveness.	CIFAR-10-W is constructed by collecting real-world images from various search engines using diverse prompts and by generating synthetic images using a diffusion model.	CIFAR-10-W provides a more challenging and realistic benchmark for accuracy prediction methods compared to synthetic datasets. Domain generalization methods show limited improvement over baseline models on CIFAR-10-W, suggesting the need for more robust algorithms. Accuracy prediction methods can be applied to estimate the performance of domain generalized models on unseen target domains.	CIFAR-10-W is based on CIFAR-10, which is relatively small compared to ImageNet. Individual datasets in CIFAR-10-W might not be sufficient for full training due to their relatively small size. The domain coverage of CIFAR-10-W, while broad, is not exhaustive.	domain generalization, accuracy prediction, out-of-distribution generalization, dataset, benchmark
2310.03739 Report	Aligning Text-to-Image Diffusion Models with Reward Backpropagation	Mihir Prabhudesai, Anirudh Goyal, Deepak Pathak, Katerina Fragkiadaki	Text-to-image diffusion models have recently emerged at the forefront of image generation, powered by very large-scale unsupervised or weakly supervised text-to-image training datasets. Due to their unsupervised training, controlling their behavior in downstream tasks, such as maximizing human-perceived image quality, image-text alignment, or ethical image generation, is difficult. Recent works finetune diffusion models to downstream reward functions using vanilla reinforcement learning, notorious for the high variance of the gradient estimators. In this paper, we propose AlignProp, a method that aligns diffusion models to downstream reward functions using end-to-end backpropagation of the reward gradient through the denoising process. While naive implementation of such backpropagation would require prohibitive memory resources for storing the partial derivatives of modern text-to-image models, AlignProp finetunes low-rank adapter weight modules and uses gradient checkpointing, to render its memory usage viable. We test AlignProp in finetuning diffusion models to various objectives, such as image-text semantic alignment, aesthetics, compressibility and controllability of the number of objects present, as well as their combinations. We show AlignProp achieves higher rewards in fewer training steps than alternatives, while being conceptually simpler, making it a straightforward choice for optimizing diffusion models for differentiable reward functions of interest. Code and Visualization results are available at https://align-prop.github.io/.	Introduces Alignment by Backpropagation (AlignProp), a differentiable method for finetuning pretrained text-to-image diffusion models using end-to-end backpropagation of reward gradients, addressing limitations of reinforcement learning approaches.	Important because aligning pretrained diffusion models with downstream objectives like aesthetics, fairness, and text-to-image alignment is crucial, and current methods are either data-hungry, computationally expensive, or inefficient.	Models the denoising process as a differentiable recurrent policy and finetunes low-rank adapter weights using gradient checkpointing and randomized truncated backpropagation to reduce memory overhead and prevent over-optimization.	AlignProp achieves higher rewards and better data efficiency (25x) compared to reinforcement learning baselines. It generalizes better to novel text prompts, demonstrating its ability to learn beyond training data. Human evaluations show preference for AlignProp-generated images in terms of fidelity and image-text alignment.	Reliance on differentiable reward functions might lead to over-optimization for imperfect reward functions. Future work includes exploring the applicability of AlignProp to diffusion-based language models for improved alignment with human feedback.	diffusion models, text-to-image generation, model alignment, reward learning, backpropagation
2310.03734 Report	Leveraging Unpaired Data for Vision-Language Generative Models via Cycle Consistency	Tianhong Li, Sangnie Bhardwaj, Yonglong Tian, Han Zhang, Jarred Barber, Dina Katabi, Guillaume Lajoie, Huiwen Chang, Dilip Krishnan	Current vision-language generative models rely on expansive corpora of paired image-text data to attain optimal performance and generalization capabilities. However, automatically collecting such data (e.g. via large-scale web scraping) leads to low quality and poor image-text correlation, while human annotation is more accurate but requires significant manual effort and expense. We introduce $\textbf{ITIT}$ ($\textbf{I}$n$\textbf{T}$egrating $\textbf{I}$mage $\textbf{T}$ext): an innovative training paradigm grounded in the concept of cycle consistency which allows vision-language training on unpaired image and text data. ITIT is comprised of a joint image-text encoder with disjoint image and text decoders that enable bidirectional image-to-text and text-to-image generation in a single framework. During training, ITIT leverages a small set of paired image-text data to ensure its output matches the input reasonably well in both directions. Simultaneously, the model is also trained on much larger datasets containing only images or texts. This is achieved by enforcing cycle consistency between the original unpaired samples and the cycle-generated counterparts. For instance, it generates a caption for a given input image and then uses the caption to create an output image, and enforces similarity between the input and output images. Our experiments show that ITIT with unpaired datasets exhibits similar scaling behavior as using high-quality paired data. We demonstrate image generation and captioning performance on par with state-of-the-art text-to-image and image-to-text models with orders of magnitude fewer (only 3M) paired image-text data.	This paper introduces ITIT, a novel training paradigm that leverages cycle consistency to train vision-language models on unpaired image and text data.	This is important because collecting large-scale paired image-text data is expensive and often results in low quality, while vast amounts of unpaired data remain underutilized.	The method uses a unified image-text encoder and separate image/text decoders. It leverages a small paired dataset for initial training and then uses cycle consistency losses on unpaired data, enforcing similarity between an input image/text and a reconstructed version generated after passing through text-to-image and then image-to-text generation (or vice versa).	ITIT achieves performance comparable to state-of-the-art methods on text-to-image and image-to-text benchmarks using significantly fewer paired data samples (up to 100x less). The method exhibits similar scaling behavior with unpaired data as models trained solely on paired data, highlighting its potential for leveraging large unpaired datasets. ITIT proves robust to low-quality paired data, showing significant improvements when incorporating a large, noisy paired dataset compared to baselines trained only on this data.	The current implementation leads to increased training time compared to standard paired data training. Future work includes scaling ITIT to even larger unpaired datasets and exploring its effectiveness with more diverse data sources.	vision-language models, cycle consistency, unpaired data, text-to-image generation, image captioning
2310.03669 Report	LumiNet: The Bright Side of Perceptual Knowledge Distillation	Md. Ismail Hossain, M M Lutfe Elahi, Sameera Ramasinghe, Ali Cheraghian, Fuad Rahman, Nabeel Mohammed, Shafin Rahman	In knowledge distillation literature, feature-based methods have dominated due to their ability to effectively tap into extensive teacher models. In contrast, logit-based approaches, which aim to distill `dark knowledge' from teachers, typically exhibit inferior performance compared to feature-based methods. To bridge this gap, we present LumiNet, a novel knowledge distillation algorithm designed to enhance logit-based distillation. We introduce the concept of 'perception', aiming to calibrate logits based on the model's representation capability. This concept addresses overconfidence issues in logit-based distillation method while also introducing a novel method to distill knowledge from the teacher. It reconstructs the logits of a sample/instances by considering relationships with other samples in the batch. LumiNet excels on benchmarks like CIFAR-100, ImageNet, and MSCOCO, outperforming leading feature-based methods, e.g., compared to KD with ResNet18 and MobileNetV2 on ImageNet, it shows improvements of 1.5% and 2.05%, respectively.	The paper presents LumiNet, a novel knowledge distillation algorithm that generates new representations for instances/samples, addressing the overconfidence issue in logit-based distillation and improving performance.	Logit-based knowledge distillation, while potentially efficient, often lags behind feature-based methods due to overconfidence issues and limitations in capturing knowledge granularity. LumiNet aims to bridge this gap.	LumiNet introduces the concept of 'perception,' leveraging mean and variance statistics of logits within a batch to generate a new representation for each instance. This approach enhances logit granularity and mitigates the overconfidence issue.	LumiNet outperforms state-of-the-art knowledge distillation methods on CIFAR-100, ImageNet, and MS-COCO datasets for image recognition and object detection tasks. The method demonstrates consistent improvement across various architectures, including ResNet, VGG, ShuffleNet, MobileNet, WRN, and Faster-RCNN-FPN. LumiNet achieves superior accuracy while maintaining efficiency comparable to traditional knowledge distillation methods.	While LumiNet shows promising results, exploring its effectiveness with larger batch sizes and diverse datasets could further strengthen its applicability. Further research can investigate the integration of feature-based methods with LumiNet to potentially enhance performance even further.	knowledge distillation, deep learning, logit-based distillation, overconfidence, perception
2310.03502 Report	Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and Latent Diffusion	Anton Razzhigaev, Arseniy Shakhmatov, Anastasia Maltseva, Vladimir Arkhipkin, Igor Pavlov, Ilya Ryabov, Angelina Kuts, Alexander Panchenko, Andrey Kuznetsov, Denis Dimitrov	Text-to-image generation is a significant domain in modern computer vision and has achieved substantial improvements through the evolution of generative architectures. Among these, there are diffusion-based models that have demonstrated essential quality enhancements. These models are generally split into two categories: pixel-level and latent-level approaches. We present Kandinsky1, a novel exploration of latent diffusion architecture, combining the principles of the image prior models with latent diffusion techniques. The image prior model is trained separately to map text embeddings to image embeddings of CLIP. Another distinct feature of the proposed model is the modified MoVQ implementation, which serves as the image autoencoder component. Overall, the designed model contains 3.3B parameters. We also deployed a user-friendly demo system that supports diverse generative modes such as text-to-image generation, image fusion, text and image fusion, image variations generation, and text-guided inpainting/outpainting. Additionally, we released the source code and checkpoints for the Kandinsky models. Experimental evaluations demonstrate a FID score of 8.03 on the COCO-30K dataset, marking our model as the top open-source performer in terms of measurable image generation quality.	This paper introduces Kandinsky, a novel text-to-image latent diffusion model that combines image prior and latent diffusion techniques.	Kandinsky achieves state-of-the-art image generation quality among open-source models, evidenced by a high FID score and strong human evaluation results.	The model utilizes CLIP embeddings for both text and images, employs a transformer-based image prior, a modified MoVQ autoencoder, and a UNet for latent diffusion.	Kandinsky achieves a FID score of 8.03 on the COCO-30K dataset, outperforming other open-source models. A linear mapping between text and image embedding spaces proves surprisingly effective for image generation. The model supports diverse generation modes like text-to-image, image fusion, and inpainting/outpainting.	Further improvements in semantic coherence between text prompts and generated images are needed. Continued research on mitigating potential biases and preventing the generation of harmful content is crucial.	text-to-image synthesis, latent diffusion model, image prior, clip embeddings, movq autoencoder
2310.03337 Report	Denoising Diffusion Step-aware Models	Shuai Yang, Yukang Chen, Luozhou Wang, Shu Liu, Yingcong Chen	Denoising Diffusion Probabilistic Models (DDPMs) have garnered popularity for data generation across various domains. However, a significant bottleneck is the necessity for whole-network computation during every step of the generative process, leading to high computational overheads. This paper presents a novel framework, Denoising Diffusion Step-aware Models (DDSM), to address this challenge. Unlike conventional approaches, DDSM employs a spectrum of neural networks whose sizes are adapted according to the importance of each generative step, as determined through evolutionary search. This step-wise network variation effectively circumvents redundant computational efforts, particularly in less critical steps, thereby enhancing the efficiency of the diffusion model. Furthermore, the step-aware design can be seamlessly integrated with other efficiency-geared diffusion models such as DDIMs and latent diffusion, thus broadening the scope of computational savings. Empirical evaluations demonstrate that DDSM achieves computational savings of 49% for CIFAR-10, 61% for CelebA-HQ, 59% for LSUN-bedroom, 71% for AFHQ, and 76% for ImageNet, all without compromising the generation quality.	This paper introduces Denoising Diffusion Step-aware Models (DDSM), a novel framework that accelerates Denoising Diffusion Probabilistic Models (DDPMs) by employing different-sized neural networks for different generative steps based on their importance.	DDPMs, while effective for data generation, suffer from high computational overhead due to the need for whole-network computation at each step of the generative process. DDSM addresses this challenge by optimizing resource allocation across steps.	DDSM utilizes a slimmable neural network trained to be executable at various sizes. An evolutionary search algorithm then identifies the optimal model size for each generative step, minimizing computational cost without sacrificing generation quality.	DDSM achieves computational savings of 49% for CIFAR-10, 61% for CelebA-HQ, 59% for LSUN-bedroom, 71% for AFHQ, and 76% for ImageNet without compromising image generation quality. The optimal step-aware strategy varies significantly across datasets, highlighting the importance of dataset-specific optimization. DDSM is compatible with other diffusion model acceleration techniques, like DDIM and latent diffusion, allowing for further efficiency improvements.	The current search algorithm, while effective, introduces additional computational overhead during the training process. Further investigation is needed to understand the theoretical underpinnings of step importance in diffusion models.	denoising diffusion probabilistic models, generative models, model compression, network pruning, evolutionary search
2310.03324 Report	Investigating the Limitation of CLIP Models: The Worst-Performing Categories	Jie-Jing Shao, Jiang-Xin Shi, Xiao-Wen Yang, Lan-Zhe Guo, Yu-Feng Li	Contrastive Language-Image Pre-training (CLIP) provides a foundation model by integrating natural language into visual concepts, enabling zero-shot recognition on downstream tasks. It is usually expected that satisfactory overall accuracy can be achieved across numerous domains through well-designed textual prompts. However, we found that their performance in the worst categories is significantly inferior to the overall performance. For example, on ImageNet, there are a total of 10 categories with class-wise accuracy as low as 0\%, even though the overall performance has achieved 64.1\%. This phenomenon reveals the potential risks associated with using CLIP models, particularly in risk-sensitive applications where specific categories hold significant importance. To address this issue, we investigate the alignment between the two modalities in the CLIP model and propose the Class-wise Matching Margin (\cmm) to measure the inference confusion. \cmm\ can effectively identify the worst-performing categories and estimate the potential performance of the candidate prompts. We further query large language models to enrich descriptions of worst-performing categories and build a weighted ensemble to highlight the efficient prompts. Experimental results clearly verify the effectiveness of our proposal, where the accuracy on the worst-10 categories on ImageNet is boosted to 5.2\%, without manual prompt engineering, laborious optimization, or access to labeled validation data.	The paper proposes CPE, a method to improve the performance of Contrastive Language-Image Pre-training (CLIP) models on worst-performing categories, which are often overlooked when focusing on overall accuracy.	CLIP models often exhibit significantly inferior performance in specific categories despite good overall accuracy. This poses potential risks for real-world applications, especially in risk-sensitive domains where performance in certain categories is crucial.	The authors introduce Class-wise Matching Margin (CMM) to measure inference confusion and identify worst-performing categories. They use CMM to select effective prompt templates and enrich descriptions of worst-performing categories using large language models (LLMs), leading to a weighted prompt ensemble method.	CPE consistently boosts the accuracy of worst-performing categories across various benchmark datasets. On ImageNet, the accuracy of the worst-10 categories is boosted from 0% to 5.2%. CPE achieves comparable overall accuracy to state-of-the-art methods while significantly improving worst-category performance, demonstrating that both can be achieved simultaneously.	CPE's reliance on pseudo-labels for CMM calculation might introduce errors, especially on challenging datasets. Further exploration is needed to optimize the prompt selection process and the number of categories to enrich descriptions for, potentially improving CPE's effectiveness further. The work uses a predefined template pool from CLIP. Future work could explore automatic generation and selection of templates.	clip, zero-shot learning, worst-case performance, prompt engineering, large language models
2310.03291 Report	Expedited Training of Visual Conditioned Language Generation via Redundancy Reduction	Yiren Jian, Tingkai Liu, Yunzhe Tao, Chunhui Zhang, Soroush Vosoughi, Hongxia Yang	In this paper, we introduce $\text{EVL}_{\text{Gen}}$, a streamlined framework designed for the pre-training of visually conditioned language generation models with high computational demands, utilizing frozen pre-trained large language models (LLMs). The conventional approach in vision-language pre-training (VLP) typically involves a two-stage optimization process: an initial resource-intensive phase dedicated to general-purpose vision-language representation learning, focused on extracting and consolidating relevant visual features. This is followed by a subsequent phase that emphasizes end-to-end alignment between visual and linguistic modalities. Our novel one-stage, single-loss framework bypasses the computationally demanding first training stage by gradually merging similar visual tokens during training, while avoiding model collapse caused by single-stage training of BLIP-2 type models. The gradual merging process effectively condenses visual information while preserving semantic richness, resulting in rapid convergence without compromising performance. Our experimental findings demonstrate that our approach accelerates the training of vision-language models by a factor of 5 without a noticeable impact on overall performance. Furthermore, we illustrate that our models significantly narrow the performance gap to current vision-language models using only 1/10 of the data. Finally, we showcase how our image-text models can seamlessly adapt to video-conditioned language generation tasks through novel soft attentive temporal token contextualizing modules. Code is available at \url{https://github.com/yiren-jian/EVLGen}.	The paper proposes EVL_Gen, a streamlined framework for pre-training visually conditioned language generation models using frozen pre-trained large language models (LLMs). EVL_Gen utilizes a novel token merging transformer, TomeFormer, as an efficient vision-language connector, reducing computational cost and training time.	Existing vision-language pre-training (VLP) methods, while effective, are computationally expensive, limiting research and exploration of different model configurations. EVL_Gen addresses this challenge by enabling faster and more efficient VLP with comparable or better performance.	EVL_Gen replaces the resource-intensive two-stage training process of previous methods with a single-stage, single-loss framework. It employs TomeFormer to merge similar visual tokens during training, compressing visual information while preserving semantic richness, leading to faster convergence.	EVL_Gen achieves competitive performance to BLIP-2 on various image-text benchmarks while using significantly less training time (5x faster) and data. The proposed temporal token contextualizing module effectively adapts EVL_Gen for video-language tasks, achieving strong performance on video captioning benchmarks. TomeFormer effectively compresses visual information into semantically rich tokens, simplifying the training process and enabling single-stage optimization.	The fixed token merging rate (r) in TomeFormer may not be optimal for all images/videos. TomeFormer lacks the ability for text-specific selection of visual features, potentially limiting performance in tasks like VQA.	vision-language pre-training, large language models, token merging, vision-language generation, video captioning
2310.03270 Report	EfficientDM: Efficient Quantization-Aware Fine-Tuning of Low-Bit Diffusion Models	Yefei He, Jing Liu, Weijia Wu, Hong Zhou, Bohan Zhuang	Diffusion models have demonstrated remarkable capabilities in image synthesis and related generative tasks. Nevertheless, their practicality for real-world applications is constrained by substantial computational costs and latency issues. Quantization is a dominant way to compress and accelerate diffusion models, where post-training quantization (PTQ) and quantization-aware training (QAT) are two main approaches, each bearing its own properties. While PTQ exhibits efficiency in terms of both time and data usage, it may lead to diminished performance in low bit-width. On the other hand, QAT can alleviate performance degradation but comes with substantial demands on computational and data resources. In this paper, we introduce a data-free and parameter-efficient fine-tuning framework for low-bit diffusion models, dubbed EfficientDM, to achieve QAT-level performance with PTQ-like efficiency. Specifically, we propose a quantization-aware variant of the low-rank adapter (QALoRA) that can be merged with model weights and jointly quantized to low bit-width. The fine-tuning process distills the denoising capabilities of the full-precision model into its quantized counterpart, eliminating the requirement for training data. We also introduce scale-aware optimization and temporal learned step-size quantization to further enhance performance. Extensive experimental results demonstrate that our method significantly outperforms previous PTQ-based diffusion models while maintaining similar time and data efficiency. Specifically, there is only a 0.05 sFID increase when quantizing both weights and activations of LDM-4 to 4-bit on ImageNet 256x256. Compared to QAT-based methods, our EfficientDM also boasts a 16.2x faster quantization speed with comparable generation quality. Code is available at \href{https://github.com/ThisisBillhe/EfficientDM}{this hrl}.	This paper proposes EfficientDM, a data-free and parameter-efficient fine-tuning framework for low-bit diffusion models, aiming to achieve quantization-aware training (QAT) performance with post-training quantization (PTQ) efficiency.	Diffusion models, while powerful for image generation, suffer from high computational cost and latency. Quantization effectively addresses these issues but struggles with significant performance degradation at low bit-widths, especially for PTQ methods.	The paper introduces a quantization-aware low-rank adapter (QALoRA), enabling joint quantization of adapter weights and model weights. This facilitates data-free fine-tuning by minimizing the MSE between estimated noises from full-precision and quantized models. The authors further propose scale-aware LoRA optimization to handle variations in weight quantization scales across layers and temporal activation LSQ (TALSQ) to tackle variations in activation distributions across time steps.	EfficientDM achieves state-of-the-art performance for low-bit quantization of diffusion models on CIFAR-10, LSUN, and ImageNet datasets, outperforming previous PTQ-based methods while maintaining similar time and data efficiency. The method enables quantization of LDM-4 model weights to 2-bit for the first time with marginal performance loss. Ablation studies demonstrate the effectiveness of each proposed component, including QALoRA, scale-aware LoRA optimization, and TALSQ.	Despite achieving QAT-level performance with PTQ-like data and time efficiency, EfficientDM still requires more GPU memory than pure PTQ methods, especially for large diffusion models. Exploration of efficient diffusion models for video or 3D generation remains an open area for future work.	diffusion models, model quantization, low-bit quantization, data-free fine-tuning, parameter-efficient fine-tuning
2310.03020 Report	Consistent-1-to-3: Consistent Image to 3D View Synthesis via Geometry-aware Diffusion Models	Jianglong Ye, Peng Wang, Kejie Li, Yichun Shi, Heng Wang	Zero-shot novel view synthesis (NVS) from a single image is an essential problem in 3D object understanding. While recent approaches that leverage pre-trained generative models can synthesize high-quality novel views from in-the-wild inputs, they still struggle to maintain 3D consistency across different views. In this paper, we present Consistent-1-to-3, which is a generative framework that significantly mitigates this issue. Specifically, we decompose the NVS task into two stages: (i) transforming observed regions to a novel view, and (ii) hallucinating unseen regions. We design a scene representation transformer and view-conditioned diffusion model for performing these two stages respectively. Inside the models, to enforce 3D consistency, we propose to employ epipolor-guided attention to incorporate geometry constraints, and multi-view attention to better aggregate multi-view information. Finally, we design a hierarchy generation paradigm to generate long sequences of consistent views, allowing a full 360-degree observation of the provided object image. Qualitative and quantitative evaluation over multiple datasets demonstrates the effectiveness of the proposed mechanisms against state-of-the-art approaches. Our project page is at https://jianglongye.com/consistent123/	Introduces Consistent-1-to-3, a novel framework that generates consistent novel views of objects from any viewpoint given a single image.	Zero-shot novel view synthesis is essential for 3D object understanding with applications in AR/VR, robotics, and content creation, but existing methods struggle to maintain 3D consistency across different views.	Decomposes the task into two stages: using a Scene Representation Transformer (SRT) for photometric warping to novel views and a view-conditioned diffusion model to hallucinate unseen regions. Employs epipolar-guided and multi-view attention for 3D consistency and a hierarchical generation paradigm for long sequences of consistent views.	Significantly improves geometric consistency compared to previous state-of-the-art methods. Achieves superior performance on Objaverse and Google Scanned Objects datasets in terms of PSNR, SSIM, LPIPS, and flow warping error. Demonstrates the effectiveness of each component through ablation studies.	Trade-off between fidelity and consistency when using multi-view attention and hierarchical generation. Future work includes incorporating better geometry constraints and representations.	novel view synthesis, 3d consistency, diffusion models, scene representation transformer, epipolar geometry
2310.03015 Report	Efficient-3DiM: Learning a Generalizable Single-image Novel-view Synthesizer in One Day	Yifan Jiang, Hao Tang, Jen-Hao Rick Chang, Liangchen Song, Zhangyang Wang, Liangliang Cao	The task of novel view synthesis aims to generate unseen perspectives of an object or scene from a limited set of input images. Nevertheless, synthesizing novel views from a single image still remains a significant challenge in the realm of computer vision. Previous approaches tackle this problem by adopting mesh prediction, multi-plain image construction, or more advanced techniques such as neural radiance fields. Recently, a pre-trained diffusion model that is specifically designed for 2D image synthesis has demonstrated its capability in producing photorealistic novel views, if sufficiently optimized on a 3D finetuning task. Although the fidelity and generalizability are greatly improved, training such a powerful diffusion model requires a vast volume of training data and model parameters, resulting in a notoriously long time and high computational costs. To tackle this issue, we propose Efficient-3DiM, a simple but effective framework to learn a single-image novel-view synthesizer. Motivated by our in-depth analysis of the inference process of diffusion models, we propose several pragmatic strategies to reduce the training overhead to a manageable scale, including a crafted timestep sampling strategy, a superior 3D feature extractor, and an enhanced training scheme. When combined, our framework is able to reduce the total training time from 10 days to less than 1 day, significantly accelerating the training process under the same computational platform (one instance with 8 Nvidia A100 GPUs). Comprehensive experiments are conducted to demonstrate the efficiency and generalizability of our proposed method.	Efficient-3DiM, an efficient diffusion model framework for single-image novel view synthesis, significantly reducing training time without sacrificing performance.	Training diffusion models for novel view synthesis is computationally expensive and time-consuming, hindering research progress. This work aims to improve training efficiency and make it more accessible.	The paper introduces three key strategies: 1) a revised timestep sampling method using Gaussian distribution to prioritize important training segments, 2) integration of a self-supervised Vision Transformer (DINO-v2) for superior 3D feature extraction and amalgamation, and 3) an enhanced training paradigm with mixed-precision training and other optimizations.	Gaussian sampling for timesteps proves more effective than uniform sampling, especially prioritizing later stages. DINO-v2 encoder outperforms CLIP encoder in capturing 3D features, leading to better performance when combined with multi-scale feature amalgamation. Efficient-3DiM achieves a 14x speedup compared to the baseline Zero 1-to-3 method, reducing training time from 10 days to less than 1 day on the same hardware.	Multi-view consistency can be further improved for even better fidelity. Exploration of applying the framework to more complex and realistic datasets beyond synthetic objects.	novel view synthesis, diffusion models, efficient training, dino-v2, single-image 3d reconstruction
2310.02992 Report	Kosmos-G: Generating Images in Context with Multimodal Large Language Models	Xichen Pan, Li Dong, Shaohan Huang, Zhiliang Peng, Wenhu Chen, Furu Wei	Recent advancements in subject-driven image generation have made significant strides. However, current methods still fall short in diverse application scenarios, as they require test-time tuning and cannot accept interleaved multi-image and text input. These limitations keep them far from the ultimate goal of "image as a foreign language in image generation." This paper presents Kosmos-G, a model that leverages the advanced multimodal perception capabilities of Multimodal Large Language Models (MLLMs) to tackle the aforementioned challenge. Our approach aligns the output space of MLLM with CLIP using the textual modality as an anchor and performs compositional instruction tuning on curated data. Kosmos-G demonstrates an impressive capability of zero-shot subject-driven generation with interleaved multi-image and text input. Notably, the score distillation instruction tuning requires no modifications to the image decoder. This allows for a seamless substitution of CLIP and effortless integration with a myriad of U-Net techniques ranging from fine-grained controls to personalized image decoder variants. We posit Kosmos-G as an initial attempt towards the goal of "image as a foreign language in image generation." The code can be found at https://aka.ms/Kosmos-G	Introduces Kosmos-G, a novel model leveraging Multimodal Large Language Models (MLLMs) for zero-shot subject-driven image generation with interleaved multi-image and text input.	Addresses limitations of current subject-driven image generation methods that require test-time tuning and struggle with complex, multi-entity scenarios, bringing us closer to “image as a foreign language in image generation.”	Employs a three-stage 'align before instruct' training strategy: 1) Multimodal Language Modeling, 2) Image Decoder Aligning, and 3) Instruction Tuning with a compositional generation task on a curated multimodal dataset.	Achieves impressive zero-shot generation results across diverse settings, including re-contextualization, stylization, modification, and accessory incorporation. Outperforms or shows comparable performance to existing fine-tuning and test-time tuning free methods in single-entity subject-driven generation. Seamlessly integrates with existing U-Net techniques like ControlNet and LoRA, unlocking a variety of novel applications.	Single-image input during evaluation for DreamBench, potentially limiting performance compared to methods using multiple images. Further improvement possible by exploring different data paradigms and refining the alignment process, particularly for prompts with prefixes.	image generation, subject-driven generation, multimodal large language model, zero-shot learning, vision-language alignment
2310.02596 Report	SweetDreamer: Aligning Geometric Priors in 2D Diffusion for Consistent Text-to-3D	Weiyu Li, Rui Chen, Xuelin Chen, Ping Tan	It is inherently ambiguous to lift 2D results from pre-trained diffusion models to a 3D world for text-to-3D generation. 2D diffusion models solely learn view-agnostic priors and thus lack 3D knowledge during the lifting, leading to the multi-view inconsistency problem. We find that this problem primarily stems from geometric inconsistency, and avoiding misplaced geometric structures substantially mitigates the problem in the final outputs. Therefore, we improve the consistency by aligning the 2D geometric priors in diffusion models with well-defined 3D shapes during the lifting, addressing the vast majority of the problem. This is achieved by fine-tuning the 2D diffusion model to be viewpoint-aware and to produce view-specific coordinate maps of canonically oriented 3D objects. In our process, only coarse 3D information is used for aligning. This "coarse" alignment not only resolves the multi-view inconsistency in geometries but also retains the ability in 2D diffusion models to generate detailed and diversified high-quality objects unseen in the 3D datasets. Furthermore, our aligned geometric priors (AGP) are generic and can be seamlessly integrated into various state-of-the-art pipelines, obtaining high generalizability in terms of unseen shapes and visual appearance while greatly alleviating the multi-view inconsistency problem. Our method represents a new state-of-the-art performance with an 85+% consistency rate by human evaluation, while many previous methods are around 30%. Our project page is https://sweetdreamer3d.github.io/	This paper proposes Aligned Geometric Priors (AGP) to address the multi-view inconsistency problem in text-to-3D generation by aligning 2D geometric priors in diffusion models with well-defined 3D shapes.	Lifting 2D diffusion results to 3D is inherently ambiguous due to the lack of 3D knowledge, leading to inconsistent 3D structures across different views. This work tackles the primary cause - geometric inconsistency.	A pre-trained 2D diffusion model is fine-tuned to generate viewpoint-conditioned canonical coordinate maps from a 3D dataset. This aligns the geometric priors in 2D diffusion with consistent 3D geometry, resulting in AGP.	AGP significantly improves multi-view consistency in text-to-3D generation, achieving over 85% consistency rate compared to around 30% in previous methods. The 'coarse' alignment using only coarse geometry information preserves the generalizability of 2D diffusion models, enabling diverse and high-quality 3D object generation. AGP is generically applicable and can be seamlessly integrated into various text-to-3D pipelines using different 3D representations like DMTet and NeRF.	The paper primarily focuses on geometric consistency and doesn't directly address appearance inconsistencies, which could be a future work direction. There's a potential risk of degrading the generalizability of geometric priors during AGP training. Investigating regularization constraints could be beneficial.	text-to-3d generation, multi-view consistency, diffusion models, geometric priors, 3d shape representation
2310.02279 Report	Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion	Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, Stefano Ermon	Consistency Models (CM) (Song et al., 2023) accelerate score-based diffusion model sampling at the cost of sample quality but lack a natural way to trade-off quality for speed. To address this limitation, we propose Consistency Trajectory Model (CTM), a generalization encompassing CM and score-based models as special cases. CTM trains a single neural network that can -- in a single forward pass -- output scores (i.e., gradients of log-density) and enables unrestricted traversal between any initial and final time along the Probability Flow Ordinary Differential Equation (ODE) in a diffusion process. CTM enables the efficient combination of adversarial training and denoising score matching loss to enhance performance and achieves new state-of-the-art FIDs for single-step diffusion model sampling on CIFAR-10 (FID 1.73) and ImageNet at 64x64 resolution (FID 1.92). CTM also enables a new family of sampling schemes, both deterministic and stochastic, involving long jumps along the ODE solution trajectories. It consistently improves sample quality as computational budgets increase, avoiding the degradation seen in CM. Furthermore, unlike CM, CTM's access to the score function can streamline the adoption of established controllable/conditional generation methods from the diffusion community. This access also enables the computation of likelihood. The code is available at https://github.com/sony/ctm.	The paper introduces Consistency Trajectory Model (CTM), a new generative model unifying score-based and distillation models, which enables unrestricted time traversal along the Probability Flow ODE trajectory.	CTM addresses limitations of current diffusion models: 1) discretization errors in score-based sampling and 2) lack of speed-quality trade-off in distillation sampling. It allows for more efficient and flexible sampling with improved generation quality.	CTM trains a neural network to predict jumps along the PF ODE trajectory using a novel 'soft consistency matching' distillation loss. Additionally, it incorporates denoising score matching and adversarial losses to enhance student model training.	CTM achieves state-of-the-art FID scores for single-step diffusion model sampling on CIFAR-10 and ImageNet 64x64. A new sampling scheme called 'γ-sampling' allows for deterministic and stochastic sampling, providing control over sample variance. CTM surpasses its teacher model in both density estimation and image generation quality.	The current CTM implementation relies on discrete timesteps for training, limiting its theoretical potential for continuous time traversal. Further investigation is needed to explore potential applications of CTM's trajectory control capabilities in downstream tasks like inpainting and colorization.	generative models, diffusion models, score-based models, distillation models, consistency models
2310.02239 Report	MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens	Kaizhi Zheng, Xuehai He, Xin Eric Wang	The effectiveness of Multimodal Large Language Models (MLLMs) demonstrates a profound capability in multimodal understanding. However, the simultaneous generation of images with coherent texts is still underdeveloped. Addressing this, we introduce a novel interleaved vision-and-language generation method, centered around the concept of ``generative vokens". These vokens serve as pivotal elements contributing to coherent image-text outputs. Our method is marked by a unique two-stage training strategy for description-free multimodal generation, which does not necessitate extensive descriptions of images. We integrate classifier-free guidance to enhance the alignment of generated images and texts, ensuring more seamless and contextually relevant multimodal interactions. Our model, MiniGPT-5, exhibits substantial improvement over the baseline models on multimodal generation datasets, including MMDialog and VIST. The human evaluation shows MiniGPT-5 is better than the baseline model on more than 56\% cases for multimodal generation, highlighting its efficacy across diverse benchmarks.	The paper introduces \modelname, a novel interleaved vision-and-language generation method using "generative vokens" to bridge the gap between text and image generation in LLMs, enabling coherent image-text outputs.	Existing Multimodal Large Language Models (MLLMs) excel in understanding but struggle with coherent multimodal output generation, particularly in tasks requiring integrated vision and language handling.	The method employs a two-stage training strategy: (1) Description-free pretraining aligns visual features with text-image pairs. (2) Fine-tuning focuses on interleaved vision-and-language generation using multimodal datasets. Classifier-free guidance enhances image-text coherence during training.	\modelname shows superior performance over baseline models on multimodal generation datasets, including MMDialog and VIST. Human evaluation indicates \modelname generates better text narrations (55%), superior image quality (53%), and more coherent multimodal outputs (56%). \modelname effectively leverages long-horizontal multimodal inputs, outperforming baselines in generating contextually relevant images and text.	Maintaining object texture consistency in generated images remains challenging. Further improvements in generated image quality are possible.	multimodal generation, large language models, generative vokens, vision-and-language, classifier-free guidance
2310.01830 Report	AI-Generated Images as Data Source: The Dawn of Synthetic Era	Zuhao Yang, Fangneng Zhan, Kunhao Liu, Muyu Xu, Shijian Lu	The advancement of visual intelligence is intrinsically tethered to the availability of large-scale data. In parallel, generative Artificial Intelligence (AI) has unlocked the potential to create synthetic images that closely resemble real-world photographs. This prompts a compelling inquiry: how much visual intelligence could benefit from the advance of generative AI? This paper explores the innovative concept of harnessing these AI-generated images as new data sources, reshaping traditional modeling paradigms in visual intelligence. In contrast to real data, AI-generated data exhibit remarkable advantages, including unmatched abundance and scalability, the rapid generation of vast datasets, and the effortless simulation of edge cases. Built on the success of generative AI models, we examine the potential of their generated data in a range of applications, from training machine learning models to simulating scenarios for computational modeling, testing, and validation. We probe the technological foundations that support this groundbreaking use of generative AI, engaging in an in-depth discussion on the ethical, legal, and practical considerations that accompany this transformative paradigm shift. Through an exhaustive survey of current technologies and applications, this paper presents a comprehensive view of the synthetic era in visual intelligence. A project associated with this paper can be found at https://github.com/mwxely/AIGS .	This paper surveys the emerging field of using AI-generated images as data sources (AIGS) for enhancing visual intelligence tasks.	AIGS offers benefits such as generating large-scale datasets with reduced cost and privacy concerns, leading to improved performance in various computer vision tasks.	The paper reviews methods like GANs, Diffusion Models, and Neural Rendering for generating synthetic images. It explores their use for data augmentation and automatic label acquisition, enabling diverse applications.	Models trained solely on synthetic images show promising results, sometimes surpassing real-image training. Augmenting real datasets with synthetic images significantly boosts performance in classification, segmentation, and detection tasks. NeRF-based AIGS shows potential for 3D-aware applications like robotics and autonomous driving.	Explainability of AIGS in handling corner cases and outliers needs further research. Development of more precise and robust evaluation metrics is crucial for assessing AIGS effectiveness.	ai-generated images, synthetic data, generative models, neural rendering, computer vision
2310.01819 Report	TP2O: Creative Text Pair-to-Object Generation using Balance Swap-Sampling	Jun Li, Zedong Zhang, Jian Yang	Generating creative combinatorial objects from two seemingly unrelated object texts is a challenging task in text-to-image synthesis, often hindered by a focus on emulating existing data distributions. In this paper, we develop a straightforward yet highly effective method, called \textbf{balance swap-sampling}. First, we propose a swapping mechanism that generates a novel combinatorial object image set by randomly exchanging intrinsic elements of two text embeddings through a cutting-edge diffusion model. Second, we introduce a balance swapping region to efficiently sample a small subset from the newly generated image set by balancing CLIP distances between the new images and their original generations, increasing the likelihood of accepting the high-quality combinations. Last, we employ a segmentation method to compare CLIP distances among the segmented components, ultimately selecting the most promising object from the sampled subset. Extensive experiments demonstrate that our approach outperforms recent SOTA T2I methods. Surprisingly, our results even rival those of human artists, such as frog-broccoli.	This paper introduces BASS (BAlance Swap-Sampling), a novel approach for generating creative combinatorial objects from two distinct object text descriptions in text-to-image synthesis.	Current text-to-image models often struggle to generate truly creative and novel combinations of objects, focusing instead on emulating existing data distributions.	BASS leverages a swapping mechanism to interchange elements of prompt embeddings, creating novel combinations. It then employs a balance region based on CLIP distances to sample high-quality combinatorial images, further refined using the Segment Anything Model (SAM) for semantic coherence.	BASS generates novel and surprising combinatorial objects, often surpassing the creativity of human artists in combining unrelated concepts. The method outperforms SOTA T2I models in generating creative combinations and demonstrates the ability to create out-of-distribution images. Evaluations using PickScore and HPSv2, trained on human preference datasets, reveal BASS's capability to generate objects with high human-preference value.	The balance swap region can sometimes lead to nonsensical or chaotic image generation, requiring further investigation and refinement. Current hyperparameter settings might favor majority classes, necessitating future research into distribution tailoring for enhanced creativity across all categories.	text-to-image synthesis, combinatorial creativity, diffusion models, clip, out-of-distribution generation
2310.01779 Report	HallE-Control: Controlling Object Hallucination in Large Multimodal Models	Bohan Zhai, Shijia Yang, Chenfeng Xu, Sheng Shen, Kurt Keutzer, Chunyuan Li, Manling Li	Current Large Multimodal Models (LMMs) achieve remarkable progress, yet there remains significant uncertainty regarding their ability to accurately apprehend visual details, that is, in performing detailed captioning. To address this, we introduce $\textit{CCEval}$, a GPT-4 assisted evaluation method for detailed captioning. Interestingly, while LMMs demonstrate minimal object existence hallucination in existing VQA benchmarks, our proposed evaluation reveals continued susceptibility to such hallucinations. In this paper, we make the first attempt to investigate such hallucination from different aspects, including image resolution, the language decoder size, and instruction data amount, quality, granularity. Our findings underscore the unwarranted inference when the language description includes details at a finer object granularity than what the vision module can ground or verify, thus inducing hallucination. To control such hallucinations, we further attribute the reliability of captioning to contextual knowledge (involving only contextually grounded objects) and parametric knowledge (containing inferred objects by the model). Thus, we introduce $\textit{HallE-Control}$, a controllable LMM in terms of $\textbf{Hall}$ucination in object $\textbf{E}$xistence. HallE-Control can condition the captioning to shift between (i) exclusively depicting contextual knowledge for grounded objects and (ii) blending it with parametric knowledge to imagine inferred objects. Our method reduces hallucination by 44% compared to LLaVA$_{7B}$ and maintains the object coverage.	The paper introduces HallE-Control, a novel approach for controlling object existence hallucination in large multimodal models (LMMs) trained for detailed image captioning.	Existing LMMs, while proficient in tasks like VQA, often hallucinate objects in detailed captions, hindering their applicability in real-world scenarios.	The authors first analyze factors influencing hallucination, identifying a key issue: misalignment between the vision encoder's grounding ability and objects mentioned in training captions. They then propose HallE-Control, which uses a control parameter and specialized datasets to distinguish between contextual (grounded) and parametric (inferred) knowledge, enabling controlled object imagination.	Increasing image resolution significantly reduces hallucination by improving object grounding. Scaling the language decoder or instruction data volume alone doesn't consistently mitigate hallucination. HallE-Control reduces hallucination by 44% compared to the baseline LLaVA model while maintaining object coverage in captions.	The study primarily focuses on object existence hallucination, leaving other types like attribute and relationship hallucinations for future work. Further exploration into alternative control mechanisms and their impact on different LMM architectures is necessary.	large multimodal models, hallucination control, image captioning, vision-language models, object grounding
2310.01662 Report	SYRAC: Synthesize, Rank, and Count	Adriano D'Alessandro, Ali Mahdavi-Amiri, Ghassan Hamarneh	Crowd counting is a critical task in computer vision, with several important applications. However, existing counting methods rely on labor-intensive density map annotations, necessitating the manual localization of each individual pedestrian. While recent efforts have attempted to alleviate the annotation burden through weakly or semi-supervised learning, these approaches fall short of significantly reducing the workload. We propose a novel approach to eliminate the annotation burden by leveraging latent diffusion models to generate synthetic data. However, these models struggle to reliably understand object quantities, leading to noisy annotations when prompted to produce images with a specific quantity of objects. To address this, we use latent diffusion models to create two types of synthetic data: one by removing pedestrians from real images, which generates ranked image pairs with a weak but reliable object quantity signal, and the other by generating synthetic images with a predetermined number of objects, offering a strong but noisy counting signal. Our method utilizes the ranking image pairs for pre-training and then fits a linear layer to the noisy synthetic images using these crowd quantity features. We report state-of-the-art results for unsupervised crowd counting.	This paper introduces a novel unsupervised crowd counting method that leverages latent diffusion models to generate synthetic data, eliminating the need for manual annotations.	Crowd counting is crucial in computer vision, but traditional methods rely on labor-intensive density map annotations. This work aims to alleviate this annotation burden by proposing an unsupervised approach.	The method utilizes latent diffusion models to create two types of synthetic data: 1) Ranked image pairs with weak but reliable object quantity signal generated by removing pedestrians from real images. 2) Synthetic images with noisy count labels generated by prompting the model to produce images with a specific number of objects. A Siamese network is pre-trained on the ranked image pairs to learn crowd quantity features, followed by fine-tuning a linear layer on the noisy synthetic counting data.	Achieves state-of-the-art performance on multiple crowd counting benchmarks for unsupervised crowd counting. Significantly reduces the annotation burden associated with crowd counting. Demonstrates the potential of using synthetic data for unsupervised crowd counting.	The method's performance on datasets with extremely dense crowds is limited by the increased label noise in synthetic data with high object counts. Future work could explore more sophisticated prompt engineering or alternative generative models to further improve the quality of synthetic data.	unsupervised crowd counting, synthetic data, latent diffusion models, crowd density estimation, computer vision
2310.01596 Report	ImagenHub: Standardizing the evaluation of conditional image generation models	Max Ku, Tianle Li, Kai Zhang, Yujie Lu, Xingyu Fu, Wenwen Zhuang, Wenhu Chen	Recently, a myriad of conditional image generation and editing models have been developed to serve different downstream tasks, including text-to-image generation, text-guided image editing, subject-driven image generation, control-guided image generation, etc. However, we observe huge inconsistencies in experimental conditions: datasets, inference, and evaluation metrics - render fair comparisons difficult. This paper proposes ImagenHub, which is a one-stop library to standardize the inference and evaluation of all the conditional image generation models. Firstly, we define seven prominent tasks and curate high-quality evaluation datasets for them. Secondly, we built a unified inference pipeline to ensure fair comparison. Thirdly, we design two human evaluation scores, i.e. Semantic Consistency and Perceptual Quality, along with comprehensive guidelines to evaluate generated images. We train expert raters to evaluate the model outputs based on the proposed metrics. Our human evaluation achieves a high inter-worker agreement of Krippendorff's alpha on 76% models with a value higher than 0.4. We comprehensively evaluated a total of around 30 models and observed three key takeaways: (1) the existing models' performance is generally unsatisfying except for Text-guided Image Generation and Subject-driven Image Generation, with 74% models achieving an overall score lower than 0.5. (2) we examined the claims from published papers and found 83% of them hold with a few exceptions. (3) None of the existing automatic metrics has a Spearman's correlation higher than 0.2 except subject-driven image generation. Moving forward, we will continue our efforts to evaluate newly published models and update our leaderboard to keep track of the progress in conditional image generation.	ImagenHub, a comprehensive library designed to standardize the inference and evaluation of conditional image generation models, including 7 prominent tasks and curated evaluation datasets.	Addresses inconsistencies in datasets, inference, and evaluation metrics in existing conditional image generation models, enabling fair comparison and progress tracking.	Curates standardized evaluation datasets, builds a unified inference pipeline, and defines human evaluation protocols (Semantic Consistency and Perceptual Quality) with comprehensive guidelines.	Existing models perform poorly in most tasks except for Text-guided Image Generation and Subject-driven Image Generation. 83% of performance claims from published papers are consistent with ImagenHub's evaluation. Automatic metrics show weak correlation with human preference, except in Subject-driven Image Generation.	Reliance on human evaluation is expensive and time-consuming. Future work includes developing generic automatic evaluation methods that better approximate human ratings.	image generation, benchmarking, evaluation metrics, diffusion models, human evaluation
2310.01506 Report	Direct Inversion: Boosting Diffusion-based Editing with 3 Lines of Code	Xuan Ju, Ailing Zeng, Yuxuan Bian, Shaoteng Liu, Qiang Xu	Text-guided diffusion models have revolutionized image generation and editing, offering exceptional realism and diversity. Specifically, in the context of diffusion-based editing, where a source image is edited according to a target prompt, the process commences by acquiring a noisy latent vector corresponding to the source image via the diffusion model. This vector is subsequently fed into separate source and target diffusion branches for editing. The accuracy of this inversion process significantly impacts the final editing outcome, influencing both essential content preservation of the source image and edit fidelity according to the target prompt. Prior inversion techniques aimed at finding a unified solution in both the source and target diffusion branches. However, our theoretical and empirical analyses reveal that disentangling these branches leads to a distinct separation of responsibilities for preserving essential content and ensuring edit fidelity. Building on this insight, we introduce "Direct Inversion," a novel technique achieving optimal performance of both branches with just three lines of code. To assess image editing performance, we present PIE-Bench, an editing benchmark with 700 images showcasing diverse scenes and editing types, accompanied by versatile annotations and comprehensive evaluation metrics. Compared to state-of-the-art optimization-based inversion techniques, our solution not only yields superior performance across 8 editing methods but also achieves nearly an order of speed-up.	This paper introduces Direct Inversion, a simple yet effective technique for inverting diffusion models for image editing.	Existing inversion techniques for diffusion-based image editing struggle to balance essential content preservation with edit fidelity, often relying on time-consuming optimization and facing error propagation issues.	Direct Inversion disentangles the source and target diffusion branches, rectifying the deviation path in the source branch directly using a 3-line code modification for better content preservation while leaving the target branch untouched to maximize edit fidelity.	Direct Inversion enhances essential content preservation by up to 83.2% in Structure Distance and up to 73.9% in background LPIPS compared to state-of-the-art optimization-based techniques. It improves edit fidelity by up to 8.8% in Edit Region Clip Similarity. The method achieves a significant speedup, nearly an order of magnitude faster than optimization-based inversion methods.	The performance of Direct Inversion is inherently tied to the limitations of existing diffusion-based editing methods, which can be unstable and have low success rates in certain editing tasks. Future work includes extending the technique to video editing, developing editing models with higher success rates, and creating a more comprehensive metric evaluation system.	image editing, diffusion models, inversion techniques, content preservation, edit fidelity
2310.01407 Report	CoDi: Conditional Diffusion Distillation for Higher-Fidelity and Faster Image Generation	Kangfu Mei, Mauricio Delbracio, Hossein Talebi, Zhengzhong Tu, Vishal M. Patel, Peyman Milanfar	Large generative diffusion models have revolutionized text-to-image generation and offer immense potential for conditional generation tasks such as image enhancement, restoration, editing, and compositing. However, their widespread adoption is hindered by the high computational cost, which limits their real-time application. To address this challenge, we introduce a novel method dubbed CoDi, that adapts a pre-trained latent diffusion model to accept additional image conditioning inputs while significantly reducing the sampling steps required to achieve high-quality results. Our method can leverage architectures such as ControlNet to incorporate conditioning inputs without compromising the model's prior knowledge gained during large scale pre-training. Additionally, a conditional consistency loss enforces consistent predictions across diffusion steps, effectively compelling the model to generate high-quality images with conditions in a few steps. Our conditional-task learning and distillation approach outperforms previous distillation methods, achieving a new state-of-the-art in producing high-quality images with very few steps (e.g., 1-4) across multiple tasks, including super-resolution, text-guided image editing, and depth-to-image generation.	This paper introduces CoDi, a novel method that adapts pre-trained latent diffusion models to accept image conditioning inputs while significantly reducing sampling steps for high-quality image generation.	Large generative diffusion models are computationally expensive, limiting their real-time application. CoDi addresses this challenge by enabling rapid generation of high-quality images under various conditional settings.	CoDi adapts pre-trained models with a conditional encoder and introduces a conditional consistency loss. This loss enforces consistent predictions across diffusion steps, enabling high-quality generation with few steps.	CoDi outperforms previous distillation methods in visual quality and quantitative metrics across tasks like super-resolution and image editing. The method enables parameter-efficient distillation (PE-CoDi), adapting models to new tasks with minimal additional parameters. CoDi achieves comparable performance with significantly fewer sampling steps (e.g., 1-4) than models requiring 20-200 steps.	The current adapter architecture introduces additional computation, which could be addressed with lightweight architectures in future work. While CoDi enhances diffusion model practicality, potential misuse for deceptive content necessitates ethical considerations.	diffusion models, conditional image generation, model distillation, parameter-efficient tuning, image-to-image translation
2310.01406 Report	HumanNorm: Learning Normal Diffusion Model for High-quality and Realistic 3D Human Generation	Xin Huang, Ruizhi Shao, Qi Zhang, Hongwen Zhang, Ying Feng, Yebin Liu, Qing Wang	Recent text-to-3D methods employing diffusion models have made significant advancements in 3D human generation. However, these approaches face challenges due to the limitations of text-to-image diffusion models, which lack an understanding of 3D structures. Consequently, these methods struggle to achieve high-quality human generation, resulting in smooth geometry and cartoon-like appearances. In this paper, we propose HumanNorm, a novel approach for high-quality and realistic 3D human generation. The main idea is to enhance the model's 2D perception of 3D geometry by learning a normal-adapted diffusion model and a normal-aligned diffusion model. The normal-adapted diffusion model can generate high-fidelity normal maps corresponding to user prompts with view-dependent and body-aware text. The normal-aligned diffusion model learns to generate color images aligned with the normal maps, thereby transforming physical geometry details into realistic appearance. Leveraging the proposed normal diffusion model, we devise a progressive geometry generation strategy and a multi-step Score Distillation Sampling (SDS) loss to enhance the performance of 3D human generation. Comprehensive experiments substantiate HumanNorm's ability to generate 3D humans with intricate geometry and realistic appearances. HumanNorm outperforms existing text-to-3D methods in both geometry and texture quality. The project page of HumanNorm is https://humannorm.github.io/.	Proposes HumanNorm, a novel approach for generating high-quality, realistic 3D human models from text descriptions by leveraging normal diffusion models to enhance 2D diffusion models' understanding of 3D geometry.	Existing text-to-3D human generation methods often struggle to produce high-fidelity models, resulting in smooth geometry, unrealistic textures, and artifacts. This work addresses these limitations to achieve more realistic and detailed 3D human generation.	Introduces a normal-adapted diffusion model for generating detailed geometry from text prompts by learning from multi-view normal maps. Employs a normal-aligned diffusion model to generate textures aligned with the 3D geometry using normal maps as guidance. Utilizes a progressive geometry generation strategy and multi-step Score Distillation Sampling (SDS) loss to enhance performance and realism.	Generates 3D humans with intricate geometric details like clothing wrinkles and realistic appearances. Quantitative evaluation shows superior performance over existing text-to-3D methods in terms of FID and CLIP score. User study confirms HumanNorm produces higher-quality 3D humans compared to state-of-the-art methods.	Current implementation requires a rigged human skeleton for animation. Future work will integrate SMPL-X for direct animation and improved body details. Generated textures might exhibit shading inconsistencies. Future research will explore Physically-Based Rendering (PBR) for material estimation and relighting.	3d human generation, text-to-3d, diffusion models, normal diffusion, score distillation sampling
2310.01400 Report	Sequential Data Generation with Groupwise Diffusion Process	Sangyun Lee, Gayoung Lee, Hyunsu Kim, Junho Kim, Youngjung Uh	We present the Groupwise Diffusion Model (GDM), which divides data into multiple groups and diffuses one group at one time interval in the forward diffusion process. GDM generates data sequentially from one group at one time interval, leading to several interesting properties. First, as an extension of diffusion models, GDM generalizes certain forms of autoregressive models and cascaded diffusion models. As a unified framework, GDM allows us to investigate design choices that have been overlooked in previous works, such as data-grouping strategy and order of generation. Furthermore, since one group of the initial noise affects only a certain group of the generated data, latent space now possesses group-wise interpretable meaning. We can further extend GDM to the frequency domain where the forward process sequentially diffuses each group of frequency components. Dividing the frequency bands of the data as groups allows the latent variables to become a hierarchical representation where individual groups encode data at different levels of abstraction. We demonstrate several applications of such representation including disentanglement of semantic attributes, image editing, and generating variations.	Introduces the Groupwise Diffusion Model (GDM), which divides data into groups and diffuses one group at a time, allowing for sequential data generation and interpretable latent space.	Provides a unified framework that generalizes certain autoregressive and cascaded diffusion models, offering flexibility in grouping strategies and generation order.	Extends traditional diffusion models by employing a per-group noise scheduling strategy and generalizing the noise schedule function to a matrix form.	GDM with specific grouping strategies encapsulates autoregressive models and cascaded diffusion models. GDM's latent space exhibits group-wise interpretability, where each latent group influences specific data elements. Extending GDM to the frequency domain (GDM-F) enables hierarchical representation learning, disentanglement of semantic attributes, and image editing applications.	Sampling efficiency decreases as the number of groups increases. Further investigation is needed to determine the optimal grouping strategies and generation order for various datasets.	diffusion models, generative models, interpretable latent space, hierarchical representation learning, image editing
2310.01391 Report	A Restoration Network as an Implicit Prior	Yuyang Hu, Mauricio Delbracio, Peyman Milanfar, Ulugbek S. Kamilov	Image denoisers have been shown to be powerful priors for solving inverse problems in imaging. In this work, we introduce a generalization of these methods that allows any image restoration network to be used as an implicit prior. The proposed method uses priors specified by deep neural networks pre-trained as general restoration operators. The method provides a principled approach for adapting state-of-the-art restoration models for other inverse problems. Our theoretical result analyzes its convergence to a stationary point of a global functional associated with the restoration operator. Numerical results show that the method using a super-resolution prior achieves state-of-the-art performance both quantitatively and qualitatively. Overall, this work offers a step forward for solving inverse problems by enabling the use of powerful pre-trained restoration models as priors.	This paper introduces Deep Restoration Priors (DRP), a novel method that generalizes the use of image denoisers as priors for solving inverse problems to any image restoration network.	This generalization enables the adaptation of powerful pre-trained restoration models for solving a variety of inverse problems, potentially leading to improved performance.	DRP incorporates a pre-trained restoration network into an iterative optimization framework similar to plug-and-play priors (PnP), leveraging the network's implicit prior to regularize the solution.	The paper provides theoretical analysis proving the convergence of DRP to a stationary point of an objective function associated with the restoration operator. DRP, when using a super-resolution network as a prior, achieves state-of-the-art performance on deblurring and super-resolution tasks, outperforming existing methods based on denoiser priors. A prior refinement strategy, inspired by the theoretical analysis, is introduced to further improve the performance of DRP.	The performance of DRP is limited by the quality of the pre-trained restoration model used as a prior. The theoretical analysis currently relies on the assumption that the restoration model performs MMSE estimation, which might not always hold in practice.	inverse problems, image restoration, deep learning, plug-and-play priors, super-resolution
2310.01218 Report	Making LLaMA SEE and Draw with SEED Tokenizer	Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, Ying Shan	The great success of Large Language Models (LLMs) has expanded the potential of multimodality, contributing to the gradual evolution of General Artificial Intelligence (AGI). A true AGI agent should not only possess the capability to perform predefined multi-tasks but also exhibit emergent abilities in an open-world context. However, despite the considerable advancements made by recent multimodal LLMs, they still fall short in effectively unifying comprehension and generation tasks, let alone open-world emergent abilities. We contend that the key to overcoming the present impasse lies in enabling text and images to be represented and processed interchangeably within a unified autoregressive Transformer. To this end, we introduce SEED, an elaborate image tokenizer that empowers LLMs with the ability to SEE and Draw at the same time. We identify two crucial design principles: (1) Image tokens should be independent of 2D physical patch positions and instead be produced with a 1D causal dependency, exhibiting intrinsic interdependence that aligns with the left-to-right autoregressive prediction mechanism in LLMs. (2) Image tokens should capture high-level semantics consistent with the degree of semantic abstraction in words, and be optimized for both discriminativeness and reconstruction during the tokenizer training phase. With SEED tokens, LLM is able to perform scalable multimodal autoregression under its original training recipe, i.e., next-word prediction. SEED-LLaMA is therefore produced by large-scale pretraining and instruction tuning on the interleaved textual and visual data, demonstrating impressive performance on a broad range of multimodal comprehension and generation tasks. More importantly, SEED-LLaMA has exhibited compositional emergent abilities such as multi-turn in-context multimodal generation, acting like your AI assistant.	This paper introduces SEED, a novel image tokenizer designed to enhance Large Language Models (LLMs) with multimodal capabilities, and presents SEED-LLaMA, an MLLM built using SEED that demonstrates strong performance in multimodal comprehension, generation, and emergent abilities.	Existing MLLMs struggle to effectively unify comprehension and generation tasks and lack open-world emergent abilities. This work addresses these limitations by enabling interchangeable representation and processing of text and images within a unified autoregressive Transformer framework.	SEED tokenizes images into discrete codes with 1D causal dependency and high-level semantics. SEED-LLaMA is trained via large-scale multimodal pretraining and instruction tuning on interleaved textual and visual data, leveraging the next-word prediction objective of LLMs.	SEED-LLaMA achieves competitive performance on various multimodal comprehension tasks, including image captioning, image/video question answering, surpassing some models using continuous visual representations. SEED-LLaMA demonstrates strong text-to-image generation capabilities, producing images that are highly correlated with given textual descriptions. SEED-LLaMA exhibits emergent abilities such as multi-turn in-context multimodal generation, including image and text generation based on user instructions, and zero-shot compositional image generation (e.g., stylized generation, image blending).	The authors acknowledge that current VQA benchmarks may not be ideal for evaluating MLLMs with open-ended output as they require exact matches. Future work will explore enhancements to the SEED tokenizer and expand the pretraining data scale and model size of SEED-LLaMA.	multimodal learning, large language models, image tokenization, multimodal generation, emergent abilities
2310.01110 Report	Prompt-tuning latent diffusion models for inverse problems	Hyungjin Chung, Jong Chul Ye, Peyman Milanfar, Mauricio Delbracio	We propose a new method for solving imaging inverse problems using text-to-image latent diffusion models as general priors. Existing methods using latent diffusion models for inverse problems typically rely on simple null text prompts, which can lead to suboptimal performance. To address this limitation, we introduce a method for prompt tuning, which jointly optimizes the text embedding on-the-fly while running the reverse diffusion process. This allows us to generate images that are more faithful to the diffusion prior. In addition, we propose a method to keep the evolution of latent variables within the range space of the encoder, by projection. This helps to reduce image artifacts, a major problem when using latent diffusion models instead of pixel-based diffusion models. Our combined method, called P2L, outperforms both image- and latent-diffusion model-based inverse problem solvers on a variety of tasks, such as super-resolution, deblurring, and inpainting.	This paper proposes P2L, a novel method for solving imaging inverse problems using text-to-image latent diffusion models as general priors, enhancing restoration quality by jointly optimizing text embedding during inference.	Existing methods using latent diffusion models for inverse problems often rely on simple null text prompts, leading to suboptimal performance in generating high-fidelity images.	The method leverages prompt tuning to optimize text embedding on-the-fly while running the reverse diffusion process. Additionally, it introduces a projection technique to constrain the evolution of latent variables within the range space of the encoder, thereby reducing image artifacts.	P2L significantly outperforms both image- and latent-diffusion model-based inverse problem solvers in perceptual quality (FID, LPIPS) on tasks like super-resolution, deblurring, and inpainting. The method effectively mitigates artifacts common in latent diffusion model-based restoration, resulting in sharper and more realistic reconstructions. It demonstrates efficacy in high-resolution image restoration, achieving comparable or superior performance to computationally expensive patch-based methods.	Prompt tuning, while improving performance, increases computational complexity due to additional forward/backward passes through the model, demanding future research for time-critical applications. The use of continuous text embedding optimization hinders direct interpretation of the learned text. Employing models with text decoders, like Imagen, could address this limitation.	inverse problems, latent diffusion models, prompt tuning, image restoration, generative priors
2310.01107 Report	Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models	Hyeonho Jeong, Jong Chul Ye	Recent endeavors in video editing have showcased promising results in single-attribute editing or style transfer tasks, either by training text-to-video (T2V) models on text-video data or adopting training-free methods. However, when confronted with the complexities of multi-attribute editing scenarios, they exhibit shortcomings such as omitting or overlooking intended attribute changes, modifying the wrong elements of the input video, and failing to preserve regions of the input video that should remain intact. To address this, here we present a novel grounding-guided video-to-video translation framework called Ground-A-Video for multi-attribute video editing. Ground-A-Video attains temporally consistent multi-attribute editing of input videos in a training-free manner without aforementioned shortcomings. Central to our method is the introduction of Cross-Frame Gated Attention which incorporates groundings information into the latent representations in a temporally consistent fashion, along with Modulated Cross-Attention and optical flow guided inverted latents smoothing. Extensive experiments and applications demonstrate that Ground-A-Video's zero-shot capacity outperforms other baseline methods in terms of edit-accuracy and frame consistency. Further results and code are available at our project page (http://ground-a-video.github.io).	Ground-A-Video is a novel video editing framework that performs multi-attribute editing on input videos by leveraging both spatially-continuous and discrete conditions.	Existing video editing methods struggle with multi-attribute editing scenarios, often exhibiting issues like omitting desired edits, modifying incorrect elements, or failing to preserve intended regions.	Ground-A-Video utilizes a combination of: (1) Optical flow guided latent smoothing for temporal consistency. (2) Inflated ControlNet with depth maps for structural guidance. (3) Modulated Cross-Attention for consistent null-embeddings. (4) Cross-Frame Gated Attention for incorporating video groundings.	Outperforms baseline methods in terms of edit-accuracy and frame consistency in qualitative and quantitative evaluations. Demonstrates successful application in video style transfer and text-to-video generation with pose control. Exhibits strong performance in preserving regions not targeted for editing, especially when combined with an inpainting strategy guided by groundings.	Performance is heavily reliant on the accuracy of the video groundings; inaccurate groundings can lead to editing errors. While ControlNet enhances structural guidance, it can limit flexibility. This limitation is mitigated by adjusting the 'ControlNet Scale' parameter.	video editing, multi-attribute editing, grounding, stable diffusion, controlnet
2310.01018 Report	Controlling Vision-Language Models for Multi-Task Image Restoration	Ziwei Luo, Fredrik K. Gustafsson, Zheng Zhao, Jens Sjölund, Thomas B. Schön	Vision-language models such as CLIP have shown great impact on diverse downstream tasks for zero-shot or label-free predictions. However, when it comes to low-level vision such as image restoration their performance deteriorates dramatically due to corrupted inputs. In this paper, we present a degradation-aware vision-language model (DA-CLIP) to better transfer pretrained vision-language models to low-level vision tasks as a multi-task framework for image restoration. More specifically, DA-CLIP trains an additional controller that adapts the fixed CLIP image encoder to predict high-quality feature embeddings. By integrating the embedding into an image restoration network via cross-attention, we are able to pilot the model to learn a high-fidelity image reconstruction. The controller itself will also output a degradation feature that matches the real corruptions of the input, yielding a natural classifier for different degradation types. In addition, we construct a mixed degradation dataset with synthetic captions for DA-CLIP training. Our approach advances state-of-the-art performance on both \emph{degradation-specific} and \emph{unified} image restoration tasks, showing a promising direction of prompting image restoration with large-scale pretrained vision-language models. Our code is available at https://github.com/Algolzw/daclip-uir.	DA-CLIP, a degradation-aware vision-language model, leverages pretrained VLMs for multi-task image restoration by adapting the image encoder to predict high-quality features from corrupted images and classifying degradation types.	Existing VLMs struggle with low-level vision tasks like image restoration due to feature misalignment between corrupted inputs and clean captions, limiting their use in this domain.	DA-CLIP introduces an Image Controller that adapts a frozen CLIP image encoder to output both high-quality content embeddings aligned with clean captions and degradation embeddings matching real corruption types. Trained with contrastive learning on a mixed degradation dataset with synthetic captions, DA-CLIP integrates into image restoration networks via cross-attention and prompt learning, enabling both degradation-specific and unified restoration.	DA-CLIP improves image restoration performance on various degradation-specific tasks, setting a new state-of-the-art on deraining. In unified image restoration, DA-CLIP consistently outperforms existing methods in perceptual quality across diverse degradation types. DA-CLIP effectively classifies ten degradation types, achieving near-perfect accuracy on most and significantly improving over the original CLIP model's classification ability.	The current dataset consists of images with single degradation types, limiting the model's ability to handle mixed degradations in real-world scenarios. DA-CLIP introduces additional model complexity and memory requirements compared to baseline models.	image restoration, vision-language models, clip, unified image restoration, degradation classification
2310.00936 Report	Trained Latent Space Navigation to Prevent Lack of Photorealism in Generated Images on Style-based Models	Takumi Harada, Kazuyuki Aihara, Hiroyuki Sakai	Recent studies on StyleGAN variants show promising performances for various generation tasks. In these models, latent codes have traditionally been manipulated and searched for the desired images. However, this approach sometimes suffers from a lack of photorealism in generated images due to a lack of knowledge about the geometry of the trained latent space. In this paper, we show a simple unsupervised method that provides well-trained local latent subspace, enabling latent code navigation while preserving the photorealism of the generated images. Specifically, the method identifies densely mapped latent spaces and restricts latent manipulations within the local latent subspace. Experimental results demonstrate that images generated within the local latent subspace maintain photorealism even when the latent codes are significantly and repeatedly manipulated. Moreover, experiments show that the method can be applied to latent code optimization for various types of style-based models. Our empirical evidence of the method will benefit applications in style-based models.	This paper proposes 'Bounded Local Space', a method for navigating the latent space of StyleGAN-like models while preserving image photorealism, especially during large traversals.	Manipulating latent codes in StyleGAN often results in unrealistic images due to leaving the distribution of training data. This method addresses this issue by restricting manipulations within a learned, photorealistic subspace.	The method computes a 'Bounded Local Space' around a latent code using singular value decomposition of the Jacobian matrix of the StyleGAN mapping network. This space, defined by singular vectors and values, restricts large traversal steps to maintain photorealism.	Bounded Local Space preserves photorealism even with significant and repeated latent code manipulations, as demonstrated by lower FID scores compared to baseline methods. The method is effective across various StyleGAN architectures and datasets, showing its generalizability. Bounded Local Space can be integrated into latent code optimization for tasks like aesthetic image manipulation, masked image search, and text-guided image generation, leading to more realistic results.	The method's reliance on both Z and W latent spaces limits its application with inversion methods that only operate in extended W spaces. While the method mitigates unrealistic images during traversal, it does not guarantee photorealism for all manipulations, as StyleGAN itself may generate unrealistic images within its trained distribution.	generative adversarial networks, stylegan, latent space manipulation, photorealism, image generation
2310.00808 Report	Completing Visual Objects via Bridging Generation and Segmentation	Xiang Li, Yinpeng Chen, Chung-Ching Lin, Hao Chen, Kai Hu, Rita Singh, Bhiksha Raj, Lijuan Wang, Zicheng Liu	This paper presents a novel approach to object completion, with the primary goal of reconstructing a complete object from its partially visible components. Our method, named MaskComp, delineates the completion process through iterative stages of generation and segmentation. In each iteration, the object mask is provided as an additional condition to boost image generation, and, in return, the generated images can lead to a more accurate mask by fusing the segmentation of images. We demonstrate that the combination of one generation and one segmentation stage effectively functions as a mask denoiser. Through alternation between the generation and segmentation stages, the partial object mask is progressively refined, providing precise shape guidance and yielding superior object completion results. Our experiments demonstrate the superiority of MaskComp over existing approaches, e.g., ControlNet and Stable Diffusion, establishing it as an effective solution for object completion.	Introduces MaskComp, a novel object completion approach that bridges conditional generation and segmentation by leveraging the observation that generated object quality is directly related to the conditioned mask quality.	Object completion is challenging due to the need for seamless alignment between generated and partial objects. MaskComp addresses this by iteratively refining incomplete masks to provide comprehensive shape guidance.	MaskComp utilizes an Iterative Mask Denoising (IMD) process with alternating generation and segmentation stages. The generation stage (CompNet) produces complete object images conditioned on partial objects and masks. The segmentation stage uses an off-the-shelf model (SAM) to refine masks from generated images.	MaskComp significantly outperforms state-of-the-art methods like ControlNet and Stable Diffusion in FID scores and user study rankings for object completeness and realism. The quality of the conditioned mask significantly influences the quality of generated objects, with complete masks leading to the best results. MaskComp shows robustness to segmentation errors and can be potentially adapted to scenarios without ground-truth complete objects during training.	The current implementation of MaskComp requires multiple diffusion processes in each IMD step, impacting inference speed. MaskComp may struggle with uncommon object poses, highlighting the need for more diverse training datasets.	object completion, image generation, segmentation, mask denoising, conditional generation
2310.00632 Report	Win-Win: Training High-Resolution Vision Transformers from Two Windows	Vincent Leroy, Jerome Revaud, Thomas Lucas, Philippe Weinzaepfel	Transformers have become the standard in state-of-the-art vision architectures, achieving impressive performance on both image-level and dense pixelwise tasks. However, training vision transformers for high-resolution pixelwise tasks has a prohibitive cost. Typical solutions boil down to hierarchical architectures, fast and approximate attention, or training on low-resolution crops. This latter solution does not constrain architectural choices, but it leads to a clear performance drop when testing at resolutions significantly higher than that used for training, thus requiring ad-hoc and slow post-processing schemes. In this paper, we propose a novel strategy for efficient training and inference of high-resolution vision transformers. The key principle is to mask out most of the high-resolution inputs during training, keeping only N random windows. This allows the model to learn local interactions between tokens inside each window, and global interactions between tokens from different windows. As a result, the model can directly process the high-resolution input at test time without any special trick. We show that this strategy is effective when using relative positional embedding such as rotary embeddings. It is 4 times faster to train than a full-resolution network, and it is straightforward to use at test time compared to existing approaches. We apply this strategy to three dense prediction tasks with high-resolution data. First, we show on the task of semantic segmentation that a simple setting with 2 windows performs best, hence the name of our method: Win-Win. Second, we confirm this result on the task of monocular depth prediction. Third, we further extend it to the binocular task of optical flow, reaching state-of-the-art performance on the Spring benchmark that contains Full-HD images with an order of magnitude faster inference than the best competitor.	This paper introduces Win-Win, a novel training strategy for high-resolution vision transformers that leverages a multi-window masking approach during training to significantly reduce computational cost while maintaining the ability to process full-resolution images at test time.	Training vision transformers on high-resolution images for dense prediction tasks is computationally expensive, and existing solutions often involve architectural changes or lead to performance degradation at test time. This paper proposes a simple yet effective training scheme to overcome these limitations.	The proposed method masks out most of the input image during training, keeping only a small number of randomly selected windows. This allows the model to learn both local and global interactions, crucial for high-resolution generalization. A temperature parameter in the attention softmax is adjusted at test time to account for the change in token distribution.	Using two windows is sufficient to achieve optimal performance, leading to the name Win-Win. The method achieves comparable or better performance than full-resolution training or tiling-based approaches while being significantly faster. Win-Win obtains state-of-the-art results on the challenging Spring optical flow benchmark with Full-HD images.	The method relies on relative positional embeddings and might not be suitable for architectures using absolute positional embeddings. Future work could explore extending the multi-window strategy to other vision tasks beyond dense prediction.	vision transformers, high-resolution images, dense prediction, semantic segmentation, optical flow
2310.00426 Report	PixArt-$α$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis	Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, Zhenguo Li	The most advanced text-to-image (T2I) models require significant training costs (e.g., millions of GPU hours), seriously hindering the fundamental innovation for the AIGC community while increasing CO2 emissions. This paper introduces PIXART-$\alpha$, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators (e.g., Imagen, SDXL, and even Midjourney), reaching near-commercial application standards. Additionally, it supports high-resolution image synthesis up to 1024px resolution with low training cost, as shown in Figure 1 and 2. To achieve this goal, three core designs are proposed: (1) Training strategy decomposition: We devise three distinct training steps that separately optimize pixel dependency, text-image alignment, and image aesthetic quality; (2) Efficient T2I Transformer: We incorporate cross-attention modules into Diffusion Transformer (DiT) to inject text conditions and streamline the computation-intensive class-condition branch; (3) High-informative data: We emphasize the significance of concept density in text-image pairs and leverage a large Vision-Language model to auto-label dense pseudo-captions to assist text-image alignment learning. As a result, PIXART-$\alpha$'s training speed markedly surpasses existing large-scale T2I models, e.g., PIXART-$\alpha$ only takes 10.8% of Stable Diffusion v1.5's training time (675 vs. 6,250 A100 GPU days), saving nearly \$300,000 (\$26,000 vs. \$320,000) and reducing 90% CO2 emissions. Moreover, compared with a larger SOTA model, RAPHAEL, our training cost is merely 1%. Extensive experiments demonstrate that PIXART-$\alpha$ excels in image quality, artistry, and semantic control. We hope PIXART-$\alpha$ will provide new insights to the AIGC community and startups to accelerate building their own high-quality yet low-cost generative models from scratch.	\model is a computationally efficient Transformer-based text-to-image diffusion model that achieves competitive image generation quality with state-of-the-art models while significantly reducing training costs and CO2 emissions.	Current state-of-the-art text-to-image models require immense computational resources and incur high training costs, hindering innovation and accessibility within the AIGC community.	The paper introduces three core designs: (1) Decomposition of the training strategy into pixel dependency learning, text-image alignment learning, and high-resolution aesthetic image generation. (2) An efficient T2I Transformer that incorporates cross-attention modules and streamlines computation. (3) Utilization of high-informative data through an auto-labeling pipeline with LLaVA for generating precise and detailed image captions.	\model achieves a FID score of 7.32 on the COCO dataset with significantly less training time (12%) and data (1.25%) compared to Stable Diffusion v1.5. It demonstrates superior performance in compositional text-to-image generation, excelling in attribute binding, object relationships, and complex compositions, as evaluated on T2I-CompBench. User study results indicate that \model outperforms existing SOTA models in terms of image quality and text-image alignment.	The model exhibits limitations in accurately controlling the number of objects and representing specific details, such as human hands. Its text generation capability is limited due to the dataset's constraints on font and letter-related images.	text-to-image generation, diffusion models, transformer, computational efficiency, auto-labeling
2310.00390 Report	InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists	Yulu Gan, Sungwoo Park, Alexander Schubert, Anthony Philippakis, Ahmed M. Alaa	Recent advances in generative diffusion models have enabled text-controlled synthesis of realistic and diverse images with impressive quality. Despite these remarkable advances, the application of text-to-image generative models in computer vision for standard visual recognition tasks remains limited. The current de facto approach for these tasks is to design model architectures and loss functions that are tailored to the task at hand. In this paper, we develop a unified language interface for computer vision tasks that abstracts away task-specific design choices and enables task execution by following natural language instructions. Our approach involves casting multiple computer vision tasks as text-to-image generation problems. Here, the text represents an instruction describing the task, and the resulting image is a visually-encoded task output. To train our model, we pool commonly-used computer vision datasets covering a range of tasks, including segmentation, object detection, depth estimation, and classification. We then use a large language model to paraphrase prompt templates that convey the specific tasks to be conducted on each image, and through this process, we create a multi-modal and multi-task training dataset comprising input and output images along with annotated instructions. Following the InstructPix2Pix architecture, we apply instruction-tuning to a text-to-image diffusion model using our constructed dataset, steering its functionality from a generative model to an instruction-guided multi-task vision learner. Experiments demonstrate that our model, dubbed InstructCV, performs competitively compared to other generalist and task-specific vision models. Moreover, it exhibits compelling generalization capabilities to unseen data, categories, and user instructions.	This paper introduces InstructCV, a unified language interface for computer vision tasks using a text-to-image generation approach, where text instructions specify the task and the generated image represents the visual output.	The current approach in computer vision relies on task-specific models, lacking generalizability. InstructCV aims to bridge this gap by learning generalized representations for multiple vision tasks through a unified language interface.	The paper proposes instruction-tuning a pre-trained conditional diffusion model (Stable Diffusion). They create a multi-modal, multi-task training dataset with image pairs, textual instructions, and visually encoded task outputs. This dataset is used to fine-tune the diffusion model to function as an instruction-guided multi-task vision learner.	InstructCV achieves competitive results on various vision tasks, including semantic segmentation, object detection, depth estimation, and classification. It demonstrates compelling generalization ability to unseen data and categories, outperforming state-of-the-art generalist models. The model shows strong performance in adapting to new, user-written instructions.	Although InstructCV reduces computational costs compared to previous generalist models, its inference speed lags behind specialized task-specific models. The model's semantic flexibility is limited by the diversity of the instruction-tuning dataset, which relies on rephrased template prompts, potentially hindering its ability to handle more nuanced instructions.	computer vision, multi-task learning, diffusion models, instruction tuning, natural language interface
2310.00240 Report	Learning Mask-aware CLIP Representations for Zero-Shot Segmentation	Siyu Jiao, Yunchao Wei, Yaowei Wang, Yao Zhao, Humphrey Shi	Recently, pre-trained vision-language models have been increasingly used to tackle the challenging zero-shot segmentation task. Typical solutions follow the paradigm of first generating mask proposals and then adopting CLIP to classify them. To maintain the CLIP's zero-shot transferability, previous practices favour to freeze CLIP during training. However, in the paper, we reveal that CLIP is insensitive to different mask proposals and tends to produce similar predictions for various mask proposals of the same image. This insensitivity results in numerous false positives when classifying mask proposals. This issue mainly relates to the fact that CLIP is trained with image-level supervision. To alleviate this issue, we propose a simple yet effective method, named Mask-aware Fine-tuning (MAFT). Specifically, Image-Proposals CLIP Encoder (IP-CLIP Encoder) is proposed to handle arbitrary numbers of image and mask proposals simultaneously. Then, mask-aware loss and self-distillation loss are designed to fine-tune IP-CLIP Encoder, ensuring CLIP is responsive to different mask proposals while not sacrificing transferability. In this way, mask-aware representations can be easily learned to make the true positives stand out. Notably, our solution can seamlessly plug into most existing methods without introducing any new parameters during the fine-tuning process. We conduct extensive experiments on the popular zero-shot benchmarks. With MAFT, the performance of the state-of-the-art methods is promoted by a large margin: 50.4% (+ 8.2%) on COCO, 81.8% (+ 3.2%) on Pascal-VOC, and 8.7% (+4.3%) on ADE20K in terms of mIoU for unseen classes. The code is available at https://github.com/jiaosiyu1999/MAFT.git.	This paper identifies a critical issue in zero-shot segmentation where frozen pre-trained vision-language models (like CLIP) are insensitive to variations in mask proposals, leading to false positives. To address this, the authors propose Mask-aware Fine-tuning (MAFT) to make CLIP sensitive to different mask proposals without compromising its transferability to novel classes.	Zero-shot segmentation aims to segment objects from unseen categories using text descriptions. Existing methods rely on frozen pre-trained vision-language models to classify mask proposals, but these models are often insensitive to the quality of proposals, leading to inaccurate segmentations.	The authors propose MAFT, which consists of an Image-Proposal CLIP Encoder (IP-CLIP) and two losses: mask-aware loss and self-distillation loss. IP-CLIP handles images and mask proposals simultaneously, mask-aware loss aligns classification scores with IoU scores of mask proposals, and self-distillation loss preserves CLIP's transferability.	MAFT significantly improves the performance of various zero-shot segmentation methods on COCO-Stuff, Pascal-VOC, and ADE20K. MAFT achieves state-of-the-art results on zero-shot segmentation benchmarks, demonstrating its effectiveness in enhancing the sensitivity of CLIP to mask proposals. MAFT also shows significant improvements in the open-vocabulary setting, outperforming previous methods on ADE20K, Pascal-Context, and Pascal-VOC datasets.	The performance of MAFT is still limited by the capabilities of pre-trained vision-language models in recognizing novel classes. Future work will focus on further improving the generalization ability of the model for unseen classes.	zero-shot segmentation, vision-language models, clip, mask-aware fine-tuning, transfer learning
2310.00161 Report	Detection-Oriented Image-Text Pretraining for Open-Vocabulary Detection	Dahun Kim, Anelia Angelova, Weicheng Kuo	We present a new open-vocabulary detection approach based on detection-oriented image-text pretraining to bridge the gap between image-level pretraining and open-vocabulary object detection. At the pretraining phase, we replace the commonly used classification architecture with the detector architecture, which better serves the region-level recognition needs of detection by enabling the detector heads to learn from noisy image-text pairs. Using only standard contrastive loss and no pseudo-labeling, our approach is a simple yet effective extension of the contrastive learning method to learn emergent object-semantic cues. In addition, we propose a shifted-window learning approach upon window attention to make the backbone representation more robust, translation-invariant, and less biased by the window pattern. On the popular LVIS open-vocabulary detection benchmark, our approach sets a new state of the art of 40.4 mask AP$_r$ using the common ViT-L backbone, significantly outperforming the best existing approach by +6.5 mask AP$_r$ at system level. On the COCO benchmark, we achieve very competitive 40.8 novel AP without pseudo labeling or weak supervision. In addition, we evaluate our approach on the transfer detection setup, where ours outperforms the baseline significantly. Visualization reveals emerging object locality from the pretraining recipes compared to the baseline. Code and models will be publicly released.	This paper presents \ours, a new open-vocabulary object detection approach that leverages detection-oriented image-text pretraining to improve the alignment between image-level pretraining and open-vocabulary object detection.	Existing open-vocabulary detectors often exhibit suboptimal generalization due to the reliance on classification-based pretrained models and training detector heads from scratch on limited datasets. This work aims to bridge this gap by incorporating detection-specific knowledge into the pretraining phase.	The methodology involves two key components: (1) Detection-Oriented Pretraining (DOP) utilizes a detector architecture with FPN and Faster R-CNN head during pretraining, enabling learning from noisy image-text pairs. (2) Shifted-Window Learning (SWL) enhances the robustness and translation invariance of vision transformer backbone features by rolling and combining shifted features.	\ours achieves a new state-of-the-art of 40.4 mask AP$_r$ on LVIS, outperforming the best existing approach by +6.5 AP$_r$. On COCO, \ours achieves a competitive 40.8 novel AP without using pseudo-labels or joint training. In transfer detection from LVIS to Objects365, \ours outperforms existing methods with comparable backbone sizes.	The reliance on web-scale image-text data might inherit potential biases and stereotypes present in those datasets. Future work could focus on more rigorous fairness evaluations and explore the impact of different pretraining datasets.	open-vocabulary detection, image-text pretraining, contrastive learning, vision transformers, shifted-window learning
2310.00031 Report	Text-image Alignment for Diffusion-based Perception	Neehar Kondapaneni, Markus Marks, Manuel Knott, Rogerio Guimaraes, Pietro Perona	Diffusion models are generative models with impressive text-to-image synthesis capabilities and have spurred a new wave of creative methods for classical machine learning tasks. However, the best way to harness the perceptual knowledge of these generative models for visual tasks is still an open question. Specifically, it is unclear how to use the prompting interface when applying diffusion backbones to vision tasks. We find that automatically generated captions can improve text-image alignment and significantly enhance a model's cross-attention maps, leading to better perceptual performance. Our approach improves upon the current state-of-the-art (SOTA) in diffusion-based semantic segmentation on ADE20K and the current overall SOTA for depth estimation on NYUv2. Furthermore, our method generalizes to the cross-domain setting. We use model personalization and caption modifications to align our model to the target domain and find improvements over unaligned baselines. Our cross-domain object detection model, trained on Pascal VOC, achieves SOTA results on Watercolor2K. Our cross-domain segmentation method, trained on Cityscapes, achieves SOTA results on Dark Zurich-val and Nighttime Driving. Project page: https://www.vision.caltech.edu/tadp/. Code: https://github.com/damaggu/TADP.	This paper proposes Text-Aligned Diffusion Perception (TADP), a novel method to enhance diffusion-based perception models by aligning text prompts with images using automated caption generation.	It addresses the open question of how to best leverage the perceptual knowledge of diffusion models for visual tasks, particularly in improving text-image alignment for enhanced performance.	The methodology involves using BLIP-2 for automated image captioning to generate aligned text prompts for diffusion models, replacing previously used averaged embedding techniques. This is extended to cross-domain tasks by incorporating target domain information into the captions, further aided by model personalization methods like Textual Inversion and DreamBooth.	Automated captioning significantly improves performance in semantic segmentation, depth estimation, and object detection tasks by enhancing text-image alignment. The method demonstrates strong cross-domain generalizability, achieving state-of-the-art results on benchmarks like Cityscapes to Dark Zurich, Cityscapes to Nighttime Driving, and Pascal VOC to Watercolor2K. Analysis reveals that diffusion models are sensitive to missing objects in captions (recall) and benefit from longer, more descriptive captions.	The current method relies on open-vocabulary captioning models, which can be improved by exploring closed-vocabulary, task-specific captioners. Future work can explore extending the framework for multi-domain generalization to unseen domains.	diffusion models, image captioning, semantic segmentation, depth estimation, cross-domain learning
2309.17400 Report	Directly Fine-Tuning Diffusion Models on Differentiable Rewards	Kevin Clark, Paul Vicol, Kevin Swersky, David J Fleet	We present Direct Reward Fine-Tuning (DRaFT), a simple and effective method for fine-tuning diffusion models to maximize differentiable reward functions, such as scores from human preference models. We first show that it is possible to backpropagate the reward function gradient through the full sampling procedure, and that doing so achieves strong performance on a variety of rewards, outperforming reinforcement learning-based approaches. We then propose more efficient variants of DRaFT: DRaFT-K, which truncates backpropagation to only the last K steps of sampling, and DRaFT-LV, which obtains lower-variance gradient estimates for the case when K=1. We show that our methods work well for a variety of reward functions and can be used to substantially improve the aesthetic quality of images generated by Stable Diffusion 1.4. Finally, we draw connections between our approach and prior work, providing a unifying perspective on the design space of gradient-based fine-tuning algorithms.	This paper presents Direct Reward Fine-Tuning (DRaFT), a method for efficiently fine-tuning diffusion models to maximize differentiable reward functions.	Fine-tuning diffusion models on reward functions like human preferences is crucial for aligning model behavior with user needs, such as generating aesthetically pleasing or fair images.	DRaFT leverages backpropagation through the sampling process to compute gradients for fine-tuning on a variety of reward functions. It employs techniques like LoRA and gradient checkpointing for efficiency, and introduces variants DRaFT-K and DRaFT-LV to further improve efficiency and training stability.	DRaFT outperforms reinforcement learning-based approaches in terms of sample efficiency by a large margin. DRaFT-LV, a variant of DRaFT, achieves state-of-the-art performance on the Human Preference Score v2 benchmark. DRaFT is shown to be applicable to a wide range of reward functions beyond aesthetic quality, such as image compressibility, object detection, and generating adversarial examples.	The paper identifies reward hacking as a challenge where the model might overfit to the reward function and lose diversity. Future work could involve exploring more robust reward functions and investigating the theoretical properties of DRaFT in more depth. Future work could explore applying DRaFT to other diffusion model applications beyond text-to-image generation	diffusion models, reward learning, fine-tuning, image generation, human preferences
2309.17261 Report	Consistent123: One Image to Highly Consistent 3D Asset Using Case-Aware Diffusion Priors	Yukang Lin, Haonan Han, Chaoqun Gong, Zunnan Xu, Yachao Zhang, Xiu Li	Reconstructing 3D objects from a single image guided by pretrained diffusion models has demonstrated promising outcomes. However, due to utilizing the case-agnostic rigid strategy, their generalization ability to arbitrary cases and the 3D consistency of reconstruction are still poor. In this work, we propose Consistent123, a case-aware two-stage method for highly consistent 3D asset reconstruction from one image with both 2D and 3D diffusion priors. In the first stage, Consistent123 utilizes only 3D structural priors for sufficient geometry exploitation, with a CLIP-based case-aware adaptive detection mechanism embedded within this process. In the second stage, 2D texture priors are introduced and progressively take on a dominant guiding role, delicately sculpting the details of the 3D model. Consistent123 aligns more closely with the evolving trends in guidance requirements, adaptively providing adequate 3D geometric initialization and suitable 2D texture refinement for different objects. Consistent123 can obtain highly 3D-consistent reconstruction and exhibits strong generalization ability across various objects. Qualitative and quantitative experiments show that our method significantly outperforms state-of-the-art image-to-3D methods. See https://Consistent123.github.io for a more comprehensive exploration of our generated 3D assets.	Consistent123 is a novel two-stage method for reconstructing highly consistent 3D assets from a single image, leveraging both 2D and 3D diffusion priors in a case-aware manner.	Existing single-image 3D reconstruction methods suffer from limitations such as poor generalization ability, 3D inconsistency, and the 'multi-face' issue. Consistent123 addresses these challenges by adaptively combining 2D and 3D priors.	Consistent123 employs a two-stage approach: (1) 3D structural initialization guided solely by 3D priors, with a CLIP-based adaptive mechanism to determine the optimal transition point to stage 2. (2) Dynamic Prior optimization, gradually integrating 2D texture priors while reducing the emphasis on 3D priors, ensuring both geometric fidelity and textural detail.	Consistent123 outperforms state-of-the-art methods in terms of 3D consistency, as evidenced by higher CLIP-Similarity scores on RealFusion15 and C10 datasets. The method effectively addresses the 'multi-face' issue observed in previous works, ensuring consistent appearance across all views. Consistent123 demonstrates superior geometric and textural quality, as evidenced by quantitative metrics (PSNR, LPIPS) and a user study.	The reconstruction quality in stage 1 is influenced by the input image's viewpoint due to the heavy reliance on 3D priors. Output quality is dependent on the clarity and specificity of the asset description used in stage 2, with overly brief descriptions potentially leading to inaccuracies.	3d reconstruction, diffusion models, single image, 3d consistency, dynamic prior
2309.17164 Report	Retail-786k: a Large-Scale Dataset for Visual Entity Matching	Bianca Lamm, Janis Keuper	Entity Matching (EM) defines the task of learning to group objects by transferring semantic concepts from example groups (=entities) to unseen data. Despite the general availability of image data in the context of many EM-problems, most currently available EM-algorithms solely rely on (textual) meta data. In this paper, we introduce the first publicly available large-scale dataset for "visual entity matching", based on a production level use case in the retail domain. Using scanned advertisement leaflets, collected over several years from different European retailers, we provide a total of ~786k manually annotated, high resolution product images containing ~18k different individual retail products which are grouped into ~3k entities. The annotation of these product entities is based on a price comparison task, where each entity forms an equivalence class of comparable products. Following on a first baseline evaluation, we show that the proposed "visual entity matching" constitutes a novel learning problem which can not sufficiently be solved using standard image based classification and retrieval algorithms. Instead, novel approaches which allow to transfer example based visual equivalent classes to new data are needed to address the proposed problem. The aim of this paper is to provide a benchmark for such algorithms. Information about the dataset, evaluation code and download instructions are provided under https://www.retail-786k.org/.	The paper introduces Retail-786k, the first large-scale publicly available dataset for "visual entity matching," aimed at benchmarking algorithms for grouping visually similar retail products.	Entity Matching (EM) is crucial for tasks like price monitoring, but existing methods rely heavily on textual data. Retail-786k enables research on visual EM using a practical, real-world dataset.	The dataset was created from a large collection of scanned advertisement leaflets, manually annotated to group over 786k product images into 3,298 entities representing comparable products. Baseline experiments were conducted using image classification and retrieval approaches.	Visual entity matching presents a novel learning problem distinct from standard classification and retrieval. Existing image classification models achieved a maximum F1-score of 83.2% on the dataset. An image retrieval approach attained an R@10 score of 56%, indicating limitations in handling entity variance.	The dataset currently lacks textual information (e.g., price, product description) that could be valuable for multi-modal solutions. Entity definitions are specific to price monitoring, potentially limiting generalizability to other EM tasks.	entity matching, images, long-tail, retail, leaflets
2309.17128 Report	HAvatar: High-fidelity Head Avatar via Facial Model Conditioned Neural Radiance Field	Xiaochen Zhao, Lizhen Wang, Jingxiang Sun, Hongwen Zhang, Jinli Suo, Yebin Liu	The problem of modeling an animatable 3D human head avatar under light-weight setups is of significant importance but has not been well solved. Existing 3D representations either perform well in the realism of portrait images synthesis or the accuracy of expression control, but not both. To address the problem, we introduce a novel hybrid explicit-implicit 3D representation, Facial Model Conditioned Neural Radiance Field, which integrates the expressiveness of NeRF and the prior information from the parametric template. At the core of our representation, a synthetic-renderings-based condition method is proposed to fuse the prior information from the parametric model into the implicit field without constraining its topological flexibility. Besides, based on the hybrid representation, we properly overcome the inconsistent shape issue presented in existing methods and improve the animation stability. Moreover, by adopting an overall GAN-based architecture using an image-to-image translation network, we achieve high-resolution, realistic and view-consistent synthesis of dynamic head appearance. Experiments demonstrate that our method can achieve state-of-the-art performance for 3D head avatar animation compared with previous methods.	This paper proposes a novel facial model conditioned Neural Radiance Field for high-fidelity and controllable 3D head avatar animation using monocular or sparse-view videos.	Modeling animatable 3D human head avatars with realistic appearance and accurate expression control under lightweight setups is crucial for various applications but remains a challenge.	The method integrates a parametric facial model with NeRF. It leverages synthetic renderings of the model to condition feature volume generation, enabling flexible topology and precise control. Learnable embeddings modulate feature generation to address shape inconsistency, and an image-to-image translation network enhances realism.	Achieves state-of-the-art performance for 3D head avatar animation with realistic appearance and accurate expression control. Addresses the inconsistent shape issue present in existing NeRF-based avatar modeling methods and significantly improves animation stability. Enables high-resolution, photo-realistic, and view-consistent synthesis of dynamic head appearances.	The proxy shapes generated by the method are not as accurate as some surface-based methods. Handling out-of-distribution head poses and extreme expressions remains challenging.	head avatar, image synthesis, neural radiance field, parametric facial model, image-to-image translation
2309.17102 Report	Guiding Instruction-based Image Editing via Multimodal Large Language Models	Tsu-Jui Fu, Wenze Hu, Xianzhi Du, William Yang Wang, Yinfei Yang, Zhe Gan	Instruction-based image editing improves the controllability and flexibility of image manipulation via natural commands without elaborate descriptions or regional masks. However, human instructions are sometimes too brief for current methods to capture and follow. Multimodal large language models (MLLMs) show promising capabilities in cross-modal understanding and visual-aware response generation via LMs. We investigate how MLLMs facilitate edit instructions and present MLLM-Guided Image Editing (MGIE). MGIE learns to derive expressive instructions and provides explicit guidance. The editing model jointly captures this visual imagination and performs manipulation through end-to-end training. We evaluate various aspects of Photoshop-style modification, global photo optimization, and local editing. Extensive experimental results demonstrate that expressive instructions are crucial to instruction-based image editing, and our MGIE can lead to a notable improvement in automatic metrics and human evaluation while maintaining competitive inference efficiency.	This paper introduces MLLM-Guided Image Editing (MGIE), which leverages Multimodal Large Language Models (MLLMs) to improve instruction-based image editing by deriving more expressive and detailed instructions.	Human instructions for image editing are often too brief and ambiguous for existing methods to understand fully. MGIE addresses this by utilizing the cross-modal understanding and visual awareness of MLLMs to generate more effective guidance.	MGIE employs an MLLM to generate concise, expressive instructions from initial, brief instructions and the input image. These instructions, along with visual tokens, guide a diffusion model to perform the desired edits. The MLLM and diffusion model are jointly trained end-to-end.	Expressive instructions significantly enhance image editing performance compared to using only brief instructions. Visual awareness is crucial for deriving effective expressive instructions, leading to superior results over language-only methods. MGIE achieves state-of-the-art performance on various editing tasks, including Photoshop-style modification, global photo optimization, and local object alteration, while maintaining competitive inference efficiency.	MGIE may inherit potential biases present in the pre-trained foundation models (LLaVA and StableDiffusion). Complex compositional commands or those requiring precise numerical or spatial understanding remain challenging.	image editing, instruction-based editing, multimodal large language models, diffusion models, expressive instructions
2309.17074 Report	DeeDiff: Dynamic Uncertainty-Aware Early Exiting for Accelerating Diffusion Model Generation	Shengkun Tang, Yaqing Wang, Caiwen Ding, Yi Liang, Yao Li, Dongkuan Xu	Diffusion models achieve great success in generating diverse and high-fidelity images. The performance improvements come with low generation speed per image, which hinders the application diffusion models in real-time scenarios. While some certain predictions benefit from the full computation of the model in each sample iteration, not every iteration requires the same amount of computation, potentially leading to computation waste. In this work, we propose DeeDiff, an early exiting framework that adaptively allocates computation resources in each sampling step to improve the generation efficiency of diffusion models. Specifically, we introduce a timestep-aware uncertainty estimation module (UEM) for diffusion models which is attached to each intermediate layer to estimate the prediction uncertainty of each layer. The uncertainty is regarded as the signal to decide if the inference terminates. Moreover, we propose uncertainty-aware layer-wise loss to fill the performance gap between full models and early-exited models. With such loss strategy, our model is able to obtain comparable results as full-layer models. Extensive experiments of class-conditional, unconditional, and text-guided generation on several datasets show that our method achieves state-of-the-art performance and efficiency trade-off compared with existing early exiting methods on diffusion models. More importantly, our method even brings extra benefits to baseline models and obtains better performance on CIFAR-10 and Celeb-A datasets. Full code and model are released for reproduction.	DeeDiff, a novel early exiting framework designed to accelerate the inference speed of diffusion models for image generation.	Diffusion models, while powerful in generating high-quality images, suffer from slow generation speed due to the multi-step inference process, hindering their application in real-time scenarios.	DeeDiff introduces a timestep-aware uncertainty estimation module (UEM) to estimate the prediction uncertainty of each layer in the diffusion model. This uncertainty guides the early exiting decisions, adaptively allocating computational resources based on the difficulty of the generation step. Additionally, an uncertainty-aware layer-wise loss function is proposed to minimize the performance gap between the full model and the early-exited model.	DeeDiff achieves state-of-the-art performance and efficiency compared to existing early exiting methods on diffusion models, reducing inference time by up to 40% with minimal performance drop. The proposed uncertainty-aware layer-wise loss strategy not only improves the efficiency of early exiting but also enhances the performance of the baseline diffusion models, even without early exiting. DeeDiff demonstrates compatibility with different diffusion models (CNN-based and Transformer-based) and acceleration methods like DPM-Solver, highlighting its generalizability and potential for broader applications.	While DeeDiff achieves a favorable performance-efficiency trade-off, further improvement is needed to maintain high image quality (low FID) at higher efficiency levels (over 60% layer reduction). The current implementation of DeeDiff focuses on early exiting based on the depth (number of layers) of the diffusion model. Exploring adaptive width (adaptive sampling steps) remains an open avenue for future research.	diffusion models, early exiting, image generation, uncertainty estimation, inference acceleration
2309.16992 Report	Segment Anything Model is a Good Teacher for Local Feature Learning	Jingqian Wu, Rongtao Xu, Zach Wood-Doughty, Changwei Wang, Shibiao Xu, Edmund Lam	Local feature detection and description play an important role in many computer vision tasks, which are designed to detect and describe keypoints in "any scene" and "any downstream task". Data-driven local feature learning methods need to rely on pixel-level correspondence for training, which is challenging to acquire at scale, thus hindering further improvements in performance. In this paper, we propose SAMFeat to introduce SAM (segment anything model), a fundamental model trained on 11 million images, as a teacher to guide local feature learning and thus inspire higher performance on limited datasets. To do so, first, we construct an auxiliary task of Pixel Semantic Relational Distillation (PSRD), which distillates feature relations with category-agnostic semantic information learned by the SAM encoder into a local feature learning network, to improve local feature description using semantic discrimination. Second, we develop a technique called Weakly Supervised Contrastive Learning Based on Semantic Grouping (WSC), which utilizes semantic groupings derived from SAM as weakly supervised signals, to optimize the metric space of local descriptors. Third, we design an Edge Attention Guidance (EAG) to further improve the accuracy of local feature detection and description by prompting the network to pay more attention to the edge region guided by SAM. SAMFeat's performance on various tasks such as image matching on HPatches, and long-term visual localization on Aachen Day-Night showcases its superiority over previous local features. The release code is available at https://github.com/vignywang/SAMFeat.	SAMFeat, a novel local feature learning method that leverages the Segment Anything Model (SAM) as a teacher to enhance performance.	Existing data-driven local feature learning methods rely heavily on pixel-level correspondence, neglecting semantic information crucial for robust feature description. SAMFeat bridges this gap by incorporating SAM's rich semantic understanding.	SAMFeat utilizes three key strategies: 1) Pixel Semantic Relational Distillation (PSRD) to distill category-agnostic semantic relations from SAM, 2) Weakly Supervised Contrastive Learning Based on Semantic Grouping (WSC) to optimize descriptor space using SAM-derived semantic groupings, and 3) Edge Attention Guidance (EAG) to prioritize edge regions identified by SAM for enhanced detection and description.	SAMFeat achieves state-of-the-art results on HPatches image matching benchmark, demonstrating superior accuracy across various thresholds. In long-term visual localization tasks on the Aachen Day-Night dataset, SAMFeat exhibits competitive performance against methods specifically designed for localization, highlighting its strong generalization capabilities. Ablation studies validate the contribution of each individual component (PSRD, WSC, EAG) to the overall performance improvement.	Exploration of alternative visual foundation models (e.g., DINO, SEEM) as potential teachers is left for future work. The impact of different weighting schemes for individual loss functions in SAMFeat's total loss could be further investigated.	local feature learning, segment anything model, semantic segmentation, image matching, visual localization
2309.16948 Report	Denoising Diffusion Bridge Models	Linqi Zhou, Aaron Lou, Samar Khanna, Stefano Ermon	Diffusion models are powerful generative models that map noise to data using stochastic processes. However, for many applications such as image editing, the model input comes from a distribution that is not random noise. As such, diffusion models must rely on cumbersome methods like guidance or projected sampling to incorporate this information in the generative process. In our work, we propose Denoising Diffusion Bridge Models (DDBMs), a natural alternative to this paradigm based on diffusion bridges, a family of processes that interpolate between two paired distributions given as endpoints. Our method learns the score of the diffusion bridge from data and maps from one endpoint distribution to the other by solving a (stochastic) differential equation based on the learned score. Our method naturally unifies several classes of generative models, such as score-based diffusion models and OT-Flow-Matching, allowing us to adapt existing design and architectural choices to our more general problem. Empirically, we apply DDBMs to challenging image datasets in both pixel and latent space. On standard image translation problems, DDBMs achieve significant improvement over baseline methods, and, when we reduce the problem to image generation by setting the source distribution to random noise, DDBMs achieve comparable FID scores to state-of-the-art methods despite being built for a more general task.	This paper proposes Denoising Diffusion Bridge Models (DDBMs), a novel framework for distribution translation by building a stochastic bridge between paired samples with tractable marginal distributions.	Standard diffusion models are limited to mapping to simple Gaussian distributions, making them ill-suited for tasks like image translation that require mapping between arbitrary distributions.	The authors leverage diffusion bridges, stochastic processes that interpolate between paired distributions, and learn the score of the diffusion bridge by matching against a tractable closed-form score.	DDBMs achieve strong performance in image-to-image translation tasks, outperforming baseline methods on metrics like FID and LPIPS. When applied to unconditional image generation, DDBMs achieve comparable FID scores to state-of-the-art diffusion models. The proposed preconditioning and hybrid sampler are shown to be crucial for the empirical success of DDBMs.	The predict-x parameterization, while effective for pixel-space generation, may be less suitable for latent-space translation. Future work includes exploring alternative parameterizations and extending DDBMs to handle more complex data modalities beyond images.	diffusion models, generative models, image translation, diffusion bridges, score matching
2309.16738 Report	ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens	Yangyang Guo, Haoyu Zhang, Yongkang Wong, Liqiang Nie, Mohan Kankanhalli	Learning a versatile language-image model is computationally prohibitive under a limited computing budget. This paper delves into the \emph{efficient language-image pre-training}, an area that has received relatively little attention despite its importance in reducing computational cost and footprint. To that end, we propose a vision token pruning and merging method ELIP, to remove less influential tokens based on the supervision of language outputs. Our method is designed with several strengths, such as being computation-efficient, memory-efficient, and trainable-parameter-free, and is distinguished from previous vision-only token pruning approaches by its alignment with task objectives. We implement this method in a progressively pruning manner using several sequential blocks. To evaluate its generalization performance, we apply ELIP to three commonly used language-image pre-training models and utilize public image-caption pairs with 4M images for pre-training. Our experiments demonstrate that with the removal of ~30$\%$ vision tokens across 12 ViT layers, ELIP maintains significantly comparable performance with baselines ($\sim$0.32 accuracy drop on average) over various downstream tasks including cross-modal retrieval, VQA, image captioning, \emph{etc}. In addition, the spared GPU resources by our ELIP allow us to scale up with larger batch sizes, thereby accelerating model pre-training and even sometimes enhancing downstream model performance.	This paper presents ELIP, a vision token pruning and merging method for efficient language-image pre-training, which removes less influential vision tokens based on the supervision of language outputs.	Learning versatile language-image models is computationally expensive. ELIP aims to improve efficiency by reducing the computational cost and footprint during pre-training.	ELIP progressively prunes and merges less important vision tokens in a multi-stage manner, guided by the fusion of image and text [CLS] token features. It divides the ViT encoder into four blocks, with increasing pruning ratios for deeper blocks.	ELIP achieves comparable performance to baseline models on various downstream tasks (e.g., retrieval, VQA) while removing ~30% of vision tokens. The reduced computational cost allows for scaling up pre-training with larger batch sizes, leading to faster training and sometimes even improved downstream performance. ELIP effectively removes less important background information while preserving salient object features, as demonstrated through visualizations.	The pruning ratio in ELIP is fixed and could benefit from an adaptive approach based on image complexity. Further efficiency improvements can be explored by integrating ELIP with other techniques like mixed-precision training.	vision-language pre-training, efficient deep learning, vision token pruning, multi-modal learning, vision transformer
2309.16671 Report	Demystifying CLIP Data	Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, Christoph Feichtenhofer	Contrastive Language-Image Pre-training (CLIP) is an approach that has advanced research and applications in computer vision, fueling modern recognition systems and generative models. We believe that the main ingredient to the success of CLIP is its data and not the model architecture or pre-training objective. However, CLIP only provides very limited information about its data and how it has been collected, leading to works that aim to reproduce CLIP's data by filtering with its model parameters. In this work, we intend to reveal CLIP's data curation approach and in our pursuit of making it open to the community introduce Metadata-Curated Language-Image Pre-training (MetaCLIP). MetaCLIP takes a raw data pool and metadata (derived from CLIP's concepts) and yields a balanced subset over the metadata distribution. Our experimental study rigorously isolates the model and training settings, concentrating solely on data. MetaCLIP applied to CommonCrawl with 400M image-text data pairs outperforms CLIP's data on multiple standard benchmarks. In zero-shot ImageNet classification, MetaCLIP achieves 70.8% accuracy, surpassing CLIP's 68.3% on ViT-B models. Scaling to 1B data, while maintaining the same training budget, attains 72.4%. Our observations hold across various model sizes, exemplified by ViT-H achieving 80.5%, without any bells-and-whistles. Curation code and training data distribution on metadata is made available at https://github.com/facebookresearch/MetaCLIP.	The paper introduces MetaCLIP, a method to reveal CLIP's data curation process by leveraging metadata (derived from CLIP's visual concepts) to create a balanced training dataset from a raw data pool.	CLIP's training data is key to its success, but it is not publicly available, hindering reproducibility and further research. Existing attempts to replicate CLIP's data rely on filtering with the CLIP model, potentially introducing biases.	The authors reconstruct CLIP's metadata, perform sub-string matching on a raw data pool (CommonCrawl), and then balance the distribution of data points over the metadata.	MetaCLIP applied to CommonCrawl with 400M image-text pairs outperforms CLIP's data on 26 standard benchmarks. In zero-shot ImageNet classification, MetaCLIP achieves 70.8% accuracy, surpassing CLIP's 68.3% on ViT-B models. Scaling MetaCLIP to 2.5B data points, while maintaining the same training budget, reaches 79.2% accuracy on ViT-L/14.	The paper only uses CommonCrawl as a data source, and other sources might yield different results. Future work could explore more sophisticated methods for balancing data distribution.	clip, data curation, vision-language pre-training, metadata, zero-shot learning
2309.16653 Report	DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation	Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, Gang Zeng	Recent advances in 3D content creation mostly leverage optimization-based 3D generation via score distillation sampling (SDS). Though promising results have been exhibited, these methods often suffer from slow per-sample optimization, limiting their practical usage. In this paper, we propose DreamGaussian, a novel 3D content generation framework that achieves both efficiency and quality simultaneously. Our key insight is to design a generative 3D Gaussian Splatting model with companioned mesh extraction and texture refinement in UV space. In contrast to the occupancy pruning used in Neural Radiance Fields, we demonstrate that the progressive densification of 3D Gaussians converges significantly faster for 3D generative tasks. To further enhance the texture quality and facilitate downstream applications, we introduce an efficient algorithm to convert 3D Gaussians into textured meshes and apply a fine-tuning stage to refine the details. Extensive experiments demonstrate the superior efficiency and competitive generation quality of our proposed approach. Notably, DreamGaussian produces high-quality textured meshes in just 2 minutes from a single-view image, achieving approximately 10 times acceleration compared to existing methods.	DreamGaussian, a novel 3D content generation framework that achieves fast and high-quality 3D generation by adapting 3D Gaussian Splatting into generative settings with companioned mesh extraction and texture refinement.	Existing optimization-based 3D generation methods suffer from slow per-sample optimization, limiting their practical usage.	1. Adapt 3D Gaussian Splatting into generative settings for efficient 3D content creation. 2. Design an efficient mesh extraction algorithm from 3D Gaussians. 3. Propose a UV-space texture refinement stage to further enhance the generation quality.	DreamGaussian significantly reduces the generation time of optimization-based 2D lifting methods for 3D content creation. DreamGaussian achieves approximately 10 times acceleration compared to existing methods, producing high-quality textured meshes in just 2 minutes from a single-view image. The proposed mesh extraction algorithm and UV-space texture refinement stage effectively enhance the generation quality of 3D content.	The generated models may exhibit limitations such as the multi-face Janus problem, over-saturated texture, and baked lighting. The back-view texture generated in image-to-3D results may appear blurry.	3d generation, gaussian splatting, score distillation sampling, mesh extraction, texture refinement
2309.16633 Report	Mixup Your Own Pairs	Yilei Wu, Zijian Dong, Chongyao Chen, Wangchunshu Zhou, Juan Helen Zhou	In representation learning, regression has traditionally received less attention than classification. Directly applying representation learning techniques designed for classification to regression often results in fragmented representations in the latent space, yielding sub-optimal performance. In this paper, we argue that the potential of contrastive learning for regression has been overshadowed due to the neglect of two crucial aspects: ordinality-awareness and hardness. To address these challenges, we advocate "mixup your own contrastive pairs for supervised contrastive regression", instead of relying solely on real/augmented samples. Specifically, we propose Supervised Contrastive Learning for Regression with Mixup (SupReMix). It takes anchor-inclusive mixtures (mixup of the anchor and a distinct negative sample) as hard negative pairs and anchor-exclusive mixtures (mixup of two distinct negative samples) as hard positive pairs at the embedding level. This strategy formulates harder contrastive pairs by integrating richer ordinal information. Through extensive experiments on six regression datasets including 2D images, volumetric images, text, tabular data, and time-series signals, coupled with theoretical analysis, we demonstrate that SupReMix pre-training fosters continuous ordered representations of regression data, resulting in significant improvement in regression performance. Furthermore, SupReMix is superior to other approaches in a range of regression challenges including transfer learning, imbalanced training data, and scenarios with fewer training samples.	The paper proposes SupReMix, a supervised contrastive learning framework for regression that leverages mixup to generate hard positive and negative pairs, improving representation learning for regression tasks by considering ordinality and hardness.	Directly applying contrastive learning methods designed for classification to regression often leads to suboptimal performance due to neglecting the inherent ordinal nature of regression data and the importance of hard contrastive pairs.	SupReMix creates hard negative pairs using anchor-inclusive mixup (anchor and a negative sample) and hard positive pairs via anchor-exclusive mixup (two negative samples with a combined label equal to the anchor's). It also introduces distance magnifying weights for negative pairs based on label distance.	SupReMix consistently outperforms baseline methods, including vanilla deep regression and other supervised contrastive learning frameworks, across six datasets with various input modalities (text, 2D/3D images, tabular data, time series). SupReMix significantly improves performance on imbalanced regression, transfer learning, and scenarios with reduced training data, demonstrating its robustness and data efficiency. Ablation studies confirm the effectiveness of each proposed component (hard negative/positive pairs, distance magnifying weights) in boosting the performance.	The choice of hyperparameters, such as the mixup Beta distribution and the window size for hard positive pair generation, requires careful tuning. Future work could explore alternative methods for generating hard contrastive pairs or investigate the effectiveness of SupReMix on other regression tasks and domains.	contrastive learning, regression, representation learning, mixup, ordinality-awareness
2309.16608 Report	KV Inversion: KV Embeddings Learning for Text-Conditioned Real Image Action Editing	Jiancheng Huang, Yifan Liu, Jin Qin, Shifeng Chen	Text-conditioned image editing is a recently emerged and highly practical task, and its potential is immeasurable. However, most of the concurrent methods are unable to perform action editing, i.e. they can not produce results that conform to the action semantics of the editing prompt and preserve the content of the original image. To solve the problem of action editing, we propose KV Inversion, a method that can achieve satisfactory reconstruction performance and action editing, which can solve two major problems: 1) the edited result can match the corresponding action, and 2) the edited object can retain the texture and identity of the original real image. In addition, our method does not require training the Stable Diffusion model itself, nor does it require scanning a large-scale dataset to perform time-consuming training.	Presents KV Inversion, a training-free method for text-conditioned action editing of real images using stable diffusion, focusing on preserving object identity and texture while enabling action modifications.	Addresses limitations of existing image editing techniques that struggle to perform action editing on real images while maintaining fidelity to the original object's appearance.	Introduces Content Preserving self-attention (CP-attn) which learns Key and Value embeddings during a tuning stage to better preserve source image content. During editing, these learned embeddings, combined with the target text prompt, guide the generation of the edited image.	Achieves superior action editing results on real images compared to concurrent methods, successfully modifying actions while retaining object identity and texture. Demonstrates consistent performance across different domains, including natural images and anime-style images. Provides a more efficient alternative by eliminating the need for finetuning the diffusion model or training on extensive datasets.	Editing results may be unsatisfactory if the prompted action drastically conflicts with the original image pose. Reliance on user-provided prompts without additional guidance (e.g., skeleton maps) can limit control over complex action editing.	real image editing, diffusion model, text-to-image generation, action editing, content preserving
2309.16588 Report	Vision Transformers Need Registers	Timothée Darcet, Maxime Oquab, Julien Mairal, Piotr Bojanowski	Transformers have recently emerged as a powerful tool for learning visual representations. In this paper, we identify and characterize artifacts in feature maps of both supervised and self-supervised ViT networks. The artifacts correspond to high-norm tokens appearing during inference primarily in low-informative background areas of images, that are repurposed for internal computations. We propose a simple yet effective solution based on providing additional tokens to the input sequence of the Vision Transformer to fill that role. We show that this solution fixes that problem entirely for both supervised and self-supervised models, sets a new state of the art for self-supervised visual models on dense visual prediction tasks, enables object discovery methods with larger models, and most importantly leads to smoother feature maps and attention maps for downstream visual processing.	The paper identifies artifacts in feature maps of Vision Transformer (ViT) networks, both supervised and self-supervised, and proposes a solution by adding register tokens to absorb these artifacts.	Artifacts, corresponding to high-norm tokens in low-informative areas, negatively impact feature map smoothness and hinder object discovery methods. Addressing this issue improves performance in dense prediction tasks and enables object discovery with larger models.	The authors analyze the artifacts, characterize them as high-norm tokens, and observe their emergence during training. They then propose adding register tokens to the input sequence, allowing the model to use them for internal computations instead of repurposing patch tokens.	Adding register tokens effectively eliminates high-norm outlier tokens. Models trained with registers show improved performance in dense prediction tasks like semantic segmentation and depth estimation. Object discovery methods, previously incompatible with newer ViT models, become viable and show significant performance improvement when using models trained with registers.	The study focuses on DINOv2 and may need further investigation for generalizability to other self-supervised methods. Future work includes exploring regularization techniques for register tokens and investigating their potential for multi-modal tasks.	vision transformers, artifacts, self-supervised learning, object discovery, register tokens
2309.16553 Report	MatrixCity: A Large-scale City Dataset for City-scale Neural Rendering and Beyond	Yixuan Li, Lihan Jiang, Linning Xu, Yuanbo Xiangli, Zhenzhi Wang, Dahua Lin, Bo Dai	Neural radiance fields (NeRF) and its subsequent variants have led to remarkable progress in neural rendering. While most of recent neural rendering works focus on objects and small-scale scenes, developing neural rendering methods for city-scale scenes is of great potential in many real-world applications. However, this line of research is impeded by the absence of a comprehensive and high-quality dataset, yet collecting such a dataset over real city-scale scenes is costly, sensitive, and technically difficult. To this end, we build a large-scale, comprehensive, and high-quality synthetic dataset for city-scale neural rendering researches. Leveraging the Unreal Engine 5 City Sample project, we develop a pipeline to easily collect aerial and street city views, accompanied by ground-truth camera poses and a range of additional data modalities. Flexible controls over environmental factors like light, weather, human and car crowd are also available in our pipeline, supporting the need of various tasks covering city-scale neural rendering and beyond. The resulting pilot dataset, MatrixCity, contains 67k aerial images and 452k street images from two city maps of total size $28km^2$. On top of MatrixCity, a thorough benchmark is also conducted, which not only reveals unique challenges of the task of city-scale neural rendering, but also highlights potential improvements for future works. The dataset and code will be publicly available at our project page: https://city-super.github.io/matrixcity/.	This paper introduces MatrixCity, a large-scale, high-quality synthetic dataset designed for city-scale neural rendering research.	Existing datasets for neural rendering are inadequate for city-scale scenes due to limited size, diversity, controllability, and availability. MatrixCity aims to bridge this gap and facilitate research in this area.	The authors leveraged Unreal Engine 5 to create MatrixCity, developing a plugin for automatic data collection and incorporating diverse urban elements, controllable environments, and multiple ground-truth properties (depth, normal, reflectance).	Modeling high-rise areas in aerial data is more challenging than low-rise areas due to complex occlusions. Street-level data, being richer in detail, poses greater challenges for model capacity compared to aerial data. Fusing aerial and street-level data naively degrades performance due to significant differences in detail and viewpoint.	The current dataset focuses on static scenes; incorporating dynamic elements like moving objects and lighting changes more realistically is left for future work. Exploring advanced algorithms to effectively fuse multi-view data with varying levels of detail is crucial for future research.	neural rendering, city-scale 3d reconstruction, synthetic dataset, unreal engine 5, nerf
2309.16421 Report	Distilling ODE Solvers of Diffusion Models into Smaller Steps	Sanghwan Kim, Hao Tang, Fisher Yu	Abstract Diffusion models have recently gained prominence as a novel category of generative models. Despite their success, these models face a notable drawback in terms of slow sampling speeds, requiring a high number of function evaluations (NFE) in the order of hundreds or thousands. In response, both learning-free and learning-based sampling strategies have been explored to expedite the sampling process. Learning-free sampling employs various ordinary differential equation (ODE) solvers based on the formulation of diffusion ODEs. However, it encounters challenges in faithfully tracking the true sampling trajectory, particularly for small NFE. Conversely, learning-based sampling methods, such as knowledge distillation, demand extensive additional training, limiting their practical applicability. To overcome these limitations, we introduce Distilled-ODE solvers (D-ODE solvers), a straightforward distillation approach grounded in ODE solver formulations. Our method seamlessly integrates the strengths of both learning-free and learning-based sampling. D-ODE solvers are constructed by introducing a single parameter adjustment to existing ODE solvers. Furthermore, we optimize D-ODE solvers with smaller steps using knowledge distillation from ODE solvers with larger steps across a batch of samples. Comprehensive experiments demonstrate the superior performance of D-ODE solvers compared to existing ODE solvers, including DDIM, PNDM, DPM-Solver, DEIS, and EDM, particularly in scenarios with fewer NFE. Notably, our method incurs negligible computational overhead compared to previous distillation techniques, facilitating straightforward and rapid integration with existing samplers. Qualitative analysis reveals that D-ODE solvers not only enhance image quality but also faithfully follow the target ODE trajectory.	This paper introduces Distilled-ODE solvers (D-ODE solvers), a novel distillation method to enhance the efficiency of diffusion model sampling by optimizing ODE solvers with minimal additional training.	Diffusion models, despite their ability to generate high-quality samples, often suffer from slow sampling speeds due to the need for numerous function evaluations. Existing solutions, such as learning-free and learning-based sampling, present limitations in terms of trajectory accuracy or excessive training requirements. D-ODE solvers address these limitations by combining the strengths of both approaches.	D-ODE solvers introduce a single adjustable parameter to existing ODE solvers, linearly combining current and previous denoising network outputs. This parameter is then optimized for each dataset by minimizing the difference between the outputs of D-ODE solvers with smaller steps (student) and those of ODE solvers with larger steps (teacher). The distillation process requires only one batch of samples, making it computationally efficient.	D-ODE solvers consistently outperform state-of-the-art ODE solvers in terms of FID scores across various image generation benchmarks, particularly with a smaller number of function evaluations (NFE). The method's efficiency is demonstrated through significantly reduced distillation times compared to previous distillation techniques, requiring only a few CPU minutes. Visual analysis reveals that D-ODE solvers effectively guide the sampling process toward the target data manifold, enhancing image quality while maintaining fidelity to the original ODE trajectory.	The single-parameter nature of D-ODE solvers may limit their effectiveness in generating high-resolution images. Future research could explore incorporating local-specific parameters to address this limitation.	diffusion models, generative models, knowledge distillation, ode solvers, fast sampling
2309.16414 Report	AutoCLIP: Auto-tuning Zero-Shot Classifiers for Vision-Language Models	Jan Hendrik Metzen, Piyapat Saranrittichai, Chaithanya Kumar Mummadi	Classifiers built upon vision-language models such as CLIP have shown remarkable zero-shot performance across a broad range of image classification tasks. Prior work has studied different ways of automatically creating descriptor sets for every class based on prompt templates, ranging from manually engineered templates over templates obtained from a large language model to templates built from random words and characters. Up until now, deriving zero-shot classifiers from the respective encoded class descriptors has remained nearly unchanged, i.e., classify to the class that maximizes cosine similarity between its averaged encoded class descriptors and the image encoding. However, weighing all class descriptors equally can be suboptimal when certain descriptors match visual clues on a given image better than others. In this work, we propose AutoCLIP, a method for auto-tuning zero-shot classifiers. AutoCLIP tunes per-image weights to each prompt template at inference time, based on statistics of class descriptor-image similarities. AutoCLIP is fully unsupervised, has very low computational overhead, and can be easily implemented in few lines of code. We show that AutoCLIP outperforms baselines across a broad range of vision-language models, datasets, and prompt templates consistently and by up to 3 percent point accuracy.	This paper introduces AutoCLIP, a method for auto-tuning zero-shot image classifiers built on top of vision-language models (VLMs). AutoCLIP adapts the weights of prompt templates at test time based on the similarity between the encoded image and class descriptors.	Zero-shot classifiers based on VLMs rely heavily on prompt engineering. Existing methods either use fixed prompts or employ computationally expensive test-time prompt tuning. AutoCLIP offers a more efficient alternative by dynamically adjusting prompt weights for each image, enhancing the zero-shot classifier's performance.	AutoCLIP leverages the embedding space of VLMs. It computes weights for prompt templates based on the similarity between encoded class descriptors and the encoded image. These weights, determined through gradient ascent on a logsumexp aggregation of similarities, are then used to compute weighted class queries for classification.	AutoCLIP consistently outperforms baseline zero-shot classifiers on a wide range of datasets, VLMs, and prompt templates, particularly with larger and more diverse sets of prompts. AutoCLIP shows an average accuracy improvement of 0.45 percentage points and up to 3 percentage points in certain settings, with minimal computational overhead. Ablation studies confirm the effectiveness of the logsumexp aggregation and the robustness of AutoCLIP to the choice of the entropy reduction factor for the step size.	The default entropy reduction factor, while generally robust, could be suboptimal for some datasets, suggesting further exploration. Future work could explore extending AutoCLIP to other zero-shot tasks like object detection and multi-modal prompting.	zero-shot learning, vision-language models, prompt engineering, test-time adaptation, image classification
2309.16364 Report	FG-NeRF: Flow-GAN based Probabilistic Neural Radiance Field for Independence-Assumption-Free Uncertainty Estimation	Songlin Wei, Jiazhao Zhang, Yang Wang, Fanbo Xiang, Hao Su, He Wang	Neural radiance fields with stochasticity have garnered significant interest by enabling the sampling of plausible radiance fields and quantifying uncertainty for downstream tasks. Existing works rely on the independence assumption of points in the radiance field or the pixels in input views to obtain tractable forms of the probability density function. However, this assumption inadvertently impacts performance when dealing with intricate geometry and texture. In this work, we propose an independence-assumption-free probabilistic neural radiance field based on Flow-GAN. By combining the generative capability of adversarial learning and the powerful expressivity of normalizing flow, our method explicitly models the density-radiance distribution of the whole scene. We represent our probabilistic NeRF as a mean-shifted probabilistic residual neural model. Our model is trained without an explicit likelihood function, thereby avoiding the independence assumption. Specifically, We downsample the training images with different strides and centers to form fixed-size patches which are used to train the generator with patch-based adversarial learning. Through extensive experiments, our method demonstrates state-of-the-art performance by predicting lower rendering errors and more reliable uncertainty on both synthetic and real-world datasets.	This paper proposes Flow-GAN NeRF (FG-NeRF), a novel probabilistic neural radiance field that leverages adversarial learning and normalizing flows to estimate uncertainty in neural scene representations without relying on independence assumptions.	Estimating uncertainty in neural radiance fields is crucial for applications like robotics, autonomous driving, and human-computer interaction where understanding the confidence of predictions is essential. Existing methods often make strong independence assumptions that limit their accuracy, especially in complex scenes.	FG-NeRF decomposes the radiance field into deterministic and probabilistic branches, with the latter implemented using conditional normalizing flow. It employs a GAN framework where the generator synthesizes image patches by volume rendering samples from the learned distribution, while the discriminator distinguishes them from real patches. This adversarial training scheme allows FG-NeRF to learn complex density-radiance distributions without relying on explicit likelihood computations or independence assumptions.	FG-NeRF achieves state-of-the-art uncertainty estimation performance on LLFF, ScanNet, and Replica datasets, outperforming previous methods like S-NeRF and CF-NeRF. The method effectively captures intricate geometry and appearance details, resulting in high-quality uncertainty maps that highlight uncertain regions like object edges and areas with high-frequency textures. Ablation studies demonstrate the effectiveness of key components like adversarial learning and the deterministic branch in improving uncertainty estimation and rendering quality.	FG-NeRF is computationally expensive, requiring significant resources for training even with acceleration techniques like multi-level hash encoding. The rendering quality, while good, is not on par with the latest advancements in NeRF rendering, leaving room for improvement by incorporating scene priors, advanced training strategies, and novel network architectures.	neural radiance fields, uncertainty estimation, generative adversarial networks, normalizing flows, scene representation
2309.16354 Report	Transformer-VQ: Linear-Time Transformers via Vector Quantization	Lucas D. Lingle	We introduce Transformer-VQ, a decoder-only transformer computing softmax-based dense self-attention in linear time. Transformer-VQ's efficient attention is enabled by vector-quantized keys and a novel caching mechanism. In our large-scale experiments, Transformer-VQ is shown highly competitive in quality, obtaining 0.99 bpb on Enwik8, 26.6 ppl on PG-19, and 3.16 bpb on ImageNet64. In addition, the optimized implementation of Transformer-VQ is over 3x faster than a comparable quadratic-time transformer at sequence length 8k, is over 12x faster at 32k, and can scale to 131k with similar throughput. Code available: \url{https://github.com/transformer-vq/transformer_vq}	Transformer-VQ, a decoder-only Transformer that uses vector-quantized keys to compute dense self-attention in linear time, is introduced.	Standard Transformers have quadratic time complexity, limiting their practicality for long sequences. Efficient Transformers are crucial for long-context applications.	Transformer-VQ combines vector-quantized keys, localized positional biases, and a compressive cache for efficient attention. Theorems prove the linear-time complexity and equivalence to dense attention. The model is trained with a cross-entropy loss and a codebook commitment loss.	Transformer-VQ achieves 0.99 bpb on Enwik8, matching Transformer-XL with fewer parameters and a shorter cache. It achieves 26.6 ppl on PG-19, near state-of-the-art, showing the efficacy of standalone self-attention for long sequences. It sets a new state-of-the-art of 3.16 bpb on ImageNet64, generating high-fidelity samples in linear time.	Overfitting was a significant issue on Enwik8, requiring careful tuning of regularization parameters. Future work includes exploring formal scaling laws, larger models, and porting to lower-level frameworks.	transformer, linear-time attention, vector quantization, long-range dependencies, efficient transformers
2309.16351 Report	Dark Side Augmentation: Generating Diverse Night Examples for Metric Learning	Albert Mohwald, Tomas Jenicek, Ondřej Chum	Image retrieval methods based on CNN descriptors rely on metric learning from a large number of diverse examples of positive and negative image pairs. Domains, such as night-time images, with limited availability and variability of training data suffer from poor retrieval performance even with methods performing well on standard benchmarks. We propose to train a GAN-based synthetic-image generator, translating available day-time image examples into night images. Such a generator is used in metric learning as a form of augmentation, supplying training data to the scarce domain. Various types of generators are evaluated and analyzed. We contribute with a novel light-weight GAN architecture that enforces the consistency between the original and translated image through edge consistency. The proposed architecture also allows a simultaneous training of an edge detector that operates on both night and day images. To further increase the variability in the training examples and to maximize the generalization of the trained model, we propose a novel method of diverse anchor mining. The proposed method improves over the state-of-the-art results on a standard Tokyo 24/7 day-night retrieval benchmark while preserving the performance on Oxford and Paris datasets. This is achieved without the need of training image pairs of matching day and night images. The source code is available at https://github.com/mohwald/gandtr .	This paper introduces a novel approach for training deep neural networks to generate image descriptors for day-night illumination-invariant image retrieval, utilizing synthetically generated night images instead of relying on corresponding pairs of night and day training images.	This method addresses the challenge of limited availability and variability of training data for night-time images in image retrieval tasks, leading to improved performance in day-night retrieval scenarios.	The proposed method utilizes a GAN-based synthetic-image generator to translate day-time images into night images for augmenting the training data. The authors propose a novel light-weight GAN architecture that enforces consistency between the original and translated image through edge consistency and enables simultaneous training of an edge detector effective on both night and day images. Additionally, a diverse anchor mining method is introduced to enhance the variability of training examples.	The method surpasses previous approaches, including those using ground-truth day-night image pairs, in retrieval performance. A larger diversity of synthesized training data proves more beneficial than a smaller set of real training data. The proposed light-weight generator, utilizing edge consistency, demonstrates comparable performance to more computationally intensive generators while training significantly faster.	The impact of using different edge detectors on the generator's performance requires further investigation. Exploring the potential benefits of combining the proposed method with other domain adaptation techniques is a promising direction for future research.	image retrieval, generative adversarial networks, data augmentation, illumination invariance, metric learning
2309.16108 Report	Channel Vision Transformers: An Image Is Worth 1 x 16 x 16 Words	Yujia Bao, Srinivasan Sivanandan, Theofanis Karaletsos	Vision Transformer (ViT) has emerged as a powerful architecture in the realm of modern computer vision. However, its application in certain imaging fields, such as microscopy and satellite imaging, presents unique challenges. In these domains, images often contain multiple channels, each carrying semantically distinct and independent information. Furthermore, the model must demonstrate robustness to sparsity in input channels, as they may not be densely available during training or testing. In this paper, we propose a modification to the ViT architecture that enhances reasoning across the input channels and introduce Hierarchical Channel Sampling (HCS) as an additional regularization technique to ensure robustness when only partial channels are presented during test time. Our proposed model, ChannelViT, constructs patch tokens independently from each input channel and utilizes a learnable channel embedding that is added to the patch tokens, similar to positional embeddings. We evaluate the performance of ChannelViT on ImageNet, JUMP-CP (microscopy cell imaging), and So2Sat (satellite imaging). Our results show that ChannelViT outperforms ViT on classification tasks and generalizes well, even when a subset of input channels is used during testing. Across our experiments, HCS proves to be a powerful regularizer, independent of the architecture employed, suggesting itself as a straightforward technique for robust ViT training. Lastly, we find that ChannelViT generalizes effectively even when there is limited access to all channels during training, highlighting its potential for multi-channel imaging under real-world conditions with sparse sensors. Our code is available at https://github.com/insitro/ChannelViT.	This paper proposes ChannelViT, a Vision Transformer (ViT) modification for multi-channel imaging, which improves channel reasoning and handles sparse channel availability.	ViTs struggle with multi-channel imaging due to semantically distinct information in each channel and potential sparsity in channel availability during training and testing.	ChannelViT generates separate patch tokens per channel, uses learnable channel embeddings, and employs Hierarchical Channel Sampling (HCS) for robustness during sparse channel training.	ChannelViT outperforms ViT on ImageNet, JUMP-CP (microscopy), and So2Sat (satellite) datasets. HCS significantly improves channel robustness, enabling models to generalize well to unseen channel combinations. ChannelViT demonstrates data efficiency, performing well even with limited access to all channels during training.	ChannelViTs increased sequence length increases computational cost. Future work includes exploring more efficient attention mechanisms to reduce computational overhead.	vision transformer, multi-channel imaging, channel robustness, hierarchical channel sampling, self-supervised learning
2309.15954 Report	The Devil is in the Details: A Deep Dive into the Rabbit Hole of Data Filtering	Haichao Yu, Yu Tian, Sateesh Kumar, Linjie Yang, Heng Wang	The quality of pre-training data plays a critical role in the performance of foundation models. Popular foundation models often design their own recipe for data filtering, which makes it hard to analyze and compare different data filtering approaches. DataComp is a new benchmark dedicated to evaluating different methods for data filtering. This paper describes our learning and solution when participating in the DataComp challenge. Our filtering strategy includes three stages: single-modality filtering, cross-modality filtering, and data distribution alignment. We integrate existing methods and propose new solutions, such as computing CLIP score on horizontally flipped images to mitigate the interference of scene text, using vision and language models to retrieve training samples for target downstream tasks, rebalancing the data distribution to improve the efficiency of allocating the computational budget, etc. We slice and dice our design choices, provide in-depth analysis, and discuss open questions. Our approach outperforms the best method from the DataComp paper by over 4% on the average performance of 38 tasks and by over 2% on ImageNet.	This paper presents a three-stage data filtering framework for multi-modal pre-training, aiming to enhance the performance of foundation models on various vision and language tasks.	Data quality is crucial for foundation model performance, and understanding effective filtering strategies is vital for improving these models.	The framework consists of: (1) Single-modality filtering (image/text quality), (2) Cross-modality filtering (image-text similarity using flipped-CLIP and BLIP-ITM), and (3) Distribution alignment (dataset diversity, computational budget allocation, and downstream task alignment).	The proposed approach outperforms the best method from the DataComp paper by over 4% on average across 38 tasks. Flipped-CLIP score is found to be a more effective filtering metric compared to the original CLIP score. Aligning pre-training data distribution with downstream tasks, especially for digit recognition, significantly improves performance.	The generalization of data filtering methods to different data distributions requires further investigation. A better balance is needed when filtering images containing scene text to benefit tasks requiring such information.	data filtering, multi-modal learning, foundation models, dataset curation, clip
2309.15842 Report	Exploiting the Signal-Leak Bias in Diffusion Models	Martin Nicolas Everaert, Athanasios Fitsios, Marco Bocchio, Sami Arpa, Sabine Süsstrunk, Radhakrishna Achanta	There is a bias in the inference pipeline of most diffusion models. This bias arises from a signal leak whose distribution deviates from the noise distribution, creating a discrepancy between training and inference processes. We demonstrate that this signal-leak bias is particularly significant when models are tuned to a specific style, causing sub-optimal style matching. Recent research tries to avoid the signal leakage during training. We instead show how we can exploit this signal-leak bias in existing diffusion models to allow more control over the generated images. This enables us to generate images with more varied brightness, and images that better match a desired style or color. By modeling the distribution of the signal leak in the spatial frequency and pixel domains, and including a signal leak in the initial latent, we generate images that better match expected results without any additional training.	This paper analyzes the signal-leak bias in diffusion models, showing its caused by discrepancies between noise and data distributions at the last training timestep, and proposes a method to exploit it for controlling image generation.	The signal-leak bias limits control over generated images, particularly affecting style adaptation and brightness diversity. This paper offers a simple solution for more controllable and varied image generation.	The proposed method models the distribution of the signal leak from target images, either in the pixel or frequency domain. This distribution is then used to sample the initial latent at inference time, effectively biasing the generated image towards desired characteristics.	Significantly improves style matching in models fine-tuned for specific styles without additional training. Enables style-specific image generation with non-style-specific diffusion models by leveraging the signal leak. Generates images with more diverse brightness and color variations by modeling low-frequency components of the signal leak.	The proposed method relies on random sampling of the signal leak, potentially limiting control over brightness matching with the textual prompt. Certain specific styles may not be easily captured by the proposed pixel-domain model, requiring alternative modeling or fine-tuning.	diffusion models, signal-leak bias, style adaptation, image generation, frequency domain analysis
2309.15830 Report	OrthoPlanes: A Novel Representation for Better 3D-Awareness of GANs	Honglin He, Zhuoqian Yang, Shikai Li, Bo Dai, Wayne Wu	We present a new method for generating realistic and view-consistent images with fine geometry from 2D image collections. Our method proposes a hybrid explicit-implicit representation called \textbf{OrthoPlanes}, which encodes fine-grained 3D information in feature maps that can be efficiently generated by modifying 2D StyleGANs. Compared to previous representations, our method has better scalability and expressiveness with clear and explicit information. As a result, our method can handle more challenging view-angles and synthesize articulated objects with high spatial degree of freedom. Experiments demonstrate that our method achieves state-of-the-art results on FFHQ and SHHQ datasets, both quantitatively and qualitatively. Project page: \url{https://orthoplanes.github.io/}.	Presents OrthoPlanes, a hybrid explicit-implicit 3D representation, to improve 3D awareness and geometry quality in 2D GANs, particularly for articulated objects.	Existing 3D-aware GANs struggle to accurately reconstruct complex 3D shapes from 2D images, limiting realism and view-consistency, especially for non-rigid objects.	Uses StyleGAN2 to generate feature maps representing sectional projections of a 3D scene onto groups of parallel planes. Location embeddings enhance spatial awareness and a lightweight MLP decodes features for volumetric rendering.	Achieves state-of-the-art results on FFHQ and SHHQ datasets for 3D-aware image synthesis. Exhibits superior view-consistency, handling challenging angles better than previous methods. Demonstrates improved geometry reconstruction, especially for articulated objects like human bodies.	Background artifacts persist, requiring further exploration of modeling assumptions. Inconsistencies under view variations due to two-stage rendering process, suggesting direct RGB rendering as future work.	3d-aware gans, image synthesis, neural rendering, view consistency, 3d reconstruction
2309.15818 Report	Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation	David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, Mike Zheng Shou	Significant advancements have been achieved in the realm of large-scale pre-trained text-to-video Diffusion Models (VDMs). However, previous methods either rely solely on pixel-based VDMs, which come with high computational costs, or on latent-based VDMs, which often struggle with precise text-video alignment. In this paper, we are the first to propose a hybrid model, dubbed as Show-1, which marries pixel-based and latent-based VDMs for text-to-video generation. Our model first uses pixel-based VDMs to produce a low-resolution video of strong text-video correlation. After that, we propose a novel expert translation method that employs the latent-based VDMs to further upsample the low-resolution video to high resolution. Compared to latent VDMs, Show-1 can produce high-quality videos of precise text-video alignment; Compared to pixel VDMs, Show-1 is much more efficient (GPU memory usage during inference is 15G vs 72G). We also validate our model on standard video generation benchmarks. Our code and model weights are publicly available at https://github.com/showlab/Show-1.	Show-1, a hybrid text-to-video generation model that combines the strengths of pixel-based and latent-based Video Diffusion Models (VDMs).	Existing pixel-based VDMs are computationally expensive while latent-based VDMs often struggle with text-video alignment.	The model uses pixel-based VDMs for low-resolution keyframe and temporal interpolation generation, ensuring strong text-video correlation. It then employs a novel expert translation method based on latent-based VDMs for efficient upsampling to high resolution.	Show-1 generates high-quality videos with precise text-video alignment. It achieves state-of-the-art performance on UCF-101 and MSR-VTT benchmarks. The model offers significant efficiency improvements, requiring only 15GB of GPU memory during inference compared to 72GB for pixel-based methods.	The model's reliance on web data for training may lead to biases in the generated content. Future work could explore methods for mitigating bias and further improving the model's efficiency.	text-to-video generation, diffusion models, video synthesis, deep learning, computer vision
2309.15807 Report	Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack	Xiaoliang Dai, Ji Hou, Chih-Yao Ma, Sam Tsai, Jialiang Wang, Rui Wang, Peizhao Zhang, Simon Vandenhende, Xiaofang Wang, Abhimanyu Dubey, Matthew Yu, Abhishek Kadian, Filip Radenovic, Dhruv Mahajan, Kunpeng Li, Yue Zhao, Vladan Petrovic, Mitesh Kumar Singh, Simran Motwani, Yi Wen, Yiwen Song, Roshan Sumbaly, Vignesh Ramanathan, Zijian He, Peter Vajda, Devi Parikh	Training text-to-image models with web scale image-text pairs enables the generation of a wide range of visual concepts from text. However, these pre-trained models often face challenges when it comes to generating highly aesthetic images. This creates the need for aesthetic alignment post pre-training. In this paper, we propose quality-tuning to effectively guide a pre-trained model to exclusively generate highly visually appealing images, while maintaining generality across visual concepts. Our key insight is that supervised fine-tuning with a set of surprisingly small but extremely visually appealing images can significantly improve the generation quality. We pre-train a latent diffusion model on $1.1$ billion image-text pairs and fine-tune it with only a few thousand carefully selected high-quality images. The resulting model, Emu, achieves a win rate of $82.9\%$ compared with its pre-trained only counterpart. Compared to the state-of-the-art SDXLv1.0, Emu is preferred $68.4\%$ and $71.3\%$ of the time on visual appeal on the standard PartiPrompts and our Open User Input benchmark based on the real-world usage of text-to-image models. In addition, we show that quality-tuning is a generic approach that is also effective for other architectures, including pixel diffusion and masked generative transformer models.	This paper proposes "quality-tuning", a method for enhancing the aesthetics of text-to-image models by fine-tuning them on a small, curated dataset of high-quality images.	Pre-trained text-to-image models often struggle to consistently generate visually appealing images. This quality-tuning approach addresses the need for improved aesthetic alignment post pre-training.	The authors pre-train a latent diffusion model on 1.1 billion image-text pairs. They then fine-tune this model using a dataset of 2000 meticulously selected, high-quality images. This dataset is curated through a combination of automatic filtering and rigorous two-stage human evaluation based on photographic principles.	Quality-tuning significantly improves the visual appeal of generated images, outperforming a state-of-the-art SDXLv1.0 model. The effectiveness of quality-tuning is demonstrated even with a surprisingly small fine-tuning dataset, emphasizing the importance of quality over quantity. The approach is generalizable and shows improvements on other architectures, including pixel diffusion and masked generative transformer models.	Human evaluation of aesthetics is inherently subjective and may vary based on prompts, annotators, and guidelines. Despite improved aesthetics, limitations from the pre-training stage might persist, such as difficulty generating specific objects not well-represented during pre-training.	text-to-image generation, aesthetic alignment, quality-tuning, latent diffusion model, fine-tuning
2309.15664 Report	Dynamic Prompt Learning: Addressing Cross-Attention Leakage for Text-Based Image Editing	Kai Wang, Fei Yang, Shiqi Yang, Muhammad Atif Butt, Joost van de Weijer	Large-scale text-to-image generative models have been a ground-breaking development in generative AI, with diffusion models showing their astounding ability to synthesize convincing images following an input text prompt. The goal of image editing research is to give users control over the generated images by modifying the text prompt. Current image editing techniques are susceptible to unintended modifications of regions outside the targeted area, such as on the background or on distractor objects which have some semantic or visual relationship with the targeted object. According to our experimental findings, inaccurate cross-attention maps are at the root of this problem. Based on this observation, we propose Dynamic Prompt Learning (DPL) to force cross-attention maps to focus on correct noun words in the text prompt. By updating the dynamic tokens for nouns in the textual input with the proposed leakage repairment losses, we achieve fine-grained image editing over particular objects while preventing undesired changes to other image regions. Our method DPL, based on the publicly available Stable Diffusion, is extensively evaluated on a wide range of images, and consistently obtains superior results both quantitatively (CLIP score, Structure-Dist) and qualitatively (on user-evaluation). We show improved prompt editing results for Word-Swap, Prompt Refinement, and Attention Re-weighting, especially for complex multi-object scenes.	This paper proposes Dynamic Prompt Learning (DPL), a method to improve text-guided image editing in diffusion models by addressing cross-attention leakage to background and distractor objects.	Current text-guided image editing techniques often produce undesired modifications outside the targeted area due to inaccurate cross-attention maps.	DPL optimizes dynamic tokens for nouns in the text prompt using three losses: Disjoint Object Attention Loss, Background Leakage Loss, and Attention Balancing Loss. It leverages DDIM and Null-Text inversion for image reconstruction and editing.	DPL significantly improves the accuracy of cross-attention maps compared to the baseline (Null-Text Inversion). DPL achieves superior quantitative results in image editing tasks like Word-Swap, as measured by CLIP-Score and Structure-Dist. User studies confirm that DPL leads to significantly better image editing results compared to the baseline, especially for complex multi-object scenes.	The reliance on smaller cross-attention maps (16x16) limits fine-grained structure control. The method currently doesn't handle scenarios where multiple noun words in the prompt refer to the same object.	image editing, diffusion models, text-to-image, cross-attention, stable diffusion
2309.15508 Report	DreamCom: Finetuning Text-guided Inpainting Model for Image Composition	Lingxiao Lu, Jiangtong Li, Bo Zhang, Li Niu	The goal of image composition is merging a foreground object into a background image to obtain a realistic composite image. Recently, generative composition methods are built on large pretrained diffusion models, due to their unprecedented image generation ability. However, they are weak in preserving the foreground object details. Inspired by recent text-to-image generation customized for certain object, we propose DreamCom by treating image composition as text-guided image inpainting customized for certain object. Specifically , we finetune pretrained text-guided image inpainting model based on a few reference images containing the same object, during which the text prompt contains a special token associated with this object. Then, given a new background, we can insert this object into the background with the text prompt containing the special token. In practice, the inserted object may be adversely affected by the background, so we propose masked attention mechanisms to avoid negative background interference. Experimental results on DreamEditBench and our contributed MureCom dataset show the outstanding performance of our DreamCom.	This supplementary material for the DreamCom paper further explores the impact of masked self-attention in the model and provides additional experiments and comparisons.	The supplementary materials provide deeper insights into the DreamCom model's effectiveness for image composition by analyzing the roles of different components and comparing it with other state-of-the-art methods.	The authors conducted ablation studies on masking self-attention layers, experimented with varying numbers of reference images, and presented a visual comparison of DreamCom with baselines like DreamEdit, ObjectStitch, and PbE.	Masking outer decoder self-attention layers hurts foreground-background compatibility, while masking other layers helps prevent color leakage. Increasing the number of reference images generally improves performance, with 4 or 5 images yielding similar results. DreamCom outperforms baselines by achieving better foreground-background integration in terms of pose and lighting and preserving foreground details.	The impact of the number of reference images might vary depending on the background complexity. Future work could explore automatically determining the optimal number of reference images.	image composition, text-guided inpainting, self-attention, reference images, dreamcom
2309.15505 Report	Finite Scalar Quantization: VQ-VAE Made Simple	Fabian Mentzer, David Minnen, Eirikur Agustsson, Michael Tschannen	We propose to replace vector quantization (VQ) in the latent representation of VQ-VAEs with a simple scheme termed finite scalar quantization (FSQ), where we project the VAE representation down to a few dimensions (typically less than 10). Each dimension is quantized to a small set of fixed values, leading to an (implicit) codebook given by the product of these sets. By appropriately choosing the number of dimensions and values each dimension can take, we obtain the same codebook size as in VQ. On top of such discrete representations, we can train the same models that have been trained on VQ-VAE representations. For example, autoregressive and masked transformer models for image generation, multimodal generation, and dense prediction computer vision tasks. Concretely, we employ FSQ with MaskGIT for image generation, and with UViM for depth estimation, colorization, and panoptic segmentation. Despite the much simpler design of FSQ, we obtain competitive performance in all these tasks. We emphasize that FSQ does not suffer from codebook collapse and does not need the complex machinery employed in VQ (commitment losses, codebook reseeding, code splitting, entropy penalties, etc.) to learn expressive discrete representations.	This paper proposes replacing vector quantization (VQ) in VQ-VAEs with finite scalar quantization (FSQ), a simpler method that projects representations to low dimensions and quantizes each dimension to a small set of values.	VQ suffers from codebook collapse and requires complex techniques for optimization. FSQ simplifies the process while aiming for high codebook utilization without auxiliary losses.	The authors replace VQ with FSQ in MaskGIT (for image generation) and UViM (for depth estimation, colorization, and segmentation), comparing their performance across various codebook sizes.	FSQ achieves comparable performance to VQ in image generation and dense computer vision tasks. FSQ exhibits high codebook utilization (almost 100%) without auxiliary losses, unlike VQ which struggles with larger codebooks. FSQ performance scales with codebook size, leading to better reconstruction and sample quality.	The study primarily focuses on MaskGIT and UViM, potentially limiting the generalizability of the findings. Further investigation into the semantic properties of FSQ representations is needed.	vector quantization, vq-vae, image generation, dense prediction, representation learning
2309.15275 Report	Efficient Low-rank Backpropagation for Vision Transformer Adaptation	Yuedong Yang, Hung-Yueh Chiang, Guihong Li, Diana Marculescu, Radu Marculescu	The increasing scale of vision transformers (ViT) has made the efficient fine-tuning of these large models for specific needs a significant challenge in various applications. This issue originates from the computationally demanding matrix multiplications required during the backpropagation process through linear layers in ViT. In this paper, we tackle this problem by proposing a new Low-rank BackPropagation via Walsh-Hadamard Transformation (LBP-WHT) method. Intuitively, LBP-WHT projects the gradient into a low-rank space and carries out backpropagation. This approach substantially reduces the computation needed for adapting ViT, as matrix multiplication in the low-rank space is far less resource-intensive. We conduct extensive experiments with different models (ViT, hybrid convolution-ViT model) on multiple datasets to demonstrate the effectiveness of our method. For instance, when adapting an EfficientFormer-L1 model on CIFAR100, our LBP-WHT achieves 10.4% higher accuracy than the state-of-the-art baseline, while requiring 9 MFLOPs less computation. As the first work to accelerate ViT adaptation with low-rank backpropagation, our LBP-WHT method is complementary to many prior efforts and can be combined with them for better performance.	This paper proposes LBP-WHT, a novel low-rank backpropagation method using Walsh-Hadamard Transformation to accelerate the adaptation of Vision Transformers (ViT) for specific tasks.	Adapting large-scale ViT models, especially on resource-constrained edge devices, is challenging due to the computational complexity of backpropagation through dense linear layers.	LBP-WHT projects gradients into a low-rank space using WHT, performs efficient matrix multiplications in this reduced space, and finally projects the results back to the original space. This reduces computational cost while maintaining accuracy.	LBP-WHT consistently outperforms LoRA-based methods in terms of both speed and accuracy across various image classification and semantic segmentation tasks. The method offers flexibility in balancing accuracy and computational cost by adjusting the rank used for low-rank projection. LBP-WHT with carefully chosen ranks can achieve accuracy comparable to or even exceeding that of full-rank backpropagation while significantly reducing computational requirements.	Accuracy degradation is observed when using a very small number of ranks for full model training with LBP-WHT. Further research on improved approximation methods could potentially mitigate this issue.	vision transformer, model adaptation, low-rank backpropagation, walsh-hadamard transform, efficient training
2309.15164 Report	3D Reconstruction with Generalizable Neural Fields using Scene Priors	Yang Fu, Shalini De Mello, Xueting Li, Amey Kulkarni, Jan Kautz, Xiaolong Wang, Sifei Liu	High-fidelity 3D scene reconstruction has been substantially advanced by recent progress in neural fields. However, most existing methods train a separate network from scratch for each individual scene. This is not scalable, inefficient, and unable to yield good results given limited views. While learning-based multi-view stereo methods alleviate this issue to some extent, their multi-view setting makes it less flexible to scale up and to broad applications. Instead, we introduce training generalizable Neural Fields incorporating scene Priors (NFPs). The NFP network maps any single-view RGB-D image into signed distance and radiance values. A complete scene can be reconstructed by merging individual frames in the volumetric space WITHOUT a fusion module, which provides better flexibility. The scene priors can be trained on large-scale datasets, allowing for fast adaptation to the reconstruction of a new scene with fewer views. NFP not only demonstrates SOTA scene reconstruction performance and efficiency, but it also supports single-image novel-view synthesis, which is underexplored in neural fields. More qualitative results are available at: https://oasisyang.github.io/neural-prior	This paper proposes Neural Fields with scene Priors (NFPs), a novel method for fast and scalable 3D scene reconstruction that leverages single-view RGB-D images to learn generalizable scene priors.	Existing neural field methods for 3D reconstruction are often scene-specific, requiring separate training for each new scene, which is time-consuming and data-intensive. NFPs address these limitations by learning generalizable priors that can be quickly adapted to novel scenes.	NFPs employ a two-stage training paradigm: (1) a geometric reconstruction network learns to map depth images to local SDFs, and (2) this pretrained network serves as a geometric prior to train a color reconstruction network (texture prior) using volumetric rendering.	NFPs achieve state-of-the-art scene reconstruction quality with fine geometric details and realistic textures, even with limited input views. The method exhibits significantly faster convergence speed compared to existing approaches. NFPs also enable high-quality single-image novel-view synthesis, which is underexplored in existing neural field methods.	The current NFPs model requires at least sparse depth information as input and cannot be directly applied to RGB-only images. Future work could explore incorporating SfM techniques to enable the use of NFPs with RGB images.	3d scene reconstruction, neural fields, scene priors, single-view reconstruction, novel view synthesis
2309.15103 Report	LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models	Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, Yuwei Guo, Tianxing Wu, Chenyang Si, Yuming Jiang, Cunjian Chen, Chen Change Loy, Bo Dai, Dahua Lin, Yu Qiao, Ziwei Liu	This work aims to learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis. It is a highly desirable yet challenging task to simultaneously a) accomplish the synthesis of visually realistic and temporally coherent videos while b) preserving the strong creative generation nature of the pre-trained T2I model. To this end, we propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models, comprising a base T2V model, a temporal interpolation model, and a video super-resolution model. Our key insights are two-fold: 1) We reveal that the incorporation of simple temporal self-attentions, coupled with rotary positional encoding, adequately captures the temporal correlations inherent in video data. 2) Additionally, we validate that the process of joint image-video fine-tuning plays a pivotal role in producing high-quality and creative outcomes. To enhance the performance of LaVie, we contribute a comprehensive and diverse video dataset named Vimeo25M, consisting of 25 million text-video pairs that prioritize quality, diversity, and aesthetic appeal. Extensive experiments demonstrate that LaVie achieves state-of-the-art performance both quantitatively and qualitatively. Furthermore, we showcase the versatility of pre-trained LaVie models in various long video generation and personalized video synthesis applications.	This paper introduces LaVie, a cascaded video generation framework that leverages pre-trained text-to-image diffusion models to generate high-quality, temporally coherent videos from text descriptions.	Training text-to-video models from scratch is computationally expensive. LaVie offers a more efficient approach by building upon pre-trained models while maintaining high visual quality and creative control.	LaVie consists of three cascaded video latent diffusion models: a base model for generating key frames, a temporal interpolation model for smoother transitions, and a video super-resolution model for enhanced visual quality. The model is trained on a new high-quality video dataset, Vimeo25M, and uses joint image-video fine-tuning to prevent catastrophic forgetting and enhance concept compositionality.	LaVie achieves state-of-the-art performance on zero-shot text-to-video generation benchmarks, outperforming existing methods in terms of visual fidelity and text-video semantic similarity. Joint image-video fine-tuning proves crucial in preventing catastrophic forgetting and enabling the transfer of concepts from the image domain to video generation. The introduction of the Vimeo25M dataset significantly contributes to generating higher-quality videos compared to using existing datasets like WebVid10M.	LaVie faces challenges in generating scenes with multiple interacting subjects and struggles to synthesize realistic human hands. Future work will focus on extending LaVie's capabilities to generate longer videos with complex transitions and movie-level quality from script descriptions.	text-to-video generation, diffusion models, video generation, generative ai, computer vision
2309.15091 Report	VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning	Han Lin, Abhay Zala, Jaemin Cho, Mohit Bansal	Although recent text-to-video (T2V) generation methods have seen significant advancements, most of these works focus on producing short video clips of a single event with a single background (i.e., single-scene videos). Meanwhile, recent large language models (LLMs) have demonstrated their capability in generating layouts and programs to control downstream visual modules such as image generation models. This raises an important question: can we leverage the knowledge embedded in these LLMs for temporally consistent long video generation? In this paper, we propose VideoDirectorGPT, a novel framework for consistent multi-scene video generation that uses the knowledge of LLMs for video content planning and grounded video generation. Specifically, given a single text prompt, we first ask our video planner LLM (GPT-4) to expand it into a 'video plan', which involves generating the scene descriptions, the entities with their respective layouts, the background for each scene, and consistency groupings of the entities and backgrounds. Next, guided by this output from the video planner, our video generator, Layout2Vid, has explicit control over spatial layouts and can maintain temporal consistency of entities/backgrounds across scenes, while only trained with image-level annotations. Our experiments demonstrate that VideoDirectorGPT framework substantially improves layout and movement control in both single- and multi-scene video generation and can generate multi-scene videos with visual consistency across scenes, while achieving competitive performance with SOTAs in open-domain single-scene T2V generation. We also demonstrate that our framework can dynamically control the strength for layout guidance and can also generate videos with user-provided images. We hope our framework can inspire future work on better integrating the planning ability of LLMs into consistent long video generation.	This paper proposes VideoPlan, a novel two-stage framework for generating temporally consistent long videos with multiple scenes by leveraging the planning abilities of Large Language Models (LLMs).	Existing text-to-video generation methods struggle to produce long, multi-scene videos with consistent entities and backgrounds. This work addresses this challenge by incorporating LLM-based planning for improved control and consistency.	The proposed VideoPlan framework consists of two stages: (1) Video Planning: An LLM (GPT-4) generates a 'VideoPlan' containing detailed scene descriptions, entity layouts, and consistency groupings for entities/backgrounds. (2) Video Generation: A novel grounded video generation module (VideoModule) renders videos based on the VideoPlan, using image/text-based layout control and ensuring entity-level temporal consistency.	VideoPlan significantly outperforms baselines in object layout and movement control in single-scene video generation. The framework excels at generating multi-scene videos with impressive visual consistency across scenes. VideoPlan achieves competitive performance with state-of-the-art models on open-domain single-scene text-to-video generation benchmarks.	The reliance on powerful LLM APIs for VideoPlan generation can be expensive. The current implementation is limited by the capabilities of the underlying text-to-video generation backbone (ModelScopeT2V).	text-to-video generation, multi-scene video generation, large language models, video planning, layout control
2309.14868 Report	Cross-Dataset-Robust Method for Blind Real-World Image Quality Assessment	Yuan Chen, Zhiliang Ma, Yang Zhao	Although many effective models and real-world datasets have been presented for blind image quality assessment (BIQA), recent BIQA models usually tend to fit specific training set. Hence, it is still difficult to accurately and robustly measure the visual quality of an arbitrary real-world image. In this paper, a robust BIQA method, is designed based on three aspects, i.e., robust training strategy, large-scale real-world dataset, and powerful backbone. First, many individual models based on popular and state-of-the-art (SOTA) Swin-Transformer (SwinT) are trained on different real-world BIQA datasets respectively. Then, these biased SwinT-based models are jointly used to generate pseudo-labels, which adopts the probability of relative quality of two random images instead of fixed quality score. A large-scale real-world image dataset with 1,000,000 image pairs and pseudo-labels is then proposed for training the final cross-dataset-robust model. Experimental results on cross-dataset tests show that the performance of the proposed method is even better than some SOTA methods that are directly trained on these datasets, thus verifying the robustness and generalization of our method.	This paper proposes a cross-dataset-robust blind image quality assessment (BIQA) network and training strategy to improve the generalization ability of BIQA models in real-world scenarios.	Existing BIQA models often overfit to specific training datasets, hindering their ability to accurately assess the quality of arbitrary real-world images.	The method involves training multiple SwinT-IQA models on different datasets, using them to generate pseudo-labels (relative quality probabilities) for a large-scale real-world image dataset, and finally training a CDR-BIQA model on this dataset using a learning-to-rank framework.	The CDR-BIQA model outperforms many state-of-the-art methods in cross-dataset testing, demonstrating its strong generalization ability. The use of relative probability as pseudo-labels and the combination of multiple biased models contribute to the robustness of the proposed method. The Swin-Transformer backbone shows superior learning ability for BIQA compared to other backbones.	The performance improvement plateaus with more than 500,000 training image pairs, suggesting potential limitations in scaling beyond this point. The study primarily focuses on authentic distortions, and future work could explore robustness in synthetically distorted datasets.	blind image quality assessment, cross-dataset robustness, swin-transformer, pseudo-labels, learning-to-rank
2309.14756 Report	On quantifying and improving realism of images generated with diffusion	Yunzhuo Chen, Naveed Akhtar, Nur Al Hasan Haldar, Ajmal Mian	Recent advances in diffusion models have led to a quantum leap in the quality of generative visual content. However, quantification of realism of the content is still challenging. Existing evaluation metrics, such as Inception Score and Fr\'echet inception distance, fall short on benchmarking diffusion models due to the versatility of the generated images. Moreover, they are not designed to quantify realism of an individual image. This restricts their application in forensic image analysis, which is becoming increasingly important in the emerging era of generative models. To address that, we first propose a metric, called Image Realism Score (IRS), computed from five statistical measures of a given image. This non-learning based metric not only efficiently quantifies realism of the generated images, it is readily usable as a measure to classify a given image as real or fake. We experimentally establish the model- and data-agnostic nature of the proposed IRS by successfully detecting fake images generated by Stable Diffusion Model (SDM), Dalle2, Midjourney and BigGAN. We further leverage this attribute of our metric to minimize an IRS-augmented generative loss of SDM, and demonstrate a convenient yet considerable quality improvement of the SDM-generated content with our modification. Our efforts have also led to Gen-100 dataset, which provides 1,000 samples for 100 classes generated by four high-quality models. We will release the dataset and code.	This paper proposes Image Realism Score (IRS), a non-learning based metric to quantify the realism of images, particularly those generated by diffusion models, for distinguishing them from natural images.	Existing metrics like Inception Score and Fréchet Inception Distance are limited in benchmarking diffusion models due to their reliance on specific datasets or models, making them unsuitable for assessing the realism of individual images, crucial for forensics in the age of generative models.	IRS combines five image statistics: Canny Edge Density, GLCM Contrast, GLCM Energy, Variance of Laplacian, and Mean Spectrum. These measures are carefully calibrated and arranged in a specific order within a pentagon-shaped geometric representation, with the pentagon's area defining the IRS value.	IRS values effectively benchmark popular generative models, aligning with their known capabilities. IRS demonstrates strong performance in fake image detection across various models. Incorporating IRS into the Stable Diffusion Model's training loss significantly enhances the realism of the generated images.	The effectiveness of IRS for images generated by emerging diffusion models needs further investigation. Exploring the potential of IRS in refining other generative models beyond SDMs presents a promising avenue for future research.	image realism score, diffusion models, fake image detection, generative models, image forensics
2309.14623 Report	Text-to-Image Generation for Abstract Concepts	Jiayi Liao, Xu Chen, Qiang Fu, Lun Du, Xiangnan He, Xiang Wang, Shi Han, Dongmei Zhang	Recent years have witnessed the substantial progress of large-scale models across various domains, such as natural language processing and computer vision, facilitating the expression of concrete concepts. Unlike concrete concepts that are usually directly associated with physical objects, expressing abstract concepts through natural language requires considerable effort, which results from their intricate semantics and connotations. An alternative approach is to leverage images to convey rich visual information as a supplement. Nevertheless, existing Text-to-Image (T2I) models are primarily trained on concrete physical objects and tend to fail to visualize abstract concepts. Inspired by the three-layer artwork theory that identifies critical factors, intent, object and form during artistic creation, we propose a framework of Text-to-Image generation for Abstract Concepts (TIAC). The abstract concept is clarified into a clear intent with a detailed definition to avoid ambiguity. LLMs then transform it into semantic-related physical objects, and the concept-dependent form is retrieved from an LLM-extracted form pattern set. Information from these three aspects will be integrated to generate prompts for T2I models via LLM. Evaluation results from human assessments and our newly designed metric concept score demonstrate the effectiveness of our framework in creating images that can sufficiently express abstract concepts.	This paper introduces TIAC, a novel framework for text-to-image generation of abstract concepts using LLMs.	Existing T2I models struggle to visualize abstract concepts due to their training on concrete objects, hindering effective communication of complex ideas.	TIAC leverages LLMs to: 1) clarify user intent with WordNet definitions, 2) transform abstract concepts into related objects, and 3) retrieve relevant form patterns from a prompt dataset. These elements are combined to generate effective T2I prompts.	TIAC outperforms baseline methods in human evaluations, indicating its ability to generate images better representing abstract concepts. A novel metric, concept score, combining visual-semantic similarity and aesthetic score, shows higher consistency with human preferences. Case studies demonstrate TIAC's ability to generate meaningful images for various abstract concepts across different T2I models.	The current implementation relies on a precise mapping of input concepts to WordNet, requiring further exploration for real-world scenarios. Future work includes investigating methods to automatically determine the optimal level of abstraction for object transformation.	text-to-image generation, abstract concepts, large language models, prompt optimization, concept visualization
2309.14494 Report	Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator	Hanzhuo Huang, Yufan Feng, Cheng Shi, Lan Xu, Jingyi Yu, Sibei Yang	Text-to-video is a rapidly growing research area that aims to generate a semantic, identical, and temporal coherence sequence of frames that accurately align with the input text prompt. This study focuses on zero-shot text-to-video generation considering the data- and cost-efficient. To generate a semantic-coherent video, exhibiting a rich portrayal of temporal semantics such as the whole process of flower blooming rather than a set of "moving images", we propose a novel Free-Bloom pipeline that harnesses large language models (LLMs) as the director to generate a semantic-coherence prompt sequence, while pre-trained latent diffusion models (LDMs) as the animator to generate the high fidelity frames. Furthermore, to ensure temporal and identical coherence while maintaining semantic coherence, we propose a series of annotative modifications to adapting LDMs in the reverse process, including joint noise sampling, step-aware attention shift, and dual-path interpolation. Without any video data and training requirements, Free-Bloom generates vivid and high-quality videos, awe-inspiring in generating complex scenes with semantic meaningful frame sequences. In addition, Free-Bloom is naturally compatible with LDMs-based extensions.	Free-Bloom, a zero-shot, training-free text-to-video generator that leverages Large Language Models (LLMs) for semantic coherence and pre-trained Latent Diffusion Models (LDMs) for high-quality frame generation.	Addresses the challenge of zero-shot text-to-video generation by generating videos with meaningful temporal variations and avoiding extensive data and computational requirements.	A three-stage pipeline: (1) Serial Prompting: LLM generates a sequence of prompts describing frame content, (2) Video Generation: LDM generates frames using joint noise sampling and step-aware attention shift for coherence, and (3) Interpolation Empowerment: Increases frame rate via dual latent space interpolation considering both contextual and semantic information.	Generates videos exhibiting semantic coherence by depicting complete events aligned with the narrative. Maintains identical coherence and temporal coherence, ensuring smooth transitions and consistent content. Outperforms existing zero-shot methods in user studies regarding fidelity and semantic representation while achieving comparable temporal coherence to trained methods.	Inherits limitations from the underlying LLM and LDM models, such as difficulties with complex scenes and sensitivity to initial noise. Future work includes improving temporal consistency and combining strengths of zero-shot and trained approaches.	text-to-video generation, zero-shot learning, large language models, latent diffusion models, video interpolation
2309.14338 Report	3D Indoor Instance Segmentation in an Open-World	Mohamed El Amine Boudjoghra, Salwa K. Al Khatib, Jean Lahoud, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Fahad Khan	Existing 3D instance segmentation methods typically assume that all semantic classes to be segmented would be available during training and only seen categories are segmented at inference. We argue that such a closed-world assumption is restrictive and explore for the first time 3D indoor instance segmentation in an open-world setting, where the model is allowed to distinguish a set of known classes as well as identify an unknown object as unknown and then later incrementally learning the semantic category of the unknown when the corresponding category labels are available. To this end, we introduce an open-world 3D indoor instance segmentation method, where an auto-labeling scheme is employed to produce pseudo-labels during training and induce separation to separate known and unknown category labels. We further improve the pseudo-labels quality at inference by adjusting the unknown class probability based on the objectness score distribution. We also introduce carefully curated open-world splits leveraging realistic scenarios based on inherent object distribution, region-based indoor scene exploration and randomness aspect of open-world classes. Extensive experiments reveal the efficacy of the proposed contributions leading to promising open-world 3D instance segmentation performance.	This paper proposes the first open-world 3D indoor instance segmentation method, enabling identification of unknown objects and incremental learning of new classes.	Existing 3D instance segmentation methods rely on a closed-world assumption, limiting their applicability in real-world scenarios with numerous unseen object classes.	The method utilizes an auto-labeling scheme for pseudo-label generation, contrastive clustering for class separation, and a reachability-based probability correction scheme for improved unknown object recognition. It also employs exemplar replay for incremental learning.	The proposed method outperforms adapted baselines in open-world 3D instance segmentation. It effectively preserves knowledge of previously learned classes during incremental learning. Qualitative results demonstrate the method's ability to correctly identify and segment both known and unknown objects.	The confidence thresholding approach, while improving known class performance, limits unknown class segmentation due to fewer pseudo-labels. Probability correction's efficacy depends on cluster characteristics and may deteriorate in imbalanced data scenarios.	3d instance segmentation, open-world learning, incremental learning, pseudo-labeling, contrastive clustering
2309.14335 Report	UnitedHuman: Harnessing Multi-Source Data for High-Resolution Human Generation	Jianglin Fu, Shikai Li, Yuming Jiang, Kwan-Yee Lin, Wayne Wu, Ziwei Liu	Human generation has achieved significant progress. Nonetheless, existing methods still struggle to synthesize specific regions such as faces and hands. We argue that the main reason is rooted in the training data. A holistic human dataset inevitably has insufficient and low-resolution information on local parts. Therefore, we propose to use multi-source datasets with various resolution images to jointly learn a high-resolution human generative model. However, multi-source data inherently a) contains different parts that do not spatially align into a coherent human, and b) comes with different scales. To tackle these challenges, we propose an end-to-end framework, UnitedHuman, that empowers continuous GAN with the ability to effectively utilize multi-source data for high-resolution human generation. Specifically, 1) we design a Multi-Source Spatial Transformer that spatially aligns multi-source images to full-body space with a human parametric model. 2) Next, a continuous GAN is proposed with global-structural guidance and CutMix consistency. Patches from different datasets are then sampled and transformed to supervise the training of this scale-invariant generative model. Extensive experiments demonstrate that our model jointly learned from multi-source data achieves superior quality than those learned from a holistic dataset.	UnitedHuman is a novel end-to-end framework that leverages multi-source datasets to generate high-resolution, full-body human images.	Existing human generation methods struggle to synthesize high-fidelity images, especially for detailed regions like faces and hands, due to the limitations of holistic human datasets.	The framework employs a two-stage approach: 1) Multi-source Spatial Transformer aligns body parts from different datasets using a parametric human model. 2) Continuous GAN, trained with global structural guidance and CutMix consistency, synthesizes image patches at various scales and stitches them together for the final output.	UnitedHuman outperforms baseline methods (StyleGAN-Human, InsetGAN, AnyRes) in generating high-resolution human images with finer details, even when trained on a smaller dataset of high-resolution images. Quantitative evaluations (kFID, pFID) demonstrate the superiority of UnitedHuman in capturing local textures and details, particularly for hands and faces. Ablation studies confirm the efficacy of the proposed Multi-Source Spatial Transformer and the Continuous GAN in improving alignment and leveraging multi-source datasets effectively.	The underlying StyleGAN3 architecture may limit the representation of high-frequency information, causing potential artifacts in upscaling. The diversity of generated poses and garments is constrained by the training data, necessitating future work on data augmentation and incorporating more varied datasets.	human generation, multi-scale generation, generative adversarial networks (gans), multi-source data, human body alignment
2309.14289 Report	CLIP-DIY: CLIP Dense Inference Yields Open-Vocabulary Semantic Segmentation For-Free	Monika Wysoczańska, Michaël Ramamonjisoa, Tomasz Trzciński, Oriane Siméoni	The emergence of CLIP has opened the way for open-world image perception. The zero-shot classification capabilities of the model are impressive but are harder to use for dense tasks such as image segmentation. Several methods have proposed different modifications and learning schemes to produce dense output. Instead, we propose in this work an open-vocabulary semantic segmentation method, dubbed CLIP-DIY, which does not require any additional training or annotations, but instead leverages existing unsupervised object localization approaches. In particular, CLIP-DIY is a multi-scale approach that directly exploits CLIP classification abilities on patches of different sizes and aggregates the decision in a single map. We further guide the segmentation using foreground/background scores obtained using unsupervised object localization methods. With our method, we obtain state-of-the-art zero-shot semantic segmentation results on PASCAL VOC and perform on par with the best methods on COCO. The code is available at http://github.com/wysoczanska/clip-diy	Introduces CLIP-DIY, a zero-shot open-vocabulary semantic segmentation method that leverages CLIP's classification ability and unsupervised object localization.	Addresses limitations of supervised semantic segmentation methods that require expensive annotations and struggle with open vocabularies.	Performs multi-scale dense inference with CLIP on image patches, aggregating predictions into a single map. Refines the map using an unsupervised foreground/background segmentation model (FOUND) for objectness guidance.	Achieves state-of-the-art zero-shot open-vocabulary semantic segmentation on PASCAL VOC (+4.9 mIoU over previous best). Performs on par with the best methods on COCO Object dataset. Demonstrates effectiveness of multi-scale patch classification and objectness guidance for accurate and robust segmentation.	Performance limited by accuracy of the unsupervised object localization model in complex scenes. Sensitivity to text ambiguities inherited from CLIP can lead to misclassifications.	semantic segmentation, open vocabulary, zero-shot learning, clip, unsupervised object localization
2309.14207 Report	Automatic Animation of Hair Blowing in Still Portrait Photos	Wenpeng Xiao, Wentao Liu, Yitong Wang, Bernard Ghanem, Bing Li	We propose a novel approach to animate human hair in a still portrait photo. Existing work has largely studied the animation of fluid elements such as water and fire. However, hair animation for a real image remains underexplored, which is a challenging problem, due to the high complexity of hair structure and dynamics. Considering the complexity of hair structure, we innovatively treat hair wisp extraction as an instance segmentation problem, where a hair wisp is referred to as an instance. With advanced instance segmentation networks, our method extracts meaningful and natural hair wisps. Furthermore, we propose a wisp-aware animation module that animates hair wisps with pleasing motions without noticeable artifacts. The extensive experiments show the superiority of our method. Our method provides the most pleasing and compelling viewing experience in the qualitative experiments and outperforms state-of-the-art still-image animation methods by a large margin in the quantitative evaluation. Project url: \url{https://nevergiveu.github.io/AutomaticHairBlowing/}	This paper proposes a novel approach to automatically animate human hair in still portrait photos, converting them into dynamic and engaging cinemagraphs.	Existing methods for still-image animation primarily focus on fluid elements and lack the ability to realistically animate hair, a crucial aspect for creating compelling portrait visuals. This work addresses this gap by enabling automatic hair animation in real-world portrait photos.	The method employs a three-step process: (1) Instance-based Hair Wisp Extraction (IHWE) identifies and segments individual hair wisps using instance segmentation networks trained on a novel hair wisp dataset. (2) Hair Wisp Animation (HWA) represents each wisp with a multi-layer mesh and simulates natural motions using a physics-based mass-spring system. (3) Depth-aware frame composition ensures proper occlusion relationships between animated hair, face, and background during video generation.	The proposed method outperforms state-of-the-art single-image-to-video generation techniques in both quantitative metrics like Frechet Video Distance (FVD) and Warping Error, indicating superior video quality and temporal consistency. Qualitative comparisons demonstrate the method's ability to generate more realistic and visually appealing hair animations compared to baselines, avoiding artifacts like distortions, unnatural movements, and flickering. Subjective user studies confirm that the generated videos are significantly preferred by human viewers, highlighting the effectiveness of the approach in enhancing portrait aesthetics.	The current method primarily focuses on animating hair blowing in the wind and may not generalize well to other hair motions like shaking or swaying. Future work could explore incorporating user controls to fine-tune animation parameters like wind direction and intensity.	hair animation, cinemagraph generation, instance segmentation, physics-based animation, portrait image animation
2309.14136 Report	Masked Image Residual Learning for Scaling Deeper Vision Transformers	Guoxi Huang, Hongtao Fu, Adrian G. Bors	Deeper Vision Transformers (ViTs) are more challenging to train. We expose a degradation problem in deeper layers of ViT when using masked image modeling (MIM) for pre-training. To ease the training of deeper ViTs, we introduce a self-supervised learning framework called Masked Image Residual Learning (MIRL), which significantly alleviates the degradation problem, making scaling ViT along depth a promising direction for performance upgrade. We reformulate the pre-training objective for deeper layers of ViT as learning to recover the residual of the masked image. We provide extensive empirical evidence showing that deeper ViTs can be effectively optimized using MIRL and easily gain accuracy from increased depth. With the same level of computational complexity as ViT-Base and ViT-Large, we instantiate 4.5$\times$ and 2$\times$ deeper ViTs, dubbed ViT-S-54 and ViT-B-48. The deeper ViT-S-54, costing 3$\times$ less than ViT-Large, achieves performance on par with ViT-Large. ViT-B-48 achieves 86.2% top-1 accuracy on ImageNet. On one hand, deeper ViTs pre-trained with MIRL exhibit excellent generalization capabilities on downstream tasks, such as object detection and semantic segmentation. On the other hand, MIRL demonstrates high pre-training efficiency. With less pre-training time, MIRL yields competitive performance compared to other approaches.	This paper identifies a degradation problem in deeper layers of Vision Transformers (ViTs) pre-trained with Masked Image Modeling (MIM) and proposes Masked Image Residual Learning (MIRL) to address it.	Scaling ViTs along the depth dimension is challenging due to the degradation problem, which hinders performance improvement. This work makes deep ViTs a promising direction for performance upgrade.	MIRL reformulates the pre-training objective for deeper layers of ViT as learning to recover the residual of the masked image, leveraging a multi-decoding process and shortcut connections.	Deeper ViTs pre-trained with MIRL outperform shallower counterparts with similar complexity (ViT-S-54 surpasses ViT-B). MIRL enables training ViTs with significantly increased depth, achieving competitive results with less computational cost (ViT-B-48 outperforms ViT-L). MIRL shows strong generalization capabilities, improving performance on downstream tasks like object detection and semantic segmentation.	A comprehensive theoretical explanation for MIRL’s effectiveness is still under exploration. Further exploration of depth scaling beyond the presented 54 blocks is needed.	vision transformer, self-supervised learning, masked image modeling, image residual learning, deep learning
2309.14068 Report	Soft Mixture Denoising: Beyond the Expressive Bottleneck of Diffusion Models	Yangming Li, Boris van Breugel, Mihaela van der Schaar	Because diffusion models have shown impressive performances in a number of tasks, such as image synthesis, there is a trend in recent works to prove (with certain assumptions) that these models have strong approximation capabilities. In this paper, we show that current diffusion models actually have an expressive bottleneck in backward denoising and some assumption made by existing theoretical guarantees is too strong. Based on this finding, we prove that diffusion models have unbounded errors in both local and global denoising. In light of our theoretical studies, we introduce soft mixture denoising (SMD), an expressive and efficient model for backward denoising. SMD not only permits diffusion models to well approximate any Gaussian mixture distributions in theory, but also is simple and efficient for implementation. Our experiments on multiple image datasets show that SMD significantly improves different types of diffusion models (e.g., DDPM), espeically in the situation of few backward iterations.	This paper identifies an expressive bottleneck in diffusion models' backward denoising process due to its Gaussian parameterization, limiting its ability to approximate multimodal data distributions, and proposes Soft Mixture Denoising (SMD) to address this limitation.	Existing theoretical guarantees for diffusion models often assume bounded score estimation errors, which is shown to be too strong an assumption. This paper demonstrates the limitations of the Gaussian denoising paradigm and aims to provide a more expressive alternative.	The paper provides theoretical proofs to demonstrate the unbounded errors in both local and global denoising in current diffusion models. It then introduces SMD, a continuous relaxation of a Gaussian mixture model for the denoising posterior, and proves its ability to accurately approximate any Gaussian mixture distribution.	Current diffusion models suffer from an expressive bottleneck in backward denoising, leading to unbounded denoising errors. The assumption of bounded score estimation errors made by existing theoretical guarantees for diffusion models is too strong. SMD significantly improves the generation quality of various diffusion models, especially with few backward iterations, enabling faster sampling and reduced computational costs.	The paper assumes globally optimized neural networks in its theoretical analysis of SMD. Future work includes extending SMD to other applications like text-to-image translation and speech synthesis, and integrating it into diffusion model libraries to speed up training and inference.	diffusion models, generative models, denoising, gaussian mixture models, expressive bottleneck
2309.14052 Report	Single Image Test-Time Adaptation for Segmentation	Klara Janouskova, Tamir Shor, Chaim Baskin, Jiri Matas	Test-Time Adaptation (TTA) methods improve the robustness of deep neural networks to domain shift on a variety of tasks such as image classification or segmentation. This work explores adapting segmentation models to a single unlabelled image with no other data available at test-time. In particular, this work focuses on adaptation by optimizing self-supervised losses at test-time. Multiple baselines based on different principles are evaluated under diverse conditions and a novel adversarial training is introduced for adaptation with mask refinement. Our additions to the baselines result in a 3.51 and 3.28 % increase over non-adapted baselines, without these improvements, the increase would be 1.7 and 2.16 % only.	This paper explores adapting segmentation models to a single unlabeled image at test-time by optimizing self-supervised losses, introducing novel adversarial training for adaptation with mask refinement.	Test-time adaptation (TTA) methods improve the robustness of deep neural networks to domain shift, a common problem when deploying models in real-world scenarios.	The authors evaluate several TTA baselines based on entropy minimization, pseudo-labeling, mask refinement, augmentation consistency, and adversarial transformation invariance. They introduce a novel adversarial training method for mask refinement and propose new evaluation metrics to account for class imbalance and per-image performance.	Optimizing Intersection over Union (IoU) loss consistently outperforms cross-entropy loss for all evaluated methods. Test-time training with pseudo-labels and mask refinement with adversarial training are identified as the overall best-performing methods. The effectiveness of TTA methods is highly dependent on domain shift type and severity, and a single set of hyperparameters may not perform well across all conditions.	Single image TTA by optimizing model parameters has high computational cost. Finding optimal hyperparameters for different domain shifts remains a challenge.	test-time adaptation, semantic segmentation, domain shift, adversarial training, mask refinement
2309.13956 Report	In-Domain GAN Inversion for Faithful Reconstruction and Editability	Jiapeng Zhu, Yujun Shen, Yinghao Xu, Deli Zhao, Qifeng Chen, Bolei Zhou	Generative Adversarial Networks (GANs) have significantly advanced image synthesis through mapping randomly sampled latent codes to high-fidelity synthesized images. However, applying well-trained GANs to real image editing remains challenging. A common solution is to find an approximate latent code that can adequately recover the input image to edit, which is also known as GAN inversion. To invert a GAN model, prior works typically focus on reconstructing the target image at the pixel level, yet few studies are conducted on whether the inverted result can well support manipulation at the semantic level. This work fills in this gap by proposing in-domain GAN inversion, which consists of a domain-guided encoder and a domain-regularized optimizer, to regularize the inverted code in the native latent space of the pre-trained GAN model. In this way, we manage to sufficiently reuse the knowledge learned by GANs for image reconstruction, facilitating a wide range of editing applications without any retraining. We further make comprehensive analyses on the effects of the encoder structure, the starting inversion point, as well as the inversion parameter space, and observe the trade-off between the reconstruction quality and the editing property. Such a trade-off sheds light on how a GAN model represents an image with various semantics encoded in the learned latent distribution. Code, models, and demo are available at the project page: https://genforce.github.io/idinvert/.	This paper proposes IDInvert, an in-domain GAN inversion method that focuses on the semantic properties of inverted latent codes to ensure editability for real image editing.	Existing GAN inversion methods primarily focus on pixel-level reconstruction and overlook the semantic meaning of inverted codes, limiting their use in real-world image editing applications.	IDInvert uses a domain-guided encoder trained on real images to map images to the latent space of a pre-trained GAN. This encoder acts as a regularizer during a subsequent domain-regularized optimization step, ensuring the inverted code stays within the semantic domain of the generator.	IDInvert produces inverted codes that are more semantically meaningful, as demonstrated by better performance in attribute classification tasks compared to baselines. The method facilitates high-quality image editing applications like interpolation and semantic manipulation, outperforming existing techniques in visual quality and semantic consistency. The study reveals a trade-off between reconstruction quality and editability, showing that increasing reconstruction accuracy often comes at the cost of reduced editing capabilities.	IDInvert's performance is limited to images that share a similar distribution with the training data. Future work could explore methods to mitigate the trade-off between reconstruction quality and editability.	gan inversion, image editing, semantic editing, latent space, generative adversarial networks
2309.13415 Report	Dream the Impossible: Outlier Imagination with Diffusion Models	Xuefeng Du, Yiyou Sun, Xiaojin Zhu, Yixuan Li	Utilizing auxiliary outlier datasets to regularize the machine learning model has demonstrated promise for out-of-distribution (OOD) detection and safe prediction. Due to the labor intensity in data collection and cleaning, automating outlier data generation has been a long-desired alternative. Despite the appeal, generating photo-realistic outliers in the high dimensional pixel space has been an open challenge for the field. To tackle the problem, this paper proposes a new framework DREAM-OOD, which enables imagining photo-realistic outliers by way of diffusion models, provided with only the in-distribution (ID) data and classes. Specifically, DREAM-OOD learns a text-conditioned latent space based on ID data, and then samples outliers in the low-likelihood region via the latent, which can be decoded into images by the diffusion model. Different from prior works, DREAM-OOD enables visualizing and understanding the imagined outliers, directly in the pixel space. We conduct comprehensive quantitative and qualitative studies to understand the efficacy of DREAM-OOD, and show that training with the samples generated by DREAM-OOD can benefit OOD detection performance. Code is publicly available at https://github.com/deeplearning-wisc/dream-ood.	\model is the first method to generate photo-realistic high-resolution outliers in the pixel space for improving OOD detection.	Existing methods for OOD detection rely on auxiliary outlier datasets, which are labor-intensive to collect and curate. \model addresses this limitation by automating outlier generation using diffusion models.	\model learns a text-conditioned latent space using ID data and a pre-trained diffusion model (Stable Diffusion). Outliers are then generated by: 1) sampling new embeddings in the low-likelihood regions of the latent space and 2) decoding these embeddings into images using the diffusion model.	\model significantly improves OOD detection performance on CIFAR-100 and ImageNet-100 benchmarks, outperforming existing methods including those based on GANs and latent-space outlier synthesis (VOS, NPOS). Analysis of generated images shows that \model effectively creates a wide spectrum of outliers ranging from near-OOD to far-OOD. An extension of the method, \textsc{Dream-id}, demonstrates the ability to generate in-distribution samples which improves model generalization on ImageNet, ImageNet-A and ImageNet-v2.	The outlier generation process in \model relies on the quality and diversity of the pre-trained diffusion model. Finding the optimal parameters for sampling outlier embeddings, such as variance and neighborhood size, requires careful tuning.	out-of-distribution (ood) detection, diffusion models, outlier generation, data augmentation, generalization
2309.13274 Report	GLOBER: Coherent Non-autoregressive Video Generation via GLOBal Guided Video DecodER	Mingzhen Sun, Weining Wang, Zihan Qin, Jiahui Sun, Sihan Chen, Jing Liu	Video generation necessitates both global coherence and local realism. This work presents a novel non-autoregressive method GLOBER, which first generates global features to obtain comprehensive global guidance and then synthesizes video frames based on the global features to generate coherent videos. Specifically, we propose a video auto-encoder, where a video encoder encodes videos into global features, and a video decoder, built on a diffusion model, decodes the global features and synthesizes video frames in a non-autoregressive manner. To achieve maximum flexibility, our video decoder perceives temporal information through normalized frame indexes, which enables it to synthesize arbitrary sub video clips with predetermined starting and ending frame indexes. Moreover, a novel adversarial loss is introduced to improve the global coherence and local realism between the synthesized video frames. Finally, we employ a diffusion-based video generator to fit the global features outputted by the video encoder for video generation. Extensive experimental results demonstrate the effectiveness and efficiency of our proposed method, and new state-of-the-art results have been achieved on multiple benchmarks.	GLOBER: a novel non-autoregressive diffusion-based video generation method that prioritizes global guidance by first generating 2D global features and then synthesizing video frames based on these features, enhancing coherence and realism.	Existing video generation methods struggle to balance global coherence and local realism due to computational limitations and the potentially infinite number of video frames. GLOBER addresses this by separating global guidance generation from frame-wise detail synthesis.	GLOBER utilizes a video auto-encoder: an encoder compresses video keyframes into 2D global features, and a diffusion-based decoder synthesizes frames based on these features and frame indexes. A novel Coherence and Realism Adversarial (CRA) loss improves global coherence and local realism. A separate diffusion model generates novel global features for video generation.	Achieves new state-of-the-art results on multiple benchmarks, including UCF-101, Sky Time-lapse, and TaiChi-HD. Significantly faster in generating video frames compared to autoregressive methods, thanks to its non-autoregressive strategy. Generates videos with enhanced coherence and realism due to the use of global features as guidance.	Difficulty in processing videos with frequent scene changes. Limited exploration in open-domain video generation tasks due to computational constraints.	video generation, diffusion models, global guidance, non-autoregressive, coherence and realism
2309.13196 Report	ClusterFormer: Clustering As A Universal Visual Learner	James C. Liang, Yiming Cui, Qifan Wang, Tong Geng, Wenguan Wang, Dongfang Liu	This paper presents CLUSTERFORMER, a universal vision model that is based on the CLUSTERing paradigm with TransFORMER. It comprises two novel designs: 1. recurrent cross-attention clustering, which reformulates the cross-attention mechanism in Transformer and enables recursive updates of cluster centers to facilitate strong representation learning; and 2. feature dispatching, which uses the updated cluster centers to redistribute image features through similarity-based metrics, resulting in a transparent pipeline. This elegant design streamlines an explainable and transferable workflow, capable of tackling heterogeneous vision tasks (i.e., image classification, object detection, and image segmentation) with varying levels of clustering granularity (i.e., image-, box-, and pixel-level). Empirical results demonstrate that CLUSTERFORMER outperforms various well-known specialized architectures, achieving 83.41% top-1 acc. over ImageNet-1K for image classification, 54.2% and 47.0% mAP over MS COCO for object detection and instance segmentation, 52.4% mIoU over ADE20K for semantic segmentation, and 55.8% PQ over COCO Panoptic for panoptic segmentation. For its efficacy, we hope our work can catalyze a paradigm shift in universal models in computer vision.	This paper introduces ClusterFormer, a universal vision model leveraging clustering within a Transformer architecture to excel at various vision tasks with varying levels of granularity.	Inspired by the human vision system's ability to group visual information for diverse tasks, ClusterFormer aims to achieve similar versatility and performance in a single model.	ClusterFormer uses a recurrent cross-attention clustering mechanism to iteratively update cluster centers based on image features. Then, it employs feature dispatching to redistribute image features based on their similarity to updated cluster centers.	ClusterFormer outperforms Swin Transformer on ImageNet classification by up to 0.39% in top-1 accuracy. For object detection on MS COCO, ClusterFormer surpasses DINO by up to 1.1% mAP. ClusterFormer achieves state-of-the-art performance on instance segmentation (MS COCO), semantic segmentation (ADE20K), and panoptic segmentation (COCO Panoptic).	The computational cost of ClusterFormer can be high for high-resolution images. Exploring the optimal number of clusters for different tasks and datasets is crucial for future work.	universal vision model, clustering, transformer, recurrent cross-attention, feature dispatching
2309.13101 Report	Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction	Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, Xiaogang Jin	Implicit neural representation has paved the way for new approaches to dynamic scene reconstruction and rendering. Nonetheless, cutting-edge dynamic neural rendering methods rely heavily on these implicit representations, which frequently struggle to capture the intricate details of objects in the scene. Furthermore, implicit methods have difficulty achieving real-time rendering in general dynamic scenes, limiting their use in a variety of tasks. To address the issues, we propose a deformable 3D Gaussians Splatting method that reconstructs scenes using 3D Gaussians and learns them in canonical space with a deformation field to model monocular dynamic scenes. We also introduce an annealing smoothing training mechanism with no extra overhead, which can mitigate the impact of inaccurate poses on the smoothness of time interpolation tasks in real-world datasets. Through a differential Gaussian rasterizer, the deformable 3D Gaussians not only achieve higher rendering quality but also real-time rendering speed. Experiments show that our method outperforms existing methods significantly in terms of both rendering quality and speed, making it well-suited for tasks such as novel-view synthesis, time interpolation, and real-time rendering.	This paper presents a novel method for high-fidelity monocular dynamic scene reconstruction using deformable 3D Gaussians, achieving real-time rendering and high-quality results.	Existing dynamic neural rendering methods often struggle to capture intricate scene details and achieve real-time performance. This work addresses these limitations by utilizing a deformable 3D Gaussian framework.	The proposed method learns 3D Gaussians in canonical space and employs a deformation field to model their variations over time. This is accomplished using a differential Gaussian rasterization pipeline and a novel annealing smoothing training mechanism.	The method significantly outperforms previous approaches in terms of rendering quality and speed on both synthetic and real-world datasets. It effectively reconstructs fine details and ensures temporal smoothness, even with inaccurate pose estimations. The method achieves real-time rendering capabilities when the number of 3D Gaussians is below 250k.	The method's performance is dependent on viewpoint diversity and pose estimation accuracy. Scenes with an extremely high number of 3D Gaussians can lead to increased training time and memory consumption.	neural rendering, dynamic scene reconstruction, 3d gaussians, deformable models, real-time rendering
2309.13097 Report	Zero-Shot Object Counting with Language-Vision Models	Jingyi Xu, Hieu Le, Dimitris Samaras	Class-agnostic object counting aims to count object instances of an arbitrary class at test time. It is challenging but also enables many potential applications. Current methods require human-annotated exemplars as inputs which are often unavailable for novel categories, especially for autonomous systems. Thus, we propose zero-shot object counting (ZSC), a new setting where only the class name is available during test time. This obviates the need for human annotators and enables automated operation. To perform ZSC, we propose finding a few object crops from the input image and use them as counting exemplars. The goal is to identify patches containing the objects of interest while also being visually representative for all instances in the image. To do this, we first construct class prototypes using large language-vision models, including CLIP and Stable Diffusion, to select the patches containing the target objects. Furthermore, we propose a ranking model that estimates the counting error of each patch to select the most suitable exemplars for counting. Experimental results on a recent class-agnostic counting dataset, FSC-147, validate the effectiveness of our method.	Introduces zero-shot object counting (ZSC), counting instances of a specific class using only the class name, without exemplars.	Enables automated object counting for arbitrary classes without requiring human-annotated exemplars, unlike traditional methods.	Two-step approach: 1) Class-relevant patch selection using class prototypes generated from either a VAE (trained on semantic embeddings) or Stable Diffusion (conditioned on class name). 2) Optimal patch selection via an error prediction network that predicts counting error for candidate patches.	Significantly reduces error rates compared to using RPN proposals directly as exemplars. Selected patches effectively represent target objects and lead to meaningful density maps. Generalizes well to other exemplar-based counting methods.	Performance may be limited by the quality of generated prototypes. Further research on handling diverse object scales and occlusions.	zero-shot learning, object counting, class-agnostic counting, language-vision models, stable diffusion
2309.13042 Report	MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation	Jiahao Xie, Wei Li, Xiangtai Li, Ziwei Liu, Yew Soon Ong, Chen Change Loy	We present MosaicFusion, a simple yet effective diffusion-based data augmentation approach for large vocabulary instance segmentation. Our method is training-free and does not rely on any label supervision. Two key designs enable us to employ an off-the-shelf text-to-image diffusion model as a useful dataset generator for object instances and mask annotations. First, we divide an image canvas into several regions and perform a single round of diffusion process to generate multiple instances simultaneously, conditioning on different text prompts. Second, we obtain corresponding instance masks by aggregating cross-attention maps associated with object prompts across layers and diffusion time steps, followed by simple thresholding and edge-aware refinement processing. Without bells and whistles, our MosaicFusion can produce a significant amount of synthetic labeled data for both rare and novel categories. Experimental results on the challenging LVIS long-tailed and open-vocabulary benchmarks demonstrate that MosaicFusion can significantly improve the performance of existing instance segmentation models, especially for rare and novel categories. Code will be released at https://github.com/Jiahao000/MosaicFusion.	This paper proposes MosaicFusion, a training-free data augmentation pipeline using diffusion models to generate images with multiple objects and their corresponding masks for large vocabulary instance segmentation.	Large vocabulary instance segmentation suffers from data scarcity, especially for rare and novel categories, limiting model performance. MosaicFusion addresses this by synthesizing labeled data for these categories.	MosaicFusion leverages a text-to-image diffusion model (Stable Diffusion) with two key components: 1) Image generation: an image canvas is divided into regions, each assigned a text prompt describing a specific object. The diffusion process runs on each region in parallel to generate the final image. 2) Mask generation: cross-attention maps corresponding to object prompts are aggregated across layers and time steps, then thresholded and refined to produce instance masks.	Generating multiple objects per image is more effective than single object generation. MosaicFusion consistently improves performance across different instance segmentation baselines (Mask R-CNN, CenterNet2) and backbones. It significantly boosts performance on rare and novel categories in both long-tailed and open-vocabulary settings on LVIS, showing complementarity with CLIP-based methods.	The study is limited to Stable Diffusion; exploring other diffusion models could be beneficial. Current diffusion models have limited expressiveness, leading to a domain gap between synthetic and real images. Generating more complex scenes is a future direction.	text-to-image diffusion models, data augmentation, long tail, open vocabulary, instance segmentation
2309.13038 Report	Privacy Assessment on Reconstructed Images: Are Existing Evaluation Metrics Faithful to Human Perception?	Xiaoxiao Sun, Nidham Gazagnadou, Vivek Sharma, Lingjuan Lyu, Hongdong Li, Liang Zheng	Hand-crafted image quality metrics, such as PSNR and SSIM, are commonly used to evaluate model privacy risk under reconstruction attacks. Under these metrics, reconstructed images that are determined to resemble the original one generally indicate more privacy leakage. Images determined as overall dissimilar, on the other hand, indicate higher robustness against attack. However, there is no guarantee that these metrics well reflect human opinions, which, as a judgement for model privacy leakage, are more trustworthy. In this paper, we comprehensively study the faithfulness of these hand-crafted metrics to human perception of privacy information from the reconstructed images. On 5 datasets ranging from natural images, faces, to fine-grained classes, we use 4 existing attack methods to reconstruct images from many different classification models and, for each reconstructed image, we ask multiple human annotators to assess whether this image is recognizable. Our studies reveal that the hand-crafted metrics only have a weak correlation with the human evaluation of privacy leakage and that even these metrics themselves often contradict each other. These observations suggest risks of current metrics in the community. To address this potential risk, we propose a learning-based measure called SemSim to evaluate the Semantic Similarity between the original and reconstructed images. SemSim is trained with a standard triplet loss, using an original image as an anchor, one of its recognizable reconstructed images as a positive sample, and an unrecognizable one as a negative. By training on human annotations, SemSim exhibits a greater reflection of privacy leakage on the semantic level. We show that SemSim has a significantly higher correlation with human judgment compared with existing metrics. Moreover, this strong correlation generalizes to unseen datasets, models and attack methods.	This paper investigates the faithfulness of existing hand-crafted image quality metrics (e.g., PSNR, SSIM) in evaluating the privacy risks of image classification models under reconstruction attacks, and proposes a new learning-based metric, SemSim, which demonstrates a stronger correlation with human perception of privacy leakage.	Existing image quality metrics, commonly used to evaluate model privacy risks, often show inconsistency with human perception of privacy leakage from reconstructed images, indicating potential risks in privacy assessment.	The authors collect human annotations on the recognizability of reconstructed images from various datasets, models, and attack methods. They then analyze the correlation between human evaluation and existing metrics. Finally, they propose SemSim, a learning-based metric trained with a triplet loss on human-annotated data to better capture semantic similarity and reflect privacy leakage.	Existing image quality metrics show weak correlation with human perception of privacy leakage. SemSim exhibits a significantly stronger correlation with human judgment compared to existing metrics. SemSim generalizes well to unseen datasets, models, and attack methods.	SemSim's performance might decrease when facing significant distributional shifts. The binary nature of privacy leakage in the current study could be extended to a more continuous measure in future work.	privacy, reconstruction attacks, image quality assessment, human perception, semantic similarity
2309.12969 Report	Detect Everything with Few Examples	Xinyu Zhang, Yuting Wang, Abdeslam Boularias	Few-shot object detection aims at detecting novel categories given a few example images. Recent methods focus on finetuning strategies, with complicated procedures that prohibit a wider application. In this paper, we introduce DE-ViT, a few-shot object detector without the need for finetuning. DE-ViT's novel architecture is based on a new region-propagation mechanism for localization. The propagated region masks are transformed into bounding boxes through a learnable spatial integral layer. Instead of training prototype classifiers, we propose to use prototypes to project ViT features into a subspace that is robust to overfitting on base classes. We evaluate DE-ViT on few-shot, and one-shot object detection benchmarks with Pascal VOC, COCO, and LVIS. DE-ViT establishes new state-of-the-art results on all benchmarks. Notably, for COCO, DE-ViT surpasses the few-shot SoTA by 15 mAP on 10-shot and 7.2 mAP on 30-shot and one-shot SoTA by 2.8 AP50. For LVIS, DE-ViT outperforms few-shot SoTA by 20 box APr.	This paper introduces DE-ViT, a fast few-shot object detector that can detect novel objects without finetuning by leveraging the generalization power of strong pretrained ViT backbones.	Existing few-shot object detection methods rely heavily on finetuning, which leads to limitations in practical use, overfitting on base classes, and a large accuracy gap between base and novel classes.	DE-ViT utilizes a novel region-propagation-based localization architecture with a learnable spatial integral layer to transform predicted regions into bounding boxes. It also employs feature subspace projection using class prototypes to mitigate overfitting on base classes.	DE-ViT achieves state-of-the-art results on few-shot object detection benchmarks, including Pascal VOC, COCO, and LVIS. On COCO, DE-ViT surpasses the previous SoTA LVC by 15 mAP on 10-shot and 7.2 mAP on 30-shot. On LVIS, DE-ViT outperforms the previous SoTA DiGeo by 20 box APr.	The feature subspace projection creates separate features for each class, introducing inference overhead. The full potential of the region propagation network and spatial integral layer for general object detection tasks, including segmentation, is not fully explored in this work.	few-shot object detection, vision transformer, region propagation, spatial integral layer, feature subspace projection
2309.12790 Report	NTO3D: Neural Target Object 3D Reconstruction with Segment Anything	Xiaobao Wei, Renrui Zhang, Jiarui Wu, Jiaming Liu, Ming Lu, Yandong Guo, Shanghang Zhang	Neural 3D reconstruction from multi-view images has recently attracted increasing attention from the community. Existing methods normally learn a neural field for the whole scene, while it is still under-explored how to reconstruct a target object indicated by users. Considering the Segment Anything Model (SAM) has shown effectiveness in segmenting any 2D images, in this paper, we propose NTO3D, a novel high-quality Neural Target Object 3D (NTO3D) reconstruction method, which leverages the benefits of both neural field and SAM. We first propose a novel strategy to lift the multi-view 2D segmentation masks of SAM into a unified 3D occupancy field. The 3D occupancy field is then projected into 2D space and generates the new prompts for SAM. This process is iterative until convergence to separate the target object from the scene. After this, we then lift the 2D features of the SAM encoder into a 3D feature field in order to improve the reconstruction quality of the target object. NTO3D lifts the 2D masks and features of SAM into the 3D neural field for high-quality neural target object 3D reconstruction. We conduct detailed experiments on several benchmark datasets to demonstrate the advantages of our method. The code will be available at: https://github.com/ucwxb/NTO3D.	This paper introduces NTO3D, a novel method for reconstructing a user-specified target object from multi-view images by leveraging the strengths of neural fields and the Segment Anything Model (SAM).	Existing neural 3D reconstruction methods typically model the entire scene, neglecting the need to isolate and reconstruct specific objects. NTO3D addresses this gap by enabling user-guided, on-the-fly target object reconstruction, improving both flexibility and reconstruction quality.	NTO3D operates in two stages. First, it trains a 3D occupancy field to merge multi-view 2D segmentation masks obtained from SAM, effectively isolating the target object in 3D space. Second, it refines the reconstruction by lifting SAM's 2D features into a 3D feature field, enhancing surface quality and detail.	NTO3D achieves superior segmentation accuracy compared to baselines, generating high-quality multi-view consistent masks of target objects. NTO3D demonstrates substantial improvements in novel view synthesis quality, as evidenced by higher PSNR and SSIM values compared to state-of-the-art methods. The reconstructed 3D models produced by NTO3D exhibit higher fidelity, achieving lower Chamfer distances compared to existing techniques.	NTO3D's performance depends on SAM's ability to segment the target object effectively; challenges arise when SAM encounters complex scenes or fails to provide accurate masks. Future work could focus on enhancing NTO3D's robustness by incorporating techniques like parameter-efficient fine-tuning for SAM to handle challenging segmentation scenarios.	3d reconstruction, neural fields, segment anything model (sam), target object segmentation, multi-view images
2309.12757 Report	Masking Improves Contrastive Self-Supervised Learning for ConvNets, and Saliency Tells You Where	Zhi-Yi Chin, Chieh-Ming Jiang, Ching-Chun Huang, Pin-Yu Chen, Wei-Chen Chiu	While image data starts to enjoy the simple-but-effective self-supervised learning scheme built upon masking and self-reconstruction objective thanks to the introduction of tokenization procedure and vision transformer backbone, convolutional neural networks as another important and widely-adopted architecture for image data, though having contrastive-learning techniques to drive the self-supervised learning, still face the difficulty of leveraging such straightforward and general masking operation to benefit their learning process significantly. In this work, we aim to alleviate the burden of including masking operation into the contrastive-learning framework for convolutional neural networks as an extra augmentation method. In addition to the additive but unwanted edges (between masked and unmasked regions) as well as other adverse effects caused by the masking operations for ConvNets, which have been discussed by prior works, we particularly identify the potential problem where for one view in a contrastive sample-pair the randomly-sampled masking regions could be overly concentrated on important/salient objects thus resulting in misleading contrastiveness to the other view. To this end, we propose to explicitly take the saliency constraint into consideration in which the masked regions are more evenly distributed among the foreground and background for realizing the masking-based augmentation. Moreover, we introduce hard negative samples by masking larger regions of salient patches in an input image. Extensive experiments conducted on various datasets, contrastive learning mechanisms, and downstream tasks well verify the efficacy as well as the superior performance of our proposed method with respect to several state-of-the-art baselines.	This paper proposes a novel saliency-guided masking augmentation method for improving contrastive self-supervised learning in convolutional neural networks.	While masking and self-reconstruction have been successful in self-supervised learning for vision transformers, convolutional networks struggle to effectively utilize these techniques. This paper addresses this gap by incorporating masking as an augmentation strategy in a contrastive learning framework.	The method utilizes a pretrained localization network to generate saliency maps, guiding the masking operation to distribute masked patches evenly between foreground and background. It introduces three masking strategies to mitigate parasitic edges: high-pass filtering, strong blurring, and mean filling. Additionally, it explores the creation of hard negative samples by masking large portions of salient patches.	Saliency-guided masking consistently outperforms random masking and other baselines in various downstream tasks, including image classification, object detection, and instance segmentation. Masking solely the query branch of the Siamese network, motivated by variance manipulation, further improves performance. Hard negative samples, generated by masking salient patches, provide additional performance benefits.	The strong blurring strategy's efficiency is limited by GPU I/O, requiring further optimization. The impact of different localization network choices on the performance requires further investigation.	self-supervised learning, contrastive learning, convolutional neural networks, masking augmentation, saliency
2309.12412 Report	Speeding up Resnet Architecture with Layers Targeted Low Rank Decomposition	Walid Ahmed, Habib Hajimolahoseini, Austin Wen, Yang Liu	Compression of a neural network can help in speeding up both the training and the inference of the network. In this research, we study applying compression using low rank decomposition on network layers. Our research demonstrates that to acquire a speed up, the compression methodology should be aware of the underlying hardware as analysis should be done to choose which layers to compress. The advantage of our approach is demonstrated via a case study of compressing ResNet50 and training on full ImageNet-ILSVRC2012. We tested on two different hardware systems Nvidia V100 and Huawei Ascend910. With hardware targeted compression, results on Ascend910 showed 5.36% training speedup and 15.79% inference speed on Ascend310 with only 1% drop in accuracy compared to the original uncompressed model	This paper proposes a hardware-aware low-rank decomposition (LRD) framework for compressing neural networks to achieve faster training and inference.	Compressing large neural networks is crucial for deployment on resource-constrained devices and for faster training and inference.	The proposed method uses LRD with tensor decomposition, selectively compressing layers based on hardware and introducing compression modes to find the optimal layers for compression. It also includes final dense layer compression rate adjustment and rank quantization.	Not all layers benefit from compression equally, and hardware plays a significant role in determining the effectiveness. The proposed framework achieved a 5.6% training speedup on Huawei Ascend910 and a 15.79% inference speedup on Ascend310 for ResNet50 with minimal accuracy loss. Simply reducing parameters or FLOPs does not guarantee speedup, and careful layer selection and hardware awareness are essential.	The study focuses on ResNet50 and two hardware platforms; further validation on diverse architectures and hardware is needed. The search for the optimal compression mode could be automated and potentially improved by incorporating hardware-specific performance models.	model compression, low-rank decomposition, hardware-aware compression, neural network speedup, resnet50
2309.11955 Report	A Study of Forward-Forward Algorithm for Self-Supervised Learning	Jonas Brenig, Radu Timofte	Self-supervised representation learning has seen remarkable progress in the last few years, with some of the recent methods being able to learn useful image representations without labels. These methods are trained using backpropagation, the de facto standard. Recently, Geoffrey Hinton proposed the forward-forward algorithm as an alternative training method. It utilizes two forward passes and a separate loss function for each layer to train the network without backpropagation. In this study, for the first time, we study the performance of forward-forward vs. backpropagation for self-supervised representation learning and provide insights into the learned representation spaces. Our benchmark employs four standard datasets, namely MNIST, F-MNIST, SVHN and CIFAR-10, and three commonly used self-supervised representation learning techniques, namely rotation, flip and jigsaw. Our main finding is that while the forward-forward algorithm performs comparably to backpropagation during (self-)supervised training, the transfer performance is significantly lagging behind in all the studied settings. This may be caused by a combination of factors, including having a loss function for each layer and the way the supervised training is realized in the forward-forward paradigm. In comparison to backpropagation, the forward-forward algorithm focuses more on the boundaries and drops part of the information unnecessary for making decisions which harms the representation learning goal. Further investigation and research are necessary to stabilize the forward-forward strategy for self-supervised learning, to work beyond the datasets and configurations demonstrated by Geoffrey Hinton.	This paper presents the first study on the performance of the forward-forward algorithm, a novel alternative to backpropagation, for self-supervised representation learning.	The forward-forward algorithm, while promising for its biological plausibility, requires evaluation in the context of self-supervised learning, a powerful technique for learning image representations without labels.	The authors benchmark the forward-forward algorithm against traditional backpropagation on four datasets (MNIST, F-MNIST, SVHN, CIFAR-10) and three self-supervised tasks (rotation, flip, jigsaw). They analyze accuracy on the self-supervised tasks and the transfer learning performance on classification using a linear classifier.	The forward-forward algorithm performs comparably to backpropagation during self-supervised pre-training. The transfer performance of forward-forward significantly lags behind backpropagation in all tested settings, indicating a difficulty in generalizing learned representations. The forward-forward algorithm appears to focus heavily on features directly relevant to the self-supervised task, potentially discarding information crucial for downstream tasks.	The specific self-supervised tasks and their implementation details might influence the performance of the forward-forward algorithm. Future work should explore alternative SSL tasks better suited for the forward-forward algorithm and investigate Siamese network structures and generative extensions of the algorithm.	forward-forward algorithm, self-supervised learning, representation learning, backpropagation, transfer learning
2309.11923 Report	TextCLIP: Text-Guided Face Image Generation And Manipulation Without Adversarial Training	Xiaozhou You, Jian Zhang	Text-guided image generation aimed to generate desired images conditioned on given texts, while text-guided image manipulation refers to semantically edit parts of a given image based on specified texts. For these two similar tasks, the key point is to ensure image fidelity as well as semantic consistency. Many previous approaches require complex multi-stage generation and adversarial training, while struggling to provide a unified framework for both tasks. In this work, we propose TextCLIP, a unified framework for text-guided image generation and manipulation without adversarial training. The proposed method accepts input from images or random noise corresponding to these two different tasks, and under the condition of the specific texts, a carefully designed mapping network that exploits the powerful generative capabilities of StyleGAN and the text image representation capabilities of Contrastive Language-Image Pre-training (CLIP) generates images of up to $1024\times1024$ resolution that can currently be generated. Extensive experiments on the Multi-modal CelebA-HQ dataset have demonstrated that our proposed method outperforms existing state-of-the-art methods, both on text-guided generation tasks and manipulation tasks.	TextCLIP, a unified framework for text-guided image generation and manipulation without adversarial training, is proposed.	Many previous methods for text-guided image generation and manipulation rely on complex multi-stage generation and adversarial training, leading to challenges in efficiency and training difficulty. This work aims to address these limitations and provide a unified framework for both tasks.	TextCLIP leverages a pretrained encoder, level-channel mapper, and a pretrained StyleGAN generator. It maps input (image or random noise) to StyleGAN's latent space, uses a level-channel mapper to encode text information, and feeds the resulting latent code to the StyleGAN generator for image synthesis. Different processing is applied based on the task (generation or manipulation).	TextCLIP outperforms state-of-the-art methods in both text-guided image generation and manipulation tasks on the Multi-modal CelebA-HQ dataset. The proposed level-channel mapper effectively maps textual information to the latent space of StyleGAN for high-quality image generation. The designed loss functions ensure both image realism and semantic alignment between generated images and given texts.	TextCLIP is currently limited to the face domain and needs further exploration for generalization to other domains like flowers or birds. The performance of TextCLIP is affected by the inherent limitations of StyleGAN and CLIP, such as the representation of certain attributes and potential vulnerabilities of CLIP.	text-guided image generation, text-guided image manipulation, stylegan, clip, adversarial training
2309.11747 Report	MarkNerf:Watermarking for Neural Radiance Field	Lifeng Chen, Jia Liu, Yan Ke, Wenquan Sun, Weina Dong, Xiaozhong Pan	A watermarking algorithm is proposed in this paper to address the copyright protection issue of implicit 3D models. The algorithm involves embedding watermarks into the images in the training set through an embedding network, and subsequently utilizing the NeRF model for 3D modeling. A copyright verifier is employed to generate a backdoor image by providing a secret perspective as input to the neural radiation field. Subsequently, a watermark extractor is devised using the hyperparameterization method of the neural network to extract the embedded watermark image from that perspective. In a black box scenario, if there is a suspicion that the 3D model has been used without authorization, the verifier can extract watermarks from a secret perspective to verify network copyright. Experimental results demonstrate that the proposed algorithm effectively safeguards the copyright of 3D models. Furthermore, the extracted watermarks exhibit favorable visual effects and demonstrate robust resistance against various types of noise attacks.	This paper proposes MarkNerf, a novel watermarking algorithm for Neural Radiance Fields (NeRF) to address copyright protection issues in implicit 3D models.	With the increasing popularity of NeRF in generating and sharing 3D content, protecting the copyright of these implicit 3D models becomes crucial.	The algorithm embeds watermarks into training images using an embedding network before training the NeRF model. A secret perspective, acting as the key, is used during training. Watermark extraction employs an over-parameterized network, only revealing the watermark when presented with an image rendered from the secret perspective.	The embedded watermarks exhibit high imperceptibility, ensuring minimal visual difference between watermarked and original images. The algorithm demonstrates robustness against various noise attacks, preserving watermark integrity. Watermark extraction is only successful when the image is rendered from the secret perspective, guaranteeing copyright protection.	The extraction network's structure could be further improved to mitigate potential watermark extraction from adjacent views. Future work can focus on exploring different watermark embedding and extraction techniques to enhance security and robustness.	neural radiance field, 3d watermarking, copyright protection, deep learning, implicit 3d models
2309.11525 Report	Light Field Diffusion for Single-View Novel View Synthesis	Yifeng Xiong, Haoyu Ma, Shanlin Sun, Kun Han, Hao Tang, Xiaohui Xie	Single-view novel view synthesis (NVS), the task of generating images from new viewpoints based on a single reference image, is important but challenging in computer vision. Recent advancements in NVS have leveraged Denoising Diffusion Probabilistic Models (DDPMs) for their exceptional ability to produce high-fidelity images. However, current diffusion-based methods typically utilize camera pose matrices to globally and implicitly enforce 3D constraints, which can lead to inconsistencies in images generated from varying viewpoints, particularly in regions with complex textures and structures. To address these limitations, we present Light Field Diffusion (LFD), a novel conditional diffusion-based approach that transcends the conventional reliance on camera pose matrices. Starting from the camera pose matrices, LFD transforms them into light field encoding, with the same shape as the reference image, to describe the direction of each ray. By integrating light field encoding with the reference image, our method imposes local pixel-wise constraints within the diffusion process, fostering enhanced view consistency. Our approach not only involves training image LFD on the ShapeNet Car dataset but also includes fine-tuning a pre-trained latent diffusion model on the Objaverse dataset. This enables our latent LFD model to exhibit remarkable zero-shot generalization capabilities across out-of-distribution datasets like RTMV as well as in-the-wild images. Experiments demonstrate that LFD not only produces high-fidelity images but also achieves superior 3D consistency in complex regions, outperforming existing novel view synthesis methods.	This paper presents Light Field Diffusion (LFD), a novel conditional diffusion-based approach for single-view novel view synthesis that utilizes light field encoding of camera poses to impose local pixel-wise constraints, enhancing view consistency.	Existing diffusion-based methods for novel view synthesis rely on camera pose matrices, which provide only global and implicit 3D constraints, leading to inconsistencies in generated images from varying viewpoints, particularly in complex regions.	LFD transforms camera pose matrices into light field encoding, describing the direction of each ray. This encoding is integrated with the reference and target images during the diffusion process using a U-Net architecture with cross-attention, enabling local pixel correspondences and enhanced view consistency. Both image-space and latent-space implementations are explored.	LFD outperforms existing methods on Objaverse and ShapeNet datasets in terms of view consistency and image quality. The latent LFD model demonstrates zero-shot generalization, effectively synthesizing novel views for out-of-distribution datasets like RTMV and in-the-wild images. LFD with light field encoding shows superior performance compared to using camera pose matrices directly in the diffusion model.	Latent LFD faces challenges with highly complex, in-the-wild images, particularly landscapes, due to its training predominantly on synthetic data. The current light field encoding does not explicitly provide depth information or details about the scene's light source.	novel view synthesis, diffusion models, light field, single-view reconstruction, 3d vision
2309.11497 Report	FreeU: Free Lunch in Diffusion U-Net	Chenyang Si, Ziqi Huang, Yuming Jiang, Ziwei Liu	In this paper, we uncover the untapped potential of diffusion U-Net, which serves as a "free lunch" that substantially improves the generation quality on the fly. We initially investigate the key contributions of the U-Net architecture to the denoising process and identify that its main backbone primarily contributes to denoising, whereas its skip connections mainly introduce high-frequency features into the decoder module, causing the network to overlook the backbone semantics. Capitalizing on this discovery, we propose a simple yet effective method-termed "FreeU" - that enhances generation quality without additional training or finetuning. Our key insight is to strategically re-weight the contributions sourced from the U-Net's skip connections and backbone feature maps, to leverage the strengths of both components of the U-Net architecture. Promising results on image and video generation tasks demonstrate that our FreeU can be readily integrated to existing diffusion models, e.g., Stable Diffusion, DreamBooth, ModelScope, Rerender and ReVersion, to improve the generation quality with only a few lines of code. All you need is to adjust two scaling factors during inference. Project page: https://chenyangsi.top/FreeU/.	Proposes "FreeU," a method to improve diffusion model sample quality during inference by re-weighting feature contributions from the U-Net's backbone and skip connections.	Diffusion U-Net's internal properties are under-explored, and improving sample quality usually requires computationally expensive training or fine-tuning.	Analyzes the contributions of the U-Net backbone and skip connections to the denoising process, then introduces backbone and skip feature scaling factors during inference to re-balance their contributions.	Amplifying backbone features improves denoising and image quality. Skip connections primarily contribute high-frequency information, and their modulation has less impact on overall quality. FreeU significantly improves image and video generation quality in Stable Diffusion, DreamBooth, ModelScope, Rerender, and ReVersion without additional training.	The impact of different scaling factors and their optimal values need further investigation. The oversmoothing of textures when amplifying backbone features needs to be addressed more robustly.	diffusion models, u-net, image generation, video generation, sample quality
2309.11043 Report	Score Mismatching for Generative Modeling	Senmao Ye, Fei Liu	We propose a new score-based model with one-step sampling. Previously, score-based models were burdened with heavy computations due to iterative sampling. For substituting the iterative process, we train a standalone generator to compress all the time steps with the gradient backpropagated from the score network. In order to produce meaningful gradients for the generator, the score network is trained to simultaneously match the real data distribution and mismatch the fake data distribution. This model has the following advantages: 1) For sampling, it generates a fake image with only one step forward. 2) For training, it only needs 10 diffusion steps.3) Compared with consistency model, it is free of the ill-posed problem caused by consistency loss. On the popular CIFAR-10 dataset, our model outperforms Consistency Model and Denoising Score Matching, which demonstrates the potential of the framework. We further provide more examples on the MINIST and LSUN datasets. The code is available on GitHub.	This paper proposes Score Mismatching (SMM), a novel score-based generative model that achieves one-step sampling by training a standalone generator to compress all diffusion time steps, guided by gradients from a score network trained to match real and mismatch fake data distributions.	Score-based models typically suffer from heavy computational burdens due to iterative sampling. This work aims to accelerate this process for wider application, particularly in resource-constrained environments like mobile devices.	SMM trains a generator and a score network jointly. The score network is trained to match the score of the true data distribution and mismatch the generated data distribution. The standalone generator is trained to fool the score network by generating samples close to the real data manifold. A zero-mean noise injection pipeline is used during training to eliminate noise corruption in the generated samples.	SMM outperforms state-of-the-art one-step score-based models, such as Consistency Model, on CIFAR-10 image generation. Only 10 diffusion steps are needed during training, significantly less than traditional score-based models. SMM avoids the ill-posed problem encountered in Consistency Model by not relying on direct pixel-wise mapping between noisy and clean images.	The performance of SMM is sensitive to the choice of noise corruption strategy and network architectures. Exploiting pre-trained score-based models for further performance improvement is challenging due to their reliance on noisy score estimation.	generative models, score-based models, one-step sampling, diffusion models, adversarial training
2309.11009 Report	Controllable Dynamic Appearance for Neural 3D Portraits	ShahRukh Athar, Zhixin Shu, Zexiang Xu, Fujun Luan, Sai Bi, Kalyan Sunkavalli, Dimitris Samaras	Recent advances in Neural Radiance Fields (NeRFs) have made it possible to reconstruct and reanimate dynamic portrait scenes with control over head-pose, facial expressions and viewing direction. However, training such models assumes photometric consistency over the deformed region e.g. the face must be evenly lit as it deforms with changing head-pose and facial expression. Such photometric consistency across frames of a video is hard to maintain, even in studio environments, thus making the created reanimatable neural portraits prone to artifacts during reanimation. In this work, we propose CoDyNeRF, a system that enables the creation of fully controllable 3D portraits in real-world capture conditions. CoDyNeRF learns to approximate illumination dependent effects via a dynamic appearance model in the canonical space that is conditioned on predicted surface normals and the facial expressions and head-pose deformations. The surface normals prediction is guided using 3DMM normals that act as a coarse prior for the normals of the human head, where direct prediction of normals is hard due to rigid and non-rigid deformations induced by head-pose and facial expression changes. Using only a smartphone-captured short video of a subject for training, we demonstrate the effectiveness of our method on free view synthesis of a portrait scene with explicit head pose and expression controls, and realistic lighting effects. The project page can be found here: http://shahrukhathar.github.io/2023/08/22/CoDyNeRF.html	This paper introduces CoDyNeRF, a system that creates controllable and reanimatable 3D neural portraits from videos captured in real-world lighting conditions.	Existing NeRF-based portrait animation methods often fail in realistic lighting due to the assumption of photometric consistency, leading to artifacts in relighting and shadowing.	CoDyNeRF employs a dynamic canonical appearance model conditioned on surface normals, head pose, facial expressions, and other shading cues. It predicts dynamic surface normals using an MLP trained with 3DMM and scene normals as priors.	CoDyNeRF realistically reproduces shadowing, shading, and specularity effects during reanimation. Quantitative evaluations demonstrate superior performance compared to state-of-the-art methods like RigNeRF and Neural Head Avatars. Ablation studies confirm the importance of dynamic appearance conditioning and accurate normal prediction.	CoDyNeRF is currently subject-specific, requiring training for each individual. It does not support relighting with novel lighting conditions.	neural radiance fields, 3d portrait animation, dynamic appearance modeling, surface normal prediction, realistic relighting
2309.10810 Report	PGDiff: Guiding Diffusion Models for Versatile Face Restoration via Partial Guidance	Peiqing Yang, Shangchen Zhou, Qingyi Tao, Chen Change Loy	Exploiting pre-trained diffusion models for restoration has recently become a favored alternative to the traditional task-specific training approach. Previous works have achieved noteworthy success by limiting the solution space using explicit degradation models. However, these methods often fall short when faced with complex degradations as they generally cannot be precisely modeled. In this paper, we propose PGDiff by introducing partial guidance, a fresh perspective that is more adaptable to real-world degradations compared to existing works. Rather than specifically defining the degradation process, our approach models the desired properties, such as image structure and color statistics of high-quality images, and applies this guidance during the reverse diffusion process. These properties are readily available and make no assumptions about the degradation process. When combined with a diffusion prior, this partial guidance can deliver appealing results across a range of restoration tasks. Additionally, PGDiff can be extended to handle composite tasks by consolidating multiple high-quality image properties, achieved by integrating the guidance from respective tasks. Experimental results demonstrate that our method not only outperforms existing diffusion-prior-based approaches but also competes favorably with task-specific models.	This paper introduces PGDiff, a novel approach for image restoration that leverages the generative prior of pre-trained diffusion models without relying on explicit degradation models.	Existing diffusion-based restoration methods, while versatile, struggle with complex, real-world degradations due to their dependence on accurate degradation modeling. PGDiff addresses this limitation, offering a more generalizable solution.	PGDiff employs 'partial guidance', which focuses on modeling desired properties of high-quality images (e.g., structure, color statistics) rather than the degradation process. This guidance, implemented using classifier guidance with dynamic adjustments, steers the diffusion model's denoising process.	PGDiff effectively handles various restoration tasks, including blind face restoration, colorization, and inpainting, outperforming existing diffusion-prior-based methods. The method demonstrates strong performance on challenging cases, such as old photo restoration with scratches, by combining guidance from multiple restoration tasks. PGDiff exhibits flexibility in incorporating additional guidance, exemplified by reference-based restoration using identity features and quality enhancement with perceptual and adversarial losses.	The performance of PGDiff is contingent upon the capabilities of the pre-trained diffusion model used. The current implementation primarily focuses on face restoration due to the use of a face-specific diffusion model. Extending it to broader object categories is left for future work.	image restoration, diffusion models, generative prior, partial guidance, classifier guidance
2309.10713 Report	Interpret Vision Transformers as ConvNets with Dynamic Convolutions	Chong Zhou, Chen Change Loy, Bo Dai	There has been a debate about the superiority between vision Transformers and ConvNets, serving as the backbone of computer vision models. Although they are usually considered as two completely different architectures, in this paper, we interpret vision Transformers as ConvNets with dynamic convolutions, which enables us to characterize existing Transformers and dynamic ConvNets in a unified framework and compare their design choices side by side. In addition, our interpretation can also guide the network design as researchers now can consider vision Transformers from the design space of ConvNets and vice versa. We demonstrate such potential through two specific studies. First, we inspect the role of softmax in vision Transformers as the activation function and find it can be replaced by commonly used ConvNets modules, such as ReLU and Layer Normalization, which results in a faster convergence rate and better performance. Second, following the design of depth-wise convolution, we create a corresponding depth-wise vision Transformer that is more efficient with comparable performance. The potential of the proposed unified interpretation is not limited to the given examples and we hope it can inspire the community and give rise to more advanced network architectures.	Presents a novel interpretation of vision Transformers as ConvNets with dynamic convolutions, providing a unified framework to understand and compare these architectures.	This interpretation bridges the gap between Transformers and ConvNets, enabling the transfer of design principles between them for developing more advanced architectures.	The authors reformulate the self-attention mechanism in Transformers as a series of operations equivalent to static and dynamic convolutions.	Softmax in vision Transformers can be effectively replaced by other normalization and activation techniques common in ConvNets, such as Layer Normalization and ReLU, leading to faster convergence and even improved performance. Inspired by depth-wise convolutions, the authors propose depth-wise vision Transformers, achieving comparable performance to standard Transformers while being more efficient. The proposed framework provides a new lens for analyzing design choices in Transformers and ConvNets side-by-side, fostering cross-architectural inspiration.	The paper focuses on $1x1$ dynamic convolutions for Transformers, leaving the exploration of larger kernel sizes for future work. Further investigation is needed to explore the full potential of this unified framework, such as designing self-attention-like dynamic convolutions with strides.	vision transformers, convolutional neural networks, dynamic convolutions, self-attention, network architecture design
2309.10556 Report	Forgedit: Text Guided Image Editing via Learning and Forgetting	Shiwen Zhang, Shuai Xiao, Weilin Huang	Text-guided image editing on real or synthetic images, given only the original image itself and the target text prompt as inputs, is a very general and challenging task. It requires an editing model to estimate by itself which part of the image should be edited, and then perform either rigid or non-rigid editing while preserving the characteristics of original image. In this paper, we design a novel text-guided image editing method, named as Forgedit. First, we propose a vision-language joint optimization framework capable of reconstructing the original image in 30 seconds, much faster than previous SOTA and much less overfitting. Then we propose a novel vector projection mechanism in text embedding space of Diffusion Models, which is capable to control the identity similarity and editing strength seperately. Finally, we discovered a general property of UNet in Diffusion Models, i.e., Unet encoder learns space and structure, Unet decoder learns appearance and identity. With such a property, we design forgetting mechanisms to successfully tackle the fatal and inevitable overfitting issues when fine-tuning Diffusion Models on one image, thus significantly boosting the editing capability of Diffusion Models. Our method, Forgedit, built on Stable Diffusion, achieves new state-of-the-art results on the challenging text-guided image editing benchmark: TEdBench, surpassing the previous SOTA methods such as Imagic with Imagen, in terms of both CLIP score and LPIPS score. Codes are available at https://github.com/witcherofresearch/Forgedit	Proposes Forgedit, a novel text-guided image editing method using diffusion models, addressing limitations of previous optimization-based methods like slow fine-tuning and overfitting.	Enables efficient and precise text-guided image editing on real or synthetic images, crucial for tasks like visual storytelling and content creation, while preserving characteristics of the original image.	Employs a two-stage approach: 1) Joint fine-tuning: Reconstructs the original image using a vision-language joint optimization framework with a BLIP-generated source prompt, achieving faster convergence and less overfitting. 2) Editing: Utilizes a novel vector projection mechanism in text embedding space for controlled editing and a forgetting strategy based on the observed UNet property (encoder learns structure, decoder learns appearance) to mitigate overfitting during sampling.	Achieves state-of-the-art results on TEdBench, outperforming previous SOTA methods like Imagic. Significantly faster fine-tuning than Imagic (30 seconds vs. 7 minutes). Successfully tackles overfitting issues common in optimization-based editing methods.	Editing quality can be influenced by randomness in fine-tuning and sampling. Limited by the editing capabilities of the underlying diffusion model.	text-guided image editing, diffusion models, overfitting, vector projection, visual storytelling
2309.10503 Report	Steganography for Neural Radiance Fields by Backdooring	Weina Dong, Jia Liu, Yan Ke, Lifeng Chen, Wenquan Sun, Xiaozhong Pan	The utilization of implicit representation for visual data (such as images, videos, and 3D models) has recently gained significant attention in computer vision research. In this letter, we propose a novel model steganography scheme with implicit neural representation. The message sender leverages Neural Radiance Fields (NeRF) and its viewpoint synthesis capabilities by introducing a viewpoint as a key. The NeRF model generates a secret viewpoint image, which serves as a backdoor. Subsequently, we train a message extractor using overfitting to establish a one-to-one mapping between the secret message and the secret viewpoint image. The sender delivers the trained NeRF model and the message extractor to the receiver over the open channel, and the receiver utilizes the key shared by both parties to obtain the rendered image in the secret view from the NeRF model, and then obtains the secret message through the message extractor. The inherent complexity of the viewpoint information prevents attackers from stealing the secret message accurately. Experimental results demonstrate that the message extractor trained in this letter achieves high-capacity steganography with fast performance, achieving a 100\% accuracy in message extraction. Furthermore, the extensive viewpoint key space of NeRF ensures the security of the steganography scheme.	This paper proposes a novel model steganography scheme utilizing Neural Radiance Fields (NeRF) and implicit neural representation.	The scheme leverages the viewpoint synthesis capabilities of NeRF to create a backdoor for embedding and extracting secret messages, addressing the need for secure communication in the context of increasing use of deep learning models.	The sender trains a NeRF model on a 3D scene and selects a secret viewpoint as the key. A message extractor is trained via overfitting to establish a one-to-one mapping between the secret viewpoint image and the secret message. The receiver uses the shared key and the received model to extract the message.	The message extractor achieves 100% accuracy in message extraction from the secret viewpoint image. The scheme demonstrates high message capacity. Even slight deviations from the secret viewpoint render message extraction impossible, ensuring steganographic security.	The scheme currently requires republishing the message extractor with each message, incurring security and efficiency drawbacks. Future work includes enhancing the message extractor with additional image processing functionalities to improve steganographic robustness.	steganography, neural radiance fields (nerf), implicit neural representation, model steganography, message extractor
2309.10388 Report	SideGAN: 3D-Aware Generative Model for Improved Side-View Image Synthesis	Kyungmin Jo, Wonjoon Jin, Jaegul Choo, Hyunjoon Lee, Sunghyun Cho	While recent 3D-aware generative models have shown photo-realistic image synthesis with multi-view consistency, the synthesized image quality degrades depending on the camera pose (e.g., a face with a blurry and noisy boundary at a side viewpoint). Such degradation is mainly caused by the difficulty of learning both pose consistency and photo-realism simultaneously from a dataset with heavily imbalanced poses. In this paper, we propose SideGAN, a novel 3D GAN training method to generate photo-realistic images irrespective of the camera pose, especially for faces of side-view angles. To ease the challenging problem of learning photo-realistic and pose-consistent image synthesis, we split the problem into two subproblems, each of which can be solved more easily. Specifically, we formulate the problem as a combination of two simple discrimination problems, one of which learns to discriminate whether a synthesized image looks real or not, and the other learns to discriminate whether a synthesized image agrees with the camera pose. Based on this, we propose a dual-branched discriminator with two discrimination branches. We also propose a pose-matching loss to learn the pose consistency of 3D GANs. In addition, we present a pose sampling strategy to increase learning opportunities for steep angles in a pose-imbalanced dataset. With extensive validation, we demonstrate that our approach enables 3D GANs to generate high-quality geometries and photo-realistic images irrespective of the camera pose.	This paper proposes SideGAN, a 3D GAN training method to generate photo-realistic images irrespective of the camera pose, particularly for faces at side-view angles.	Existing 3D GANs struggle to generate high-quality images at steep angles due to the difficulty of learning both pose consistency and photo-realism simultaneously from datasets with imbalanced poses (more frontal views).	SideGAN splits the problem into two subproblems: real/fake image discrimination and pose-consistency discrimination. It introduces a dual-branched discriminator, a pose-matching loss, and an additional uniform pose sampling (AUPS) strategy to address the challenges.	SideGAN generates high-quality images and shapes at various camera poses, outperforming baselines in terms of FID and depth accuracy. SideGAN effectively learns to synthesize realistic details (e.g., ears) even at steep angles, unlike previous methods which produce blurry results. Ablation studies demonstrate the benefits of each proposed component, including improved FID scores and more accurate 3D geometry.	The model sometimes generates artifacts (black spots) behind ears, especially for animal faces. Background separation may not be perfect despite using a background network.	generative adversarial networks (gans), 3d-aware image synthesis, multi-view consistency, pose-controllable image generation, neural radiance fields (nerf)
2309.10336 Report	Anti-Aliased Neural Implicit Surfaces with Encoding Level of Detail	Yiyu Zhuang, Qi Zhang, Ying Feng, Hao Zhu, Yao Yao, Xiaoyu Li, Yan-Pei Cao, Ying Shan, Xun Cao	We present LoD-NeuS, an efficient neural representation for high-frequency geometry detail recovery and anti-aliased novel view rendering. Drawing inspiration from voxel-based representations with the level of detail (LoD), we introduce a multi-scale tri-plane-based scene representation that is capable of capturing the LoD of the signed distance function (SDF) and the space radiance. Our representation aggregates space features from a multi-convolved featurization within a conical frustum along a ray and optimizes the LoD feature volume through differentiable rendering. Additionally, we propose an error-guided sampling strategy to guide the growth of the SDF during the optimization. Both qualitative and quantitative evaluations demonstrate that our method achieves superior surface reconstruction and photorealistic view synthesis compared to state-of-the-art approaches.	Presents LoD-NeuS, a novel neural implicit surface representation that leverages encoding level of detail (LoD) for high-quality surface reconstruction and anti-aliased novel view rendering.	Addresses the limitations of existing neural implicit surface reconstruction methods in capturing fine-grained details and mitigating aliasing artifacts.	Introduces a multi-scale tri-plane-based scene representation to capture LoD of SDF and radiance, employs multi-convolved featurization within conical frustums to approximate cone sampling, and develops an error-guided sampling strategy for SDF growth.	Achieves superior surface reconstruction with finer details compared to state-of-the-art methods, particularly for objects with intricate geometries. Effectively reduces aliasing artifacts in novel view rendering by accounting for pixel size and shape through cone sampling approximation. Demonstrates computational efficiency with reduced MLP queries compared to super-sampling techniques.	SDF growth refinement, while effective, has been tested on a limited number of cases due to computational constraints. Exploring the application of the proposed method to dynamic scenes with time-varying geometry and appearance.	neural implicit surface, signed distance function, volume rendering, anti-aliasing, level of detail
2309.10279 Report	360$^\circ$ Reconstruction From a Single Image Using Space Carved Outpainting	Nuri Ryu, Minsu Gong, Geonung Kim, Joo-Haeng Lee, Sunghyun Cho	We introduce POP3D, a novel framework that creates a full $360^\circ$-view 3D model from a single image. POP3D resolves two prominent issues that limit the single-view reconstruction. Firstly, POP3D offers substantial generalizability to arbitrary categories, a trait that previous methods struggle to achieve. Secondly, POP3D further improves reconstruction fidelity and naturalness, a crucial aspect that concurrent works fall short of. Our approach marries the strengths of four primary components: (1) a monocular depth and normal predictor that serves to predict crucial geometric cues, (2) a space carving method capable of demarcating the potentially unseen portions of the target object, (3) a generative model pre-trained on a large-scale image dataset that can complete unseen regions of the target, and (4) a neural implicit surface reconstruction method tailored in reconstructing objects using RGB images along with monocular geometric cues. The combination of these components enables POP3D to readily generalize across various in-the-wild images and generate state-of-the-art reconstructions, outperforming similar works by a significant margin. Project page: \url{http://cg.postech.ac.kr/research/POP3D}	POP3D, a novel framework for reconstructing full 360° 3D models from single images, addresses limitations in generalizability and reconstruction fidelity.	Generating high-quality 3D models from single images is crucial for various applications but remains challenging due to limited generalizability and fidelity in existing methods.	POP3D leverages pre-trained priors for depth, normals, and image generation to progressively outpaint unseen regions. It uses a camera schedule to capture 360° views and refines a neural implicit surface representation using the generated pseudo-ground-truth data.	Outperforms existing methods in input-view reconstruction fidelity. Generates semantically similar and high-quality novel views compared to ground truth. Produces high-fidelity 3D shapes and appearances surpassing alternative approaches.	Performance relies on the accuracy of off-the-shelf priors used in the pipeline. Reconstruction time can be long due to iterative nature and reliance on 3D model training.	single-view 3d reconstruction, shape and appearance reconstruction, novel-view synthesis, space carving, outpainting
2309.10206 Report	Image-Text Pre-Training for Logo Recognition	Mark Hubenthal, Suren Kumar	Open-set logo recognition is commonly solved by first detecting possible logo regions and then matching the detected parts against an ever-evolving dataset of cropped logo images. The matching model, a metric learning problem, is especially challenging for logo recognition due to the mixture of text and symbols in logos. We propose two novel contributions to improve the matching model's performance: (a) using image-text paired samples for pre-training, and (b) an improved metric learning loss function. A standard paradigm of fine-tuning ImageNet pre-trained models fails to discover the text sensitivity necessary to solve the matching problem effectively. This work demonstrates the importance of pre-training on image-text pairs, which significantly improves the performance of a visual embedder trained for the logo retrieval task, especially for more text-dominant classes. We construct a composite public logo dataset combining LogoDet3K, OpenLogo, and FlickrLogos-47 deemed OpenLogoDet3K47. We show that the same vision backbone pre-trained on image-text data, when fine-tuned on OpenLogoDet3K47, achieves $98.6\%$ recall@1, significantly improving performance over pre-training on Imagenet1K ($97.6\%$). We generalize the ProxyNCA++ loss function to propose ProxyNCAHN++ which incorporates class-specific hard negative images. The proposed method sets new state-of-the-art on five public logo datasets considered, with a $3.5\%$ zero-shot recall@1 improvement on LogoDet3K test, $4\%$ on OpenLogo, $6.5\%$ on FlickrLogos-47, $6.2\%$ on Logos In The Wild, and $0.6\%$ on BelgaLogo.	The paper introduces a novel approach for open-set logo recognition by utilizing image-text pre-training for improved text sensitivity in logo matching models and proposes a new metric learning loss function, ProxyNCAHN++, for enhanced class separation.	Open-set logo recognition is crucial for various applications but challenging due to the evolving nature of logo designs and the presence of text and symbols. Existing methods often struggle with text sensitivity, leading to inaccurate matching.	The authors leverage image-text paired data for pre-training a vision backbone, enabling it to develop inherent OCR capabilities. They also introduce ProxyNCAHN++, a metric learning loss function that incorporates hard negative sampling to refine class boundaries in the embedding space.	Image-text pre-trained models significantly outperform ImageNet pre-trained models, achieving up to a 6.5% improvement in recall@1 on various public logo datasets. The proposed ProxyNCAHN++ loss function further enhances class separation, leading to a 0.1% improvement in recall@1 on the OpenLogoDet3K47 dataset. The study demonstrates the effectiveness of image-text pre-training in improving text sensitivity for logo recognition, particularly for text-dominant logo classes.	The impact of logo detector accuracy on the overall system performance requires further investigation. Challenges such as poorly aligned bounding boxes, blurry logo regions, and stylized text need to be addressed in future work.	logo recognition, open-set recognition, metric learning, image-text pre-training, hard negative mining
2309.09858 Report	Unsupervised Open-Vocabulary Object Localization in Videos	Ke Fan, Zechen Bai, Tianjun Xiao, Dominik Zietlow, Max Horn, Zixu Zhao, Carl-Johann Simon-Gabriel, Mike Zheng Shou, Francesco Locatello, Bernt Schiele, Thomas Brox, Zheng Zhang, Yanwei Fu, Tong He	In this paper, we show that recent advances in video representation learning and pre-trained vision-language models allow for substantial improvements in self-supervised video object localization. We propose a method that first localizes objects in videos via a slot attention approach and then assigns text to the obtained slots. The latter is achieved by an unsupervised way to read localized semantic information from the pre-trained CLIP model. The resulting video object localization is entirely unsupervised apart from the implicit annotation contained in CLIP, and it is effectively the first unsupervised approach that yields good results on regular video benchmarks.	This paper introduces the first unsupervised approach for localizing and naming objects in real-world videos, leveraging slot attention and a modified CLIP model for local feature alignment.	This work is important because it bypasses the need for expensive manual annotation in video object localization, paving the way for self-supervised video understanding.	The method uses a three-part pipeline: 1) Video slot learning with a self-supervised video encoder and slot attention for spatiotemporal object localization. 2) Video slot labeling by adapting CLIP for local feature alignment and assigning text labels to the localized slots. 3) Post-processing to merge slots and improve localization and labeling using both visual and semantic information.	The approach achieves high-quality and spatio-temporally consistent object localization in real-world videos without any labeled training data. The proposed patch-based CLIP adaptation effectively aligns CLIP with local features for improved semantic labeling. Joint optimization using both text and image features significantly improves both localization and labeling performance, as demonstrated by quantitative results and qualitative examples.	The current method struggles to differentiate between individual object instances. The resolution of patch tokens limits the precision of object localization.	unsupervised learning, object localization, video understanding, slot attention, clip
2309.09818 Report	Grasp-Anything: Large-scale Grasp Dataset from Foundation Models	An Dinh Vuong, Minh Nhat Vu, Hieu Le, Baoru Huang, Binh Huynh, Thieu Vo, Andreas Kugi, Anh Nguyen	Foundation models such as ChatGPT have made significant strides in robotic tasks due to their universal representation of real-world domains. In this paper, we leverage foundation models to tackle grasp detection, a persistent challenge in robotics with broad industrial applications. Despite numerous grasp datasets, their object diversity remains limited compared to real-world figures. Fortunately, foundation models possess an extensive repository of real-world knowledge, including objects we encounter in our daily lives. As a consequence, a promising solution to the limited representation in previous grasp datasets is to harness the universal knowledge embedded in these foundation models. We present Grasp-Anything, a new large-scale grasp dataset synthesized from foundation models to implement this solution. Grasp-Anything excels in diversity and magnitude, boasting 1M samples with text descriptions and more than 3M objects, surpassing prior datasets. Empirically, we show that Grasp-Anything successfully facilitates zero-shot grasp detection on vision-based tasks and real-world robotic experiments. Our dataset and code are available at https://grasp-anything-2023.github.io.	The paper introduces Grasp-Anything, a new large-scale language-driven dataset for robotic grasp detection, leveraging foundation models to overcome object diversity limitations in previous datasets.	Existing grasp datasets are limited in object diversity and real-world scene representation, hindering robust generalization in grasp detection. This dataset addresses these limitations using the knowledge embedded in foundation models.	The dataset is generated in a multi-stage process involving: 1) Prompt engineering with ChatGPT to create diverse scene descriptions. 2) Image synthesis using Stable Diffusion based on the generated text prompts. 3) Automatic grasp pose annotation and evaluation using a pre-trained model (RAGT-3/3) and a physics-based evaluation method.	Grasp-Anything significantly surpasses previous datasets in diversity and magnitude, containing 1 million samples with text descriptions and over 3 million objects. Zero-shot grasp detection experiments demonstrate that Grasp-Anything effectively supports generalization to unseen objects and outperforms other datasets in cross-dataset transfer learning. Real-world robot experiments confirm the effectiveness of Grasp-Anything, achieving higher grasp success rates compared to models trained on other datasets.	Dataset creation is time-consuming and relies on access to commercial APIs like ChatGPT. The dataset currently lacks 3D point clouds, which could enhance its applicability in robotic tasks. Future work could explore generating point clouds from the existing data.	grasp detection, robotics, dataset, foundation models, zero-shot learning
2309.09724 Report	Robust Geometry-Preserving Depth Estimation Using Differentiable Rendering	Chi Zhang, Wei Yin, Gang Yu, Zhibin Wang, Tao Chen, Bin Fu, Joey Tianyi Zhou, Chunhua Shen	In this study, we address the challenge of 3D scene structure recovery from monocular depth estimation. While traditional depth estimation methods leverage labeled datasets to directly predict absolute depth, recent advancements advocate for mix-dataset training, enhancing generalization across diverse scenes. However, such mixed dataset training yields depth predictions only up to an unknown scale and shift, hindering accurate 3D reconstructions. Existing solutions necessitate extra 3D datasets or geometry-complete depth annotations, constraints that limit their versatility. In this paper, we propose a learning framework that trains models to predict geometry-preserving depth without requiring extra data or annotations. To produce realistic 3D structures, we render novel views of the reconstructed scenes and design loss functions to promote depth estimation consistency across different views. Comprehensive experiments underscore our framework's superior generalization capabilities, surpassing existing state-of-the-art methods on several benchmark datasets without leveraging extra training information. Moreover, our innovative loss functions empower the model to autonomously recover domain-specific scale-and-shift coefficients using solely unlabeled images.	This paper proposes a novel depth estimation learning framework that produces geometry-preserving depth for 3D scene recovery without requiring extra data or annotations.	Existing mix-dataset training methods for depth estimation, while offering strong generalization, produce depth predictions with unknown scale and shift, hindering accurate 3D reconstruction. This paper addresses this challenge to enable robust 3D scene recovery from monocular images.	The proposed framework leverages differentiable rendering to generate novel views of the 3D scene reconstructed from the predicted depth. It then enforces consistency between depth predictions of the original and rendered views using novel loss functions.	The method outperforms existing geometry-preserving depth estimation techniques without using additional data or annotations. The proposed consistency loss enables self-supervised recovery of domain-specific scale and shift coefficients for pre-trained models. The framework can also estimate camera intrinsic parameters, such as focal length, by minimizing the proposed consistency losses over a range of possible values.	The focal length needs to be estimated for point cloud reconstruction when not available. Future work could explore extending the framework to handle dynamic scenes.	depth estimation, 3d reconstruction, differentiable rendering, self-supervised learning, multi-view consistency
2309.09614 Report	Gradpaint: Gradient-Guided Inpainting with Diffusion Models	Asya Grechka, Guillaume Couairon, Matthieu Cord	Denoising Diffusion Probabilistic Models (DDPMs) have recently achieved remarkable results in conditional and unconditional image generation. The pre-trained models can be adapted without further training to different downstream tasks, by guiding their iterative denoising process at inference time to satisfy additional constraints. For the specific task of image inpainting, the current guiding mechanism relies on copying-and-pasting the known regions from the input image at each denoising step. However, diffusion models are strongly conditioned by the initial random noise, and therefore struggle to harmonize predictions inside the inpainting mask with the real parts of the input image, often producing results with unnatural artifacts. Our method, dubbed GradPaint, steers the generation towards a globally coherent image. At each step in the denoising process, we leverage the model's "denoised image estimation" by calculating a custom loss measuring its coherence with the masked input image. Our guiding mechanism uses the gradient obtained from backpropagating this loss through the diffusion model itself. GradPaint generalizes well to diffusion models trained on various datasets, improving upon current state-of-the-art supervised and unsupervised methods.	Presents GradPaint, a training-free algorithm that guides diffusion models for image inpainting by harmonizing generated content with the known context through gradient descent.	Leveraging pre-trained diffusion models for inpainting without retraining is highly desirable due to their computational cost. Existing methods struggle to harmonize generated regions, leading to unnatural artifacts.	Introduces a novel gradient-based update mechanism during the diffusion denoising process. Uses a custom loss combining masked MSE and an alignment loss to ensure smooth transitions between generated and real regions. Backpropagates the loss through the diffusion model itself to guide the generation at each step.	Achieves state-of-the-art FID scores on FFHQ, outperforming training-based and training-free inpainting methods. Significantly improves harmonization between generated and real regions, resulting in more natural and coherent inpainted images. Generalizes well to various datasets (CelebA-HQ, FFHQ, ImageNet, Places2) and pre-trained diffusion models, including latent diffusion models (Stable Diffusion).	Computational cost is higher than gradient-free baselines, although early stopping can mitigate this. May introduce unintended bias from the background context into the generated regions in certain cases.	image inpainting, diffusion models, generative models, gradient-based optimization, zero-shot learning
2309.09466 Report	Progressive Text-to-Image Diffusion with Soft Latent Direction	YuTeng Ye, Jiale Cai, Hang Zhou, Guanwen Li, Youjia Zhang, Zikai Song, Chenxing Gao, Junqing Yu, Wei Yang	In spite of the rapidly evolving landscape of text-to-image generation, the synthesis and manipulation of multiple entities while adhering to specific relational constraints pose enduring challenges. This paper introduces an innovative progressive synthesis and editing operation that systematically incorporates entities into the target image, ensuring their adherence to spatial and relational constraints at each sequential step. Our key insight stems from the observation that while a pre-trained text-to-image diffusion model adeptly handles one or two entities, it often falters when dealing with a greater number. To address this limitation, we propose harnessing the capabilities of a Large Language Model (LLM) to decompose intricate and protracted text descriptions into coherent directives adhering to stringent formats. To facilitate the execution of directives involving distinct semantic operations-namely insertion, editing, and erasing-we formulate the Stimulus, Response, and Fusion (SRF) framework. Within this framework, latent regions are gently stimulated in alignment with each operation, followed by the fusion of the responsive latent components to achieve cohesive entity manipulation. Our proposed framework yields notable advancements in object synthesis, particularly when confronted with intricate and lengthy textual inputs. Consequently, it establishes a new benchmark for text-to-image generation tasks, further elevating the field's performance standards.	This paper introduces a novel progressive text-to-image diffusion framework that uses a Large Language Model (LLM) to decompose complex text descriptions into a sequence of short prompts, enabling the synthesis and manipulation of multiple entities with specific spatial and relational constraints.	Existing text-to-image generation methods struggle with synthesizing or editing images with multiple objects and complex relationships described in lengthy textual prompts. This work aims to improve the accuracy and controllability of multi-entity image generation.	The proposed framework consists of three main components: 1) Text Decomposition using a fine-tuned GPT model to break down complex text into short, structured prompts. 2) Stimulus & Response, where a stimulus loss function guides the cross-attention map of a diffusion model to focus on relevant spatial regions for object generation. 3) Latent Fusion, which seamlessly blends the generated object features with the background image from the previous stage.	The proposed method outperforms single-stage generation baselines in object recall and relation accuracy, demonstrating its ability to handle complex scenes with multiple objects and relationships. Compared to other progressive generation methods, this approach achieves superior image fidelity and controllability in synthesizing, editing, and erasing objects according to textual instructions. Ablation studies confirm the importance of both Stimulus & Response and Latent Fusion components in achieving accurate and consistent image generation.	The text decomposition method may not be effective for all types of complex sentences, particularly those with deeply nested clauses and relationships. Future work includes improving the parsing capabilities of the GPT model for handling a wider variety of complex text descriptions.	text-to-image generation, diffusion models, large language models, progressive synthesis, image manipulation
2309.09456 Report	Object2Scene: Putting Objects in Context for Open-Vocabulary 3D Detection	Chenming Zhu, Wenwei Zhang, Tai Wang, Xihui Liu, Kai Chen	Point cloud-based open-vocabulary 3D object detection aims to detect 3D categories that do not have ground-truth annotations in the training set. It is extremely challenging because of the limited data and annotations (bounding boxes with class labels or text descriptions) of 3D scenes. Previous approaches leverage large-scale richly-annotated image datasets as a bridge between 3D and category semantics but require an extra alignment process between 2D images and 3D points, limiting the open-vocabulary ability of 3D detectors. Instead of leveraging 2D images, we propose Object2Scene, the first approach that leverages large-scale large-vocabulary 3D object datasets to augment existing 3D scene datasets for open-vocabulary 3D object detection. Object2Scene inserts objects from different sources into 3D scenes to enrich the vocabulary of 3D scene datasets and generates text descriptions for the newly inserted objects. We further introduce a framework that unifies 3D detection and visual grounding, named L3Det, and propose a cross-domain category-level contrastive learning approach to mitigate the domain gap between 3D objects from different datasets. Extensive experiments on existing open-vocabulary 3D object detection benchmarks show that Object2Scene obtains superior performance over existing methods. We further verify the effectiveness of Object2Scene on a new benchmark OV-ScanNet-200, by holding out all rare categories as novel categories not seen during training.	This paper presents Object2Scene, a novel approach that leverages large-scale 3D object datasets to enrich existing 3D scene datasets, enabling open-vocabulary 3D object detection.	Open-vocabulary 3D object detection is crucial for real-world applications but challenging due to the limited annotations in 3D scenes. This work addresses this by leveraging readily available 3D object datasets.	Object2Scene inserts objects from large-vocabulary 3D object datasets into 3D scenes and generates language grounding prompts to guide model training. It introduces a unified model, L3Det, for 3D detection and grounding, incorporating cross-domain contrastive learning to mitigate domain gaps.	Object2Scene achieves state-of-the-art performance, outperforming previous methods by significant margins on OV-ScanNet20 and OV-SUN RGB-D20 benchmarks. Relative location prompts, generated by Object2Scene, prove to be highly effective for open-vocabulary 3D detection. Cross-domain category-level contrastive learning effectively reduces the domain gap between inserted objects and original scenes, improving performance.	The model may struggle with objects that have significant variations in point cloud distributions, such as chairs tucked under tables. Future work will explore better ways to align point cloud distributions from different sources and address challenging cases.	open-vocabulary learning, 3d object detection, 3d visual grounding, point cloud processing, domain adaptation
2309.09256 Report	LiDAR Data Synthesis with Denoising Diffusion Probabilistic Models	Kazuto Nakashima, Ryo Kurazume	Generative modeling of 3D LiDAR data is an emerging task with promising applications for autonomous mobile robots, such as scalable simulation, scene manipulation, and sparse-to-dense completion of LiDAR point clouds. While existing approaches have demonstrated the feasibility of image-based LiDAR data generation using deep generative models, they still struggle with fidelity and training stability. In this work, we present R2DM, a novel generative model for LiDAR data that can generate diverse and high-fidelity 3D scene point clouds based on the image representation of range and reflectance intensity. Our method is built upon denoising diffusion probabilistic models (DDPMs), which have shown impressive results among generative model frameworks in recent years. To effectively train DDPMs in the LiDAR domain, we first conduct an in-depth analysis of data representation, loss functions, and spatial inductive biases. Leveraging our R2DM model, we also introduce a flexible LiDAR completion pipeline based on the powerful capabilities of DDPMs. We demonstrate that our method surpasses existing methods in generating tasks on the KITTI-360 and KITTI-Raw datasets, as well as in the completion task on the KITTI-360 dataset. Our project page can be found at https://kazuto1011.github.io/r2dm.	This paper presents R2DM, a novel denoising diffusion probabilistic model for generating realistic LiDAR range and reflectance images, and demonstrates its effectiveness for LiDAR point cloud completion.	Generative modeling of LiDAR point clouds is crucial for applications like autonomous driving, enabling scalable simulation, scene manipulation, and sparse-to-dense completion of LiDAR data.	The authors investigate various aspects of DDPM design for LiDAR data, including loss functions, data representation, and spatial inductive bias. They find that using Fourier features for positional encoding significantly improves generation quality. They also integrate R2DM with the RePaint method for LiDAR completion tasks.	R2DM achieves state-of-the-art generation performance on KITTI-360 and KITTI-Raw datasets, outperforming previous GAN-based and diffusion-based methods. Fourier features as spatial inductive bias are found to be crucial for generating high-fidelity LiDAR point clouds. The proposed R2DM-based completion pipeline outperforms baseline methods on beam-level upsampling tasks, demonstrating its effectiveness for LiDAR data completion.	The paper mainly focuses on relatively clean and downsampled LiDAR data, and future work could explore noise-robust training and handling full-resolution point clouds. Further investigation is needed to explore the scalability of the model and its applications to downstream perception tasks.	lidar, generative model, diffusion model, point cloud completion, autonomous driving
2309.08957 Report	ExBluRF: Efficient Radiance Fields for Extreme Motion Blurred Images	Dongwoo Lee, Jeongtaek Oh, Jaesung Rim, Sunghyun Cho, Kyoung Mu Lee	We present ExBluRF, a novel view synthesis method for extreme motion blurred images based on efficient radiance fields optimization. Our approach consists of two main components: 6-DOF camera trajectory-based motion blur formulation and voxel-based radiance fields. From extremely blurred images, we optimize the sharp radiance fields by jointly estimating the camera trajectories that generate the blurry images. In training, multiple rays along the camera trajectory are accumulated to reconstruct single blurry color, which is equivalent to the physical motion blur operation. We minimize the photo-consistency loss on blurred image space and obtain the sharp radiance fields with camera trajectories that explain the blur of all images. The joint optimization on the blurred image space demands painfully increasing computation and resources proportional to the blur size. Our method solves this problem by replacing the MLP-based framework to low-dimensional 6-DOF camera poses and voxel-based radiance fields. Compared with the existing works, our approach restores much sharper 3D scenes from challenging motion blurred views with the order of 10 times less training time and GPU memory consumption.	ExBluRF, an efficient radiance field method for synthesizing sharp novel views from images with extreme motion blur.	Existing NeRF-based methods struggle with extreme motion blur due to shape-radiance ambiguity and computational limitations.	Models motion blur using a 6-DOF camera trajectory optimized via Bézier curves and utilizes voxel-based radiance fields for efficiency.	Achieves superior deblurring and novel view synthesis compared to previous methods on both real and synthetic datasets. Significantly reduces memory consumption and training time, enabling scalability to extreme blur. Estimated camera trajectories accurately converge to ground truth trajectories without additional supervision.	Potential overfitting with unnecessarily high-order Bézier curves. Reliance on accurate camera pose estimation for evaluation on real datasets with limited ground truth.	neural radiance fields, motion deblurring, novel view synthesis, voxel-based radiance fields, camera trajectory estimation
2309.08826 Report	Dual-Camera Joint Deblurring-Denoising	Shayan Shekarforoush, Amanpreet Walia, Marcus A. Brubaker, Konstantinos G. Derpanis, Alex Levinshtein	Recent image enhancement methods have shown the advantages of using a pair of long and short-exposure images for low-light photography. These image modalities offer complementary strengths and weaknesses. The former yields an image that is clean but blurry due to camera or object motion, whereas the latter is sharp but noisy due to low photon count. Motivated by the fact that modern smartphones come equipped with multiple rear-facing camera sensors, we propose a novel dual-camera method for obtaining a high-quality image. Our method uses a synchronized burst of short exposure images captured by one camera and a long exposure image simultaneously captured by another. Having a synchronized short exposure burst alongside the long exposure image enables us to (i) obtain better denoising by using a burst instead of a single image, (ii) recover motion from the burst and use it for motion-aware deblurring of the long exposure image, and (iii) fuse the two results to further enhance quality. Our method is able to achieve state-of-the-art results on synthetic dual-camera images from the GoPro dataset with five times fewer training parameters compared to the next best method. We also show that our method qualitatively outperforms competing approaches on real synchronized dual-camera captures.	This paper introduces a novel dual-camera method for enhancing image quality, leveraging a synchronized burst of short exposure images from one camera and a long exposure image from another.	This approach overcomes limitations of single-image restoration by utilizing complementary information from both short and long exposure modalities.	The method employs a flow-guided deblurring network to remove motion blur from the long exposure image based on optical flow estimated from the burst. It also incorporates a burst denoising module to produce a clean image from the short exposures. Finally, a fusion module combines features from both deblurred and denoised outputs.	The method achieves state-of-the-art results on synthetic dual-camera images, surpassing previous joint deblurring-denoising approaches. It outperforms single-task baselines, demonstrating the effectiveness of combining deblurring and denoising. Qualitative evaluations on real dual-camera captures show superior performance compared to competing methods.	The current implementation assumes relative rigidity between cameras, limiting its applicability in scenarios with significant camera motion. Deployment on smartphones with limited computational resources requires further optimization and engineering.	image deblurring, image denoising, dual-camera systems, burst processing, optical flow
2309.08816 Report	EgoObjects: A Large-Scale Egocentric Dataset for Fine-Grained Object Understanding	Chenchen Zhu, Fanyi Xiao, Andres Alvarado, Yasmine Babaei, Jiabo Hu, Hichem El-Mohri, Sean Chang Culatana, Roshan Sumbaly, Zhicheng Yan	Object understanding in egocentric visual data is arguably a fundamental research topic in egocentric vision. However, existing object datasets are either non-egocentric or have limitations in object categories, visual content, and annotation granularities. In this work, we introduce EgoObjects, a large-scale egocentric dataset for fine-grained object understanding. Its Pilot version contains over 9K videos collected by 250 participants from 50+ countries using 4 wearable devices, and over 650K object annotations from 368 object categories. Unlike prior datasets containing only object category labels, EgoObjects also annotates each object with an instance-level identifier, and includes over 14K unique object instances. EgoObjects was designed to capture the same object under diverse background complexities, surrounding objects, distance, lighting and camera motion. In parallel to the data collection, we conducted data annotation by developing a multi-stage federated annotation process to accommodate the growing nature of the dataset. To bootstrap the research on EgoObjects, we present a suite of 4 benchmark tasks around the egocentric object understanding, including a novel instance level- and the classical category level object detection. Moreover, we also introduce 2 novel continual learning object detection tasks. The dataset and API are available at https://github.com/facebookresearch/EgoObjects.	EgoObjects, a large-scale egocentric video dataset for fine-grained object understanding, is introduced. It contains over 9K videos, 650K object annotations from 368 categories, and 14K unique object instances captured under diverse conditions (background, lighting, distance, camera motion).	Existing object datasets are limited in their suitability for egocentric object understanding due to factors like non-egocentric viewpoints, limited object categories, lack of visual content variations, and coarse annotation granularities.	Data is collected by participants worldwide using various wearable devices. A multi-stage federated annotation process ensures rich annotations including bounding boxes, category labels, and instance IDs.	Target-aware instance detection model significantly outperforms target-agnostic baseline. Continual learning benchmarks for object detection at both instance and category levels are established. Category-level object detection on EgoObjects presents unique challenges compared to existing exocentric datasets.	Current version is a pilot release representing 10% of the full dataset. Continual learning models require further research to address scalability and architecture limitations.	egocentric vision, object detection, instance-level detection, continual learning, dataset
2309.08586 Report	Replacing softmax with ReLU in Vision Transformers	Mitchell Wortsman, Jaehoon Lee, Justin Gilmer, Simon Kornblith	Previous research observed accuracy degradation when replacing the attention softmax with a point-wise activation such as ReLU. In the context of vision transformers, we find that this degradation is mitigated when dividing by sequence length. Our experiments training small to large vision transformers on ImageNet-21k indicate that ReLU-attention can approach or match the performance of softmax-attention in terms of scaling behavior as a function of compute.	This paper demonstrates that replacing the softmax function in the attention mechanism of Vision Transformers with ReLU, when divided by the sequence length, can achieve comparable performance to traditional softmax attention in terms of scaling behavior with compute.	The softmax operation in traditional attention is computationally expensive and difficult to parallelize. This research offers a more efficient alternative using ReLU that is easier to parallelize and may lead to faster training and inference.	The authors conducted experiments replacing the softmax in the attention mechanism with ReLU divided by the sequence length. They trained Vision Transformers of various sizes on ImageNet-21k and compared their performance to models using traditional softmax attention. Additionally, they explored the effects of different sequence length scaling factors, alternative activation functions, qk-layernorm, and the addition of a gated unit.	ReLU-attention, when scaled by the inverse of the sequence length, exhibits similar scaling behavior to softmax-attention in terms of compute for Vision Transformers trained on ImageNet-21k. Scaling the activation function by a factor involving sequence length is crucial for achieving high accuracy with ReLU-attention. Using qk-layernorm and adding a gated attention unit did not significantly impact the performance of ReLU-attention with sequence length scaling.	The theoretical reasoning behind the effectiveness of the sequence length scaling factor remains unclear and needs further investigation. Future research could explore the performance of ReLU-attention with a learnable sequence length scaling factor and investigate other potentially more effective activation functions.	vision transformer, attention mechanism, relu, softmax, sequence length scaling
2309.08523 Report	Breathing New Life into 3D Assets with Generative Repainting	Tianfu Wang, Menelaos Kanakis, Konrad Schindler, Luc Van Gool, Anton Obukhov	Diffusion-based text-to-image models ignited immense attention from the vision community, artists, and content creators. Broad adoption of these models is due to significant improvement in the quality of generations and efficient conditioning on various modalities, not just text. However, lifting the rich generative priors of these 2D models into 3D is challenging. Recent works have proposed various pipelines powered by the entanglement of diffusion models and neural fields. We explore the power of pretrained 2D diffusion models and standard 3D neural radiance fields as independent, standalone tools and demonstrate their ability to work together in a non-learned fashion. Such modularity has the intrinsic advantage of eased partial upgrades, which became an important property in such a fast-paced domain. Our pipeline accepts any legacy renderable geometry, such as textured or untextured meshes, orchestrates the interaction between 2D generative refinement and 3D consistency enforcement tools, and outputs a painted input geometry in several formats. We conduct a large-scale study on a wide range of objects and categories from the ShapeNetSem dataset and demonstrate the advantages of our approach, both qualitatively and quantitatively. Project page: https://www.obukhov.ai/repainting_3d_assets	This paper presents a novel pipeline for text-guided painting of 3D assets by leveraging pre-trained 2D image diffusion models and neural radiance fields (NeRF) in a modular and interpretable fashion.	Lifting the power of 2D generative models into 3D is challenging, and existing methods often suffer from limitations like UV unwrapping artifacts and lack of modularity. This work addresses these issues by utilizing readily available tools.	The pipeline iteratively generates novel views using a text- and depth-conditioned diffusion model, remaps existing views for consistency, and employs NeRF for global reconciliation. This process allows for painting complex geometries without relying on UV maps.	The method achieves state-of-the-art results on the ShapeNetSem dataset, outperforming existing methods on FID and KID metrics. The modular design allows for partial upgrades as diffusion models and NeRF technology advance. The pipeline supports various input formats and can even be extended to text-to-3D generation using Point-E.	The method currently assumes opaque surfaces, limiting its applicability to some object types. Future work could explore faster and more efficient view selection strategies and incorporate recent advancements in diffusion and NeRF research.	generative 3d models, text-guided painting, diffusion models, neural radiance fields, 3d asset creation
2309.08273 Report	A Generative Framework for Self-Supervised Facial Representation Learning	Ruian He, Zhen Xing, Weimin Tan, Bo Yan	Self-supervised representation learning has gained increasing attention for strong generalization ability without relying on paired datasets. However, it has not been explored sufficiently for facial representation. Self-supervised facial representation learning remains unsolved due to the coupling of facial identities, expressions, and external factors like pose and light. Prior methods primarily focus on contrastive learning and pixel-level consistency, leading to limited interpretability and suboptimal performance. In this paper, we propose LatentFace, a novel generative framework for self-supervised facial representations. We suggest that the disentangling problem can be also formulated as generative objectives in space and time, and propose the solution using a 3D-aware latent diffusion model. First, we introduce a 3D-aware autoencoder to encode face images into 3D latent embeddings. Second, we propose a novel representation diffusion model to disentangle 3D latent into facial identity and expression. Consequently, our method achieves state-of-the-art performance in facial expression recognition (FER) and face verification among self-supervised facial representation learning models. Our model achieves a 3.75\% advantage in FER accuracy on RAF-DB and 3.35\% on AffectNet compared to SOTA methods.	This paper proposes LatentFace, a novel generative framework for self-supervised facial representation learning using a 3D-aware latent diffusion model to disentangle facial identity and expression.	Self-supervised facial representation learning is important for its generalization ability without paired datasets, but previous methods suffer from limited interpretability and performance due to the coupling of facial identities, expressions, and external factors.	The methodology involves two stages: 1) 3D Latent Autoencoding disentangles facial texture and shape from pose and illumination using a 3D-aware autoencoder. 2) Latent Space Disentangling predicts facial identity as the time-invariant component of facial features using a Representation Diffusion Model (RDM) trained on video sequences.	LatentFace achieves state-of-the-art performance in facial expression recognition (FER) and face verification among self-supervised methods. The model outperforms previous SOTA methods by 3.75% in FER accuracy on RAF-DB and 3.35% on AffectNet. Qualitative results demonstrate improved disentanglement of facial identity and expression compared to previous methods.	Interpreting faces with large deflection angles remains challenging due to occlusion. Potential application risks exist due to the model's ability to generate realistic facial textures and shapes.	self-supervised learning, facial representation learning, diffusion models, 3d face modeling, disentanglement
2309.08009 Report	Measuring the Quality of Text-to-Video Model Outputs: Metrics and Dataset	Iya Chivileva, Philip Lynch, Tomas E. Ward, Alan F. Smeaton	Evaluating the quality of videos generated from text-to-video (T2V) models is important if they are to produce plausible outputs that convince a viewer of their authenticity. We examine some of the metrics used in this area and highlight their limitations. The paper presents a dataset of more than 1,000 generated videos from 5 very recent T2V models on which some of those commonly used quality metrics are applied. We also include extensive human quality evaluations on those videos, allowing the relative strengths and weaknesses of metrics, including human assessment, to be compared. The contribution is an assessment of commonly used quality metrics, and a comparison of their performances and the performance of human evaluations on an open dataset of T2V videos. Our conclusion is that naturalness and semantic matching with the text prompt used to generate the T2V output are important but there is no single measure to capture these subtleties in assessing T2V model output.	This paper presents a dataset of over 1,000 videos generated by 5 recent text-to-video models and uses it to compare commonly used quality metrics with human evaluations, revealing limitations in existing metrics.	Evaluating the quality of text-to-video models is crucial for producing plausible outputs, but developing reliable metrics is an often-overlooked challenge.	The authors generated videos from various models, computed quality metrics (including their own ensemble metric), and collected human annotations for alignment and perception. These results were compared to assess the metrics' effectiveness.	Human evaluations generally align with common metrics but not always, highlighting limitations. Text2Video-Zero was the best-performing model, while Aphantasia performed the worst. Shorter prompts generally resulted in better video quality across all models.	The naturalness classifier needs further training with more diverse cartoon-style videos. Future work could explore alternative metrics and ensemble approaches for comprehensive quality assessment.	text-to-video models, video synthesis, evaluation metrics, human evaluation, dataset
2309.07986 Report	Viewpoint Textual Inversion: Unleashing Novel View Synthesis with Pretrained 2D Diffusion Models	James Burgess, Kuan-Chieh Wang, Serena Yeung	Text-to-image diffusion models understand spatial relationship between objects, but do they represent the true 3D structure of the world from only 2D supervision? We demonstrate that yes, 3D knowledge is encoded in 2D image diffusion models like Stable Diffusion, and we show that this structure can be exploited for 3D vision tasks. Our method, Viewpoint Neural Textual Inversion (ViewNeTI), controls the 3D viewpoint of objects in generated images from frozen diffusion models. We train a small neural mapper to take camera viewpoint parameters and predict text encoder latents; the latents then condition the diffusion generation process to produce images with the desired camera viewpoint. ViewNeTI naturally addresses Novel View Synthesis (NVS). By leveraging the frozen diffusion model as a prior, we can solve NVS with very few input views; we can even do single-view novel view synthesis. Our single-view NVS predictions have good semantic details and photorealism compared to prior methods. Our approach is well suited for modeling the uncertainty inherent in sparse 3D vision problems because it can efficiently generate diverse samples. Our view-control mechanism is general, and can even change the camera view in images generated by user-defined prompts.	The paper introduces Viewpoint Neural Textual Inversion (ViewNeTI), a method to control the 3D viewpoint of objects in images generated by frozen text-to-image diffusion models, enabling novel view synthesis from as little as a single input view.	Leveraging pre-trained diffusion models for 3D vision tasks is appealing due to their large and diverse training data and their ability to model ambiguity inherent in sparse 3D data.	ViewNeTI trains a small neural network (view-mapper) that takes camera parameters as input and predicts text encoder latents, conditioning the diffusion model to generate images from the desired viewpoint.	The method successfully interpolates novel viewpoints when trained on a single scene with sparse viewpoints. Pre-training the view-mapper on a multi-scene dataset allows for extrapolation to novel viewpoints and generalization to new scenes, even enabling single-view novel view synthesis. ViewNeTI can be used for view-controlled text-to-image generation by prepending the view-mapper's token to user-defined text prompts.	A major limitation is the potential misalignment of generated objects compared to ground truth, impacting PSNR scores. Generating precise object details remains challenging, although this is an active research area in textual inversion.	novel view synthesis, textual inversion, diffusion models, 3d vision, single-view reconstruction
2309.07920 Report	Large-Vocabulary 3D Diffusion Model with Transformer	Ziang Cao, Fangzhou Hong, Tong Wu, Liang Pan, Ziwei Liu	Creating diverse and high-quality 3D assets with an automatic generative model is highly desirable. Despite extensive efforts on 3D generation, most existing works focus on the generation of a single category or a few categories. In this paper, we introduce a diffusion-based feed-forward framework for synthesizing massive categories of real-world 3D objects with a single generative model. Notably, there are three major challenges for this large-vocabulary 3D generation: a) the need for expressive yet efficient 3D representation; b) large diversity in geometry and texture across categories; c) complexity in the appearances of real-world objects. To this end, we propose a novel triplane-based 3D-aware Diffusion model with TransFormer, DiffTF, for handling challenges via three aspects. 1) Considering efficiency and robustness, we adopt a revised triplane representation and improve the fitting speed and accuracy. 2) To handle the drastic variations in geometry and texture, we regard the features of all 3D objects as a combination of generalized 3D knowledge and specialized 3D features. To extract generalized 3D knowledge from diverse categories, we propose a novel 3D-aware transformer with shared cross-plane attention. It learns the cross-plane relations across different planes and aggregates the generalized 3D knowledge with specialized 3D features. 3) In addition, we devise the 3D-aware encoder/decoder to enhance the generalized 3D knowledge in the encoded triplanes for handling categories with complex appearances. Extensive experiments on ShapeNet and OmniObject3D (over 200 diverse real-world categories) convincingly demonstrate that a single DiffTF model achieves state-of-the-art large-vocabulary 3D object generation performance with large diversity, rich semantics, and high quality.	This paper introduces DiffTF, a novel triplane-based 3D-aware diffusion model with Transformer, for synthesizing massive categories of real-world 3D objects with a single generative model.	Generating diverse and high-quality 3D assets across a large vocabulary is crucial for applications in gaming, robotics, and architecture, but existing methods struggle to maintain robustness across diverse objects.	DiffTF utilizes a revised triplane representation with improved fitting speed and accuracy. It leverages a 3D-aware transformer to extract generalized 3D knowledge across various categories and integrate it with specialized 3D features of individual objects. Additionally, a 3D-aware encoder/decoder enhances 3D awareness and semantic information in triplanes.	DiffTF achieves state-of-the-art performance in large-vocabulary 3D object generation on ShapeNet and OmniObject3D, surpassing GAN-based and other diffusion-based methods in both 2D image quality and 3D geometry metrics. The generated 3D objects exhibit large diversity, rich semantics, and high quality, demonstrating the effectiveness of the proposed 3D-aware modules. The method shows promising results in capturing complex geometry and textures, even for challenging categories like fruits and sculptures.	The triplane fitting process, while accelerated, remains time-consuming when scaled to millions of objects. Details in generated triplanes for some complicated categories have room for improvement.	3d object generation, diffusion models, transformers, triplane representation, large vocabulary
2309.07906 Report	Generative Image Dynamics	Zhengqi Li, Richard Tucker, Noah Snavely, Aleksander Holynski	We present an approach to modeling an image-space prior on scene motion. Our prior is learned from a collection of motion trajectories extracted from real video sequences depicting natural, oscillatory dynamics such as trees, flowers, candles, and clothes swaying in the wind. We model this dense, long-term motion prior in the Fourier domain:given a single image, our trained model uses a frequency-coordinated diffusion sampling process to predict a spectral volume, which can be converted into a motion texture that spans an entire video. Along with an image-based rendering module, these trajectories can be used for a number of downstream applications, such as turning still images into seamlessly looping videos, or allowing users to realistically interact with objects in real pictures by interpreting the spectral volumes as image-space modal bases, which approximate object dynamics.	This paper introduces a novel method for modeling an image-space prior on scene motion, enabling the animation of still images with natural, oscillatory dynamics.	Synthesizing realistic scene motion is crucial for visual content creation, as human perception is highly sensitive to motion. Existing methods struggle with issues like temporal inconsistency and unrealistic motion.	The method leverages spectral volumes, a frequency-domain motion representation, to capture long-range pixel trajectories. A diffusion model trained on real video sequences learns to predict spectral volumes conditioned on a single input image. An image-based rendering module then animates the image using the predicted motion.	The approach significantly outperforms previous single-image animation methods in terms of realism and temporal coherence. The method enables the creation of seamlessly looping videos and interactive dynamic images from a single picture. The use of spectral volumes allows for efficient representation of long-range motions and facilitates long-term temporal consistency in generated videos.	The model may not accurately capture non-oscillatory or high-frequency motions. Generating motions that require large amounts of novel content can lead to visual artifacts.	motion synthesis, diffusion models, image animation, spectral volumes, interactive dynamics
2309.07867 Report	Beta Diffusion	Mingyuan Zhou, Tianqi Chen, Zhendong Wang, Huangjie Zheng	We introduce beta diffusion, a novel generative modeling method that integrates demasking and denoising to generate data within bounded ranges. Using scaled and shifted beta distributions, beta diffusion utilizes multiplicative transitions over time to create both forward and reverse diffusion processes, maintaining beta distributions in both the forward marginals and the reverse conditionals, given the data at any point in time. Unlike traditional diffusion-based generative models relying on additive Gaussian noise and reweighted evidence lower bounds (ELBOs), beta diffusion is multiplicative and optimized with KL-divergence upper bounds (KLUBs) derived from the convexity of the KL divergence. We demonstrate that the proposed KLUBs are more effective for optimizing beta diffusion compared to negative ELBOs, which can also be derived as the KLUBs of the same KL divergence with its two arguments swapped. The loss function of beta diffusion, expressed in terms of Bregman divergence, further supports the efficacy of KLUBs for optimization. Experimental results on both synthetic data and natural images demonstrate the unique capabilities of beta diffusion in generative modeling of range-bounded data and validate the effectiveness of KLUBs in optimizing diffusion models, thereby making them valuable additions to the family of diffusion-based generative models and the optimization techniques used to train them.	This paper introduces beta diffusion, a novel generative modeling method specifically designed for range-bounded data by incorporating multiplicative noise, unlike traditional Gaussian diffusion using additive noise.	The existing diffusion models mostly rely on additive Gaussian noise. This paper proposes a new diffusion model that uses multiplicative noise and is specifically designed for modeling range-bounded data.	The paper proposes to use scaled and shifted beta distributions for both forward and reverse diffusion processes. It further proposes a novel objective function, Kullback--Leibler Upper Bounds (KLUBs), along with its corresponding Bregman divergence formulation for efficient optimization.	KLUBs are more effective than negative ELBOs in optimizing beta diffusion. Beta diffusion demonstrates superior performance in modeling range-bounded data, including point masses, compared to Gaussian diffusion. Beta diffusion exhibits competitive performance on CIFAR-10 image generation.	Training beta diffusion can be computationally expensive, similar to Gaussian diffusion. Further exploration of network architectures and hyperparameter optimization tailored for beta diffusion is needed.	generative modeling, diffusion models, beta distribution, klub, range-bounded data
2309.07749 Report	OmnimatteRF: Robust Omnimatte with 3D Background Modeling	Geng Lin, Chen Gao, Jia-Bin Huang, Changil Kim, Yipeng Wang, Matthias Zwicker, Ayush Saraf	Video matting has broad applications, from adding interesting effects to casually captured movies to assisting video production professionals. Matting with associated effects such as shadows and reflections has also attracted increasing research activity, and methods like Omnimatte have been proposed to separate dynamic foreground objects of interest into their own layers. However, prior works represent video backgrounds as 2D image layers, limiting their capacity to express more complicated scenes, thus hindering application to real-world videos. In this paper, we propose a novel video matting method, OmnimatteRF, that combines dynamic 2D foreground layers and a 3D background model. The 2D layers preserve the details of the subjects, while the 3D background robustly reconstructs scenes in real-world videos. Extensive experiments demonstrate that our method reconstructs scenes with better quality on various videos.	The paper proposes OmnimatteRF, a novel video matting method combining 2D foreground layers for details with a 3D background model for robust scene reconstruction in real-world videos with parallax effects.	Existing video matting methods struggle with complex scenes and parallax effects due to their 2D background representation. OmnimatteRF addresses this limitation, enabling high-quality matting in more realistic settings.	OmnimatteRF utilizes a two-branch network: a foreground branch predicting RGBA layers for each object and a background branch employing a 3D radiance field. The model is trained jointly with reconstruction and regularization losses. A masked retraining step refines the background, removing artifacts.	Outperforms state-of-the-art methods (Omnimatte, D^2NeRF, LNA) in background reconstruction quality on synthetic datasets with parallax effects. Demonstrates robustness and generalization to diverse real-world videos with complex scenes and camera motions. Enables cleaner background reconstruction by leveraging learned foreground masks in a retraining step.	Background reconstruction can be impacted if a region is constantly shadowed. Unrelated background motions might be captured by the foreground layer due to the static nature of the background model.	video matting, 3d background modeling, radiance fields, omnimatte, parallax effects
2309.07499 Report	Efficiently Robustify Pre-trained Models	Nishant Jain, Harkirat Behl, Yogesh Singh Rawat, Vibhav Vineet	A recent trend in deep learning algorithms has been towards training large scale models, having high parameter count and trained on big dataset. However, robustness of such large scale models towards real-world settings is still a less-explored topic. In this work, we first benchmark the performance of these models under different perturbations and datasets thereby representing real-world shifts, and highlight their degrading performance under these shifts. We then discuss on how complete model fine-tuning based existing robustification schemes might not be a scalable option given very large scale networks and can also lead them to forget some of the desired characterstics. Finally, we propose a simple and cost-effective method to solve this problem, inspired by knowledge transfer literature. It involves robustifying smaller models, at a lower computation cost, and then use them as teachers to tune a fraction of these large scale networks, reducing the overall computational overhead. We evaluate our proposed method under various vision perturbations including ImageNet-C,R,S,A datasets and also for transfer learning, zero-shot evaluation setups on different datasets. Benchmark results show that our method is able to induce robustness to these large scale models efficiently, requiring significantly lower time and also preserves the transfer learning, zero-shot properties of the original model which none of the existing methods are able to achieve.	This paper proposes a novel knowledge transfer method to efficiently induce robustness in large pre-trained vision models, preserving their original properties like clean accuracy and transfer learning capabilities.	Large vision models, though achieving impressive performance on various tasks, are brittle under distribution shifts. Existing robustification methods are computationally expensive and can lead to forgetting of the original properties.	The method involves robustifying a smaller teacher model using advanced augmentation techniques and then distilling this robustness to a small, tunable portion of the large student model using an uncertainty-aware knowledge distillation technique. This allows for selective utilization of clean and robust heads during inference based on input characteristics.	The proposed method outperforms existing approaches on robust accuracy while maintaining comparable clean accuracy. It preserves transfer learning capabilities, unlike methods involving extensive fine-tuning. It is computationally efficient, requiring significantly lower training time compared to full fine-tuning.	The paper lacks a theoretical analysis of the proposed approach. Future work could explore test-time adaptation of small models and subsequent distillation to large models.	robustness, knowledge distillation, distribution shift, large pre-trained models, computer vision
2309.07277 Report	Limitations of Face Image Generation	Harrison Rosenberg, Shimaa Ahmed, Guruprasad V Ramesh, Ramya Korlakai Vinayak, Kassem Fawaz	Text-to-image diffusion models have achieved widespread popularity due to their unprecedented image generation capability. In particular, their ability to synthesize and modify human faces has spurred research into using generated face images in both training data augmentation and model performance assessments. In this paper, we study the efficacy and shortcomings of generative models in the context of face generation. Utilizing a combination of qualitative and quantitative measures, including embedding-based metrics and user studies, we present a framework to audit the characteristics of generated faces conditioned on a set of social attributes. We applied our framework on faces generated through state-of-the-art text-to-image diffusion models. We identify several limitations of face image generation that include faithfulness to the text prompt, demographic disparities, and distributional shifts. Furthermore, we present an analytical model that provides insights into how training data selection contributes to the performance of generative models.	This paper presents a framework to audit the characteristics of generated faces conditioned on social attributes, focusing on the efficacy and shortcomings of text-to-image diffusion models in face generation.	The ability of diffusion models to synthesize and modify human faces makes them valuable for data augmentation and model performance assessments in facial recognition, necessitating an understanding of their capabilities and limitations.	The study uses a data generation pipeline with Stable Diffusion and a fine-tuned Realistic Vision model, combined with SEGA for attribute manipulation. Evaluation includes quantitative metrics, face verification accuracy, and user studies assessing image quality and attribute correctness.	Generated faces exhibit demographic disparities in face recognition systems and user-perceived quality, often favoring majority demographics. Quantitative metrics like CLIP-I and DINO-I show weak correlation with human perception of identity retention and transformation correctness. An analytical model demonstrates how bias in training data propagates to generated images, impacting the fidelity of synthetic datasets.	Limited exploration of the Own Race Effect (ORE) in the context of generated images. Reliance on CLIP, which has its own biases and limitations in understanding nuanced facial features and cultural constructs.	face generation, diffusion models, demographic bias, image quality assessment, user study
2309.07125 Report	Text-Guided Generation and Editing of Compositional 3D Avatars	Hao Zhang, Yao Feng, Peter Kulits, Yandong Wen, Justus Thies, Michael J. Black	Our goal is to create a realistic 3D facial avatar with hair and accessories using only a text description. While this challenge has attracted significant recent interest, existing methods either lack realism, produce unrealistic shapes, or do not support editing, such as modifications to the hairstyle. We argue that existing methods are limited because they employ a monolithic modeling approach, using a single representation for the head, face, hair, and accessories. Our observation is that the hair and face, for example, have very different structural qualities that benefit from different representations. Building on this insight, we generate avatars with a compositional model, in which the head, face, and upper body are represented with traditional 3D meshes, and the hair, clothing, and accessories with neural radiance fields (NeRF). The model-based mesh representation provides a strong geometric prior for the face region, improving realism while enabling editing of the person's appearance. By using NeRFs to represent the remaining components, our method is able to model and synthesize parts with complex geometry and appearance, such as curly hair and fluffy scarves. Our novel system synthesizes these high-quality compositional avatars from text descriptions. The experimental results demonstrate that our method, Text-guided generation and Editing of Compositional Avatars (TECA), produces avatars that are more realistic than those of recent methods while being editable because of their compositional nature. For example, our TECA enables the seamless transfer of compositional features like hairstyles, scarves, and other accessories between avatars. This capability supports applications such as virtual try-on.	Presents TECA, a novel method for generating realistic 3D facial avatars with hair and accessories from text descriptions using a compositional model, combining mesh-based representations for the face and body with NeRF for hair and clothing.	Existing text-to-3D avatar generation methods struggle with realism, shape fidelity, and editability. TECA addresses these limitations by using distinct representations for different avatar components, resulting in higher-quality, customizable avatars.	The pipeline starts by generating a face image from text using Stable Diffusion and fitting an SMPL-X model to extract 3D geometry. Texture is generated via iterative inpainting. Hair and accessories are generated using latent NeRF, optimized with SDS and guided by CLIPSeg segmentation masks. Finally, refinement is performed in RGB space using SDS and BLIP losses.	Outperforms SOTA methods in visual realism and text consistency, as shown in qualitative comparisons and a perceptual study. Enables editing of individual components, such as transferring hairstyles and accessories between avatars. Demonstrates animation capabilities through SMPL-X parameter manipulation.	Reliance on CLIPSeg for segmentation can lead to artifacts if segmentation is inaccurate. Limited ability to handle complex dynamics, such as realistic hair and clothing movement.	3d avatar generation, text-to-3d, neural radiance fields (nerf), compositional modeling, score distillation sampling (sds)
2309.06933 Report	DreamStyler: Paint by Style Inversion with Text-to-Image Diffusion Models	Namhyuk Ahn, Junsoo Lee, Chunggi Lee, Kunhee Kim, Daesik Kim, Seung-Hun Nam, Kibeom Hong	Recent progresses in large-scale text-to-image models have yielded remarkable accomplishments, finding various applications in art domain. However, expressing unique characteristics of an artwork (e.g. brushwork, colortone, or composition) with text prompts alone may encounter limitations due to the inherent constraints of verbal description. To this end, we introduce DreamStyler, a novel framework designed for artistic image synthesis, proficient in both text-to-image synthesis and style transfer. DreamStyler optimizes a multi-stage textual embedding with a context-aware text prompt, resulting in prominent image quality. In addition, with content and style guidance, DreamStyler exhibits flexibility to accommodate a range of style references. Experimental results demonstrate its superior performance across multiple scenarios, suggesting its promising potential in artistic product creation.	DreamStyler is a novel framework for artistic image synthesis that excels at both text-to-image synthesis and style transfer using a single style reference image.	Existing methods struggle to accurately capture and apply the unique styles of artworks, often leading to a trade-off between preserving the content of text prompts and replicating artistic styles.	DreamStyler introduces multi-stage textual inversion to increase style representation capacity, context-aware prompt augmentation to disentangle style from content, and style and context guidance for user control over image generation.	DreamStyler demonstrates superior performance in balancing text prompt adherence and accurate style replication. It surpasses state-of-the-art methods in style transfer while preserving content structure. The multi-stage textual inversion enables novel style mixing from diverse references.	The framework's applicability to abstract or highly nuanced artistic styles requires further investigation. Determining when style and context guidance are most effective remains an open question for future research.	text-to-image synthesis, style transfer, artistic image generation, diffusion models, textual inversion
2309.06922 Report	Hydra: Multi-head Low-rank Adaptation for Parameter Efficient Fine-tuning	Sanghyeon Kim, Hyunmo Yang, Younghyun Kim, Youngjoon Hong, Eunbyung Park	The recent surge in large-scale foundation models has spurred the development of efficient methods for adapting these models to various downstream tasks. Low-rank adaptation methods, such as LoRA, have gained significant attention due to their outstanding parameter efficiency and no additional inference latency. This paper investigates a more general form of adapter module based on the analysis that parallel and sequential adaptation branches learn novel and general features during fine-tuning, respectively. The proposed method, named Hydra, due to its multi-head computational branches, combines parallel and sequential branch to integrate capabilities, which is more expressive than existing single branch methods and enables the exploration of a broader range of optimal points in the fine-tuning process. In addition, the proposed adaptation method explicitly leverages the pre-trained weights by performing a linear combination of the pre-trained features. It allows the learned features to have better generalization performance across diverse downstream tasks. Furthermore, we perform a comprehensive analysis of the characteristics of each adaptation branch with empirical evidence. Through an extensive range of experiments, encompassing comparisons and ablation studies, we substantiate the efficiency and demonstrate the superior performance of Hydra. This comprehensive evaluation underscores the potential impact and effectiveness of Hydra in a variety of applications. Our code is available on \url{https://github.com/extremebird/Hydra}	This paper proposes "Hydra," a novel adapter module for parameter-efficient fine-tuning (PEFT) that combines parallel and sequential branches for enhanced expressiveness and generalization performance.	Efficiently adapting large-scale pre-trained models to downstream tasks is crucial due to their size and computational demands. Existing PEFT methods, particularly adapter-based ones, are limited to either parallel or sequential approaches, potentially missing learning opportunities.	Hydra integrates the parallel feature learning of LoRA (learning novel features) with the sequential approach (leveraging pre-trained features for generalizability) using linear adapter modules to avoid inference latency. This multi-branch structure is applied to MLP blocks in transformers and evaluated on various vision and NLP tasks.	Hydra consistently outperforms other PEFT methods, achieving higher accuracy on the ELEVATER benchmark and VTAB-1k benchmark. Analysis of the weight matrices and feature space visualizations reveals that parallel and sequential branches learn distinct and complementary features. Ablation studies confirm the advantage of combining branches and the MLP block as the optimal position for Hydra.	While theoretically similar in complexity, Hydra's multi-branch design might lead to slight bottlenecks on GPUs compared to single-branch methods. Exploration of more sophisticated adapter modules within the Hydra framework could further enhance performance.	parameter-efficient fine-tuning, adapter modules, transformer networks, few-shot learning, transfer learning
2309.06895 Report	MagiCapture: High-Resolution Multi-Concept Portrait Customization	Junha Hyung, Jaeyo Shin, Jaegul Choo	Large-scale text-to-image models including Stable Diffusion are capable of generating high-fidelity photorealistic portrait images. There is an active research area dedicated to personalizing these models, aiming to synthesize specific subjects or styles using provided sets of reference images. However, despite the plausible results from these personalization methods, they tend to produce images that often fall short of realism and are not yet on a commercially viable level. This is particularly noticeable in portrait image generation, where any unnatural artifact in human faces is easily discernible due to our inherent human bias. To address this, we introduce MagiCapture, a personalization method for integrating subject and style concepts to generate high-resolution portrait images using just a few subject and style references. For instance, given a handful of random selfies, our fine-tuned model can generate high-quality portrait images in specific styles, such as passport or profile photos. The main challenge with this task is the absence of ground truth for the composed concepts, leading to a reduction in the quality of the final output and an identity shift of the source subject. To address these issues, we present a novel Attention Refocusing loss coupled with auxiliary priors, both of which facilitate robust learning within this weakly supervised learning setting. Our pipeline also includes additional post-processing steps to ensure the creation of highly realistic outputs. MagiCapture outperforms other baselines in both quantitative and qualitative evaluations and can also be generalized to other non-human objects.	This paper introduces MagiCapture, a novel multi-concept personalization method for generating high-resolution portrait images that blend subject identity and reference style from a few input images.	Existing personalization methods for text-to-image models often lack realism, especially in challenging areas like portrait generation where identity preservation is crucial.	MagiCapture leverages a two-phase optimization process with masked reconstruction, a novel attention refocusing loss to enhance information disentanglement, and composed prompt learning with pseudo-labels for robust style integration.	MagiCapture quantitatively outperforms baselines like DreamBooth, Textual Inversion, and Custom Diffusion in identity similarity, style preservation, and aesthetic quality. Qualitative evaluations demonstrate superior image fidelity and faithful reflection of both source and reference images, as supported by a user study. The method exhibits generalization capabilities, allowing for further image manipulation using textual prompts and adaptation to non-human objects.	The model may occasionally generate unrealistic body parts and shows limitations in handling diverse ethnicities and gender representations. Addressing the inherent biases of pre-trained text-to-image models within a few-shot setting poses a challenge for future work.	image generation, personalization, text-to-image synthesis, diffusion models, few-shot learning
2309.06802 Report	Dynamic NeRFs for Soccer Scenes	Sacha Lewin, Maxime Vandegar, Thomas Hoyoux, Olivier Barnich, Gilles Louppe	The long-standing problem of novel view synthesis has many applications, notably in sports broadcasting. Photorealistic novel view synthesis of soccer actions, in particular, is of enormous interest to the broadcast industry. Yet only a few industrial solutions have been proposed, and even fewer that achieve near-broadcast quality of the synthetic replays. Except for their setup of multiple static cameras around the playfield, the best proprietary systems disclose close to no information about their inner workings. Leveraging multiple static cameras for such a task indeed presents a challenge rarely tackled in the literature, for a lack of public datasets: the reconstruction of a large-scale, mostly static environment, with small, fast-moving elements. Recently, the emergence of neural radiance fields has induced stunning progress in many novel view synthesis applications, leveraging deep learning principles to produce photorealistic results in the most challenging settings. In this work, we investigate the feasibility of basing a solution to the task on dynamic NeRFs, i.e., neural models purposed to reconstruct general dynamic content. We compose synthetic soccer environments and conduct multiple experiments using them, identifying key components that help reconstruct soccer scenes with dynamic NeRFs. We show that, although this approach cannot fully meet the quality requirements for the target application, it suggests promising avenues toward a cost-efficient, automatic solution. We also make our work dataset and code publicly available, with the goal to encourage further efforts from the research community on the task of novel view synthesis for dynamic soccer scenes. For code, data, and video results, please see https://soccernerfs.isach.be.	This work explores the feasibility of using dynamic Neural Radiance Fields (NeRFs) for novel view synthesis of soccer scenes, aiming to create broadcast-quality replays.	Developing an automated and cost-efficient solution for generating photorealistic virtual replays of soccer actions is highly valuable for sports broadcasting.	The authors compose increasingly complex synthetic soccer environments and conduct experiments using state-of-the-art dynamic NeRF models (K-Planes and NeRFPlayer) to evaluate their performance under different camera setups.	Dynamic NeRFs can reconstruct detailed soccer scenes with close-up camera views. Performance significantly degrades with distant, broadcast-style camera setups, even with enhancements like ray importance sampling. While promising, general dynamic NeRFs currently fall short of broadcast-quality standards for complex soccer scene reconstruction.	The study is limited to synthetic datasets due to the lack of suitable public real-world soccer datasets. Domain-specific knowledge and additional components, such as incorporating broadcast camera views, may be necessary to reach broadcast-quality results.	neural radiance fields, novel view synthesis, dynamic scene reconstruction, sports broadcasting, soccer replays
2309.06714 Report	MPI-Flow: Learning Realistic Optical Flow with Multiplane Images	Yingping Liang, Jiaming Liu, Debing Zhang, Ying Fu	The accuracy of learning-based optical flow estimation models heavily relies on the realism of the training datasets. Current approaches for generating such datasets either employ synthetic data or generate images with limited realism. However, the domain gap of these data with real-world scenes constrains the generalization of the trained model to real-world applications. To address this issue, we investigate generating realistic optical flow datasets from real-world images. Firstly, to generate highly realistic new images, we construct a layered depth representation, known as multiplane images (MPI), from single-view images. This allows us to generate novel view images that are highly realistic. To generate optical flow maps that correspond accurately to the new image, we calculate the optical flows of each plane using the camera matrix and plane depths. We then project these layered optical flows into the output optical flow map with volume rendering. Secondly, to ensure the realism of motion, we present an independent object motion module that can separate the camera and dynamic object motion in MPI. This module addresses the deficiency in MPI-based single-view methods, where optical flow is generated only by camera motion and does not account for any object movement. We additionally devise a depth-aware inpainting module to merge new images with dynamic objects and address unnatural motion occlusions. We show the superior performance of our method through extensive experiments on real-world datasets. Moreover, our approach achieves state-of-the-art performance in both unsupervised and supervised training of learning-based models. The code will be made publicly available at: \url{https://github.com/Sharpiless/MPI-Flow}.	This paper presents MPI-Flow, a novel framework for generating large-scale optical flow datasets from single-view images, enhancing both image realism and motion realism for training optical flow estimation models.	Existing synthetic optical flow datasets lack realism, hindering the generalization of trained models to real-world applications. This paper tackles this limitation by enabling the creation of highly realistic datasets from real-world images.	The method leverages Multiplane Images (MPI) to construct layered depth representations of single-view images. It utilizes volume rendering for realistic novel view synthesis and incorporates an independent object motion module to simulate diverse motions. A depth-aware inpainting module further refines the generated images by addressing occlusions.	MPI-Flow generates significantly more realistic images and optical flows compared to previous methods like Depthstillation and RealFlow. Training optical flow estimation models (specifically RAFT) on datasets generated by MPI-Flow achieves superior performance on real-world benchmarks, outperforming models trained on synthetic data or datasets from other generation methods. The proposed approach exhibits strong generalization capabilities across diverse datasets like Sintel, KITTI, and DAVIS, demonstrating its potential for advancing real-world optical flow estimation.	The current implementation primarily focuses on single-object motion, limiting its applicability to scenes with complex multi-object interactions. Further exploration of advanced techniques for handling occlusions and disocclusions in dynamic scenes can further enhance the realism of generated datasets.	optical flow, dataset generation, multiplane images (mpi), novel view synthesis, computer vision
2309.06660 Report	Generalizable Neural Fields as Partially Observed Neural Processes	Jeffrey Gu, Kuan-Chieh Wang, Serena Yeung	Neural fields, which represent signals as a function parameterized by a neural network, are a promising alternative to traditional discrete vector or grid-based representations. Compared to discrete representations, neural representations both scale well with increasing resolution, are continuous, and can be many-times differentiable. However, given a dataset of signals that we would like to represent, having to optimize a separate neural field for each signal is inefficient, and cannot capitalize on shared information or structures among signals. Existing generalization methods view this as a meta-learning problem and employ gradient-based meta-learning to learn an initialization which is then fine-tuned with test-time optimization, or learn hypernetworks to produce the weights of a neural field. We instead propose a new paradigm that views the large-scale training of neural representations as a part of a partially-observed neural process framework, and leverage neural process algorithms to solve this task. We demonstrate that this approach outperforms both state-of-the-art gradient-based meta-learning approaches and hypernetwork approaches.	This paper introduces a new neural process-inspired framework (PONP) for the efficient training of neural fields over large datasets, addressing the challenge of learning a single neural field representation for multiple signals.	Existing methods for generalizing neural fields to multiple signals, such as gradient-based meta-learning and hypernetworks, have limitations in terms of efficiency, flexibility, and scalability. This paper proposes a neural process-based approach as a more promising alternative.	The paper proposes a partially-observed neural process (PONP) framework that consists of a task-specific encoder to aggregate partial observations into a representation and a decoder with a conditional neural field conditioned on this representation. The framework utilizes probabilistic inference for training and accommodates various neural process architectures and latent variable approaches.	PONP significantly outperforms gradient-based meta-learning and hypernetwork methods on tasks such as 2D image regression and completion. PONP demonstrates superior performance in 2D CT reconstruction from sparse projections, exceeding the performance of Reptile and random initialization, even without test-time optimization. In the ShapeNet view synthesis task, PONP achieves comparable or better results than the state-of-the-art Transformer INR method, showcasing its effectiveness in handling complex 3D data.	While PONP outperforms previous approaches, there is still room for improvement, particularly in further closing the performance gap in fully-observed settings. The choice of neural process architecture and encoder design significantly influences PONP's effectiveness, suggesting a need for further research into task-specific architectures.	neural fields, neural processes, meta-learning, implicit representations, generalization
2309.06581 Report	Zero-Shot Visual Classification with Guided Cropping	Piyapat Saranrittichai, Mauricio Munoz, Volker Fischer, Chaithanya Kumar Mummadi	Pretrained vision-language models, such as CLIP, show promising zero-shot performance across a wide variety of datasets. For closed-set classification tasks, however, there is an inherent limitation: CLIP image encoders are typically designed to extract generic image-level features that summarize superfluous or confounding information for the target tasks. This results in degradation of classification performance, especially when objects of interest cover small areas of input images. In this work, we propose CLIP with Guided Cropping (GC-CLIP), where we use an off-the-shelf zero-shot object detection model in a preprocessing step to increase focus of zero-shot classifier to the object of interest and minimize influence of extraneous image regions. We empirically show that our approach improves zero-shot classification results across architectures and datasets, favorably for small objects.	The paper proposes GC-CLIP, a method to improve CLIPs zero-shot object classification performance by cropping input images around objects of interest using bounding boxes from OWL-ViT.	CLIP image encoders extract generic image-level features that may include superfluous or confounding information for specific classification tasks, degrading performance, especially for small objects.	GC-CLIP uses OWL-ViT to extract bounding boxes around potential objects of interest. These boxes are then used to crop the input image before passing it to the CLIP image encoder.	GC-CLIP consistently improves zero-shot classification accuracy, especially for images with small objects. Test-time box augmentation further improves performance, with Multi-Margin Augmentation (MAug) generally outperforming Random Crop Augmentation (RAug). Using OWL-ViT directly as a classifier results in poor performance compared to CLIP baselines.	Current box selection strategy does not dynamically weight the importance of context information. Future work could investigate methods for dynamically weighting context based on distance and semantic relationship to the target object.	zero-shot learning, object classification, clip, owl-vit, guided cropping
2309.06380 Report	InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation	Xingchao Liu, Xiwen Zhang, Jianzhu Ma, Jian Peng, Qiang Liu	Diffusion models have revolutionized text-to-image generation with its exceptional quality and creativity. However, its multi-step sampling process is known to be slow, often requiring tens of inference steps to obtain satisfactory results. Previous attempts to improve its sampling speed and reduce computational costs through distillation have been unsuccessful in achieving a functional one-step model. In this paper, we explore a recent method called Rectified Flow, which, thus far, has only been applied to small datasets. The core of Rectified Flow lies in its \emph{reflow} procedure, which straightens the trajectories of probability flows, refines the coupling between noises and images, and facilitates the distillation process with student models. We propose a novel text-conditioned pipeline to turn Stable Diffusion (SD) into an ultra-fast one-step model, in which we find reflow plays a critical role in improving the assignment between noise and images. Leveraging our new pipeline, we create, to the best of our knowledge, the first one-step diffusion-based text-to-image generator with SD-level image quality, achieving an FID (Frechet Inception Distance) of $23.3$ on MS COCO 2017-5k, surpassing the previous state-of-the-art technique, progressive distillation, by a significant margin ($37.2$ $\rightarrow$ $23.3$ in FID). By utilizing an expanded network with 1.7B parameters, we further improve the FID to $22.4$. We call our one-step models \emph{InstaFlow}. On MS COCO 2014-30k, InstaFlow yields an FID of $13.1$ in just $0.09$ second, the best in $\leq 0.1$ second regime, outperforming the recent StyleGAN-T ($13.9$ in $0.1$ second). Notably, the training of InstaFlow only costs 199 A100 GPU days. Codes and pre-trained models are available at \url{github.com/gnobitab/InstaFlow}.	This paper introduces InstaFlow, the first one-step text-to-image diffusion model based on Stable Diffusion that achieves high-quality generation.	Large-scale text-to-image generation models are computationally expensive and time-consuming. InstaFlow offers a solution for ultra-fast generation with minimal quality loss.	The authors propose a novel text-conditioned Rectified Flow pipeline. This pipeline leverages a "reflow" procedure to straighten the trajectories of probability flows, improving the coupling between noise and image data. This refined coupling facilitates the distillation process, leading to a high-quality one-step model.	InstaFlow-0.9B achieves an FID of 23.4 on MS COCO 2017-5k in just 0.09 seconds, surpassing previous state-of-the-art distillation techniques. InstaFlow-1.7B, a larger variant, further reduces the FID to 22.4. On MS COCO 2014-30k, InstaFlow-0.9B achieves an FID of 13.1 in 0.09 seconds, outperforming StyleGAN-T in speed and quality.	InstaFlow can struggle with complex compositions within text prompts. Future work includes exploring longer training durations and larger datasets for potential improvements.	text-to-image generation, diffusion models, rectified flow, knowledge distillation, one-step generation
2309.06323 Report	SAMPLING: Scene-adaptive Hierarchical Multiplane Images Representation for Novel View Synthesis from a Single Image	Xiaoyu Zhou, Zhiwei Lin, Xiaojun Shan, Yongtao Wang, Deqing Sun, Ming-Hsuan Yang	Recent novel view synthesis methods obtain promising results for relatively small scenes, e.g., indoor environments and scenes with a few objects, but tend to fail for unbounded outdoor scenes with a single image as input. In this paper, we introduce SAMPLING, a Scene-adaptive Hierarchical Multiplane Images Representation for Novel View Synthesis from a Single Image based on improved multiplane images (MPI). Observing that depth distribution varies significantly for unbounded outdoor scenes, we employ an adaptive-bins strategy for MPI to arrange planes in accordance with each scene image. To represent intricate geometry and multi-scale details, we further introduce a hierarchical refinement branch, which results in high-quality synthesized novel views. Our method demonstrates considerable performance gains in synthesizing large-scale unbounded outdoor scenes using a single image on the KITTI dataset and generalizes well to the unseen Tanks and Temples dataset.The code and models will soon be made available.	This paper proposes SAMPLING, a novel single-image view synthesis method for unbounded outdoor scenes based on an improved Multiplane Images (MPI) representation.	Existing methods struggle to synthesize novel views of large-scale outdoor scenes from single images due to limitations in handling complex geometry and multi-scale details.	SAMPLING utilizes an adaptive-bins strategy to arrange MPI planes according to each scene's depth distribution and employs a hierarchical refinement branch to capture multi-scale features.	SAMPLING achieves state-of-the-art performance on the KITTI dataset for outdoor scene synthesis. The method demonstrates strong generalization ability, achieving competitive results on the indoor Tanks and Temples dataset despite being trained on outdoor scenes. Ablation studies confirm the effectiveness of the adaptive-bins strategy and the hierarchical refinement branch in improving synthesis quality.	SAMPLING, based on MPI, struggles with synthesizing views far from the input view, leading to distortions. Areas with strong diffuse reflections and thin structures pose challenges for accurate scene representation.	novel view synthesis, multiplane images, unbounded outdoor scenes, single image, hierarchical refinement
2309.06169 Report	Elucidating the solution space of extended reverse-time SDE for diffusion models	Qinpeng Cui, Xinyi Zhang, Zongqing Lu, Qingmin Liao	Diffusion models (DMs) demonstrate potent image generation capabilities in various generative modeling tasks. Nevertheless, their primary limitation lies in slow sampling speed, requiring hundreds or thousands of sequential function evaluations through large neural networks to generate high-quality images. Sampling from DMs can be seen alternatively as solving corresponding stochastic differential equations (SDEs) or ordinary differential equations (ODEs). In this work, we formulate the sampling process as an extended reverse-time SDE (ER SDE), unifying prior explorations into ODEs and SDEs. Leveraging the semi-linear structure of ER SDE solutions, we offer exact solutions and arbitrarily high-order approximate solutions for VP SDE and VE SDE, respectively. Based on the solution space of the ER SDE, we yield mathematical insights elucidating the superior performance of ODE solvers over SDE solvers in terms of fast sampling. Additionally, we unveil that VP SDE solvers stand on par with their VE SDE counterparts. Finally, we devise fast and training-free samplers, ER-SDE-Solvers, achieving state-of-the-art performance across all stochastic samplers. Experimental results demonstrate achieving 3.45 FID in 20 function evaluations and 2.24 FID in 50 function evaluations on the ImageNet $64\times64$ dataset.	This paper proposes ER-SDE-Solvers, a family of fast and training-free samplers for diffusion models based on an extended reverse-time SDE (ER SDE) formulation.	Diffusion models excel in image generation but suffer from slow sampling speed. This work aims to improve sampling speed without retraining.	The paper unifies previous ODE and SDE sampling methods into the ER SDE framework, derives exact and approximate solutions for both VP and VE SDEs, analyzes the solution space to understand sampler performance, and designs customized noise scale functions for fast sampling.	Mathematically proves the superior performance of ODE solvers over SDE solvers for fast sampling due to lower discretization errors. Demonstrates that VP SDE solvers achieve comparable image quality to VE SDE solvers given consistent pretrained models. ER-SDE-Solvers achieve state-of-the-art performance among stochastic samplers, significantly accelerating generation across various datasets (e.g., 3.45 FID in 20 NFE on ImageNet 64x64).	ER-SDE-Solvers focus on fast sampling and may not be suitable for accelerating likelihood evaluation in diffusion models. While achieving state-of-the-art performance among training-free stochastic samplers, ER-SDE-Solvers are still not as fast as some highly optimized GANs or flow-based models.	diffusion models, fast sampling, training-free, stochastic differential equations, ordinary differential equations
2309.06023 Report	Learning from History: Task-agnostic Model Contrastive Learning for Image Restoration	Gang Wu, Junjun Jiang, Kui Jiang, Xianming Liu	Contrastive learning has emerged as a prevailing paradigm for high-level vision tasks, which, by introducing properly negative samples, has also been exploited for low-level vision tasks to achieve a compact optimization space to account for their ill-posed nature. However, existing methods rely on manually predefined and task-oriented negatives, which often exhibit pronounced task-specific biases. To address this challenge, our paper introduces an innovative method termed 'learning from history', which dynamically generates negative samples from the target model itself. Our approach, named Model Contrastive Learning for Image Restoration (MCLIR), rejuvenates latency models as negative models, making it compatible with diverse image restoration tasks. We propose the Self-Prior guided Negative loss (SPN) to enable it. This approach significantly enhances existing models when retrained with the proposed model contrastive paradigm. The results show significant improvements in image restoration across various tasks and architectures. For example, models retrained with SPN outperform the original FFANet and DehazeFormer by 3.41 dB and 0.57 dB on the RESIDE indoor dataset for image dehazing. Similarly, they achieve notable improvements of 0.47 dB on SPA-Data over IDT for image deraining and 0.12 dB on Manga109 for a 4x scale super-resolution over lightweight SwinIR, respectively. Code and retrained models are available at https://github.com/Aitical/MCLIR.	This paper proposes Model Contrastive Learning for Image Restoration (MCLIR), a novel method that dynamically generates negative samples from the target model itself for contrastive learning in image restoration tasks.	Existing contrastive learning methods for image restoration rely on manually predefined negative samples, limiting their generalization capability and introducing task-specific biases. MCLIR addresses these limitations by generating adaptive negatives directly from the model.	MCLIR utilizes a latency model updated with exponential moving averages (EMA) of the target model's parameters. A Self-Prior guided Negative loss (SPN) compares features from the target model's output and the latency model, guiding the target model towards a more optimal solution.	MCLIR consistently improves performance across various image restoration tasks (super-resolution, dehazing, deraining, deblurring) and architectures (CNNs, Transformers). Retrained models with MCLIR outperform their original counterparts and even surpass some state-of-the-art methods. Ablation studies confirm the effectiveness of using EMA for latency model updates, the impact of negative step size, and the importance of the balancing coefficient in the loss function.	The paper primarily focuses on image restoration tasks, leaving its application to other dense prediction tasks unexplored. Hyperparameter tuning was not extensively evaluated across all image restoration tasks.	contrastive learning, image restoration, self-supervised learning, negative sample generation, deep learning
2309.05956 Report	Beyond Generation: Harnessing Text to Image Models for Object Detection and Segmentation	Yunhao Ge, Jiashu Xu, Brian Nlong Zhao, Neel Joshi, Laurent Itti, Vibhav Vineet	We propose a new paradigm to automatically generate training data with accurate labels at scale using the text-to-image synthesis frameworks (e.g., DALL-E, Stable Diffusion, etc.). The proposed approach1 decouples training data generation into foreground object generation, and contextually coherent background generation. To generate foreground objects, we employ a straightforward textual template, incorporating the object class name as input prompts. This is fed into a text-to-image synthesis framework, producing various foreground images set against isolated backgrounds. A foreground-background segmentation algorithm is then used to generate foreground object masks. To generate context images, we begin by creating language descriptions of the context. This is achieved by applying an image captioning method to a small set of images representing the desired context. These textual descriptions are then transformed into a diverse array of context images via a text-to-image synthesis framework. Subsequently, we composite these with the foreground object masks produced in the initial step, utilizing a cut-and-paste method, to formulate the training data. We demonstrate the advantages of our approach on five object detection and segmentation datasets, including Pascal VOC and COCO. We found that detectors trained solely on synthetic data produced by our method achieve performance comparable to those trained on real data (Fig. 1). Moreover, a combination of real and synthetic data yields even much better results. Further analysis indicates that the synthetic data distribution complements the real data distribution effectively. Additionally, we emphasize the compositional nature of our data generation approach in out-of-distribution and zero-shot data generation scenarios. We open-source our code at https://github.com/gyhandy/Text2Image-for-Detection	This paper presents a novel method leveraging text-to-image synthesis models (e.g., DALL-E, Stable Diffusion) to automatically generate large-scale, accurately labeled training datasets for object detection and segmentation.	This approach addresses the high cost and labor-intensive nature of acquiring large labeled datasets, essential for training modern deep learning models.	The proposed pipeline generates foreground object masks and contextually coherent backgrounds separately. Foreground masks are generated by feeding object class names into a text-to-image model and segmenting the output. Backgrounds are generated by captioning a few exemplar images (or using predefined templates in zero-shot scenarios) and feeding the augmented captions to the text-to-image model. Finally, foregrounds are pasted onto backgrounds to create pseudo-labeled training data.	Training object detectors solely on synthetic data generated by this method achieves performance comparable to training on real data, particularly in low-resource scenarios. Combining synthetic and real data further improves performance, indicating the synthetic data effectively complements real data. The method generalizes well to multiple object detection and segmentation datasets and benefits from the compositionality of language, enabling easy modification of generated data by editing textual descriptions.	The current approach lacks control over factors like illumination, viewpoint, and object pose. The method is not directly applicable to 3D geometry tasks like 3D object pose estimation.	synthetic data generation, text-to-image synthesis, object detection, instance segmentation, low-resource learning
2309.05940 Report	Catch You Everything Everywhere: Guarding Textual Inversion via Concept Watermarking	Weitao Feng, Jiyan He, Jie Zhang, Tianwei Zhang, Wenbo Zhou, Weiming Zhang, Nenghai Yu	AIGC (AI-Generated Content) has achieved tremendous success in many applications such as text-to-image tasks, where the model can generate high-quality images with diverse prompts, namely, different descriptions in natural languages. More surprisingly, the emerging personalization techniques even succeed in describing unseen concepts with only a few personal images as references, and there have been some commercial platforms for sharing the valuable personalized concept. However, such an advanced technique also introduces a severe threat, where malicious users can misuse the target concept to generate highly-realistic illegal images. Therefore, it becomes necessary for the platform to trace malicious users and hold them accountable. In this paper, we focus on guarding the most popular lightweight personalization model, ie, Textual Inversion (TI). To achieve it, we propose the novel concept watermarking, where watermark information is embedded into the target concept and then extracted from generated images based on the watermarked concept. Specifically, we jointly train a watermark encoder and a watermark decoder with the sampler in the loop. It shows great resilience to different diffusion sampling processes possibly chosen by malicious users, meanwhile preserving utility for normal use. In practice, the concept owner can upload his concept with different watermarks (ie, serial numbers) to the platform, and the platform allocates different users with different serial numbers for subsequent tracing and forensics.	This paper proposes "concept watermarking," a novel method to embed watermark information into Textual Inversion embeddings for tracing the misuse of personalized AI-generated content.	Sharing personalized AI concepts raises concerns about misuse for illegal commercial purposes or generating harmful content. Tracing malicious users is crucial for accountability.	The method jointly trains a watermark encoder and decoder with the diffusion sampler in the loop, ensuring the watermark's robustness against different diffusion configurations and preserving the fidelity and editability of the original concept.	The proposed method effectively embeds watermarks into concepts with a high success rate while preserving image fidelity and textual editability. It exhibits robustness against various distortions, including different diffusion sampling configurations, post-processing on generated images, and pre-processing on watermarked concepts. The method shows resilience against adaptive attacks like retraining concept embeddings and forgery attacks.	The capacity for encoding information is limited by the number of tokens used in Textual Inversion. The training process can be computationally expensive due to the involvement of the diffusion sampling pipeline.	ai-generated content, textual inversion, concept watermarking, copyright protection, diffusion models
2309.05793 Report	PhotoVerse: Tuning-Free Image Customization with Text-to-Image Diffusion Models	Li Chen, Mengyi Zhao, Yiheng Liu, Mingxu Ding, Yangyang Song, Shizun Wang, Xu Wang, Hao Yang, Jing Liu, Kang Du, Min Zheng	Personalized text-to-image generation has emerged as a powerful and sought-after tool, empowering users to create customized images based on their specific concepts and prompts. However, existing approaches to personalization encounter multiple challenges, including long tuning times, large storage requirements, the necessity for multiple input images per identity, and limitations in preserving identity and editability. To address these obstacles, we present PhotoVerse, an innovative methodology that incorporates a dual-branch conditioning mechanism in both text and image domains, providing effective control over the image generation process. Furthermore, we introduce facial identity loss as a novel component to enhance the preservation of identity during training. Remarkably, our proposed PhotoVerse eliminates the need for test time tuning and relies solely on a single facial photo of the target identity, significantly reducing the resource cost associated with image generation. After a single training phase, our approach enables generating high-quality images within only a few seconds. Moreover, our method can produce diverse images that encompass various scenes and styles. The extensive evaluation demonstrates the superior performance of our approach, which achieves the dual objectives of preserving identity and facilitating editability. Project page: https://photoverse2d.github.io/	This paper presents PhotoVerse, a novel personalized text-to-image generation method that uses dual-branch conditioning (text and image) and a facial identity loss to preserve identity while enabling image editing.	Existing personalized text-to-image generation methods suffer from long tuning times, large storage needs, and limitations in preserving identity and editability. PhotoVerse addresses these challenges.	PhotoVerse uses dual-branch conditioning to project a reference image into a pseudo-word and image feature. These are injected into a fine-tuned text-to-image diffusion model (Stable Diffusion) alongside a facial identity loss during training.	PhotoVerse generates personalized images in seconds using a single reference image, eliminating test-time tuning. It surpasses state-of-the-art methods in preserving identity attributes (facial features, expressions, hair) while enabling stylization and scene generation. Ablation studies highlight the importance of the dual-branch conditioning, facial identity loss, and regularization for high-quality personalized image generation.	The model's performance across different ethnicities may be influenced by biases in the pre-trained model. Future work could explore further improvements in pose control and incorporating additional control mechanisms.	text-to-image generation, personalization, diffusion models, identity preservation, image editing
2309.05569 Report	ITI-GEN: Inclusive Text-to-Image Generation	Cheng Zhang, Xuanbai Chen, Siqi Chai, Chen Henry Wu, Dmitry Lagun, Thabo Beeler, Fernando De la Torre	Text-to-image generative models often reflect the biases of the training data, leading to unequal representations of underrepresented groups. This study investigates inclusive text-to-image generative models that generate images based on human-written prompts and ensure the resulting images are uniformly distributed across attributes of interest. Unfortunately, directly expressing the desired attributes in the prompt often leads to sub-optimal results due to linguistic ambiguity or model misrepresentation. Hence, this paper proposes a drastically different approach that adheres to the maxim that "a picture is worth a thousand words". We show that, for some attributes, images can represent concepts more expressively than text. For instance, categories of skin tones are typically hard to specify by text but can be easily represented by example images. Building upon these insights, we propose a novel approach, ITI-GEN, that leverages readily available reference images for Inclusive Text-to-Image GENeration. The key idea is learning a set of prompt embeddings to generate images that can effectively represent all desired attribute categories. More importantly, ITI-GEN requires no model fine-tuning, making it computationally efficient to augment existing text-to-image models. Extensive experiments demonstrate that ITI-GEN largely improves over state-of-the-art models to generate inclusive images from a prompt. Project page: https://czhang0528.github.io/iti-gen.	This paper introduces \ourmethodbold, a novel framework that leverages reference images to learn inclusive prompts, thus improving text-to-image generation diversity across various attributes (e.g., skin tone, age) without retraining the generative model.	Existing text-to-image models often inherit biases from training data, resulting in an under-representation of certain demographics and attributes. Current bias mitigation techniques in text-to-image generation are limited by linguistic ambiguity and computational complexity.	\ourmethodbold uses a pre-trained CLIP model and a reference image set for each attribute. It learns inclusive token embeddings by aligning the direction of image and prompt embeddings, and employs a semantic consistency loss to preserve language semantics. By sampling from a diverse set of learned prompts, \ourmethodbold generates images with a balanced representation of attributes.	\ourmethodbold effectively balances single binary attributes, achieving near-perfect performance on most of the 40 attributes from CelebA. It generalizes well to multiple attributes, generating diverse images with various category combinations by aggregating inclusive tokens. The approach demonstrates strong performance in handling multi-category attributes like age and skin tone, even when using synthetic images.	Limitations: \ourmethodbold may not be optimal for subtle facial attributes or highly entangled attribute combinations. Future Work: Explore lifelong learning capabilities for adding new attributes and investigate the application to other attribute types like 3D geometry.	inclusive text-to-image generation, bias mitigation, prompt learning, clip, diversity
2309.05448 Report	Panoptic Vision-Language Feature Fields	Haoran Chen, Kenneth Blomqvist, Francesco Milano, Roland Siegwart	Recently, methods have been proposed for 3D open-vocabulary semantic segmentation. Such methods are able to segment scenes into arbitrary classes based on text descriptions provided during runtime. In this paper, we propose to the best of our knowledge the first algorithm for open-vocabulary panoptic segmentation in 3D scenes. Our algorithm, Panoptic Vision-Language Feature Fields (PVLFF), learns a semantic feature field of the scene by distilling vision-language features from a pretrained 2D model, and jointly fits an instance feature field through contrastive learning using 2D instance segments on input frames. Despite not being trained on the target classes, our method achieves panoptic segmentation performance similar to the state-of-the-art closed-set 3D systems on the HyperSim, ScanNet and Replica dataset and additionally outperforms current 3D open-vocabulary systems in terms of semantic segmentation. We ablate the components of our method to demonstrate the effectiveness of our model architecture. Our code will be available at https://github.com/ethz-asl/pvlff.	PVLFF, the first open-vocabulary 3D panoptic segmentation system, reconstructs scenes implicitly and enables panoptic segmentation under open-vocabulary prompts.	Existing 3D panoptic segmentation methods are limited to closed-set predictions. PVLFF bridges this gap, allowing for flexible semantic queries and instance segmentation.	PVLFF learns a semantic feature field by distilling vision-language embeddings and an instance feature field via contrastive learning on 2D instance proposals, all within a neural radiance field framework.	PVLFF achieves comparable panoptic segmentation performance to state-of-the-art closed-set methods on HyperSim, ScanNet, and Replica datasets. PVLFF outperforms zero-shot methods in both 2D and 3D semantic segmentation on ScanNet. The learned instance features exhibit a hierarchical structure, enabling instance segmentation at different scales.	The instance segmentation performance is limited by the quality and granularity of the pre-computed object-agnostic 2D instance proposals. The semantic segmentation relies on a vision-language model with a closed vocabulary, limiting its performance on unseen categories.	3d panoptic segmentation, open-vocabulary, neural radiance fields, contrastive learning, vision-language
2309.05418 Report	FlowIBR: Leveraging Pre-Training for Efficient Neural Image-Based Rendering of Dynamic Scenes	Marcel Büsching, Josef Bengtson, David Nilsson, Mårten Björkman	We introduce FlowIBR, a novel approach for efficient monocular novel view synthesis of dynamic scenes. Existing techniques already show impressive rendering quality but tend to focus on optimization within a single scene without leveraging prior knowledge, resulting in long optimization times per scene. FlowIBR circumvents this limitation by integrating a neural image-based rendering method, pre-trained on a large corpus of widely available static scenes, with a per-scene optimized scene flow field. Utilizing this flow field, we bend the camera rays to counteract the scene dynamics, thereby presenting the dynamic scene as if it were static to the rendering network. The proposed method reduces per-scene optimization time by an order of magnitude, achieving comparable rendering quality to existing methods -- all on a single consumer-grade GPU.	FlowIBR, a novel view synthesis method for dynamic scenes, reduces training time by combining a pre-trained generalizable neural image-based rendering method with a per-scene optimized scene flow field.	Existing methods for dynamic novel view synthesis have long training times and struggle with fast-changing scenes due to relying solely on per-scene optimization without leveraging prior knowledge.	The method uses a pre-trained Generalizable NeRF Transformer (GNT) for static scenes and learns a per-scene scene flow field. This field bends camera rays to compensate for scene dynamics, allowing the static GNT to render dynamic scenes.	Reduces per-scene optimization time to 1.5 hours, an order of magnitude faster than previous methods. Achieves comparable rendering quality to state-of-the-art methods on the Nvidia Dynamic Scenes Dataset. Enables training on a single consumer-grade GPU due to a dynamics-focused optimization process.	Moderate rendering speed due to the general-purpose rendering backbone and multi-view projection. Long sequences can be challenging for the single scene flow network to capture.	novel view synthesis, dynamic scenes, scene flow, neural rendering, image-based rendering
2309.05375 Report	Toward a Deeper Understanding: RetNet Viewed through Convolution	Chenghao Li, Chaoning Zhang	The success of Vision Transformer (ViT) has been widely reported on a wide range of image recognition tasks. ViT can learn global dependencies superior to CNN, yet CNN's inherent locality can substitute for expensive training resources. Recently, the outstanding performance of RetNet in the field of language modeling has garnered attention, surpassing that of the Transformer with explicit local modeling, shifting researchers' focus towards Transformers in the CV field. This paper investigates the effectiveness of RetNet from a CNN perspective and presents a variant of RetNet tailored to the visual domain. Similar to RetNet we improves ViT's local modeling by applying a weight mask on the original self-attention matrix. A straightforward way to locally adapt the self-attention matrix can be realized by an element-wise learnable weight mask (ELM), for which our preliminary results show promising results. However, the element-wise simple learnable weight mask not only induces a non-trivial additional parameter overhead but also increases the optimization complexity. To this end, this work proposes a novel Gaussian mixture mask (GMM) in which one mask only has two learnable parameters and it can be conveniently used in any ViT variants whose attention mechanism allows the use of masks. Experimental results on multiple small datasets demonstrate that the effectiveness of our proposed Gaussian mask for boosting ViTs for free (almost zero additional parameter or computation cost). Our code can be publicly available at https://github.com/CatworldLee/Gaussian-Mixture-Mask-Attention.	This paper proposes a Gaussian Mixture Mask (GMM) for Vision Transformers (ViTs) to enhance their local modeling capabilities, especially on small datasets, with negligible parameter and computation overhead.	Transformers often struggle to match the performance of CNNs on small datasets due to their lack of inherent local inductive bias. While techniques like pre-training and explicit local modeling exist, they come with limitations. This work aims to address this by introducing a lightweight, effective, and plug-and-play module for ViTs.	The paper first introduces an Element-wise Learnable Mask (ELM) added to the attention scores, revealing two key characteristics: locality (preference for nearby patches) and extroversion (reduced self-attention). Building upon these findings, they propose GMM, which uses a mixture of Gaussian functions to generate the attention mask dynamically. This approach significantly reduces the number of learnable parameters compared to ELM while achieving superior performance.	GMM-ViT consistently outperforms standard ViT and even achieves comparable performance to Swin Transformer on small datasets with minimal additional parameters. GMM effectively improves the performance of deep ViTs, mitigating the accuracy drop observed with increasing depth. Visualization of attention maps shows that GMM-ViT exhibits stronger expressive power than both standard ViT and ELM-ViT.	The paper mainly focuses on small-scale datasets and further investigation is needed to evaluate the effectiveness of GMM on large-scale datasets. The impact of GMM on different ViT variants is explored, but a more comprehensive analysis on a wider range of architectures is desirable.	vision transformer, local modeling, gaussian mixture mask, attention mechanism, small datasets
2309.05251 Report	Multi3DRefer: Grounding Text Description to Multiple 3D Objects	Yiming Zhang, ZeMing Gong, Angel X. Chang	We introduce the task of localizing a flexible number of objects in real-world 3D scenes using natural language descriptions. Existing 3D visual grounding tasks focus on localizing a unique object given a text description. However, such a strict setting is unnatural as localizing potentially multiple objects is a common need in real-world scenarios and robotic tasks (e.g., visual navigation and object rearrangement). To address this setting we propose Multi3DRefer, generalizing the ScanRefer dataset and task. Our dataset contains 61926 descriptions of 11609 objects, where zero, single or multiple target objects are referenced by each description. We also introduce a new evaluation metric and benchmark methods from prior work to enable further investigation of multi-modal 3D scene understanding. Furthermore, we develop a better baseline leveraging 2D features from CLIP by rendering object proposals online with contrastive learning, which outperforms the state of the art on the ScanRefer benchmark.	Introduces the task of localizing multiple 3D objects in real-world scenes from natural language descriptions, addressing the limitation of previous work that assumes a single target object.	Enables more realistic and flexible 3D visual grounding, crucial for robotics, AR/VR, and other applications requiring interaction with 3D environments.	Creates the Multi3DRefer dataset, augmenting ScanRefer with descriptions referencing zero, single, or multiple objects using ChatGPT and manual verification. Proposes M3DRef-CLIP, a CLIP-based approach with online rendering and contrastive learning, and benchmarks it against existing methods.	M3DRef-CLIP outperforms state-of-the-art methods on ScanRefer and achieves competitive results on Nr3D. Training on Multi3DRefer improves performance on ScanRefer, demonstrating the dataset's value. Analysis shows that CLIP features, contrastive learning, and the Hungarian matching strategy improve performance on Multi3DRefer.	Current design relies heavily on 3D object detector features, potentially limiting the understanding of global context. Exploring positional encoding to improve spatial reasoning is an area for future work.	3d visual grounding, multi-object localization, vision-language models, clip, 3d scene understanding
2309.04917 Report	Editing 3D Scenes via Text Prompts without Retraining	Shuangkang Fang, Yufeng Wang, Yi Yang, Yi-Hsuan Tsai, Wenrui Ding, Shuchang Zhou, Ming-Hsuan Yang	Numerous diffusion models have recently been applied to image synthesis and editing. However, editing 3D scenes is still in its early stages. It poses various challenges, such as the requirement to design specific methods for different editing types, retraining new models for various 3D scenes, and the absence of convenient human interaction during editing. To tackle these issues, we introduce a text-driven editing method, termed DN2N, which allows for the direct acquisition of a NeRF model with universal editing capabilities, eliminating the requirement for retraining. Our method employs off-the-shelf text-based editing models of 2D images to modify the 3D scene images, followed by a filtering process to discard poorly edited images that disrupt 3D consistency. We then consider the remaining inconsistency as a problem of removing noise perturbation, which can be solved by generating training data with similar perturbation characteristics for training. We further propose cross-view regularization terms to help the generalized NeRF model mitigate these perturbations. Our text-driven method allows users to edit a 3D scene with their desired description, which is more friendly, intuitive, and practical than prior works. Empirical results show that our method achieves multiple editing types, including but not limited to appearance editing, weather transition, material changing, and style transfer. Most importantly, our method generalizes well with editing abilities shared among a set of model parameters without requiring a customized editing model for some specific scenes, thus inferring novel views with editing effects directly from user input. The project website is available at https://sk-fun.fun/DN2N	Proposes DN2N, a text-driven 3D scene editing framework with generalization capability, eliminating the need to retrain models for different scenes or editing types.	Existing 3D scene editing methods often lack user-friendliness, require retraining for different scenes or editing types, and have limited modification capabilities.	Leverages off-the-shelf 2D image editing models for initial 3D editing, filters poorly edited images, and trains a generalizable NeRF model to remove inconsistencies, treating them as perturbations.	Achieves diverse editing types, including appearance editing, weather transitions, object changing, and style transfer. Demonstrates superior performance in preserving image content, aligning with text descriptions, and maintaining 3D consistency compared to existing methods. Significantly reduces editing time and storage consumption by eliminating retraining for new scenes or editing types.	The quality of 3D editing is constrained by the capabilities of the underlying 2D editing model. Quantitative evaluation of editing results remains a challenge, relying on subjective evaluations like user studies.	3d scene editing, text-driven editing, neural radiance fields, generalizable model, content filter
2309.04907 Report	Effective Real Image Editing with Accelerated Iterative Diffusion Inversion	Zhihong Pan, Riccardo Gherardi, Xiufeng Xie, Stephen Huang	Despite all recent progress, it is still challenging to edit and manipulate natural images with modern generative models. When using Generative Adversarial Network (GAN), one major hurdle is in the inversion process mapping a real image to its corresponding noise vector in the latent space, since its necessary to be able to reconstruct an image to edit its contents. Likewise for Denoising Diffusion Implicit Models (DDIM), the linearization assumption in each inversion step makes the whole deterministic inversion process unreliable. Existing approaches that have tackled the problem of inversion stability often incur in significant trade-offs in computational efficiency. In this work we propose an Accelerated Iterative Diffusion Inversion method, dubbed AIDI, that significantly improves reconstruction accuracy with minimal additional overhead in space and time complexity. By using a novel blended guidance technique, we show that effective results can be obtained on a large range of image editing tasks without large classifier-free guidance in inversion. Furthermore, when compared with other diffusion inversion based works, our proposed process is shown to be more robust for fast image editing in the 10 and 20 diffusion steps' regimes.	Presents AIDI, an accelerated iterative diffusion inversion method for enhanced real image editing with text-to-image diffusion models.	Addresses the challenge of unreliable inversion in diffusion models, which limits their effectiveness for real image editing.	Proposes AIDI, employing fixed-point iteration and acceleration techniques for improved inversion stability. Introduces blended guidance to apply different guidance scales for editing and inversion, enhancing editing control.	AIDI significantly improves reconstruction accuracy compared to baseline methods, achieving near-exact inversion without classifier-free guidance. Enables effective image editing with as few as 10 diffusion steps, outperforming competing approaches in terms of editing quality and perceptual similarity. Proposed stochastic editing recovers from failure cases of deterministic editing, increasing editing flexibility.	Detailed control of the editable area using coarse cross-attention maps requires further investigation. Improving inversion stability for large guidance scales remains an area for future research.	diffusion models, image editing, diffusion inversion, text-to-image synthesis, generative models
2309.04887 Report	SortedAP: Rethinking evaluation metrics for instance segmentation	Long Chen, Yuli Wu, Johannes Stegmaier, Dorit Merhof	Designing metrics for evaluating instance segmentation revolves around comprehensively considering object detection and segmentation accuracy. However, other important properties, such as sensitivity, continuity, and equality, are overlooked in the current study. In this paper, we reveal that most existing metrics have a limited resolution of segmentation quality. They are only conditionally sensitive to the change of masks or false predictions. For certain metrics, the score can change drastically in a narrow range which could provide a misleading indication of the quality gap between results. Therefore, we propose a new metric called sortedAP, which strictly decreases with both object- and pixel-level imperfections and has an uninterrupted penalization scale over the entire domain. We provide the evaluation toolkit and experiment code at https://www.github.com/looooongChen/sortedAP.	The paper proposes a new evaluation metric for instance segmentation called sorted Average Precision (sortedAP) that addresses limitations in existing metrics.	Current instance segmentation metrics often lack sensitivity to small changes, exhibit abrupt score changes, or treat objects unequally based on size. This can lead to misleading evaluations.	The authors analyze existing metrics like mAP and PQ, highlighting their deficiencies. They then introduce sortedAP, which utilizes a "Unique Matching" method based on the Hungarian algorithm to maximize IoU matching and calculate AP over the entire IoU range.	sortedAP demonstrates smooth and continuous score degradation with gradually introduced errors in simulated experiments, unlike metrics like mAP or PQ. The Unique Matching method allows for the use of IoU thresholds below 0.5 and handles object overlap effectively. sortedAP provides a more sensitive and accurate reflection of segmentation quality compared to existing metrics.	The paper primarily focuses on clustered instances, and further evaluation with different instance types is suggested. Future work could explore incorporating sortedAP into existing deep learning frameworks for model training and optimization.	instance segmentation, evaluation metric, sortedap, unique matching, iou
2309.04820 Report	ABC Easy as 123: A Blind Counter for Exemplar-Free Multi-Class Class-agnostic Counting	Michael A. Hobley, Victor A. Prisacariu	Class-agnostic counting methods enumerate objects of an arbitrary class, providing tremendous utility in many fields. Prior works have limited usefulness as they require either a set of examples of the type to be counted or that the image contains only a single type of object. A significant factor in these shortcomings is the lack of a dataset to properly address counting in settings with more than one kind of object present. To address these issues, we propose the first Multi-class, Class-Agnostic Counting dataset (MCAC) and A Blind Counter (ABC123), a method that can count multiple types of objects simultaneously without using examples of type during training or inference. ABC123 introduces a new paradigm where instead of requiring exemplars to guide the enumeration, examples are found after the counting stage to help a user understand the generated outputs. We show that ABC123 outperforms contemporary methods on MCAC without the requirement of human in-the-loop annotations. We also show that this performance transfers to FSC-147, the standard class-agnostic counting dataset.	This paper introduces MCAC, the first multi-class class-agnostic counting dataset, and ABC123, an exemplar-free multi-class class-agnostic counting method that outperforms existing methods in multi-class settings.	Existing class-agnostic counting methods rely on exemplars or single-class images, limiting their real-world applicability. This work addresses the need for accurate counting in multi-class scenarios.	ABC123 leverages a vision transformer backbone and multiple upsampling heads to regress density maps for potential object classes. It then uses a matching stage to align predictions with ground truth labels during training and employs an example discovery stage to provide interpretable visualizations.	ABC123 significantly outperforms exemplar-based methods on MCAC, demonstrating its effectiveness in multi-class counting. The method generalizes well to FSC-147, a photographic counting dataset, highlighting its ability to learn from synthetic data. ABC123 often identifies 'valid-but-unknown' counts, revealing its potential for novel class discovery.	The example discovery stage relies on a pre-trained segmentation method, which might limit its accuracy. Quantitative evaluation on FSC-147 is hindered by discrepancies in class definitions between MCAC and FSC, highlighting the need for further research on aligning synthetic and real-world data.	class-agnostic counting, multi-class counting, exemplar-free counting, synthetic dataset, object counting
2309.04581 Report	Dynamic Mesh-Aware Radiance Fields	Yi-Ling Qiao, Alexander Gao, Yiran Xu, Yue Feng, Jia-Bin Huang, Ming C. Lin	Embedding polygonal mesh assets within photorealistic Neural Radience Fields (NeRF) volumes, such that they can be rendered and their dynamics simulated in a physically consistent manner with the NeRF, is under-explored from the system perspective of integrating NeRF into the traditional graphics pipeline. This paper designs a two-way coupling between mesh and NeRF during rendering and simulation. We first review the light transport equations for both mesh and NeRF, then distill them into an efficient algorithm for updating radiance and throughput along a cast ray with an arbitrary number of bounces. To resolve the discrepancy between the linear color space that the path tracer assumes and the sRGB color space that standard NeRF uses, we train NeRF with High Dynamic Range (HDR) images. We also present a strategy to estimate light sources and cast shadows on the NeRF. Finally, we consider how the hybrid surface-volumetric formulation can be efficiently integrated with a high-performance physics simulator that supports cloth, rigid and soft bodies. The full rendering and simulation system can be run on a GPU at interactive rates. We show that a hybrid system approach outperforms alternatives in visual realism for mesh insertion, because it allows realistic light transport from volumetric NeRF media onto surfaces, which affects the appearance of reflective/refractive surfaces and illumination of diffuse surfaces informed by the dynamic scene.	This paper presents a hybrid graphics pipeline that integrates neural radiance fields (NeRF) and polygonal meshes for photorealistic rendering and physically-based simulation.	NeRF excels at capturing photorealistic appearances, while meshes are better suited for simulation and traditional graphics pipelines. Integrating both allows leveraging their respective strengths.	The method unifies light transport equations for NeRF and surface rendering, enabling seamless switching between ray marching and path tracing. It utilizes HDR NeRF for accurate lighting and estimates light sources for shadow casting.	Hybrid rendering produces more realistic results than separate rendering or mesh extraction from NeRF. HDR NeRF provides more accurate lighting compared to standard NeRF, especially for indirect illumination. The system achieves interactive frame rates on a laptop GPU for real-time applications.	Current implementation lacks shadow casting and illumination on NeRF points. Support for advanced rendering features like environment maps and UV textures is limited.	neural radiance fields, nerf, hybrid rendering, physics simulation, hdr
2309.04561 Report	Three Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding	Ozan Unal, Christos Sakaridis, Suman Saha, Fisher Yu, Luc Van Gool	3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language. With a wide range of applications ranging from autonomous indoor robotics to AR/VR, the task has recently risen in popularity. A common formulation to tackle 3D visual grounding is grounding-by-detection, where localization is done via bounding boxes. However, for real-life applications that require physical interactions, a bounding box insufficiently describes the geometry of an object. We therefore tackle the problem of dense 3D visual grounding, i.e. referral-based 3D instance segmentation. We propose a dense 3D grounding network ConcreteNet, featuring three novel stand-alone modules which aim to improve grounding performance for challenging repetitive instances, i.e. instances with distractors of the same semantic class. First, we introduce a bottom-up attentive fusion module that aims to disambiguate inter-instance relational cues, next we construct a contrastive training scheme to induce separation in the latent space, and finally we resolve view-dependent utterances via a learned global camera token. ConcreteNet ranks 1st on the challenging ScanRefer online benchmark by a considerable +9.43% accuracy at 50% IoU and has won the ICCV 3rd Workshop on Language for 3D Scenes "3D Object Localization" challenge.	This paper presents ConcreteNet, a novel network for dense 3D visual grounding, which localizes objects in a 3D scene based on natural language descriptions and provides detailed 3D instance masks instead of just bounding boxes.	Dense 3D visual grounding is crucial for real-world applications like robotics, AR/VR, where detailed object geometry is needed for interactions beyond simple detection.	ConcreteNet uses a grounding-by-selection approach: a 3D instance segmentation backbone generates candidates, followed by a verbo-visual fusion module that selects the target object based on the language input. Three novel modules are introduced: (1) Bottom-up Attentive Fusion (BAF) for disambiguating object relations using local attention, (2) Contrastive learning for better separation of instance embeddings, and (3) a Global Camera Token (GCT) to handle view-dependent descriptions.	ConcreteNet outperforms state-of-the-art methods on the ScanRefer benchmark, achieving a significant +9.43% accuracy improvement at 50% IoU with test-time augmentation. The ablation study demonstrates that each proposed module (BAF, contrastive learning, GCT) contributes to the improved performance, especially for challenging repetitive instances. The paper provides evidence that using 3D instance segmentation for grounding yields more robust localization and tighter predictions compared to 3D object detection.	The paper acknowledges the challenge of determining camera positions from unlabeled datasets for learning GCT in a fully unsupervised manner. Future work could explore extending ConcreteNet to handle more complex language and interactions in 3D scenes.	visual grounding, vision-language fusion, 3d vision, contrastive learning, instance segmentation
2309.04430 Report	Create Your World: Lifelong Text-to-Image Diffusion	Gan Sun, Wenqi Liang, Jiahua Dong, Jun Li, Zhengming Ding, Yang Cong	Text-to-image generative models can produce diverse high-quality images of concepts with a text prompt, which have demonstrated excellent ability in image generation, image translation, etc. We in this work study the problem of synthesizing instantiations of a use's own concepts in a never-ending manner, i.e., create your world, where the new concepts from user are quickly learned with a few examples. To achieve this goal, we propose a Lifelong text-to-image Diffusion Model (L2DM), which intends to overcome knowledge "catastrophic forgetting" for the past encountered concepts, and semantic "catastrophic neglecting" for one or more concepts in the text prompt. In respect of knowledge "catastrophic forgetting", our L2DM framework devises a task-aware memory enhancement module and a elastic-concept distillation module, which could respectively safeguard the knowledge of both prior concepts and each past personalized concept. When generating images with a user text prompt, the solution to semantic "catastrophic neglecting" is that a concept attention artist module can alleviate the semantic neglecting from concept aspect, and an orthogonal attention module can reduce the semantic binding from attribute aspect. To the end, our model can generate more faithful image across a range of continual text prompts in terms of both qualitative and quantitative metrics, when comparing with the related state-of-the-art models. The code will be released at https://wenqiliang.github.io/.	This paper proposes L$^2$DM, a lifelong text-to-image diffusion model that continually incorporates user-specific concepts while retaining prior knowledge.	Existing text-to-image models struggle to efficiently learn new concepts without forgetting previously learned ones, limiting their ability to personalize to individual users' needs.	L$^2$DM addresses catastrophic forgetting with a task-aware memory enhancement (TAME) module for prior knowledge and an elastic concept distillation (ECD) module for personalized knowledge. It tackles catastrophic neglecting during multi-concept generation using a concept attention artist (CAA) module for concept-neglecting and an orthogonal attention artist (OAA) module for attribute-neglecting.	L$^2$DM outperforms state-of-the-art methods in lifelong single- and multi-concept generation, achieving higher text and image alignment. It demonstrates superior anti-forgetting ability, evidenced by lower task forgetting rates. The model is computationally efficient, requiring lower network parameters compared to other methods.	L$^2$DM still faces challenges in generating complex compositions, similar to existing diffusion models. Catastrophic neglecting remains an issue when composing four or more concepts.	lifelong machine learning, stable diffusion, image generation, continual learning, text-to-image synthesis
2309.04410 Report	DeformToon3D: Deformable 3D Toonification from Neural Radiance Fields	Junzhe Zhang, Yushi Lan, Shuai Yang, Fangzhou Hong, Quan Wang, Chai Kiat Yeo, Ziwei Liu, Chen Change Loy	In this paper, we address the challenging problem of 3D toonification, which involves transferring the style of an artistic domain onto a target 3D face with stylized geometry and texture. Although fine-tuning a pre-trained 3D GAN on the artistic domain can produce reasonable performance, this strategy has limitations in the 3D domain. In particular, fine-tuning can deteriorate the original GAN latent space, which affects subsequent semantic editing, and requires independent optimization and storage for each new style, limiting flexibility and efficient deployment. To overcome these challenges, we propose DeformToon3D, an effective toonification framework tailored for hierarchical 3D GAN. Our approach decomposes 3D toonification into subproblems of geometry and texture stylization to better preserve the original latent space. Specifically, we devise a novel StyleField that predicts conditional 3D deformation to align a real-space NeRF to the style space for geometry stylization. Thanks to the StyleField formulation, which already handles geometry stylization well, texture stylization can be achieved conveniently via adaptive style mixing that injects information of the artistic domain into the decoder of the pre-trained 3D GAN. Due to the unique design, our method enables flexible style degree control and shape-texture-specific style swap. Furthermore, we achieve efficient training without any real-world 2D-3D training pairs but proxy samples synthesized from off-the-shelf 2D toonification models.	This paper proposes DeformToon3D, a novel 3D toonification framework that decomposes geometry and texture stylization, preserving the pre-trained GAN latent space for compatibility with existing editing and animation tools.	3D toonification is crucial for applications like avatar creation, but existing methods relying on fine-tuning pre-trained GANs suffer from limitations in preserving the original GAN latent space, efficient style control, and storage efficiency.	DeformToon3D introduces a novel StyleField module to deform the real-space NeRF to the style space for geometry stylization. It further utilizes adaptive style mixing to inject artistic domain information into the decoder for texture stylization. The method is trained on synthetic paired data, eliminating the need for real-world 2D-3D pairs.	DeformToon3D achieves high-quality geometry and texture toonification over diverse styles, outperforming baselines in identity preservation and FID. The method retains the original GAN latent space, enabling compatibility with inversion, editing, and animation techniques designed for the pre-trained GAN. DeformToon3D significantly reduces storage costs by up to 98.5% compared to fine-tuning-based methods, making it suitable for deployment on resource-constrained devices.	The performance of DeformToon3D relies on the quality of paired training data. Styles with limited information cues might lead to noticeable artifacts. Future work could explore introducing re-lighting during training, incorporating vision-language models for flexible style control, and integrating 3D animation pipelines.	toonification, 3d gan, style transfer, nerf, deformation field
2309.04399 Report	MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask	Yupeng Zhou, Daquan Zhou, Zuo-Liang Zhu, Yaxing Wang, Qibin Hou, Jiashi Feng	Recent advancements in diffusion models have showcased their impressive capacity to generate visually striking images. Nevertheless, ensuring a close match between the generated image and the given prompt remains a persistent challenge. In this work, we identify that a crucial factor leading to the text-image mismatch issue is the inadequate cross-modality relation learning between the prompt and the output image. To better align the prompt and image content, we advance the cross-attention with an adaptive mask, which is conditioned on the attention maps and the prompt embeddings, to dynamically adjust the contribution of each text token to the image features. This mechanism explicitly diminishes the ambiguity in semantic information embedding from the text encoder, leading to a boost of text-to-image consistency in the synthesized images. Our method, termed MaskDiffusion, is training-free and hot-pluggable for popular pre-trained diffusion models. When applied to the latent diffusion models, our MaskDiffusion can significantly improve the text-to-image consistency with negligible computation overhead compared to the original diffusion models.	This paper introduces a training-free method to enhance text-to-image consistency in diffusion models by adaptively masking cross-attention maps based on prompt embeddings.	Existing text-to-image diffusion models often struggle to generate images that strictly adhere to the input text prompts, particularly when dealing with long or complex descriptions.	The method identifies inadequate cross-modality relation learning as a root cause for inconsistency. It then proposes a conditional mask generation algorithm which analyzes the cross-attention maps and modifies them to ensure that the relevant objects and attributes are appropriately represented in the generated image.	The method significantly improves text-to-image consistency without requiring any additional training data or significantly increasing computational overhead. Evaluation using CLIP score and a user study demonstrated superior performance over other state-of-the-art techniques. Ablation studies highlighted the importance of momentum-based attention map updating and selecting the appropriate feature resolution for mask application.	The method's reliance on the CLIP text encoder can be limiting, as CLIP may not always accurately interpret complex sentences. Future work aims to address the ambiguity arising from CLIP's semantic understanding and further enhance the method's capability to handle intricate prompts.	diffusion models, text-to-image synthesis, cross-attention, semantic consistency, training-free methods
2309.04354 Report	Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts	Erik Daxberger, Floris Weers, Bowen Zhang, Tom Gunter, Ruoming Pang, Marcin Eichner, Michael Emmersberger, Yinfei Yang, Alexander Toshev, Xianzhi Du	Sparse Mixture-of-Experts models (MoEs) have recently gained popularity due to their ability to decouple model size from inference efficiency by only activating a small subset of the model parameters for any given input token. As such, sparse MoEs have enabled unprecedented scalability, resulting in tremendous successes across domains such as natural language processing and computer vision. In this work, we instead explore the use of sparse MoEs to scale-down Vision Transformers (ViTs) to make them more attractive for resource-constrained vision applications. To this end, we propose a simplified and mobile-friendly MoE design where entire images rather than individual patches are routed to the experts. We also propose a stable MoE training procedure that uses super-class information to guide the router. We empirically show that our sparse Mobile Vision MoEs (V-MoEs) can achieve a better trade-off between performance and efficiency than the corresponding dense ViTs. For example, for the ViT-Tiny model, our Mobile V-MoE outperforms its dense counterpart by 3.39% on ImageNet-1k. For an even smaller ViT variant with only 54M FLOPs inference cost, our MoE achieves an improvement of 4.66%.	This paper introduces Mobile Vision MoEs (V-MoEs), a novel approach utilizing sparse Mixture-of-Experts (MoEs) to enhance the efficiency of Vision Transformers (ViTs) for resource-limited vision applications.	This work addresses the limitations of traditional dense ViT models in resource-constrained environments by leveraging sparse MoEs to decouple model size from inference cost, thereby enhancing their suitability for mobile-friendly vision tasks.	The paper proposes a simplified MoE design with per-image routing and a robust training strategy employing semantic super-class guidance for expert specialization. Experiments are conducted on ImageNet-1k to evaluate the performance and efficiency trade-offs.	Mobile V-MoEs consistently outperform their dense ViT counterparts across various model sizes, demonstrating superior accuracy vs. FLOPs trade-off. Optimal performance is achieved with 10 experts and 2 MoE layers, indicating a balance between model capacity and routing efficiency. Per-image routing with semantic super-class guidance proves to be an effective strategy, outperforming end-to-end learned routing and random super-class baselines.	Future work includes applying the MoE design to more mobile-friendly architectures like MobileNets and exploring its effectiveness in other vision tasks. Further investigation into on-device latency measurements for a comprehensive efficiency analysis.	vision transformers, mixture-of-experts, mobile vision, resource-constrained devices, image classification
2309.04109 Report	From Text to Mask: Localizing Entities Using the Attention of Text-to-Image Diffusion Models	Changming Xiao, Qi Yang, Feng Zhou, Changshui Zhang	Diffusion models have revolted the field of text-to-image generation recently. The unique way of fusing text and image information contributes to their remarkable capability of generating highly text-related images. From another perspective, these generative models imply clues about the precise correlation between words and pixels. In this work, a simple but effective method is proposed to utilize the attention mechanism in the denoising network of text-to-image diffusion models. Without re-training nor inference-time optimization, the semantic grounding of phrases can be attained directly. We evaluate our method on Pascal VOC 2012 and Microsoft COCO 2014 under weakly-supervised semantic segmentation setting and our method achieves superior performance to prior methods. In addition, the acquired word-pixel correlation is found to be generalizable for the learned text embedding of customized generation methods, requiring only a few modifications. To validate our discovery, we introduce a new practical task called "personalized referring image segmentation" with a new dataset. Experiments in various situations demonstrate the advantages of our method compared to strong baselines on this task. In summary, our work reveals a novel way to extract the rich multi-modal knowledge hidden in diffusion models for segmentation.	This paper proposes a novel method for open-vocabulary segmentation that utilizes the attention mechanism in off-the-shelf text-to-image diffusion models without retraining.	Existing open-vocabulary segmentation methods rely on discriminative models, which may not have a thorough understanding of images. This work explores the potential of generative diffusion models for segmentation, which are believed to have a better grasp of scene-level structure.	The method leverages the cross-attention and self-attention mechanisms in diffusion models. It treats self-attention as the affinity matrix of different image patches and propagates cross-attention scores accordingly to capture both unary and pairwise potentials for localization.	The method achieves state-of-the-art performance on weakly-supervised semantic segmentation benchmarks like PASCAL VOC 2012 and MS COCO 2014. A new benchmark for personalized referring image segmentation, Mug19, is introduced to evaluate the model's ability to locate user-specific items. The proposed method outperforms strong baselines on Mug19, demonstrating its superior multi-modal comprehension ability.	The method exhibits limitations when dealing with semantically similar objects (Cohyponym Entanglement). The model's capability to handle affordance-related text queries is limited.	open-vocabulary segmentation, diffusion models, attention mechanism, personalized referring image segmentation, multi-modal learning
2309.03904 Report	Exploring Sparse MoE in GANs for Text-conditioned Image Synthesis	Jiapeng Zhu, Ceyuan Yang, Kecheng Zheng, Yinghao Xu, Zifan Shi, Yujun Shen	Due to the difficulty in scaling up, generative adversarial networks (GANs) seem to be falling from grace on the task of text-conditioned image synthesis. Sparsely-activated mixture-of-experts (MoE) has recently been demonstrated as a valid solution to training large-scale models with limited computational resources. Inspired by such a philosophy, we present Aurora, a GAN-based text-to-image generator that employs a collection of experts to learn feature processing, together with a sparse router to help select the most suitable expert for each feature point. To faithfully decode the sampling stochasticity and the text condition to the final synthesis, our router adaptively makes its decision by taking into account the text-integrated global latent code. At 64x64 image resolution, our model trained on LAION2B-en and COYO-700M achieves 6.2 zero-shot FID on MS COCO. We release the code and checkpoints to facilitate the community for further development.	Presents Aurora, a GAN-based text-to-image generator employing sparse mixture-of-experts (MoE) to enhance model capacity for efficient text-conditioned image synthesis.	Addresses the limitations of GANs in scaling up for text-to-image generation, a task where diffusion models have become dominant. It offers a fast inference alternative to iterative diffusion models.	Employs a sparse MoE approach with a router considering both input features and text-integrated latent code for expert selection. It uses progressive training with a reference FID indicator for stable and efficient learning.	Achieves 6.2 zero-shot FID on MS COCO at 64x64 resolution. Exhibits smooth semantic transitions during text prompt interpolation. Reveals unexpected behavior in latent space interpolation, challenging the common belief of semantic continuity in GAN latent spaces.	Latent space interpolation results suggest a need to better disentangle text condition and sampling stochasticity effects. Current model trained at 64x64 resolution, future work to focus on directly generating higher-resolution images.	generative adversarial networks, text-to-image synthesis, sparse mixture-of-experts, progressive training, latent space interpolation
2309.03903 Report	Tracking Anything with Decoupled Video Segmentation	Ho Kei Cheng, Seoung Wug Oh, Brian Price, Alexander Schwing, Joon-Young Lee	Training data for video segmentation are expensive to annotate. This impedes extensions of end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary settings. To 'track anything' without training on video data for every individual task, we develop a decoupled video segmentation approach (DEVA), composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. Due to this design, we only need an image-level model for the target task (which is cheaper to train) and a universal temporal propagation model which is trained once and generalizes across tasks. To effectively combine these two modules, we use bi-directional propagation for (semi-)online fusion of segmentation hypotheses from different frames to generate a coherent segmentation. We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks including large-vocabulary video panoptic segmentation, open-world video segmentation, referring video segmentation, and unsupervised video object segmentation. Code is available at: https://hkchengrex.github.io/Tracking-Anything-with-DEVA	This paper proposes DEVA, a decoupled video segmentation approach that leverages task-specific image-level segmentation and class-agnostic bi-directional temporal propagation to 'track anything' in videos.	Training data for video segmentation is expensive, hindering the extension of end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary settings. DEVA addresses this by leveraging external, task-agnostic data, enabling better generalization to tasks with limited annotations.	DEVA decouples video segmentation into two modules: a task-specific image segmentation model and a universal, class-agnostic temporal propagation model. It employs bi-directional propagation, including in-clip consensus and merging of image and propagated segmentations, to ensure temporal consistency and incorporate new objects.	DEVA outperforms state-of-the-art end-to-end methods on large-scale video panoptic segmentation (VIPSeg) and open-world video segmentation (BURST). It also achieves competitive results on referring video segmentation (Ref-DAVIS, Ref-YouTubeVOS) and unsupervised video object segmentation (DAVIS) without end-to-end training. The approach shows significant improvements when target domain training data is scarce, particularly for rare object categories.	The temporal propagation model relies on the image segmentation model to detect new objects, leading to potential delays in detection. End-to-end approaches might still be preferable when sufficient training data is available, mainly in smaller vocabulary settings.	video segmentation, temporal propagation, open-world learning, large-vocabulary segmentation, tracking-by-detection
2309.03897 Report	ProPainter: Improving Propagation and Transformer for Video Inpainting	Shangchen Zhou, Chongyi Li, Kelvin C. K. Chan, Chen Change Loy	Flow-based propagation and spatiotemporal Transformer are two mainstream mechanisms in video inpainting (VI). Despite the effectiveness of these components, they still suffer from some limitations that affect their performance. Previous propagation-based approaches are performed separately either in the image or feature domain. Global image propagation isolated from learning may cause spatial misalignment due to inaccurate optical flow. Moreover, memory or computational constraints limit the temporal range of feature propagation and video Transformer, preventing exploration of correspondence information from distant frames. To address these issues, we propose an improved framework, called ProPainter, which involves enhanced ProPagation and an efficient Transformer. Specifically, we introduce dual-domain propagation that combines the advantages of image and feature warping, exploiting global correspondences reliably. We also propose a mask-guided sparse video Transformer, which achieves high efficiency by discarding unnecessary and redundant tokens. With these components, ProPainter outperforms prior arts by a large margin of 1.46 dB in PSNR while maintaining appealing efficiency.	Proposes ProPainter, a novel video inpainting framework, featuring enhanced dual-domain propagation and a highly efficient mask-guided sparse video Transformer.	Addresses limitations of existing flow-based propagation and spatiotemporal Transformer methods in video inpainting, aiming for higher quality and efficiency.	Combines image and feature warping for reliable global propagation, employs a recurrent network for fast flow completion, and introduces a sparse Transformer that discards redundant tokens for efficiency.	Achieves superior performance with a large margin of 1.46 dB in PSNR compared to state-of-the-art methods. Demonstrates significant efficiency gains, reducing memory consumption and running time. Shows particular effectiveness on datasets with dynamic scenes and larger motions.	Performance improvement is less pronounced on datasets with predominantly static scenes. Further exploration of sparse attention mechanisms for higher resolutions and longer video sequences is promising for future work.	video inpainting, dual-domain propagation, sparse video transformer, flow completion, deep learning
2309.03895 Report	InstructDiffusion: A Generalist Modeling Interface for Vision Tasks	Zigang Geng, Binxin Yang, Tiankai Hang, Chen Li, Shuyang Gu, Ting Zhang, Jianmin Bao, Zheng Zhang, Han Hu, Dong Chen, Baining Guo	We present InstructDiffusion, a unifying and generic framework for aligning computer vision tasks with human instructions. Unlike existing approaches that integrate prior knowledge and pre-define the output space (e.g., categories and coordinates) for each vision task, we cast diverse vision tasks into a human-intuitive image-manipulating process whose output space is a flexible and interactive pixel space. Concretely, the model is built upon the diffusion process and is trained to predict pixels according to user instructions, such as encircling the man's left shoulder in red or applying a blue mask to the left car. InstructDiffusion could handle a variety of vision tasks, including understanding tasks (such as segmentation and keypoint detection) and generative tasks (such as editing and enhancement). It even exhibits the ability to handle unseen tasks and outperforms prior methods on novel datasets. This represents a significant step towards a generalist modeling interface for vision tasks, advancing artificial general intelligence in the field of computer vision.	Introduces InstructDiffusion, a generalist modeling interface for vision tasks, unifying them as image generation through human-intuitive instructions.	Addresses the challenge of unifying diverse vision tasks with different output formats, methodologies, and continuous input/output spaces.	Leverages DDPM to handle various vision tasks as instructional image editing, trained on a dataset covering keypoint detection, segmentation, image enhancement, and editing.	Achieves good performance in individual vision tasks, outperforming other generalist models in keypoint detection and referring segmentation. Demonstrates enhanced generalization ability through joint training of multiple tasks. Exhibits AGI capabilities by handling unseen tasks like image detection and classification, performing well on novel datasets.	Limited by the VAE model's information loss, impacting performance in tasks like image enhancement. Future work includes exploring better unified representations and incorporating self-supervised/unsupervised learning for improved generalization.	generalist model, vision tasks, instruction following, image generation, diffusion models
2309.03893 Report	DiffusionEngine: Diffusion Model is Scalable Data Engine for Object Detection	Manlin Zhang, Jie Wu, Yuxi Ren, Ming Li, Jie Qin, Xuefeng Xiao, Wei Liu, Rui Wang, Min Zheng, Andy J. Ma	Data is the cornerstone of deep learning. This paper reveals that the recently developed Diffusion Model is a scalable data engine for object detection. Existing methods for scaling up detection-oriented data often require manual collection or generative models to obtain target images, followed by data augmentation and labeling to produce training pairs, which are costly, complex, or lacking diversity. To address these issues, we presentDiffusionEngine (DE), a data scaling-up engine that provides high-quality detection-oriented training pairs in a single stage. DE consists of a pre-trained diffusion model and an effective Detection-Adapter, contributing to generating scalable, diverse and generalizable detection data in a plug-and-play manner. Detection-Adapter is learned to align the implicit semantic and location knowledge in off-the-shelf diffusion models with detection-aware signals to make better bounding-box predictions. Additionally, we contribute two datasets, i.e., COCO-DE and VOC-DE, to scale up existing detection benchmarks for facilitating follow-up research. Extensive experiments demonstrate that data scaling-up via DE can achieve significant improvements in diverse scenarios, such as various detection algorithms, self-supervised pre-training, data-sparse, label-scarce, cross-domain, and semi-supervised learning. For example, when using DE with a DINO-based adapter to scale up data, mAP is improved by 3.1% on COCO, 7.6% on VOC, and 11.5% on Clipart.	This paper introduces DiffusionEngine (DE), a one-stage data engine for object detection that leverages pre-trained diffusion models to generate high-quality, scalable, diverse, and generalizable detection training data.	Large-scale, high-quality training data is crucial for object detection, but traditional data collection and existing augmentation methods are costly, complex, or lack diversity. DE addresses these limitations by efficiently generating diverse and scalable training pairs.	DE consists of a frozen pre-trained diffusion model and a trainable Detection-Adapter. The adapter learns to align the implicit semantic and location knowledge within diffusion models with explicit detection signals. During training, DE simulates the last diffusion step on real images to learn from existing detection datasets. During inference, DE generates new images from text prompts and uses the adapter to directly predict bounding boxes.	DE consistently improves the performance of various object detection algorithms, backbones, and pre-training strategies (including self-supervised) on COCO. DE outperforms state-of-the-art data scaling techniques like Copy-Paste and DALL-E for Detection on VOC and exhibits greater data scalability. DE effectively generalizes to out-of-domain scenarios, demonstrating significant improvements in cross-domain object detection (VOC to Clipart) and semi-supervised learning.	Future work could explore creating an all-in-one model for various detection tasks using task-specific adapters. Integrating ChatGPT for prompt generation and leveraging RLHF for improved alignment and quality of detection pairs are promising directions.	object detection, data augmentation, diffusion models, data scaling, synthetic data
2309.03809 Report	SimNP: Learning Self-Similarity Priors Between Neural Points	Christopher Wewer, Eddy Ilg, Bernt Schiele, Jan Eric Lenssen	Existing neural field representations for 3D object reconstruction either (1) utilize object-level representations, but suffer from low-quality details due to conditioning on a global latent code, or (2) are able to perfectly reconstruct the observations, but fail to utilize object-level prior knowledge to infer unobserved regions. We present SimNP, a method to learn category-level self-similarities, which combines the advantages of both worlds by connecting neural point radiance fields with a category-level self-similarity representation. Our contribution is two-fold. (1) We design the first neural point representation on a category level by utilizing the concept of coherent point clouds. The resulting neural point radiance fields store a high level of detail for locally supported object regions. (2) We learn how information is shared between neural points in an unconstrained and unsupervised fashion, which allows to derive unobserved regions of an object during the reconstruction process from given observations. We show that SimNP is able to outperform previous methods in reconstructing symmetric unseen object regions, surpassing methods that build upon category-level or pixel-aligned radiance fields, while providing semantic correspondences between instances	SimNP, a novel neural point radiance field that learns category-level self-similarities for 3D object reconstruction.	Existing methods either lack detail by relying on global representations or fail to generalize to unseen regions due to overfitting observations. SimNP addresses this by combining local detail with a learned prior of object self-similarities.	SimNP connects a coherent neural point radiance field to learned embeddings via bipartite attention, encoding self-similarity. During training, this attention learns to connect similar points, allowing information transfer during inference when one side is unseen.	Outperforms state-of-the-art in reconstructing unseen symmetric object parts from single and two views. Learns and leverages object symmetries for improved reconstruction. Provides a disentangled representation space enabling meaningful interpolation.	Assumes a canonical space with ground-truth point clouds during training, limiting applicability to in-the-wild data. Point cloud prediction, while effective, could be further improved.	neural radiance fields, 3d reconstruction, self-similarity, neural points, single-view reconstruction
2309.03729 Report	Phasic Content Fusing Diffusion Model with Directional Distribution Consistency for Few-Shot Model Adaption	Teng Hu, Jiangning Zhang, Liang Liu, Ran Yi, Siqi Kou, Haokun Zhu, Xu Chen, Yabiao Wang, Chengjie Wang, Lizhuang Ma	Training a generative model with limited number of samples is a challenging task. Current methods primarily rely on few-shot model adaption to train the network. However, in scenarios where data is extremely limited (less than 10), the generative network tends to overfit and suffers from content degradation. To address these problems, we propose a novel phasic content fusing few-shot diffusion model with directional distribution consistency loss, which targets different learning objectives at distinct training stages of the diffusion model. Specifically, we design a phasic training strategy with phasic content fusion to help our model learn content and style information when t is large, and learn local details of target domain when t is small, leading to an improvement in the capture of content, style and local details. Furthermore, we introduce a novel directional distribution consistency loss that ensures the consistency between the generated and source distributions more efficiently and stably than the prior methods, preventing our model from overfitting. Finally, we propose a cross-domain structure guidance strategy that enhances structure consistency during domain adaptation. Theoretical analysis, qualitative and quantitative experiments demonstrate the superiority of our approach in few-shot generative model adaption tasks compared to state-of-the-art methods. The source code is available at: https://github.com/sjtuplayer/few-shot-diffusion.	The paper proposes a novel few-shot diffusion model incorporating a phasic content fusing module and a directional distribution consistency loss.	Training generative models on limited data often results in overfitting and content degradation. Existing few-shot methods suffer from these issues, especially with extremely limited data.	The method introduces a phasic training strategy with content fusion to enhance content and style capture at different denoising stages. It also proposes a directional distribution consistency loss to ensure consistent structure and prevent distribution rotation during training. Lastly, a cross-domain structure guidance strategy is used to improve structure preservation during inference.	The proposed model outperforms state-of-the-art few-shot generative models in content preservation and domain adaptation. The directional distribution consistency loss effectively maintains the structure of the generated distribution and avoids rotation during training. The iterative cross-domain structure guidance strategy enhances structure consistency in domain translation.	The model requires careful tuning of hyperparameters, including the phasic factor and style enhancement factor. The study primarily focuses on image generation, and future work could explore its application to other data modalities.	few-shot learning, diffusion models, generative models, domain adaptation, image generation
2309.03599 Report	Chasing Consistency in Text-to-3D Generation from a Single Image	Yichen Ouyang, Wenhao Chai, Jiayi Ye, Dapeng Tao, Yibing Zhan, Gaoang Wang	Text-to-3D generation from a single-view image is a popular but challenging task in 3D vision. Although numerous methods have been proposed, existing works still suffer from the inconsistency issues, including 1) semantic inconsistency, 2) geometric inconsistency, and 3) saturation inconsistency, resulting in distorted, overfitted, and over-saturated generations. In light of the above issues, we present Consist3D, a three-stage framework Chasing for semantic-, geometric-, and saturation-Consistent Text-to-3D generation from a single image, in which the first two stages aim to learn parameterized consistency tokens, and the last stage is for optimization. Specifically, the semantic encoding stage learns a token independent of views and estimations, promoting semantic consistency and robustness. Meanwhile, the geometric encoding stage learns another token with comprehensive geometry and reconstruction constraints under novel-view estimations, reducing overfitting and encouraging geometric consistency. Finally, the optimization stage benefits from the semantic and geometric tokens, allowing a low classifier-free guidance scale and therefore preventing oversaturation. Experimental results demonstrate that Consist3D produces more consistent, faithful, and photo-realistic 3D assets compared to previous state-of-the-art methods. Furthermore, Consist3D also allows background and object editing through text prompts.	Presents Consist3D, a novel three-stage framework for consistent text-to-3D generation from a single image, addressing semantic, geometric, and saturation inconsistencies.	Existing text-to-3D methods struggle with inconsistencies, resulting in distorted, overfitted, and oversaturated 3D generations.	Consist3D uses a semantic encoding stage to learn a view-independent token, a geometric encoding stage to learn a token with geometric and reconstruction constraints, and a low-scale score distillation sampling stage for optimization.	Generates more consistent and photo-realistic 3D assets compared to previous methods. Enables background and object editing through text prompts. Achieves high-quality 3D generation with lower classifier-free guidance scales, resulting in more natural saturation.	Struggles when point cloud estimation is inaccurate. Complex background prompts may result in low-detail background generation.	text-to-3d generation, 3d vision, score distillation sampling, consistency, single image 3d reconstruction
2309.03550 Report	Text2Control3D: Controllable 3D Avatar Generation in Neural Radiance Fields using Geometry-Guided Text-to-Image Diffusion Model	Sungwon Hwang, Junha Hyung, Jaegul Choo	Recent advances in diffusion models such as ControlNet have enabled geometrically controllable, high-fidelity text-to-image generation. However, none of them addresses the question of adding such controllability to text-to-3D generation. In response, we propose Text2Control3D, a controllable text-to-3D avatar generation method whose facial expression is controllable given a monocular video casually captured with hand-held camera. Our main strategy is to construct the 3D avatar in Neural Radiance Fields (NeRF) optimized with a set of controlled viewpoint-aware images that we generate from ControlNet, whose condition input is the depth map extracted from the input video. When generating the viewpoint-aware images, we utilize cross-reference attention to inject well-controlled, referential facial expression and appearance via cross attention. We also conduct low-pass filtering of Gaussian latent of the diffusion model in order to ameliorate the viewpoint-agnostic texture problem we observed from our empirical analysis, where the viewpoint-aware images contain identical textures on identical pixel positions that are incomprehensible in 3D. Finally, to train NeRF with the images that are viewpoint-aware yet are not strictly consistent in geometry, our approach considers per-image geometric variation as a view of deformation from a shared 3D canonical space. Consequently, we construct the 3D avatar in a canonical space of deformable NeRF by learning a set of per-image deformation via deformation field table. We demonstrate the empirical results and discuss the effectiveness of our method.	Text2Control3D, the first controllable text-to-3D avatar generation method that leverages a monocular video for controlling facial expressions and shapes.	No existing work addresses adding geometric controllability to text-to-3D generation, despite its importance for creating controllable and expressive avatars.	The method uses a depth-conditional ControlNet to generate viewpoint-aware images with controlled expressions. It introduces cross-reference attention for consistent appearance and expression across viewpoints and employs low-pass filtering of the Gaussian latent to address texture-sticking issues. Finally, it reconstructs the 3D avatar in a deformable NeRF canonical space to handle geometric inconsistencies.	Generates high-fidelity 3D avatars that reflect text descriptions and source video expressions. Outperforms baselines like DreamFusion and Instruct-NeRF2NeRF in user studies and quantitative metrics. Demonstrates the effectiveness of cross-reference attention and low-pass filtering in improving controllability and visual quality.	Controllability is limited by the capabilities of the key-point conditional ControlNet, particularly for less common expressions. Future work can explore improving ControlNet's controllability and expanding the method to handle a wider range of expressions and geometric controls.	text-to-3d generation, controllable avatar generation, neural radiance fields, diffusion models, controlnet
2309.03549 Report	Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation	Jiaxi Gu, Shicong Wang, Haoyu Zhao, Tianyi Lu, Xing Zhang, Zuxuan Wu, Songcen Xu, Wei Zhang, Yu-Gang Jiang, Hang Xu	Inspired by the remarkable success of Latent Diffusion Models (LDMs) for image synthesis, we study LDM for text-to-video generation, which is a formidable challenge due to the computational and memory constraints during both model training and inference. A single LDM is usually only capable of generating a very limited number of video frames. Some existing works focus on separate prediction models for generating more video frames, which suffer from additional training cost and frame-level jittering, however. In this paper, we propose a framework called "Reuse and Diffuse" dubbed $\textit{VidRD}$ to produce more frames following the frames already generated by an LDM. Conditioned on an initial video clip with a small number of frames, additional frames are iteratively generated by reusing the original latent features and following the previous diffusion process. Besides, for the autoencoder used for translation between pixel space and latent space, we inject temporal layers into its decoder and fine-tune these layers for higher temporal consistency. We also propose a set of strategies for composing video-text data that involve diverse content from multiple existing datasets including video datasets for action recognition and image-text datasets. Extensive experiments show that our method achieves good results in both quantitative and qualitative evaluations. Our project page is available $\href{https://anonymous0x233.github.io/ReuseAndDiffuse/}{here}$.	This paper introduces VidRD, a text-to-video generation framework that uses Latent Diffusion Models (LDMs) to iteratively generate smooth and coherent videos from text prompts.	Current text-to-video generation methods struggle to produce long, high-quality videos with consistent content. VidRD addresses this by enabling the generation of longer, smoother videos from text.	VidRD leverages a pre-trained image LDM and adapts it for video by incorporating temporal layers. It uses a 'reuse and diffuse' strategy to generate videos clip-by-clip, reusing latent features and imitating the diffusion process from previous clips. It also employs novel techniques like Frame-level Noise Reversion (FNR), Past-dependent Noise Sampling (PNS), and Denoising with Staged Guidance (DSG) to ensure temporal consistency.	VidRD achieves state-of-the-art results on the UCF-101 benchmark for text-to-video generation, outperforming existing methods in terms of Frechet Video Distance (FVD) and Inception Score (IS). The paper demonstrates the effectiveness of using pseudo-videos created from image-text datasets to enhance temporal consistency in generated videos. Ablation studies validate the importance of the proposed FNR, PNS, and DSG techniques for generating smooth and coherent videos.	The paper acknowledges that existing metrics for evaluating video generation models may not fully capture perceptual quality and can be inconsistent with human perception. Future work could explore techniques to further improve the diversity of generated video content and address potential issues like content cycling.	text-to-video generation, latent diffusion models, video synthesis, temporal consistency, iterative generation
2309.03350 Report	Relay Diffusion: Unifying diffusion process across resolutions for image synthesis	Jiayan Teng, Wendi Zheng, Ming Ding, Wenyi Hong, Jianqiao Wangni, Zhuoyi Yang, Jie Tang	Diffusion models achieved great success in image synthesis, but still face challenges in high-resolution generation. Through the lens of discrete cosine transformation, we find the main reason is that \emph{the same noise level on a higher resolution results in a higher Signal-to-Noise Ratio in the frequency domain}. In this work, we present Relay Diffusion Model (RDM), which transfers a low-resolution image or noise into an equivalent high-resolution one for diffusion model via blurring diffusion and block noise. Therefore, the diffusion process can continue seamlessly in any new resolution or model without restarting from pure noise or low-resolution conditioning. RDM achieves state-of-the-art FID on CelebA-HQ and sFID on ImageNet 256$\times$256, surpassing previous works such as ADM, LDM and DiT by a large margin. All the codes and checkpoints are open-sourced at \url{https://github.com/THUDM/RelayDiffusion}.	This paper introduces Relay Diffusion Model (RDM), a novel cascaded diffusion framework that enhances high-resolution image generation by transferring low-resolution images or noise into equivalent high-resolution representations.	High-resolution image generation with diffusion models is challenging due to limitations in training efficiency and noise schedule design. Existing methods, like cascaded models, while effective, still suffer from drawbacks like time-consuming training and distribution mismatch issues.	RDM leverages block noise and patch-level blurring diffusion to connect different stages of image generation. It starts diffusion from the previous stage's output instead of pure noise, mitigating the need for low-resolution conditioning and reducing training steps.	RDM achieves state-of-the-art FID on CelebA-HQ 256x256, outperforming existing methods like StyleSwin with significantly fewer training iterations. On ImageNet 256x256, RDM achieves state-of-the-art sFID and competitive FID results compared to advanced techniques like MDT-XL/2, even with less training data. Ablation studies confirm the efficacy of block noise, stochastic sampling, and reduced sampling steps in enhancing RDM's performance.	While RDM shows promising results, exploring better noise schedules tailored to model size and data distribution remains a future direction. Investigating the impact of longer training and more granular classifier-free guidance strategies on RDM's FID scores, particularly for ImageNet, is another area for improvement.	diffusion models, image generation, high-resolution, cascaded diffusion, block noise
2309.03185 Report	Bayes' Rays: Uncertainty Quantification for Neural Radiance Fields	Lily Goli, Cody Reading, Silvia Sellán, Alec Jacobson, Andrea Tagliasacchi	Neural Radiance Fields (NeRFs) have shown promise in applications like view synthesis and depth estimation, but learning from multiview images faces inherent uncertainties. Current methods to quantify them are either heuristic or computationally demanding. We introduce BayesRays, a post-hoc framework to evaluate uncertainty in any pre-trained NeRF without modifying the training process. Our method establishes a volumetric uncertainty field using spatial perturbations and a Bayesian Laplace approximation. We derive our algorithm statistically and show its superior performance in key metrics and applications. Additional results available at: https://bayesrays.github.io.	Introduces BayesRays, a post-hoc algorithm to estimate the spatial uncertainty of any pre-trained NeRF without modifying the training process.	Quantifying uncertainty in NeRF is crucial for tasks like outlier detection and next-best-view planning, especially in critical applications like autonomous driving.	Simulates spatially parametrized perturbations of the radiance field and uses a Bayesian Laplace approximation to produce a volumetric uncertainty field.	Calculated uncertainties are statistically meaningful and outperform previous works on key metrics like correlation to reconstructed depth error. Provides a framework for applications like removing 'floater' artifacts from NeRF, matching or improving the state-of-the-art. Uncertainty field can be rendered as an additional color channel, enabling interactive artifact removal by thresholding.	Discretization of the deformation field using a uniform grid can lead to high memory cost in regions of little geometric interest. Future work may explore more complex data structures. Only quantifies epistemic uncertainty and does not capture aleatoric uncertainty caused by noise or inconsistencies between views. Combining with existing frameworks for aleatoric quantification is a potential future direction.	neural radiance fields, uncertainty quantification, laplace approximation, artifact removal, depth estimation
2309.03179 Report	SLiMe: Segment Like Me	Aliasghar Khani, Saeid Asgari Taghanaki, Aditya Sanghi, Ali Mahdavi Amiri, Ghassan Hamarneh	Significant strides have been made using large vision-language models, like Stable Diffusion (SD), for a variety of downstream tasks, including image editing, image correspondence, and 3D shape generation. Inspired by these advancements, we explore leveraging these extensive vision-language models for segmenting images at any desired granularity using as few as one annotated sample by proposing SLiMe. SLiMe frames this problem as an optimization task. Specifically, given a single training image and its segmentation mask, we first extract attention maps, including our novel "weighted accumulated self-attention map" from the SD prior. Then, using the extracted attention maps, the text embeddings of Stable Diffusion are optimized such that, each of them, learn about a single segmented region from the training image. These learned embeddings then highlight the segmented region in the attention maps, which in turn can then be used to derive the segmentation map. This enables SLiMe to segment any real-world image during inference with the granularity of the segmented region in the training image, using just one example. Moreover, leveraging additional training data when available, i.e. few-shot, improves the performance of SLiMe. We carried out a knowledge-rich set of experiments examining various design factors and showed that SLiMe outperforms other existing one-shot and few-shot segmentation methods.	This paper proposes SLiMe, a one-shot image segmentation method that leverages the semantic knowledge of Stable Diffusion (SD) to segment objects and parts at user-defined granularity levels.	Current image segmentation methods often require extensive annotated data or training class-specific generative models. SLiMe addresses this by utilizing a single annotated image and the pre-trained SD model to perform accurate segmentation across various object categories and granularity levels.	SLiMe frames the segmentation problem as a one-shot optimization task. It first extracts cross-attention and a novel weighted accumulated self-attention (WAS) map from SD. Then, it fine-tunes SD's text embeddings to highlight segmented regions within these attention maps, guided by a single reference image and its segmentation mask. During inference, the optimized text embeddings are used to segment unseen images, preserving the granularity of the reference segmentation.	SLiMe outperforms existing one- and few-shot segmentation methods, including ReGAN and SegDDPM, on PASCAL-Part and CelebAMask-HQ datasets. The method demonstrates strong generalization capabilities, segmenting unseen object categories and handling occlusions effectively. Ablation studies confirm the importance of each component in SLiMe, including the WAS-attention map, loss functions, and parameter choices.	SLiMe may struggle with segmenting tiny objects due to the lower resolution of the extracted attention maps. Future work includes addressing this limitation and extending SLiMe's applicability to 3D and video segmentation.	image segmentation, one-shot learning, stable diffusion, attention mechanisms, few-shot learning
2309.03160 Report	ResFields: Residual Neural Fields for Spatiotemporal Signals	Marko Mihajlovic, Sergey Prokudin, Marc Pollefeys, Siyu Tang	Neural fields, a category of neural networks trained to represent high-frequency signals, have gained significant attention in recent years due to their impressive performance in modeling complex 3D data, such as signed distance (SDFs) or radiance fields (NeRFs), via a single multi-layer perceptron (MLP). However, despite the power and simplicity of representing signals with an MLP, these methods still face challenges when modeling large and complex temporal signals due to the limited capacity of MLPs. In this paper, we propose an effective approach to address this limitation by incorporating temporal residual layers into neural fields, dubbed ResFields. It is a novel class of networks specifically designed to effectively represent complex temporal signals. We conduct a comprehensive analysis of the properties of ResFields and propose a matrix factorization technique to reduce the number of trainable parameters and enhance generalization capabilities. Importantly, our formulation seamlessly integrates with existing MLP-based neural fields and consistently improves results across various challenging tasks: 2D video approximation, dynamic shape modeling via temporal SDFs, and dynamic NeRF reconstruction. Lastly, we demonstrate the practical utility of ResFields by showcasing its effectiveness in capturing dynamic 3D scenes from sparse RGBD cameras of a lightweight capture system.	Presents ResFields, a new method for increasing the capacity of neural fields when modeling complex temporal signals without increasing the underlying MLP size, thus maintaining efficient training and inference	Addresses the limitations of neural fields in representing long, complex temporal signals due to the limited capacity of MLPs, which can hinder applications in computer graphics, vision, and robotics.	Introduces temporal residual layers that add time-dependent residuals to the MLP weights. These residuals are factorized using a low-rank representation to reduce parameters and enhance generalization.	ResFields consistently improve performance across various tasks: 2D video approximation, temporal 3D shape modeling, dynamic radiance field reconstruction, and scene flow learning. Smaller MLPs with ResFields often outperform larger MLPs without them, leading to faster training and lower memory requirements. A low-rank factorization of residual weights improves generalization compared to alternatives like no factorization or hypernetworks.	ResFields are less beneficial for ill-posed monocular reconstruction tasks where constraints, rather than capacity, are the bottleneck. Modeling very long or evolving signals may require chunking the sequence due to limitations in the shared weight matrix capacity.	neural fields, temporal signals, residual networks, low-rank factorization, dynamic scene reconstruction
2309.03110 Report	Do We Still Need Non-Maximum Suppression? Accurate Confidence Estimates and Implicit Duplication Modeling with IoU-Aware Calibration	Johannes Gilg, Torben Teepe, Fabian Herzog, Philipp Wolters, Gerhard Rigoll	Object detectors are at the heart of many semi- and fully autonomous decision systems and are poised to become even more indispensable. They are, however, still lacking in accessibility and can sometimes produce unreliable predictions. Especially concerning in this regard are the -- essentially hand-crafted -- non-maximum suppression algorithms that lead to an obfuscated prediction process and biased confidence estimates. We show that we can eliminate classic NMS-style post-processing by using IoU-aware calibration. IoU-aware calibration is a conditional Beta calibration; this makes it parallelizable with no hyper-parameters. Instead of arbitrary cutoffs or discounts, it implicitly accounts for the likelihood of each detection being a duplicate and adjusts the confidence score accordingly, resulting in empirically based precision estimates for each detection. Our extensive experiments on diverse detection architectures show that the proposed IoU-aware calibration can successfully model duplicate detections and improve calibration. Compared to the standard sequential NMS and calibration approach, our joint modeling can deliver performance gains over the best NMS-based alternative while producing consistently better-calibrated confidence predictions with less complexity. The \hyperlink{https://github.com/Blueblue4/IoU-AwareCalibration}{code} for all our experiments is publicly available.	This paper proposes IoU-aware calibration, a method to replace Non-Maximum Suppression (NMS) in object detection by directly modeling the probability of duplicate detections within the confidence score.	NMS, a crucial step in object detection pipelines, relies on hand-crafted algorithms and often leads to biased confidence estimates and obfuscated prediction processes. IoU-aware calibration aims to address these issues by providing a data-driven approach for accurate and reliable confidence estimates.	The proposed method utilizes a conditional Beta calibration, conditioned on the minimum Jaccard distance to other detections, to implicitly model the likelihood of a detection being a duplicate. This approach eliminates the need for iterative NMS calculations and enables parallelized computation.	IoU-aware calibration successfully models duplicate detections and improves calibration across diverse detection architectures. It achieves performance gains comparable to or exceeding those obtained by fine-tuned NMS, demonstrating its ability to implicitly capture duplicate likelihood. The method leads to significantly better-calibrated confidence predictions than NMS-based approaches, enabling more reliable probability estimates for object detection.	Performance may degrade under significant distribution shifts between calibration and deployment data. Highly crowded scenes, unseen during calibration, may lead to under-confident predictions.	object detection, non-maximum suppression, confidence calibration, deep learning, computer vision
2309.02999 Report	Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning	Sijin Chen, Hongyuan Zhu, Mingsheng Li, Xin Chen, Peng Guo, Yinjie Lei, Gang Yu, Taihao Li, Tao Chen	3D dense captioning requires a model to translate its understanding of an input 3D scene into several captions associated with different object regions. Existing methods adopt a sophisticated "detect-then-describe" pipeline, which builds explicit relation modules upon a 3D detector with numerous hand-crafted components. While these methods have achieved initial success, the cascade pipeline tends to accumulate errors because of duplicated and inaccurate box estimations and messy 3D scenes. In this paper, we first propose Vote2Cap-DETR, a simple-yet-effective transformer framework that decouples the decoding process of caption generation and object localization through parallel decoding. Moreover, we argue that object localization and description generation require different levels of scene understanding, which could be challenging for a shared set of queries to capture. To this end, we propose an advanced version, Vote2Cap-DETR++, which decouples the queries into localization and caption queries to capture task-specific features. Additionally, we introduce the iterative spatial refinement strategy to vote queries for faster convergence and better localization performance. We also insert additional spatial information to the caption head for more accurate descriptions. Without bells and whistles, extensive experiments on two commonly used datasets, ScanRefer and Nr3D, demonstrate Vote2Cap-DETR and Vote2Cap-DETR++ surpass conventional "detect-then-describe" methods by a large margin. Codes will be made available at https://github.com/ch3cook-fdu/Vote2Cap-DETR.	This paper proposes Vote2Cap-DETR and Vote2Cap-DETR++, two novel transformer-based frameworks for 3D dense captioning that decouple caption generation and object localization, unlike conventional "detect-then-describe" pipelines.	Existing "detect-then-describe" methods suffer from error accumulation due to serial processing and rely heavily on hand-crafted components, limiting their performance in complex 3D scenes.	The models utilize a transformer encoder-decoder architecture with vote queries for object localization and a dual-clued captioner for description generation. Vote2Cap-DETR++ further decouples queries for task-specific feature extraction, introduces iterative spatial refinement for queries, and injects 3D spatial information into the caption head.	Vote2Cap-DETR and Vote2Cap-DETR++ significantly outperform previous state-of-the-art methods on ScanRefer and Nr3D datasets. Vote queries with iterative spatial refinement improve object localization accuracy and convergence speed. Injecting 3D spatial information into the caption head enhances the quality and informativeness of generated descriptions.	Limited caption diversity due to small text annotations and beam search. Future work includes exploring multimodal pre-training and leveraging LLMs for improved caption diversity.	3d dense captioning, vision-language, transformers, vote queries, spatial refinement
2309.02773 Report	Diffusion Model is Secretly a Training-free Open Vocabulary Semantic Segmenter	Jinglong Wang, Xiawei Li, Jing Zhang, Qingyuan Xu, Qin Zhou, Qian Yu, Lu Sheng, Dong Xu	The pre-trained text-image discriminative models, such as CLIP, has been explored for open-vocabulary semantic segmentation with unsatisfactory results due to the loss of crucial localization information and awareness of object shapes. Recently, there has been a growing interest in expanding the application of generative models from generation tasks to semantic segmentation. These approaches utilize generative models either for generating annotated data or extracting features to facilitate semantic segmentation. This typically involves generating a considerable amount of synthetic data or requiring additional mask annotations. To this end, we uncover the potential of generative text-to-image diffusion models (e.g., Stable Diffusion) as highly efficient open-vocabulary semantic segmenters, and introduce a novel training-free approach named DiffSegmenter. The insight is that to generate realistic objects that are semantically faithful to the input text, both the complete object shapes and the corresponding semantics are implicitly learned by diffusion models. We discover that the object shapes are characterized by the self-attention maps while the semantics are indicated through the cross-attention maps produced by the denoising U-Net, forming the basis of our segmentation results.Additionally, we carefully design effective textual prompts and a category filtering mechanism to further enhance the segmentation results. Extensive experiments on three benchmark datasets show that the proposed DiffSegmenter achieves impressive results for open-vocabulary semantic segmentation.	This paper introduces DiffSegmenter, a novel training-free method for open-vocabulary semantic segmentation leveraging off-the-shelf text-to-image diffusion models.	Existing open-vocabulary segmentation methods relying on discriminative models often lose crucial localization information. This work explores the potential of generative diffusion models for this task.	DiffSegmenter leverages the cross-attention maps from the denoising U-Net of a diffusion model as initial segmentation scores, further refined by the self-attention maps. Textual prompts are designed to enhance semantic understanding.	DiffSegmenter achieves impressive results for open-vocabulary semantic segmentation in both zero-shot and weakly-supervised settings. It outperforms most existing methods on PASCAL VOC 2012, Pascal Context, and COCO-Object datasets. The method shows potential for downstream tasks like controllable image editing.	The use of latent features in Stable Diffusion might lead to the disappearance of small objects. Future work can explore other text-to-image diffusion models or larger latent feature map sizes to address this.	semantic segmentation, open-vocabulary, diffusion models, generative models, attention mechanisms
2309.02401 Report	Prototype-based Dataset Comparison	Nanne van Noord	Dataset summarisation is a fruitful approach to dataset inspection. However, when applied to a single dataset the discovery of visual concepts is restricted to those most prominent. We argue that a comparative approach can expand upon this paradigm to enable richer forms of dataset inspection that go beyond the most prominent concepts. To enable dataset comparison we present a module that learns concept-level prototypes across datasets. We leverage self-supervised learning to discover these prototypes without supervision, and we demonstrate the benefits of our approach in two case-studies. Our findings show that dataset comparison extends dataset inspection and we hope to encourage more works in this direction. Code and usage instructions available at https://github.com/Nanne/ProtoSim	This paper introduces "dataset comparison" as a novel approach for inspecting datasets and proposes a method for learning concept-level prototypes across datasets called "ProtoSim."	Dataset inspection is crucial for understanding dataset content, identifying potential biases, and ensuring alignment with usage goals, especially with the increasing size of image datasets used in computer vision.	ProtoSim, a module integrated into a Vision Transformer (ViT), leverages self-supervised learning, specifically the DINO loss, to discover visual concepts in datasets without relying on class labels.	ProtoSim successfully identifies both dataset-specific and shared prototypes, effectively distinguishing unique and common visual concepts across datasets. The comparative approach allows for a richer understanding of dataset content, revealing subtle differences and nuances not apparent from single-dataset analysis. Case studies on ImageNet/PASS and three artwork datasets demonstrate the efficacy of dataset comparison in revealing dataset biases and highlighting unique characteristics.	Interpreting the meaning of learned prototypes requires manual inspection, although visualization of attention maps can aid in this process. The choice of a pre-trained backbone, particularly one trained on a dataset under comparison like ImageNet, might influence the learned prototypes and warrants further investigation.	dataset comparison, prototype learning, self-supervised learning, vision transformer, dataset inspection
2309.02270 Report	SAM-Deblur: Let Segment Anything Boost Image Deblurring	Siwei Li, Mingxuan Liu, Yating Zhang, Shu Chen, Haoxiang Li, Zifei Dou, Hong Chen	Image deblurring is a critical task in the field of image restoration, aiming to eliminate blurring artifacts. However, the challenge of addressing non-uniform blurring leads to an ill-posed problem, which limits the generalization performance of existing deblurring models. To solve the problem, we propose a framework SAM-Deblur, integrating prior knowledge from the Segment Anything Model (SAM) into the deblurring task for the first time. In particular, SAM-Deblur is divided into three stages. First, we preprocess the blurred images, obtain segment masks via SAM, and propose a mask dropout method for training to enhance model robustness. Then, to fully leverage the structural priors generated by SAM, we propose a Mask Average Pooling (MAP) unit specifically designed to average SAM-generated segmented areas, serving as a plug-and-play component which can be seamlessly integrated into existing deblurring networks. Finally, we feed the fused features generated by the MAP Unit into the deblurring model to obtain a sharp image. Experimental results on the RealBlurJ, ReloBlur, and REDS datasets reveal that incorporating our methods improves GoPro-trained NAFNet's PSNR by 0.05, 0.96, and 7.03, respectively. Project page is available at GitHub \href{https://hplqaq.github.io/projects/sam-deblur}{HPLQAQ/SAM-Deblur}.	This paper introduces SAM-Deblur, a novel framework that integrates semantic priors from the Segment Anything Model (SAM) to enhance the performance of image deblurring, particularly in addressing the challenge of non-uniform blurring.	Image deblurring models often struggle with generalizability, especially when dealing with non-uniform blurring in real-world scenarios. SAM-Deblur aims to improve the generalization performance of these models by incorporating semantic information.	The SAM-Deblur framework consists of three primary stages: preprocessing of blurred images and mask generation using SAM, a novel Mask Average Pooling (MAP) unit to integrate SAM priors, and finally, feeding the fused features into a deblurring model (NAFNet). A mask dropout method is also used during training to enhance model robustness.	SAM-Deblur significantly improves the deblurring performance on out-of-distribution datasets like RealBlurJ, REDS, and ReloBlur. The proposed method enhances PSNR on the tested datasets and reduces the Mode Collapse Rate (MCR), indicating better generalization ability. The introduced MAP unit proves to be more effective in leveraging SAM priors compared to previously used methods, leading to superior results.	The reliance on a pre-trained SAM model introduces additional computational overhead. The performance of SAM-Deblur is contingent on the quality of masks generated by SAM, which can be influenced by factors like image quality and the scale of the SAM model. Further exploration of alternative architectural designs for the MAP unit and exploring the effectiveness of SAM-Deblur in conjunction with other state-of-the-art deblurring models.	image deblurring, segment anything model, out-of-distribution generalization, mask average pooling, semantic priors
2309.02224 Report	Dense Object Grounding in 3D Scenes	Wencan Huang, Daizong Liu, Wei Hu	Localizing objects in 3D scenes according to the semantics of a given natural language is a fundamental yet important task in the field of multimedia understanding, which benefits various real-world applications such as robotics and autonomous driving. However, the majority of existing 3D object grounding methods are restricted to a single-sentence input describing an individual object, which cannot comprehend and reason more contextualized descriptions of multiple objects in more practical 3D cases. To this end, we introduce a new challenging task, called 3D Dense Object Grounding (3D DOG), to jointly localize multiple objects described in a more complicated paragraph rather than a single sentence. Instead of naively localizing each sentence-guided object independently, we found that dense objects described in the same paragraph are often semantically related and spatially located in a focused region of the 3D scene. To explore such semantic and spatial relationships of densely referred objects for more accurate localization, we propose a novel Stacked Transformer based framework for 3D DOG, named 3DOGSFormer. Specifically, we first devise a contextual query-driven local transformer decoder to generate initial grounding proposals for each target object. Then, we employ a proposal-guided global transformer decoder that exploits the local object features to learn their correlation for further refining initial grounding proposals. Extensive experiments on three challenging benchmarks (Nr3D, Sr3D, and ScanRefer) show that our proposed 3DOGSFormer outperforms state-of-the-art 3D single-object grounding methods and their dense-object variants by significant margins.	This paper introduces 3D Dense Object Grounding (3D DOG), a new task aiming to localize multiple objects described in a paragraph within a 3D scene, and proposes 3DOGSFormer, a novel Stacked Transformer-based framework to address this task.	Existing 3D object grounding methods are limited to single-sentence inputs, failing to capture the contextual semantic and spatial relationships crucial for understanding and localizing multiple objects described in a paragraph.	3DOGSFormer employs a two-phase grounding pipeline. First, a contextual query-driven local transformer decoder generates initial grounding proposals for each sentence, leveraging semantic relations within the paragraph. Second, a proposal-guided global transformer decoder refines these proposals by capturing 3D spatial relations among objects.	3D DOG methods, including the proposed 3DOGSFormer, significantly outperform 3D single-object grounding methods adapted to the dense grounding setting. Jointly modeling multiple target objects within a paragraph through contextual query generation and global reasoning leads to substantial performance improvements in 3D DOG. 3DOGSFormer's proposal-guided global transformer decoder effectively captures 3D spatial relations among objects, further enhancing grounding accuracy.	The performance of 3DOGSFormer is sensitive to the number of sentences in the input paragraph, with longer descriptions generally leading to better results. Further research is needed to explore more sophisticated methods for handling complex linguistic structures and long-range dependencies in paragraph descriptions.	3d dense object grounding, query-based proposal generation, global transformer, 3d vision and language, spatial relation understanding
2309.02186 Report	AniPortraitGAN: Animatable 3D Portrait Generation from 2D Image Collections	Yue Wu, Sicheng Xu, Jianfeng Xiang, Fangyun Wei, Qifeng Chen, Jiaolong Yang, Xin Tong	Previous animatable 3D-aware GANs for human generation have primarily focused on either the human head or full body. However, head-only videos are relatively uncommon in real life, and full body generation typically does not deal with facial expression control and still has challenges in generating high-quality results. Towards applicable video avatars, we present an animatable 3D-aware GAN that generates portrait images with controllable facial expression, head pose, and shoulder movements. It is a generative model trained on unstructured 2D image collections without using 3D or video data. For the new task, we base our method on the generative radiance manifold representation and equip it with learnable facial and head-shoulder deformations. A dual-camera rendering and adversarial learning scheme is proposed to improve the quality of the generated faces, which is critical for portrait images. A pose deformation processing network is developed to generate plausible deformations for challenging regions such as long hair. Experiments show that our method, trained on unstructured 2D images, can generate diverse and high-quality 3D portraits with desired control over different properties.	This paper introduces AniPortraitGAN, the first animatable 3D-aware GAN for generating portrait images with controllable facial expressions, head poses, and shoulder movements from 2D image collections, without relying on 3D or video data.	This method addresses limitations of previous 3D-aware GANs that focused solely on either heads or full bodies, aiming to create realistic and controllable virtual human portraits for applications like video conferencing.	The method utilizes a generative radiance manifold representation with learnable facial and head-shoulder deformations guided by 3DMM and SMPL models. A dual-camera rendering scheme with multiple discriminators enhances face generation quality. A pose deformation processing network ensures plausible deformations, especially for challenging areas like hair.	AniPortraitGAN generates high-quality, diverse 3D portraits with control over facial expressions, head poses, and shoulder movements. The dual-camera rendering significantly improves face generation quality compared to single-camera approaches. The pose deformation processing module ensures plausible and smooth deformations, particularly for hair, addressing limitations of standard skinning weight assignment methods.	The model exhibits limitations in generating poses and expressions outside the training data distribution. Control over attributes like eye gaze and environment lighting is absent in the current implementation.	generative adversarial networks, 3d-aware image generation, animatable avatars, portrait generation, deep learning
2309.02119 Report	Hierarchical Masked 3D Diffusion Model for Video Outpainting	Fanda Fan, Chaoxu Guo, Litong Gong, Biao Wang, Tiezheng Ge, Yuning Jiang, Chunjie Luo, Jianfeng Zhan	Video outpainting aims to adequately complete missing areas at the edges of video frames. Compared to image outpainting, it presents an additional challenge as the model should maintain the temporal consistency of the filled area. In this paper, we introduce a masked 3D diffusion model for video outpainting. We use the technique of mask modeling to train the 3D diffusion model. This allows us to use multiple guide frames to connect the results of multiple video clip inferences, thus ensuring temporal consistency and reducing jitter between adjacent frames. Meanwhile, we extract the global frames of the video as prompts and guide the model to obtain information other than the current video clip using cross-attention. We also introduce a hybrid coarse-to-fine inference pipeline to alleviate the artifact accumulation problem. The existing coarse-to-fine pipeline only uses the infilling strategy, which brings degradation because the time interval of the sparse frames is too large. Our pipeline benefits from bidirectional learning of the mask modeling and thus can employ a hybrid strategy of infilling and interpolation when generating sparse frames. Experiments show that our method achieves state-of-the-art results in video outpainting tasks. More results and codes are provided at our https://fanfanda.github.io/M3DDM/.	This paper introduces a novel Masked 3D Diffusion Model (M3DDM) and a hybrid coarse-to-fine inference pipeline specifically designed for video outpainting.	Video outpainting requires maintaining temporal consistency across frames, a challenge unmet by existing image outpainting techniques. This work addresses the limitations of previous video outpainting methods in handling long videos and complex motions.	The M3DDM is trained with a mask modeling technique that uses guide frames to enhance temporal consistency and reduce jitter. Global video clips are integrated as prompts to provide global context. The hybrid coarse-to-fine pipeline leverages infilling and interpolation for long videos to minimize artifact accumulation.	The M3DDM outperforms previous methods in generating high temporal consistency and visually plausible outpainting results on DAVIS, YouTube-VOS, and a 5M E-commerce dataset. The hybrid coarse-to-fine pipeline effectively mitigates artifact accumulation in long video outpainting. The use of guide frames and global video prompts significantly improves temporal consistency and content realism.	The method's reliance on a fixed VAE encoder can lead to limitations in depicting fine structures like human faces. The model's sensitivity to initial Gaussian noise during sampling may cause edge blurring in some cases. Future work can focus on improving the robustness of text generation within videos and addressing the limitations of the VAE encoder.	video outpainting, diffusion model, mask modeling, coarse-to-fine, temporal consistency
2309.02049 Report	Diffusion-based 3D Object Detection with Random Boxes	Xin Zhou, Jinghua Hou, Tingting Yao, Dingkang Liang, Zhe Liu, Zhikang Zou, Xiaoqing Ye, Jianwei Cheng, Xiang Bai	3D object detection is an essential task for achieving autonomous driving. Existing anchor-based detection methods rely on empirical heuristics setting of anchors, which makes the algorithms lack elegance. In recent years, we have witnessed the rise of several generative models, among which diffusion models show great potential for learning the transformation of two distributions. Our proposed Diff3Det migrates the diffusion model to proposal generation for 3D object detection by considering the detection boxes as generative targets. During training, the object boxes diffuse from the ground truth boxes to the Gaussian distribution, and the decoder learns to reverse this noise process. In the inference stage, the model progressively refines a set of random boxes to the prediction results. We provide detailed experiments on the KITTI benchmark and achieve promising performance compared to classical anchor-based 3D detection methods.	This paper presents Diff3Det, a novel 3D object detection framework that leverages diffusion models for proposal generation, eliminating the need for pre-defined anchor boxes.	Existing anchor-based 3D object detection methods rely on manually set anchors, lacking elegance and potentially hindering performance. This work explores the potential of diffusion models in 3D vision for more flexible and effective proposal generation.	Diff3Det employs a diffusion-guided proposal generator that corrupts ground truth boxes with Gaussian noise during training. A 3D encoder extracts point cloud features, while a decoder learns to recover the original boxes from noisy ones. The inference involves progressively refining randomly generated boxes to predictions through a reverse diffusion process.	Diff3Det achieves competitive performance compared to state-of-the-art anchor-based methods on the KITTI benchmark. The proposed size correlation and dynamic time step strategies for proposal refinement demonstrate significant performance improvement. Increasing sampling steps during inference further boosts performance, particularly for hard examples, highlighting the benefit of the iterative denoising process.	The method suffers from slow convergence due to the difficulty of regressing from random boxes. Future work will focus on exploring fast-converging diffusion-based 3D object detection.	3d object detection, diffusion models, proposal generation, autonomous driving, point cloud processing
2309.01958 Report	Empowering Low-Light Image Enhancer through Customized Learnable Priors	Naishan Zheng, Man Zhou, Yanmeng Dong, Xiangyu Rui, Jie Huang, Chongyi Li, Feng Zhao	Deep neural networks have achieved remarkable progress in enhancing low-light images by improving their brightness and eliminating noise. However, most existing methods construct end-to-end mapping networks heuristically, neglecting the intrinsic prior of image enhancement task and lacking transparency and interpretability. Although some unfolding solutions have been proposed to relieve these issues, they rely on proximal operator networks that deliver ambiguous and implicit priors. In this work, we propose a paradigm for low-light image enhancement that explores the potential of customized learnable priors to improve the transparency of the deep unfolding paradigm. Motivated by the powerful feature representation capability of Masked Autoencoder (MAE), we customize MAE-based illumination and noise priors and redevelop them from two perspectives: 1) \textbf{structure flow}: we train the MAE from a normal-light image to its illumination properties and then embed it into the proximal operator design of the unfolding architecture; and m2) \textbf{optimization flow}: we train MAE from a normal-light image to its gradient representation and then employ it as a regularization term to constrain noise in the model output. These designs improve the interpretability and representation capability of the model.Extensive experiments on multiple low-light image enhancement datasets demonstrate the superiority of our proposed paradigm over state-of-the-art methods. Code is available at https://github.com/zheng980629/CUE.	This paper proposes a new deep unfolding paradigm, Customized Unfolding Enhancer (CUE), for low-light image enhancement, which leverages customized learnable priors for illumination and noise.	Most existing deep learning methods for low-light image enhancement lack transparency and interpretability due to heuristically constructed networks. This paper addresses this by integrating learnable priors based on intrinsic image properties.	The authors use a Masked Autoencoder (MAE) to learn illumination and noise priors. The illumination prior is trained to predict illumination maps filtered by a bilateral filter, and is embedded into the unfolding architecture. The noise prior learns gradient representations and is used as a regularization term to reduce noise.	CUE outperforms state-of-the-art methods on LOL and Huawei datasets in terms of PSNR, SSIM, and NIQE. The learned illumination prior enhances the transparency of the unfolding architecture and improves visual quality. The noise prior effectively reduces noise in enhanced images and also demonstrates promising results for image denoising.	The performance of CUE may be further improved by exploring more sophisticated prior designs. The computational cost of CUE is relatively high compared to some lightweight methods.	low-light image enhancement, deep unfolding, learnable priors, masked autoencoder, image denoising
2309.01858 Report	Towards Universal Image Embeddings: A Large-Scale Dataset and Challenge for Generic Image Representations	Nikolaos-Antonios Ypsilantis, Kaifeng Chen, Bingyi Cao, Mário Lipovský, Pelin Dogan-Schönberger, Grzegorz Makosa, Boris Bluntschli, Mojtaba Seyedhosseini, Ondřej Chum, André Araujo	Fine-grained and instance-level recognition methods are commonly trained and evaluated on specific domains, in a model per domain scenario. Such an approach, however, is impractical in real large-scale applications. In this work, we address the problem of universal image embedding, where a single universal model is trained and used in multiple domains. First, we leverage existing domain-specific datasets to carefully construct a new large-scale public benchmark for the evaluation of universal image embeddings, with 241k query images, 1.4M index images and 2.8M training images across 8 different domains and 349k classes. We define suitable metrics, training and evaluation protocols to foster future research in this area. Second, we provide a comprehensive experimental evaluation on the new dataset, demonstrating that existing approaches and simplistic extensions lead to worse performance than an assembly of models trained for each domain separately. Finally, we conducted a public research competition on this topic, leveraging industrial datasets, which attracted the participation of more than 1k teams worldwide. This exercise generated many interesting research ideas and findings which we present in detail. Project webpage: https://cmp.felk.cvut.cz/univ_emb/	This paper introduces the Universal Embedding Dataset (UnED) for training and evaluating universal image embeddings, which aim to discriminate fine-grained objects across multiple domains.	Universal image embeddings are crucial for general-purpose visual search systems, as using domain-specific models is impractical. Previous research lacked a standard large-scale dataset for this purpose.	The authors construct UnED from existing public datasets, encompassing 4.1M images, 349k classes, and 8 domains. They benchmark various pre-trained models and propose universal embedding training methods based on joint and separate classifiers with different sampling strategies.	DINOv2 pretraining yields the best off-the-shelf performance for universal embedding. Direct extensions of specialist training to universal embedding show promising results, approaching specialist performance on some domains. The Google Universal Image Embedding Challenge revealed the effectiveness of image-text foundation models (like CLIP) for pre-training, multi-stage finetuning, and careful training data selection.	The baseline universal embedding training methods are simplistic and don't exploit domain-specific knowledge. Future work can explore more advanced training techniques and architectures specifically designed for universal embedding.	image embedding, universal representation learning, fine-grained recognition, image retrieval, benchmark dataset
2309.01770 Report	StyleAdapter: A Single-Pass LoRA-Free Model for Stylized Image Generation	Zhouxia Wang, Xintao Wang, Liangbin Xie, Zhongang Qi, Ying Shan, Wenping Wang, Ping Luo	This paper presents a LoRA-free method for stylized image generation that takes a text prompt and style reference images as inputs and produces an output image in a single pass. Unlike existing methods that rely on training a separate LoRA for each style, our method can adapt to various styles with a unified model. However, this poses two challenges: 1) the prompt loses controllability over the generated content, and 2) the output image inherits both the semantic and style features of the style reference image, compromising its content fidelity. To address these challenges, we introduce StyleAdapter, a model that comprises two components: a two-path cross-attention module (TPCA) and three decoupling strategies. These components enable our model to process the prompt and style reference features separately and reduce the strong coupling between the semantic and style information in the style references. StyleAdapter can generate high-quality images that match the content of the prompts and adopt the style of the references (even for unseen styles) in a single pass, which is more flexible and efficient than previous methods. Experiments have been conducted to demonstrate the superiority of our method over previous works.	This paper presents StyleAdapter, a LoRA-free method for generating stylized images in a single pass using text prompts and style reference images.	Existing methods for stylized image generation either struggle to capture detailed style from text descriptions or require computationally expensive fine-tuning for each new style.	StyleAdapter leverages a two-path cross-attention module (TPCA) to process prompt and style features independently and employs three decoupling strategies to separate semantic and style information in reference images.	StyleAdapter generates high-quality images consistent with prompts and style references, even for unseen styles. It outperforms existing methods in balancing content fidelity and stylization, as demonstrated by qualitative and quantitative comparisons. StyleAdapter can be integrated with controllable synthesis methods, such as T2I-adapter, for enhanced control over image generation.	StyleAdapter's stylization performance may not always match LoRA, which is specifically trained for each style. The reliance on pre-trained stable diffusion might lead to the generation of unethical content.	stylized image generation, lora-free, text-to-image synthesis, two-path cross-attention, semantic and style decoupling
2309.01694 Report	No Data Augmentation? Alternative Regularizations for Effective Training on Small Datasets	Lorenzo Brigato, Stavroula Mougiakakou	Solving image classification tasks given small training datasets remains an open challenge for modern computer vision. Aggressive data augmentation and generative models are among the most straightforward approaches to overcoming the lack of data. However, the first fails to be agnostic to varying image domains, while the latter requires additional compute and careful design. In this work, we study alternative regularization strategies to push the limits of supervised learning on small image classification datasets. In particular, along with the model size and training schedule scaling, we employ a heuristic to select (semi) optimal learning rate and weight decay couples via the norm of model parameters. By training on only 1% of the original CIFAR-10 training set (i.e., 50 images per class) and testing on ciFAIR-10, a variant of the original CIFAR without duplicated images, we reach a test accuracy of 66.5%, on par with the best state-of-the-art methods.	This paper introduces a simple yet effective training methodology to enhance image classification accuracy when dealing with small training datasets, achieving results comparable to state-of-the-art methods that rely on extensive data augmentation or generative models.	Improving data efficiency in deep learning is crucial, particularly in domains where data is scarce or expensive to collect, such as medicine. This method offers a practical and transferable alternative to complex data augmentation techniques.	The methodology involves a combination of: - A heuristic for selecting optimal learning rate and weight decay pairs based on the norm of model parameters. - Removal of momentum from the optimizer. - Scaling of model size (specifically width). - Increasing training schedule length.	A simple Wide ResNet-16-1 trained with this method achieves 66.5% accuracy on ciFAIR-10, matching the performance of state-of-the-art methods that use complex augmentation strategies. The proposed hyperparameter selection method based on parameter norm proves to be a reliable predictor of generalization performance. Increasing model size and training length further boosts performance, highlighting the importance of these factors in small data regimes.	The experiments were conducted on a single dataset (ciFAIR-10), so further validation on diverse datasets is necessary. While the optimal hyperparameter combination remained consistent across different model sizes, further investigation is needed to understand this observation fully.	image classification, small data, data efficiency, regularization, hyperparameter optimization
2309.01430 Report	DAT++: Spatially Dynamic Vision Transformer with Deformable Attention	Zhuofan Xia, Xuran Pan, Shiji Song, Li Erran Li, Gao Huang	Transformers have shown superior performance on various vision tasks. Their large receptive field endows Transformer models with higher representation power than their CNN counterparts. Nevertheless, simply enlarging the receptive field also raises several concerns. On the one hand, using dense attention in ViT leads to excessive memory and computational cost, and features can be influenced by irrelevant parts that are beyond the region of interests. On the other hand, the handcrafted attention adopted in PVT or Swin Transformer is data agnostic and may limit the ability to model long-range relations. To solve this dilemma, we propose a novel deformable multi-head attention module, where the positions of key and value pairs in self-attention are adaptively allocated in a data-dependent way. This flexible scheme enables the proposed deformable attention to dynamically focus on relevant regions while maintains the representation power of global attention. On this basis, we present Deformable Attention Transformer (DAT), a general vision backbone efficient and effective for visual recognition. We further build an enhanced version DAT++. Extensive experiments show that our DAT++ achieves state-of-the-art results on various visual recognition benchmarks, with 85.9% ImageNet accuracy, 54.5 and 47.0 MS-COCO instance segmentation mAP, and 51.5 ADE20K semantic segmentation mIoU.	This paper proposes DAT++, a hierarchical Vision Transformer utilizing a novel deformable multi-head attention module (DMHA) for dynamic and data-dependent allocation of key and value pairs in self-attention.	DAT++ addresses limitations of traditional Vision Transformers, such as dense attention leading to high computational cost and handcrafted sparse attention being data-agnostic, by enabling flexible and adaptive attention to relevant image regions.	DMHA employs an offset generation network to predict offsets for reference points, guiding bilinear sampling of features from important image regions to form deformed keys and values for attention computation. This allows for data-dependent sparse attention with linear space complexity.	DAT++ achieves 85.9% Top-1 accuracy on ImageNet image classification. It achieves 54.5 bbox mAP and 47.0 mask mAP on MS-COCO instance segmentation. It achieves 51.5 mIoU on ADE20K semantic segmentation, demonstrating state-of-the-art performance across diverse visual recognition tasks.	The performance gain of incorporating DMHA in earlier stages of the model tends to saturate. The current implementation of DMHA does not leverage techniques like EMA, LayerScale, or layer-wise learning rate decay, which could potentially lead to further improvements.	vision transformer, deformable attention, dynamic neural networks, image classification, object detection
2309.01409 Report	Implicit Neural Image Stitching	Minsu Kim, Jaewon Lee, Byeonghun Lee, Sunghoon Im, Kyong Hwan Jin	Existing frameworks for image stitching often provide visually reasonable stitchings. However, they suffer from blurry artifacts and disparities in illumination, depth level, etc. Although the recent learning-based stitchings relax such disparities, the required methods impose sacrifice of image qualities failing to capture high-frequency details for stitched images. To address the problem, we propose a novel approach, implicit Neural Image Stitching (NIS) that extends arbitrary-scale super-resolution. Our method estimates Fourier coefficients of images for quality-enhancing warps. Then, the suggested model blends color mismatches and misalignment in the latent space and decodes the features into RGB values of stitched images. Our experiments show that our approach achieves improvement in resolving the low-definition imaging of the previous deep image stitching with favorable accelerated image-enhancing methods. Our source code is available at https://github.com/minshu-kim/NIS.	This paper proposes Neural Image Stitching (NIS), an implicit neural representation method for image stitching that enhances the resolution and quality of stitched images.	Existing image stitching methods often result in blurry artifacts and struggle to handle disparities in illumination and parallax errors. This work aims to improve the quality of stitched images by leveraging implicit neural representations.	NIS uses a two-stage training strategy: 1) learns high-frequency details through supervised learning on synthetic data, 2) learns to blend images and reduce artifacts by minimizing a photometric seam loss on real images. The model utilizes a neural warping module to extract detail-aware features, Fourier coefficients to represent high-frequency details, and a blender to merge features from multiple images.	NIS outperforms traditional stitching methods (bilinear, bicubic) and a recent deep stitching method (UDIS) in terms of PSNR and SSIM on synthetic images. On real images, NIS with fine-tuning achieves better NIQE, PIQE, and BRISQUE scores compared to other methods, including feature-based and learning-based stitching. Ablation studies show the importance of Fourier features and the effectiveness of the two-stage training strategy.	NIS currently exhibits higher computational cost for very high-resolution images compared to UDIS. Future work will focus on developing a fully end-to-end deep image stitching pipeline that integrates alignment and reconstruction.	image stitching, implicit neural representation, super-resolution, image blending, fourier features
2309.01369 Report	Exploring Limits of Diffusion-Synthetic Training with Weakly Supervised Semantic Segmentation	Ryota Yoshihashi, Yuya Otsuka, Kenji Doi, Tomohiro Tanaka, Hirokatsu Kataoka	The advance of generative models for images has inspired various training techniques for image recognition utilizing synthetic images. In semantic segmentation, one promising approach is extracting pseudo-masks from attention maps in text-to-image diffusion models, which enables real-image-and-annotation-free training. However, the pioneering training method using the diffusion-synthetic images and pseudo-masks, i.e., DiffuMask has limitations in terms of mask quality, scalability, and ranges of applicable domains. To overcome these limitations, this work introduces three techniques for diffusion-synthetic semantic segmentation training. First, reliability-aware robust training, originally used in weakly supervised learning, helps segmentation with insufficient synthetic mask quality. %Second, large-scale pretraining of whole segmentation models, not only backbones, on synthetic ImageNet-1k-class images with pixel-labels benefits downstream segmentation tasks. Second, we introduce prompt augmentation, data augmentation to the prompt text set to scale up and diversify training images with a limited text resources. Finally, LoRA-based adaptation of Stable Diffusion enables the transfer to a distant domain, e.g., auto-driving images. Experiments in PASCAL VOC, ImageNet-S, and Cityscapes show that our method effectively closes gap between real and synthetic training in semantic segmentation.	This paper proposes Attn2mask, a real-image-and-annotation-free semantic segmentation method that leverages diffusion models for synthetic training data generation.	Attn2mask addresses limitations in previous diffusion-synthetic training methods by framing the problem as weakly supervised learning, enabling accurate segmentation from potentially inaccurate generated labels.	The method generates training images and pseudo-masks using Stable Diffusion and its cross-attention maps. It then employs reliability-aware robust co-training to handle inaccuracies in the pseudo-masks. Additional techniques include prompt augmentation for data diversity and LoRA-based adaptation for domain transfer.	Attn2mask achieves 62.2 mIoU on PASCAL VOC without using real images or annotations, outperforming prior diffusion-synthetic methods. It demonstrates competitive performance on ImageNet-S, showcasing scalability to larger datasets and more classes. LoRA-based adaptation significantly improves performance on Cityscapes, highlighting its effectiveness for domain transfer.	The performance of Attn2mask, while impressive for real-image-free training, is still lower than fully supervised or real-image-based weakly supervised methods. The method relies on the quality and biases present in the Stable Diffusion model and its training data.	diffusion model, semantic segmentation, weakly supervised learning, diffusion-synthetic training, domain adaptation
2309.01141 Report	VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders	Xuyang Liu, Siteng Huang, Yachen Kang, Honggang Chen, Donglin Wang	Large-scale text-to-image diffusion models have shown impressive capabilities for generative tasks by leveraging strong vision-language alignment from pre-training. However, most vision-language discriminative tasks require extensive fine-tuning on carefully-labeled datasets to acquire such alignment, with great cost in time and computing resources. In this work, we explore directly applying a pre-trained generative diffusion model to the challenging discriminative task of visual grounding without any fine-tuning and additional training dataset. Specifically, we propose VGDiffZero, a simple yet effective zero-shot visual grounding framework based on text-to-image diffusion models. We also design a comprehensive region-scoring method considering both global and local contexts of each isolated proposal. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg show that VGDiffZero achieves strong performance on zero-shot visual grounding. Our code is available at https://github.com/xuyang-liu16/VGDiffZero.	This paper presents VGDiffZero, a zero-shot visual grounding framework that leverages the pre-trained text-to-image diffusion models without fine-tuning.	Fine-tuning vision-language models for discriminative tasks like visual grounding is expensive. This paper explores using pre-trained generative diffusion models for this task in a zero-shot setting.	VGDiffZero uses a pre-trained diffusion model (Stable Diffusion) and proposes a region-scoring method. It injects noise into latent representations of object proposals and uses the denoising process to assess the alignment between the proposal and text query.	VGDiffZero outperforms other zero-shot visual grounding baselines on RefCOCO, RefCOCO+, and RefCOCOg datasets. Considering both global and local contexts of object proposals improves performance. Larger-scale pre-trained diffusion models lead to better visual grounding accuracy.	The performance improvement from using core expressions instead of full expressions is inconsistent across datasets. Future work can explore different proposal generation methods or more efficient diffusion model architectures.	visual grounding, diffusion models, zero-shot learning, vision-language models, stable diffusion
2309.00908 Report	MagicProp: Diffusion-based Video Editing via Motion-aware Appearance Propagation	Hanshu Yan, Jun Hao Liew, Long Mai, Shanchuan Lin, Jiashi Feng	This paper addresses the issue of modifying the visual appearance of videos while preserving their motion. A novel framework, named MagicProp, is proposed, which disentangles the video editing process into two stages: appearance editing and motion-aware appearance propagation. In the first stage, MagicProp selects a single frame from the input video and applies image-editing techniques to modify the content and/or style of the frame. The flexibility of these techniques enables the editing of arbitrary regions within the frame. In the second stage, MagicProp employs the edited frame as an appearance reference and generates the remaining frames using an autoregressive rendering approach. To achieve this, a diffusion-based conditional generation model, called PropDPM, is developed, which synthesizes the target frame by conditioning on the reference appearance, the target motion, and its previous appearance. The autoregressive editing approach ensures temporal consistency in the resulting videos. Overall, MagicProp combines the flexibility of image-editing techniques with the superior temporal consistency of autoregressive modeling, enabling flexible editing of object types and aesthetic styles in arbitrary regions of input videos while maintaining good temporal consistency across frames. Extensive experiments in various video editing scenarios demonstrate the effectiveness of MagicProp.	MagicProp, a novel two-stage framework, edits video appearances while preserving motion by first editing a reference frame and then propagating the appearance to other frames based on the original motion.	Existing methods struggle to balance temporal consistency and editing flexibility. MagicProp addresses this by leveraging powerful image editing techniques and autoregressive modeling for flexible editing with high temporal consistency.	MagicProp uses an image diffusion model (e.g., ControlNet) for appearance editing in the first stage. Then, a novel diffusion-based conditional generation model, PropDPM, synthesizes the target video frame-by-frame, conditioned on the reference appearance, target motion (depth map), and previous frame.	MagicProp enables flexible editing of object types and aesthetic styles in arbitrary regions of input videos. The method maintains good temporal consistency across frames thanks to its autoregressive approach. Zero-Terminal-SNR noise schedule and a novel appearance adaptor help to alleviate error accumulation and color shifting.	The current implementation may exhibit degradation in quality for videos longer than 30 frames due to error accumulation. Future work aims to improve MagicProp's capability to handle longer videos.	video editing, appearance editing, motion preservation, diffusion models, temporal consistency
2309.00828 Report	When 3D Bounding-Box Meets SAM: Point Cloud Instance Segmentation with Weak-and-Noisy Supervision	Qingtao Yu, Heming Du, Chen Liu, Xin Yu	Learning from bounding-boxes annotations has shown great potential in weakly-supervised 3D point cloud instance segmentation. However, we observed that existing methods would suffer severe performance degradation with perturbed bounding box annotations. To tackle this issue, we propose a complementary image prompt-induced weakly-supervised point cloud instance segmentation (CIP-WPIS) method. CIP-WPIS leverages pretrained knowledge embedded in the 2D foundation model SAM and 3D geometric prior to achieve accurate point-wise instance labels from the bounding box annotations. Specifically, CP-WPIS first selects image views in which 3D candidate points of an instance are fully visible. Then, we generate complementary background and foreground prompts from projections to obtain SAM 2D instance mask predictions. According to these, we assign the confidence values to points indicating the likelihood of points belonging to the instance. Furthermore, we utilize 3D geometric homogeneity provided by superpoints to decide the final instance label assignments. In this fashion, we achieve high-quality 3D point-wise instance labels. Extensive experiments on both Scannet-v2 and S3DIS benchmarks demonstrate that our method is robust against noisy 3D bounding-box annotations and achieves state-of-the-art performance.	This paper presents CIP-WPIS, a method for weakly-supervised 3D point cloud instance segmentation that is robust to noisy bounding box annotations.	Existing methods struggle with performance degradation when bounding box annotations are inaccurate, which is common in real-world scenarios. This method leverages readily available noisy annotations to achieve accurate instance segmentation.	CIP-WPIS utilizes a greedy algorithm to select image views where instance points are visible and generates complementary image prompts for the Segment Anything Model (SAM). It then uses SAM predictions and 3D geometric constraints from superpoints to refine point-wise instance labels.	CIP-WPIS achieves state-of-the-art performance on ScanNet-v2 and S3DIS datasets, even with noisy bounding boxes. The method demonstrates strong robustness to increasing noise levels in bounding box annotations. Using complementary prompts and 3D geometric consistency significantly improves labeling accuracy compared to using noisy boxes directly.	Labeling accuracy, while improved, still doesn't reach the level of human annotation. The greedy view selection, while balancing performance and computation, could be further optimized.	3d point cloud, instance segmentation, weakly supervised learning, noisy annotations, segment anything model (sam)
2309.00775 Report	Contrastive Feature Masking Open-Vocabulary Vision Transformer	Dahun Kim, Anelia Angelova, Weicheng Kuo	We present Contrastive Feature Masking Vision Transformer (CFM-ViT) - an image-text pretraining methodology that achieves simultaneous learning of image- and region-level representation for open-vocabulary object detection (OVD). Our approach combines the masked autoencoder (MAE) objective into the contrastive learning objective to improve the representation for localization tasks. Unlike standard MAE, we perform reconstruction in the joint image-text embedding space, rather than the pixel space as is customary with the classical MAE method, which causes the model to better learn region-level semantics. Moreover, we introduce Positional Embedding Dropout (PED) to address scale variation between image-text pretraining and detection finetuning by randomly dropping out the positional embeddings during pretraining. PED improves detection performance and enables the use of a frozen ViT backbone as a region classifier, preventing the forgetting of open-vocabulary knowledge during detection finetuning. On LVIS open-vocabulary detection benchmark, CFM-ViT achieves a state-of-the-art 33.9 AP$r$, surpassing the best approach by 7.6 points and achieves better zero-shot detection transfer. Finally, CFM-ViT acquires strong image-level representation, outperforming the state of the art on 8 out of 12 metrics on zero-shot image-text retrieval benchmarks.	Proposes Contrastive Feature Masking Vision Transformer (CFMT), an image-text pretraining methodology for open-vocabulary object detection (OVD) by combining masked autoencoder objectives with contrastive learning to enhance object localization.	Addresses the limitations of existing VLMs, which are primarily optimized for image-level tasks and lack adequate utilization of pixel- and region-level information crucial for OVD.	Introduces Contrastive Feature Masking to predict masked image regions in the joint image-text embedding space and proposes Positional Embedding Dropout (PED) to enhance region-level representation learning and enable frozen ViT encoder usage during detection.	Achieves state-of-the-art 33.9 APr on LVIS OVD benchmark, surpassing previous best by 7.6 points. Demonstrates competitive novel AP on COCO without pseudo labels or weak supervision, representing the first ViT-based approach on this benchmark. Exhibits strong zero-shot transfer capabilities, outperforming previous methods on Objects365, and surpassing state-of-the-art on 8 out of 12 image-text retrieval benchmarks.	Potential overfitting on benchmarks with fewer training categories when using only vanilla detection losses. Future exploration of alternative techniques to mitigate overfitting on specific benchmarks.	open-vocabulary object detection, vision transformer, contrastive learning, masked image modeling, positional embedding dropout
2309.00616 Report	OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation	Zhening Huang, Xiaoyang Wu, Xi Chen, Hengshuang Zhao, Lei Zhu, Joan Lasenby	Current 3D open-vocabulary scene understanding methods mostly utilize well-aligned 2D images as the bridge to learn 3D features with language. However, applying these approaches becomes challenging in scenarios where 2D images are absent. In this work, we introduce a new pipeline, namely, OpenIns3D, which requires no 2D image inputs, for 3D open-vocabulary scene understanding at the instance level. The OpenIns3D framework employs a "Mask-Snap-Lookup" scheme. The "Mask" module learns class-agnostic mask proposals in 3D point clouds. The "Snap" module generates synthetic scene-level images at multiple scales and leverages 2D vision language models to extract interesting objects. The "Lookup" module searches through the outcomes of "Snap" with the help of Mask2Pixel maps, which contain the precise correspondence between 3D masks and synthetic images, to assign category names to the proposed masks. This 2D input-free and flexible approach achieves state-of-the-art results on a wide range of indoor and outdoor datasets by a large margin. Moreover, OpenIns3D allows for effortless switching of 2D detectors without re-training. When integrated with powerful 2D open-world models such as ODISE and GroundingDINO, excellent results were observed on open-vocabulary instance segmentation. When integrated with LLM-powered 2D models like LISA, it demonstrates a remarkable capacity to process highly complex text queries which require intricate reasoning and world knowledge. Project page: https://zheninghuang.github.io/OpenIns3D/	This paper presents OpenIns3D, a novel framework for 3D open-vocabulary instance understanding that operates solely on 3D point clouds, eliminating the need for aligned 2D images.	Current 3D open-vocabulary scene understanding methods heavily rely on well-aligned 2D images, limiting their applicability in real-world scenarios where such images are often unavailable.	OpenIns3D employs a "Mask-Snap-Lookup" scheme: 1) Mask: Learns class-agnostic mask proposals from point clouds. 2) Snap: Generates synthetic scene-level images and leverages 2D vision-language models to detect objects. 3) Lookup: Assigns category names to 3D masks by searching object detections in synthetic images using Mask2Pixel maps.	Achieves state-of-the-art results on indoor (S3DIS, ScanNetv2) and outdoor (STPLS3D) datasets for open-vocabulary instance segmentation and object detection. Demonstrates robustness by not requiring retraining when switching between different 2D detectors. Exhibits strong capability to comprehend complex language queries, including those requiring reasoning and world knowledge, when integrated with LLM-powered 2D models like LISA.	Reliance on ground truth instance masks for training the Mask Proposal Module. Limited performance in semantic segmentation due to prioritization of mask quality over completeness.	3d open-vocabulary learning, instance segmentation, object detection, point cloud understanding, vision-language models
2309.00615 Report	Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following	Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Yiwen Tang, Xianzheng Ma, Jiaming Han, Kexin Chen, Peng Gao, Xianzhi Li, Hongsheng Li, Pheng-Ann Heng	We introduce Point-Bind, a 3D multi-modality model aligning point clouds with 2D image, language, audio, and video. Guided by ImageBind, we construct a joint embedding space between 3D and multi-modalities, enabling many promising applications, e.g., any-to-3D generation, 3D embedding arithmetic, and 3D open-world understanding. On top of this, we further present Point-LLM, the first 3D large language model (LLM) following 3D multi-modal instructions. By parameter-efficient fine-tuning techniques, Point-LLM injects the semantics of Point-Bind into pre-trained LLMs, e.g., LLaMA, which requires no 3D instruction data, but exhibits superior 3D and multi-modal question-answering capacity. We hope our work may cast a light on the community for extending 3D point clouds to multi-modality applications. Code is available at https://github.com/ZiyuGuo99/Point-Bind_Point-LLM.	This paper introduces Point-Bind, a 3D multi-modality model aligning point clouds with 2D images, language, audio, and video, and Point-LLM, the first 3D large language model that understands and responds to 3D and multi-modal instructions.	Extending 3D point clouds to multi-modal applications is crucial for expanding the applications of 3D vision, enabling more robust and diverse 3D understanding, generation, and interaction.	Point-Bind leverages ImageBind's joint embedding space and contrastive learning to align 3D point clouds with other modalities. Point-LLM is built upon Point-Bind and LLaMA, fine-tuned with vision-language data and parameter-efficient techniques.	Point-Bind enables several 3D multi-modal applications like any-to-3D generation, 3D embedding arithmetic, and achieves state-of-the-art performance in 3D zero-shot classification and cross-modal retrieval. Point-LLM successfully demonstrates the ability to understand 3D point clouds and respond to 3D-related instructions in both English and Chinese. Both Point-Bind and Point-LLM show strong data efficiency, requiring no 3D instruction data for training.	Future work includes aligning multi-modality with more diverse 3D data like indoor and outdoor scenes. Exploring more complex 3D instruction following tasks is another potential direction.	3d vision, multi-modality learning, point cloud, large language model, instruction following
2309.00613 Report	Iterative Multi-granular Image Editing using Diffusion Models	K J Joseph, Prateksha Udhayanan, Tripti Shukla, Aishwarya Agarwal, Srikrishna Karanam, Koustava Goswami, Balaji Vasan Srinivasan	Recent advances in text-guided image synthesis has dramatically changed how creative professionals generate artistic and aesthetically pleasing visual assets. To fully support such creative endeavors, the process should possess the ability to: 1) iteratively edit the generations and 2) control the spatial reach of desired changes (global, local or anything in between). We formalize this pragmatic problem setting as Iterative Multi-granular Editing. While there has been substantial progress with diffusion-based models for image synthesis and editing, they are all one shot (i.e., no iterative editing capabilities) and do not naturally yield multi-granular control (i.e., covering the full spectrum of local-to-global edits). To overcome these drawbacks, we propose EMILIE: Iterative Multi-granular Image Editor. EMILIE introduces a novel latent iteration strategy, which re-purposes a pre-trained diffusion model to facilitate iterative editing. This is complemented by a gradient control operation for multi-granular control. We introduce a new benchmark dataset to evaluate our newly proposed setting. We conduct exhaustive quantitatively and qualitatively evaluation against recent state-of-the-art approaches adapted to our task, to being out the mettle of EMILIE. We hope our work would attract attention to this newly identified, pragmatic problem setting.	This paper introduces EMILIE, a novel diffusion model based framework for iterative and multi-granular image editing.	Existing image editing with diffusion models are one-shot and do not allow user control over the spatial extent of the edits, while creators often require iterative editing and multi-granular control (local to global edits).	EMILIE leverages a novel latent iteration strategy that re-purposes a pre-trained diffusion model to facilitate iterative editing, and employs gradient control to enable multi-granular control.	EMILIE effectively reduces artifact accumulation during iterative edits by operating in the latent space. Gradient control through masking allows for precise localization of edits. Quantitative and qualitative evaluation on the proposed IMIEBench and EditBench datasets demonstrate EMILIE's superiority over existing methods.	EMILIE struggles with negative edit instructions (e.g., undoing previous edits). Future work includes exploring disentanglement of feature representations to address limitations with negative edits and improve consistency.	image editing, diffusion models, iterative editing, multi-granular control, latent space
2309.00610 Report	CityDreamer: Compositional Generative Model of Unbounded 3D Cities	Haozhe Xie, Zhaoxi Chen, Fangzhou Hong, Ziwei Liu	3D city generation is a desirable yet challenging task, since humans are more sensitive to structural distortions in urban environments. Additionally, generating 3D cities is more complex than 3D natural scenes since buildings, as objects of the same class, exhibit a wider range of appearances compared to the relatively consistent appearance of objects like trees in natural scenes. To address these challenges, we propose \textbf{CityDreamer}, a compositional generative model designed specifically for unbounded 3D cities. Our key insight is that 3D city generation should be a composition of different types of neural fields: 1) various building instances, and 2) background stuff, such as roads and green lands. Specifically, we adopt the bird's eye view scene representation and employ a volumetric render for both instance-oriented and stuff-oriented neural fields. The generative hash grid and periodic positional embedding are tailored as scene parameterization to suit the distinct characteristics of building instances and background stuff. Furthermore, we contribute a suite of CityGen Datasets, including OSM and GoogleEarth, which comprises a vast amount of real-world city imagery to enhance the realism of the generated 3D cities both in their layouts and appearances. CityDreamer achieves state-of-the-art performance not only in generating realistic 3D cities but also in localized editing within the generated cities.	Proposes CityDreamer, a compositional generative model for creating unbounded 3D cities, which separates the generation of building instances and background stuff (roads, green lands, etc.) to handle the diversity of building appearances.	Addresses the challenge of 3D city generation, which is more complex than generating natural scenes due to the wide range of appearances exhibited by buildings. This has applications in urban planning, environmental simulations, and game development.	Employs a bird's eye view (BEV) scene representation and volumetric rendering for both building instances and background stuff. Utilizes a generative hash grid for background and periodic positional encoding for buildings. Leverages CityGen Datasets, including OSM and GoogleEarth, for realistic city layouts and appearances.	Achieves state-of-the-art performance in generating realistic 3D cities, as evidenced by FID, KID, depth error, and camera error metrics. Outperforms baselines in user studies assessing perceptual quality, 3D realism, and view consistency. Enables localized editing of building instances within the generated cities, including style and height modifications.	Limited to modeling convex geometries due to the voxel-based representation. Individual building generation during inference leads to higher computational cost.	3d city generation, generative adversarial networks, neural radiance fields, unbounded scene generation, compositional modeling
2309.00398 Report	VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation	Xin Li, Wenqing Chu, Ye Wu, Weihang Yuan, Fanglong Liu, Qi Zhang, Fu Li, Haocheng Feng, Errui Ding, Jingdong Wang	In this paper, we present VideoGen, a text-to-video generation approach, which can generate a high-definition video with high frame fidelity and strong temporal consistency using reference-guided latent diffusion. We leverage an off-the-shelf text-to-image generation model, e.g., Stable Diffusion, to generate an image with high content quality from the text prompt, as a reference image to guide video generation. Then, we introduce an efficient cascaded latent diffusion module conditioned on both the reference image and the text prompt, for generating latent video representations, followed by a flow-based temporal upsampling step to improve the temporal resolution. Finally, we map latent video representations into a high-definition video through an enhanced video decoder. During training, we use the first frame of a ground-truth video as the reference image for training the cascaded latent diffusion module. The main characterises of our approach include: the reference image generated by the text-to-image model improves the visual fidelity; using it as the condition makes the diffusion model focus more on learning the video dynamics; and the video decoder is trained over unlabeled video data, thus benefiting from high-quality easily-available videos. VideoGen sets a new state-of-the-art in text-to-video generation in terms of both qualitative and quantitative evaluation. See \url{https://videogen.github.io/VideoGen/} for more samples.	VideoGen, a novel text-to-video generation approach that leverages a reference-guided latent diffusion model to produce high-definition videos with strong temporal consistency and fidelity.	Text-to-video generation is a challenging task, requiring high visual quality, temporally consistent motion, and handling limited video-text pair datasets. Existing methods often struggle to balance these aspects.	VideoGen utilizes a pre-trained text-to-image model to generate a reference image from the text prompt. This image guides a cascaded latent video diffusion model, conditioned on the text and reference, to generate latent video representations. Flow-based temporal upsampling enhances resolution, and a video decoder trained on unlabeled video data maps latent representations to the final video.	Achieves state-of-the-art results on UCF-101 and MSR-VTT benchmarks, demonstrating superior video quality and text-video alignment. Significantly improves Inception Score (IS) compared to previous methods, indicating high quality and diversity in generated videos. User studies confirm that VideoGen generates videos with better visual quality and text alignment compared to Make-A-Video and Imagen Video.	While achieving competitive results, further exploration is needed to improve Frechet Video Distance (FVD) for even better distribution alignment with real videos. Fine-tuning the text-to-image model specifically for video generation could further enhance content fidelity to the target domain.	text-to-video generation, latent diffusion model, reference-guided synthesis, temporal consistency, high-definition video
2309.00339 Report	Robust Point Cloud Processing through Positional Embedding	Jianqiao Zheng, Xueqian Li, Sameera Ramasinghe, Simon Lucey	End-to-end trained per-point embeddings are an essential ingredient of any state-of-the-art 3D point cloud processing such as detection or alignment. Methods like PointNet, or the more recent point cloud transformer -- and its variants -- all employ learned per-point embeddings. Despite impressive performance, such approaches are sensitive to out-of-distribution (OOD) noise and outliers. In this paper, we explore the role of an analytical per-point embedding based on the criterion of bandwidth. The concept of bandwidth enables us to draw connections with an alternate per-point embedding -- positional embedding, particularly random Fourier features. We present compelling robust results across downstream tasks such as point cloud classification and registration with several categories of OOD noise.	This paper investigates the use of untrained, analytical positional embeddings (PE) as a more robust alternative to learned per-point embeddings (PPE) in 3D point cloud processing tasks.	Learned PPEs in popular architectures like PointNet and PCT are sensitive to out-of-distribution (OOD) noise and outliers, leading to significant performance degradation in practical applications.	The authors theoretically connect the concept of bandwidth and spatial locality to the variance of weights in PE, specifically random Fourier features (RFF). They empirically evaluate the robustness of PE-based PointNet and PCT variants on classification and registration tasks using ModelNet40 and ModelNet40-C datasets with various OOD corruptions.	PE-based embeddings show comparable performance to learned PPEs on clean data. PE-based methods significantly outperform learned PPEs on various OOD corruptions, including noise and outliers. The bandwidth of PE can be easily tuned to control its robustness to different levels of noise.	PE-based methods do not show clear advantages over learned PPEs for corruptions like density changes and transformations. Future work includes investigating more specialized PE functions and data normalization techniques to address these limitations.	point cloud processing, positional embedding, out-of-distribution robustness, pointnet, point cloud transformer
2309.00107 Report	Unsupervised evaluation of GAN sample quality: Introducing the TTJac Score	Egor Sevriugov, Ivan Oseledets	Evaluation metrics are essential for assessing the performance of generative models in image synthesis. However, existing metrics often involve high memory and time consumption as they compute the distance between generated samples and real data points. In our study, the new evaluation metric called the "TTJac score" is proposed to measure the fidelity of individual synthesized images in a data-free manner. The study first establishes a theoretical approach to directly evaluate the generated sample density. Then, a method incorporating feature extractors and discrete function approximation through tensor train is introduced to effectively assess the quality of generated samples. Furthermore, the study demonstrates that this new metric can be used to improve the fidelity-variability trade-off when applying the truncation trick. The experimental results of applying the proposed metric to StyleGAN 2 and StyleGAN 2 ADA models on FFHQ, AFHQ-Wild, LSUN-Cars, and LSUN-Horse datasets are presented. The code used in this research will be made publicly available online for the research community to access and utilize.	This paper introduces 'TTJac score', a novel data-free metric for evaluating the fidelity of individual synthesized images generated by GANs.	Existing image synthesis evaluation metrics suffer from high memory and time consumption due to their reliance on comparing generated samples with real data points.	The TTJac score leverages the generator's Jacobian to directly calculate sample density. It uses feature extractors (VGG19) to reduce Jacobian size and tensor train decomposition for efficient approximation and inference time reduction.	TTJac score effectively identifies low-quality images with artifacts comparable to data-dependent metrics like Realism score. It enables a better fidelity-variability trade-off when used for truncation trick, particularly for LSUN-Car dataset. Qualitative evaluation shows TTJac score's ability to detect visual artifacts and unrealistic elements in various domains (FFHQ, AFHQ-Wild, LSUN-Cars, LSUN-Horse).	Error in metric approximation limits achieving maximum precision with minimal recall. Challenges arise in domains like AFHQ-Wild where GAN models already exhibit high precision.	generative adversarial networks, image synthesis, evaluation metrics, tensor train decomposition, feature density
2309.00096 Report	AttrSeg: Open-Vocabulary Semantic Segmentation via Attribute Decomposition-Aggregation	Chaofan Ma, Yuhuan Yang, Chen Ju, Fei Zhang, Ya Zhang, Yanfeng Wang	Open-vocabulary semantic segmentation is a challenging task that requires segmenting novel object categories at inference time. Recent studies have explored vision-language pre-training to handle this task, but suffer from unrealistic assumptions in practical scenarios, i.e., low-quality textual category names. For example, this paradigm assumes that new textual categories will be accurately and completely provided, and exist in lexicons during pre-training. However, exceptions often happen when encountering ambiguity for brief or incomplete names, new words that are not present in the pre-trained lexicons, and difficult-to-describe categories for users. To address these issues, this work proposes a novel attribute decomposition-aggregation framework, AttrSeg, inspired by human cognition in understanding new concepts. Specifically, in the decomposition stage, we decouple class names into diverse attribute descriptions to complement semantic contexts from multiple perspectives. Two attribute construction strategies are designed: using large language models for common categories, and involving manually labeling for human-invented categories. In the aggregation stage, we group diverse attributes into an integrated global description, to form a discriminative classifier that distinguishes the target object from others. One hierarchical aggregation architecture is further proposed to achieve multi-level aggregations, leveraging the meticulously designed clustering module. The final results are obtained by computing the similarity between aggregated attributes and images embeddings. To evaluate the effectiveness, we annotate three types of datasets with attribute descriptions, and conduct extensive experiments and ablation studies. The results show the superior performance of attribute decomposition-aggregation.	This paper proposes AttrSeg, an attribute decomposition-aggregation framework for open-vocabulary semantic segmentation to address the limitations of relying solely on potentially ambiguous or unfamiliar category names.	Existing open-vocabulary segmentation methods struggle with ambiguous category names, new words (neologisms), and difficult-to-describe categories, limiting their real-world practicality.	The framework decomposes class names into detailed attribute descriptions, generated by LLMs or manual annotation. These attributes are then hierarchically aggregated into a global representation, enabling segmentation based on similarity with image embeddings.	AttrSeg outperforms state-of-the-art methods on PASCAL-5$^i$, COCO-20$^i$, PASCAL VOC, and PASCAL Context, even when using only attribute descriptions. The method demonstrates strong performance on the newly introduced 'Fantastic Beasts' dataset, specifically designed to test neologisms and unnameable categories. Ablation studies validate the effectiveness of the hierarchical aggregation strategy, the importance of each component, and the model's robustness to noisy attribute inputs.	The reliance on CLIP's training data may introduce biases into the model's predictions. Attribute decomposition using LLMs can also potentially introduce biases if not carefully controlled.	open-vocabulary semantic segmentation, attribute decomposition-aggregation, vision-language pre-training, hierarchical aggregation, fantastic beasts dataset
2309.00035 Report	FACET: Fairness in Computer Vision Evaluation Benchmark	Laura Gustafson, Chloe Rolland, Nikhila Ravi, Quentin Duval, Aaron Adcock, Cheng-Yang Fu, Melissa Hall, Candace Ross	Computer vision models have known performance disparities across attributes such as gender and skin tone. This means during tasks such as classification and detection, model performance differs for certain classes based on the demographics of the people in the image. These disparities have been shown to exist, but until now there has not been a unified approach to measure these differences for common use-cases of computer vision models. We present a new benchmark named FACET (FAirness in Computer Vision EvaluaTion), a large, publicly available evaluation set of 32k images for some of the most common vision tasks - image classification, object detection and segmentation. For every image in FACET, we hired expert reviewers to manually annotate person-related attributes such as perceived skin tone and hair type, manually draw bounding boxes and label fine-grained person-related classes such as disk jockey or guitarist. In addition, we use FACET to benchmark state-of-the-art vision models and present a deeper understanding of potential performance disparities and challenges across sensitive demographic attributes. With the exhaustive annotations collected, we probe models using single demographics attributes as well as multiple attributes using an intersectional approach (e.g. hair color and perceived skin tone). Our results show that classification, detection, segmentation, and visual grounding models exhibit performance disparities across demographic attributes and intersections of attributes. These harms suggest that not all people represented in datasets receive fair and equitable treatment in these vision tasks. We hope current and future results using our benchmark will contribute to fairer, more robust vision models. FACET is available publicly at https://facet.metademolab.com/	FACET is a large-scale, publicly available fairness benchmark for evaluating bias in computer vision models across a variety of tasks.	Existing fairness datasets lack exhaustive demographic annotations and often only support a limited number of vision tasks, making it difficult to thoroughly analyze model fairness.	FACET comprises 30,000 images with annotations for 52 person-related classes and 17 attributes, including demographic attributes (gender, skin tone, age), physical presentation (hair type, accessories), and robustness factors (lighting, occlusion). Expert annotators from diverse geographical regions manually labeled the dataset.	CLIP image classification model shows disparities in performance based on perceived gender, reflecting societal biases. Faster R-CNN object detection model demonstrates lower accuracy in detecting people with darker skin tones, especially in precise localization, and this issue is exacerbated for certain hair types. Mask R-CNN exhibits similar performance disparities across gender for both person detection and segmentation tasks, though disparities are slightly larger for detection.	The use of perceived attributes instead of self-identified ones introduces potential for annotator bias. The discrete nature of labels for gender and age might lead to the erasure of certain identities.	fairness, computer vision, benchmark, bias, demographic attributes
2308.16911 Report	PointLLM: Empowering Large Language Models to Understand Point Clouds	Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiangmiao Pang, Dahua Lin	The unprecedented advancements in Large Language Models (LLMs) have shown a profound impact on natural language processing but are yet to fully embrace the realm of 3D understanding. This paper introduces PointLLM, a preliminary effort to fill this gap, enabling LLMs to understand point clouds and offering a new avenue beyond 2D visual data. PointLLM understands colored object point clouds with human instructions and generates contextually appropriate responses, illustrating its grasp of point clouds and common sense. Specifically, it leverages a point cloud encoder with a powerful LLM to effectively fuse geometric, appearance, and linguistic information. We collect a novel dataset comprising 660K simple and 70K complex point-text instruction pairs to enable a two-stage training strategy: aligning latent spaces and subsequently instruction-tuning the unified model. To rigorously evaluate the perceptual and generalization capabilities of PointLLM, we establish two benchmarks: Generative 3D Object Classification and 3D Object Captioning, assessed through three different methods, including human evaluation, GPT-4/ChatGPT evaluation, and traditional metrics. Experimental results reveal PointLLM's superior performance over existing 2D and 3D baselines, with a notable achievement in human-evaluated object captioning tasks where it surpasses human annotators in over 50% of the samples. Codes, datasets, and benchmarks are available at https://github.com/OpenRobotLab/PointLLM .	Introduces PointLLM, a multi-modal large language model capable of understanding colored point clouds of objects, addressing the lack of LLM integration with 3D data.	Enables LLMs to move beyond 2D visual data and understand 3D structures, paving the way for applications like interactive 3D content creation and robot manipulation through natural language.	Leverages a point cloud encoder and a pre-trained LLM, trained in two stages: aligning latent spaces using a novel point-text instruction dataset, and instruction-tuning the unified model for complex instruction understanding.	Outperforms 2D and 3D baselines in generative 3D object classification on ModelNet40 and Objaverse datasets. Achieves superior performance in 3D object captioning, surpassing human annotators in over 50% of cases in human evaluation. Demonstrates accurate understanding of object details, including those often obscured by occlusion in 2D images.	Further improvement in reducing hallucination rates to match human-level precision. Exploration of more efficient, point cloud-specific fusion mechanisms for MLLMs.	large language models, point cloud understanding, 3d object recognition, multi-modal learning, generative ai
2308.16909 Report	StyleInV: A Temporal Style Modulated Inversion Network for Unconditional Video Generation	Yuhan Wang, Liming Jiang, Chen Change Loy	Unconditional video generation is a challenging task that involves synthesizing high-quality videos that are both coherent and of extended duration. To address this challenge, researchers have used pretrained StyleGAN image generators for high-quality frame synthesis and focused on motion generator design. The motion generator is trained in an autoregressive manner using heavy 3D convolutional discriminators to ensure motion coherence during video generation. In this paper, we introduce a novel motion generator design that uses a learning-based inversion network for GAN. The encoder in our method captures rich and smooth priors from encoding images to latents, and given the latent of an initially generated frame as guidance, our method can generate smooth future latent by modulating the inversion encoder temporally. Our method enjoys the advantage of sparse training and naturally constrains the generation space of our motion generator with the inversion network guided by the initial frame, eliminating the need for heavy discriminators. Moreover, our method supports style transfer with simple fine-tuning when the encoder is paired with a pretrained StyleGAN generator. Extensive experiments conducted on various benchmarks demonstrate the superiority of our method in generating long and high-resolution videos with decent single-frame quality and temporal consistency.	This paper introduces StyleInV, a novel motion generator for unconditional video generation that leverages a learning-based inversion network for GANs to generate temporally coherent motion latent codes.	Unconditional video generation is challenging, especially in synthesizing high-resolution videos with long durations and coherent motion. Existing methods often struggle with motion collapse or require computationally heavy discriminators.	StyleInV utilizes a GAN inversion network modulated by temporal style codes. It employs a first-frame-aware acyclic positional encoding (FFA-APE) and first-frame-aware sparse training (FFA-ST) to ensure smooth motion and accurate initial frame reconstruction.	StyleInV generates high-quality, long-duration videos with superior identity preservation on human-face datasets compared to baselines like MoCoGAN-HD, DIGAN, StyleGAN-V, and Long-Video-GAN. The method supports fine-tuning-based style transfer by leveraging the pretrained StyleGAN generator, allowing for easy adaptation to different artistic styles. StyleInV enables initial-frame conditioned generation, allowing users to generate videos with specific starting content.	The model's performance on datasets with global motions (e.g., SkyTimelapse) needs improvement. The two-stage training process, although efficient for hyperparameter tuning, is more computationally expensive than single-stage methods like StyleGAN-V.	video generation, gan inversion, motion generation, style transfer, temporal consistency
2308.16880 Report	Text2Scene: Text-driven Indoor Scene Stylization with Part-aware Details	Inwoo Hwang, Hyeonwoo Kim, Young Min Kim	We propose Text2Scene, a method to automatically create realistic textures for virtual scenes composed of multiple objects. Guided by a reference image and text descriptions, our pipeline adds detailed texture on labeled 3D geometries in the room such that the generated colors respect the hierarchical structure or semantic parts that are often composed of similar materials. Instead of applying flat stylization on the entire scene at a single step, we obtain weak semantic cues from geometric segmentation, which are further clarified by assigning initial colors to segmented parts. Then we add texture details for individual objects such that their projections on image space exhibit feature embedding aligned with the embedding of the input. The decomposition makes the entire pipeline tractable to a moderate amount of computation resources and memory. As our framework utilizes the existing resources of image and text embedding, it does not require dedicated datasets with high-quality textures designed by skillful artists. To the best of our knowledge, it is the first practical and scalable approach that can create detailed and realistic textures of the desired style that maintain structural context for scenes with multiple objects.	Text2Scene: a novel method for automatically generating realistic textures for virtual 3D scenes composed of multiple objects, guided by a reference image and text descriptions.	Creating realistic virtual scenes is crucial for various applications but current methods are either manual and not scalable or lack detail and realism. Text2Scene addresses this by automating texture creation while respecting object part boundaries and style consistency.	Text2Scene uses a coarse-to-fine strategy: 1) Retrieves texture for walls, ceilings, and floors from a material library. 2) Decomposes objects into parts based on geometric features and texture similarity. 3) Assigns base colors to parts, optimizing for global scene harmony. 4) Adds detailed texture to individual objects using local neural style fields, respecting part boundaries and guided by text descriptions.	Generates realistic textures with clear part boundaries, outperforming baselines in user studies. Successfully discovers part segments for diverse objects without explicit part labels. Enables efficient scene stylization, allowing object manipulation and diverse outputs from various text prompts and target images.	Current pipeline handles objects separately after base color assignment, potentially limiting holistic scene understanding. Requires class labels or text descriptions per object, which could be further automated.	texture synthesis, 3d scene stylization, text-to-3d, part discovery, neural style fields
2308.16825 Report	Coarse-to-Fine Amodal Segmentation with Shape Prior	Jianxiong Gao, Xuelin Qian, Yikai Wang, Tianjun Xiao, Tong He, Zheng Zhang, Yanwei Fu	Amodal object segmentation is a challenging task that involves segmenting both visible and occluded parts of an object. In this paper, we propose a novel approach, called Coarse-to-Fine Segmentation (C2F-Seg), that addresses this problem by progressively modeling the amodal segmentation. C2F-Seg initially reduces the learning space from the pixel-level image space to the vector-quantized latent space. This enables us to better handle long-range dependencies and learn a coarse-grained amodal segment from visual features and visible segments. However, this latent space lacks detailed information about the object, which makes it difficult to provide a precise segmentation directly. To address this issue, we propose a convolution refine module to inject fine-grained information and provide a more precise amodal object segmentation based on visual features and coarse-predicted segmentation. To help the studies of amodal object segmentation, we create a synthetic amodal dataset, named as MOViD-Amodal (MOViD-A), which can be used for both image and video amodal object segmentation. We extensively evaluate our model on two benchmark datasets: KINS and COCO-A. Our empirical results demonstrate the superiority of C2F-Seg. Moreover, we exhibit the potential of our approach for video amodal object segmentation tasks on FISHBOWL and our proposed MOViD-A. Project page at: http://jianxgao.github.io/C2F-Seg.	Proposes C2F-Seg, a coarse-to-fine framework for amodal segmentation that leverages shape priors learned in a latent space via transformers and refines them with a convolutional module.	Amodal segmentation is challenging due to the ill-posed nature of predicting occluded regions. Shape priors can help but need to be refined with visual details.	C2F-Seg first generates a coarse amodal mask from a latent representation of the visible mask and image features using a mask-and-predict transformer. This mask is then refined with a convolutional module guided by an attention mechanism derived from the coarse mask and image features.	C2F-Seg achieves state-of-the-art performance on KINS and COCOA image amodal segmentation benchmarks. It also excels in video amodal segmentation, surpassing baselines on FISHBOWL and the newly proposed MOViD-A dataset. Ablation studies demonstrate the effectiveness of the convolutional refinement, attention mechanism, and iterative inference process.	Reliance on pre-detected visible masks limits efficiency in multi-object scenes, aiming to explore single-point input or end-to-end detection integration in future work. Handling heavily occluded objects remains challenging; future efforts will focus on leveraging spatio-temporal priors and enforcing inter-frame consistency.	amodal segmentation, shape prior, transformers, convolutional refinement, video amodal segmentation
2308.16758 Report	Towards High-Fidelity Text-Guided 3D Face Generation and Manipulation Using only Images	Cuican Yu, Guansong Lu, Yihan Zeng, Jian Sun, Xiaodan Liang, Huibin Li, Zongben Xu, Songcen Xu, Wei Zhang, Hang Xu	Generating 3D faces from textual descriptions has a multitude of applications, such as gaming, movie, and robotics. Recent progresses have demonstrated the success of unconditional 3D face generation and text-to-3D shape generation. However, due to the limited text-3D face data pairs, text-driven 3D face generation remains an open problem. In this paper, we propose a text-guided 3D faces generation method, refer as TG-3DFace, for generating realistic 3D faces using text guidance. Specifically, we adopt an unconditional 3D face generation framework and equip it with text conditions, which learns the text-guided 3D face generation with only text-2D face data. On top of that, we propose two text-to-face cross-modal alignment techniques, including the global contrastive learning and the fine-grained alignment module, to facilitate high semantic consistency between generated 3D faces and input texts. Besides, we present directional classifier guidance during the inference process, which encourages creativity for out-of-domain generations. Compared to the existing methods, TG-3DFace creates more realistic and aesthetically pleasing 3D faces, boosting 9% multi-view consistency (MVIC) over Latent3D. The rendered face images generated by TG-3DFace achieve higher FID and CLIP score than text-to-2D face/image generation models, demonstrating our superiority in generating realistic and semantic-consistent textures.	This paper introduces TG-3DFace, a novel framework for generating high-fidelity 3D faces from textual descriptions using only text-2D face image pairs.	Text-guided 3D face generation is highly sought after in fields like gaming and film, but is challenging due to limited text-3D face data and the need for semantic alignment between generated faces and text.	TG-3DFace employs a text-conditional 3D GAN trained on text-2D face images and leverages two key techniques: global text-to-face contrastive learning for semantic consistency and fine-grained text-to-face alignment for detailed attribute control. Additionally, directional classifier guidance is used during inference to enable out-of-domain generation.	TG-3DFace generates more realistic and aesthetically pleasing 3D faces, outperforming baseline Latent3D in multi-view consistency by 9%. Rendered face images from TG-3DFace achieve higher FID and CLIP scores compared to existing text-to-2D face generation models, demonstrating superior realism and semantic consistency. TG-3DFace has applications in single-view 3D face reconstruction and text-guided 3D face manipulation.	Current limitations include the inability to infer identity from text, occasional asymmetry in generated faces, and limited racial diversity. Future work aims to address these limitations, improve shape quality, and enhance racial representation in the generated faces.	3d face generation, text-to-3d, text-guided synthesis, generative adversarial networks, cross-modal alignment
2308.16689 Report	ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation	Weihan Wang, Zhen Yang, Bin Xu, Juanzi Li, Yankui Sun	Vision-language pre-training (VLP) methods are blossoming recently, and its crucial goal is to jointly learn visual and textual features via a transformer-based architecture, demonstrating promising improvements on a variety of vision-language tasks. Prior arts usually focus on how to align visual and textual features, but strategies for improving the robustness of model and speeding up model convergence are left insufficiently explored. In this paper, we propose a novel method ViLTA, comprising of two components to further facilitate the model to learn fine-grained representations among image-text pairs. For Masked Language Modeling (MLM), we propose a cross-distillation method to generate soft labels to enhance the robustness of model, which alleviates the problem of treating synonyms of masked words as negative samples in one-hot labels. For Image-Text Matching (ITM), we leverage the current language encoder to synthesize hard negatives based on the context of language input, encouraging the model to learn high-quality representations by increasing the difficulty of the ITM task. By leveraging the above techniques, our ViLTA can achieve better performance on various vision-language tasks. Extensive experiments on benchmark datasets demonstrate that the effectiveness of ViLTA and its promising potential for vision-language pre-training.	This paper introduces ViLTA, a novel vision-language pre-training method that enhances model representation ability through textual augmentation.	Existing VLP models suffer from limitations in MLM robustness due to one-hot labels and sub-optimal negative sample selection in ITM.	ViLTA employs two key components: (1) cross-distillation for MLM using a frozen language encoder to generate soft labels and (2) synthetic hard negative generation for ITM based on the current language encoder.	ViLTA achieves state-of-the-art performance on various vision-language tasks, including VQA, visual reasoning, and image captioning. Cross-distillation in MLM improves model robustness and learning efficiency. Synthetic hard negatives for ITM enhance model convergence and downstream performance.	The study mainly focuses on MLM and ITM, potentially limiting performance gains in retrieval tasks compared to models incorporating ITC. Future work can explore the impact of large-scale pre-training on ViLTA's performance and investigate advanced negative sampling techniques.	vision-language pre-training, textual augmentation, knowledge distillation, hard negative mining, multimodal learning
2308.16632 Report	3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression Segmentation	Changli Wu, Yiwei Ma, Qi Chen, Haowei Wang, Gen Luo, Jiayi Ji, Xiaoshuai Sun	In 3D Referring Expression Segmentation (3D-RES), the earlier approach adopts a two-stage paradigm, extracting segmentation proposals and then matching them with referring expressions. However, this conventional paradigm encounters significant challenges, most notably in terms of the generation of lackluster initial proposals and a pronounced deceleration in inference speed. Recognizing these limitations, we introduce an innovative end-to-end Superpoint-Text Matching Network (3D-STMN) that is enriched by dependency-driven insights. One of the keystones of our model is the Superpoint-Text Matching (STM) mechanism. Unlike traditional methods that navigate through instance proposals, STM directly correlates linguistic indications with their respective superpoints, clusters of semantically related points. This architectural decision empowers our model to efficiently harness cross-modal semantic relationships, primarily leveraging densely annotated superpoint-text pairs, as opposed to the more sparse instance-text pairs. In pursuit of enhancing the role of text in guiding the segmentation process, we further incorporate the Dependency-Driven Interaction (DDI) module to deepen the network's semantic comprehension of referring expressions. Using the dependency trees as a beacon, this module discerns the intricate relationships between primary terms and their associated descriptors in expressions, thereby elevating both the localization and segmentation capacities of our model. Comprehensive experiments on the ScanRefer benchmark reveal that our model not only set new performance standards, registering an mIoU gain of 11.7 points but also achieve a staggering enhancement in inference speed, surpassing traditional methods by 95.7 times. The code and models are available at https://github.com/sosppxo/3D-STMN.	This paper proposes 3D-STMN, an efficient end-to-end Superpoint-Text Matching Network for 3D Referring Expression Segmentation (3D-RES), addressing limitations of previous two-stage methods.	3D-RES is crucial for applications like robotics and self-driving by enabling precise object identification and segmentation from language descriptions in 3D scenes.	3D-STMN leverages superpoints and a novel Superpoint-Text Matching mechanism (STM) for aligning visual and language modalities. It incorporates a Dependency-Driven Interaction (DDI) module to exploit sentence structure for improved semantic comprehension.	3D-STMN significantly outperforms prior art (TGNN) on ScanRefer benchmark, achieving 11.7 point improvement in mIoU. It achieves a remarkable 95.7x speedup in inference time compared to TGNN, enabling near real-time performance. Qualitative analysis demonstrates 3D-STMN's superior accuracy in localizing and segmenting target objects, even in complex scenes with similar objects.	The model's performance depends on the effectiveness of individual components like BERT, Sparse 3D U-Net, and the dependency parser. Future work can explore incorporating contextual information from the 3D scene to further enhance object disambiguation.	3d referring expression segmentation, 3d visual grounding, superpoint, dependency parsing, multimodal learning
2308.16582 Report	Any-Size-Diffusion: Toward Efficient Text-Driven Synthesis for Any-Size HD Images	Qingping Zheng, Yuanfan Guo, Jiankang Deng, Jianhua Han, Ying Li, Songcen Xu, Hang Xu	Stable diffusion, a generative model used in text-to-image synthesis, frequently encounters resolution-induced composition problems when generating images of varying sizes. This issue primarily stems from the model being trained on pairs of single-scale images and their corresponding text descriptions. Moreover, direct training on images of unlimited sizes is unfeasible, as it would require an immense number of text-image pairs and entail substantial computational expenses. To overcome these challenges, we propose a two-stage pipeline named Any-Size-Diffusion (ASD), designed to efficiently generate well-composed images of any size, while minimizing the need for high-memory GPU resources. Specifically, the initial stage, dubbed Any Ratio Adaptability Diffusion (ARAD), leverages a selected set of images with a restricted range of ratios to optimize the text-conditional diffusion model, thereby improving its ability to adjust composition to accommodate diverse image sizes. To support the creation of images at any desired size, we further introduce a technique called Fast Seamless Tiled Diffusion (FSTD) at the subsequent stage. This method allows for the rapid enlargement of the ASD output to any high-resolution size, avoiding seaming artifacts or memory overloads. Experimental results on the LAION-COCO and MM-CelebA-HQ benchmarks demonstrate that ASD can produce well-structured images of arbitrary sizes, cutting down the inference time by 2x compared to the traditional tiled algorithm.	This paper introduces Any-Size-Diffusion (ASD), a two-stage pipeline for generating high-resolution images of any size from text prompts, addressing resolution-induced composition problems in existing text-to-image synthesis models.	Existing models struggle with resolution changes, leading to poor composition in generated images of varying sizes. This issue arises from training on single-scale images and poses challenges in handling the vast range of possible image sizes.	ASD employs a two-stage approach: 1) Any Ratio Adaptability Diffusion (ARAD) trains on multi-aspect ratio images to generate well-composed images adaptable to different sizes. 2) Fast Seamless Tiled Diffusion (FSTD) magnifies ARAD outputs to arbitrary sizes using an implicit overlap technique for efficiency and seam artifact avoidance.	ASD outperforms baseline models in quantitative metrics (FID, IS, CLIP) and qualitative comparisons, demonstrating superior composition quality. Multi-aspect ratio training in ARAD significantly improves composition consistency across different sizes compared to single-size trained models. FSTD effectively mitigates seaming artifacts while achieving comparable inference time to non-overlapping tiled sampling.	The current implementation relies on a pre-defined set of aspect ratios for training. Future work will explore expanding the range of aspect ratios and optimizing the computational efficiency of FSTD for even faster inference.	text-to-image synthesis, stable diffusion, multi-resolution generation, compositionality, tiled diffusion
2308.16512 Report	MVDream: Multi-view Diffusion for 3D Generation	Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, Xiao Yang	We introduce MVDream, a diffusion model that is able to generate consistent multi-view images from a given text prompt. Learning from both 2D and 3D data, a multi-view diffusion model can achieve the generalizability of 2D diffusion models and the consistency of 3D renderings. We demonstrate that such a multi-view diffusion model is implicitly a generalizable 3D prior agnostic to 3D representations. It can be applied to 3D generation via Score Distillation Sampling, significantly enhancing the consistency and stability of existing 2D-lifting methods. It can also learn new concepts from a few 2D examples, akin to DreamBooth, but for 3D generation.	This paper presents MVDream, a novel multi-view diffusion model for consistent multi-view and 3D image generation from text prompts.	Existing 2D-lifting methods for text-to-3D generation often produce inconsistent images across different viewpoints. MVDream addresses this by directly incorporating multi-view consistency into a diffusion-based framework.	The authors achieve this by (1) adapting a pre-trained text-to-image diffusion model to generate multiple views using inflated 3D self-attention and camera embeddings, (2) training the model on a combination of 3D rendered data and a large-scale 2D image-text dataset, and (3) using the trained model as a prior for 3D generation via Score Distillation Sampling (SDS).	MVDream effectively addresses the multi-view consistency issue common in 2D-lifting methods. The method demonstrates strong generalizability, generating coherent multi-view images from unseen and potentially counterfactual prompts. MVDream can be extended to a DreamBooth model for personalized 3D generation, outperforming previous state-of-the-art methods in terms of quality and detail.	The current implementation is limited to a 256x256 resolution and its generalizability depends on the base model. The generated image styles can be influenced by the rendered dataset, highlighting the need for larger and more diverse 3D datasets.	multi-view diffusion model, text-to-3d generation, score distillation sampling, dreambooth, multi-view consistency
2308.16510 Report	Robust GAN inversion	Egor Sevriugov, Ivan Oseledets	Recent advancements in real image editing have been attributed to the exploration of Generative Adversarial Networks (GANs) latent space. However, the main challenge of this procedure is GAN inversion, which aims to map the image to the latent space accurately. Existing methods that work on extended latent space $W+$ are unable to achieve low distortion and high editability simultaneously. To address this issue, we propose an approach which works in native latent space $W$ and tunes the generator network to restore missing image details. We introduce a novel regularization strategy with learnable coefficients obtained by training randomized StyleGAN 2 model - WRanGAN. This method outperforms traditional approaches in terms of reconstruction quality and computational efficiency, achieving the lowest distortion with 4 times fewer parameters. Furthermore, we observe a slight improvement in the quality of constructing hyperplanes corresponding to binary image attributes. We demonstrate the effectiveness of our approach on two complex datasets: Flickr-Faces-HQ and LSUN Church.	This paper introduces WRanGAN, a novel GAN inversion approach using adaptive regularization in the native latent space (W) to enhance image reconstruction quality while maintaining editability.	Existing GAN inversion methods struggle to balance high-fidelity image reconstruction with preserving editability, limiting their use in real image editing applications.	The method leverages a randomized StyleGAN 2 model (WRanGAN), where a subset of generator weights are randomized. This allows for learning regularization coefficients through adversarial training. During inversion, these coefficients are applied, guiding the optimization towards high-quality reconstruction without compromising model fidelity.	WRanGAN achieves superior image reconstruction compared to baselines, evidenced by lower MSE and higher MS-SSIM values. It outperforms the best baseline (PTI) in reconstruction quality while being significantly faster and requiring less memory. The method retains good image editing capabilities, comparable to or slightly better than the original StyleGAN 2 model.	The approach focuses on StyleGAN 2 architecture and its generalizability to other GAN architectures needs further investigation. While the method maintains good editability, further exploration of techniques for even finer-grained control is possible.	gan inversion, image editing, generative adversarial networks, stylegan, regularization
2308.16481 Report	Point-TTA: Test-Time Adaptation for Point Cloud Registration Using Multitask Meta-Auxiliary Learning	Ahmed Hatem, Yiming Qian, Yang Wang	We present Point-TTA, a novel test-time adaptation framework for point cloud registration (PCR) that improves the generalization and the performance of registration models. While learning-based approaches have achieved impressive progress, generalization to unknown testing environments remains a major challenge due to the variations in 3D scans. Existing methods typically train a generic model and the same trained model is applied on each instance during testing. This could be sub-optimal since it is difficult for the same model to handle all the variations during testing. In this paper, we propose a test-time adaptation approach for PCR. Our model can adapt to unseen distributions at test-time without requiring any prior knowledge of the test data. Concretely, we design three self-supervised auxiliary tasks that are optimized jointly with the primary PCR task. Given a test instance, we adapt our model using these auxiliary tasks and the updated model is used to perform the inference. During training, our model is trained using a meta-auxiliary learning approach, such that the adapted model via auxiliary tasks improves the accuracy of the primary task. Experimental results demonstrate the effectiveness of our approach in improving generalization of point cloud registration and outperforming other state-of-the-art approaches.	This paper introduces Point-TTA, a novel test-time adaptation framework designed for point cloud registration (PCR) that enhances the generalization and performance of registration models.	Generalization to unknown testing environments remains a challenge for learning-based PCR approaches due to variations in 3D scans, making a single set of model parameters sub-optimal.	The method utilizes three self-supervised auxiliary tasks: point cloud reconstruction, feature learning (BYOL), and correspondence classification, to adapt the model to unseen distributions at test time. It employs a meta-auxiliary learning approach based on MAML to train the model, ensuring the adapted model improves the accuracy of the primary PCR task.	Point-TTA significantly improves registration recall and reduces rotation and translation errors on the 3DMatch benchmark, outperforming state-of-the-art methods. The method exhibits strong generalization capabilities, demonstrated by significant performance improvements in cross-dataset evaluations between 3DMatch and KITTI datasets, as well as robustness on the low-overlapping 3DLoMatch dataset. Point-TTA, integrated into a multi-way registration pipeline, enhances the accuracy of 3D reconstruction scenes on the Augmented ICL-NUIM dataset, surpassing baseline methods.	The paper acknowledges the potential limitation of slightly worse rotation error observed in some cases when using all three auxiliary tasks. Future work could explore extending the approach to handle dynamic scenes and incorporating additional self-supervised auxiliary tasks.	point cloud registration, test-time adaptation, meta-learning, self-supervised learning, 3d vision
2308.16110 Report	Improving Few-shot Image Generation by Structural Discrimination and Textural Modulation	Mengping Yang, Zhe Wang, Wenyi Feng, Qian Zhang, Ting Xiao	Few-shot image generation, which aims to produce plausible and diverse images for one category given a few images from this category, has drawn extensive attention. Existing approaches either globally interpolate different images or fuse local representations with pre-defined coefficients. However, such an intuitive combination of images/features only exploits the most relevant information for generation, leading to poor diversity and coarse-grained semantic fusion. To remedy this, this paper proposes a novel textural modulation (TexMod) mechanism to inject external semantic signals into internal local representations. Parameterized by the feedback from the discriminator, our TexMod enables more fined-grained semantic injection while maintaining the synthesis fidelity. Moreover, a global structural discriminator (StructD) is developed to explicitly guide the model to generate images with reasonable layout and outline. Furthermore, the frequency awareness of the model is reinforced by encouraging the model to distinguish frequency signals. Together with these techniques, we build a novel and effective model for few-shot image generation. The effectiveness of our model is identified by extensive experiments on three popular datasets and various settings. Besides achieving state-of-the-art synthesis performance on these datasets, our proposed techniques could be seamlessly integrated into existing models for a further performance boost.	This paper proposes SDTM-GAN, a few-shot image generation model, which improves global coherence and enables fine-grained semantic fusion through structural discrimination (StructD) and textural modulation (TexMod).	Few-shot image generation models struggle to achieve desirable diversity and fidelity due to limitations in semantic fusion and lack of structural guidance.	TexMod injects external semantic layouts into internal textural styles using a two-stage injection mechanism. StructD uses Laplacian representations to provide global structural guidelines. A frequency discriminator encourages high-frequency signal capture.	SDTM-GAN significantly improves FID and LPIPS scores on Flowers, Animal Faces, and VGGFace datasets, achieving state-of-the-art performance. The generated images show improved global coherence, fine-grained semantic details, and diversity. The proposed techniques are complementary to existing models and improve downstream classification accuracy when used for data augmentation.	The model's performance might degrade on datasets with large class variances or in cross-domain generation with substantial domain gaps. Future work includes exploring data augmentation for one-shot generation, capturing more distributional information, and investigating diffusion models.	few-shot image generation, textural modulation, structural discrimination, generative adversarial networks, semantic fusion
2308.15854 Report	Zero-shot Inversion Process for Image Attribute Editing with Diffusion Models	Zhanbo Feng, Zenan Ling, Ci Gong, Feng Zhou, Jie Li, Robert C. Qiu	Denoising diffusion models have shown outstanding performance in image editing. Existing works tend to use either image-guided methods, which provide a visual reference but lack control over semantic coherence, or text-guided methods, which ensure faithfulness to text guidance but lack visual quality. To address the problem, we propose the Zero-shot Inversion Process (ZIP), a framework that injects a fusion of generated visual reference and text guidance into the semantic latent space of a \textit{frozen} pre-trained diffusion model. Only using a tiny neural network, the proposed ZIP produces diverse content and attributes under the intuitive control of the text prompt. Moreover, ZIP shows remarkable robustness for both in-domain and out-of-domain attribute manipulation on real images. We perform detailed experiments on various benchmark datasets. Compared to state-of-the-art methods, ZIP produces images of equivalent quality while providing a realistic editing effect.	This paper introduces Zero-shot Inversion Process (ZIP), a novel framework for realistic image attribute editing that injects a fusion of generated visual reference and text guidance into the semantic latent space of a frozen pre-trained diffusion model.	Existing image editing methods using diffusion models often lack control over semantic coherence (image-guided) or suffer from low visual quality (text-guided). ZIP addresses this by combining text guidance for intuitive control and generated visual references for fine-grained visual patterns.	ZIP leverages a pre-trained text-to-image diffusion model to generate a reference image from the target attribute. An attribute encoder (a small neural network) is trained to encode this reference image into features, which are then integrated into the latent space of a frozen pre-trained diffusion model (editing generator). The editing process is guided by a text prompt and optimized using a CLIP-based loss function.	ZIP enables consistent and controllable editing by generating specific attributes aligned with the reference image, unlike text-guided methods like Null-text Inversion. ZIP demonstrates superior performance in both in-domain and out-of-domain attribute editing compared to image-guided methods (ILVR) and text-guided methods (Asyrp), achieving higher CLIP scores while preserving visual quality. ZIP exhibits versatility across diverse datasets (CelebA-HQ, LSUN-church, LSUN-bedroom), successfully synthesizing attributes and manipulating images based on textual semantics.	ZIP currently lacks the capability to utilize a target mask for precise editing, which may lead to unintended modifications. Future work will focus on improving the accuracy of attribute acquisition from reference images, particularly for attributes with similar visual features.	image editing, diffusion models, text-guided synthesis, zero-shot learning, semantic manipulation
2308.15692 Report	Intriguing Properties of Diffusion Models: An Empirical Study of the Natural Attack Capability in Text-to-Image Generative Models	Takami Sato, Justin Yue, Nanze Chen, Ningfei Wang, Qi Alfred Chen	Denoising probabilistic diffusion models have shown breakthrough performance to generate more photo-realistic images or human-level illustrations than the prior models such as GANs. This high image-generation capability has stimulated the creation of many downstream applications in various areas. However, we find that this technology is actually a double-edged sword: We identify a new type of attack, called the Natural Denoising Diffusion (NDD) attack based on the finding that state-of-the-art deep neural network (DNN) models still hold their prediction even if we intentionally remove their robust features, which are essential to the human visual system (HVS), through text prompts. The NDD attack shows a significantly high capability to generate low-cost, model-agnostic, and transferable adversarial attacks by exploiting the natural attack capability in diffusion models. To systematically evaluate the risk of the NDD attack, we perform a large-scale empirical study with our newly created dataset, the Natural Denoising Diffusion Attack (NDDA) dataset. We evaluate the natural attack capability by answering 6 research questions. Through a user study, we find that it can achieve an 88% detection rate while being stealthy to 93% of human subjects; we also find that the non-robust features embedded by diffusion models contribute to the natural attack capability. To confirm the model-agnostic and transferable attack capability, we perform the NDD attack against the Tesla Model 3 and find that 73% of the physically printed attacks can be detected as stop signs. Our hope is that the study and dataset can help our community be aware of the risks in diffusion models and facilitate further research toward robust DNN models.	This paper identifies a new security threat, the Natural Denoising Diffusion (NDD) attack, which exploits the natural attack capability of diffusion models to generate model-agnostic and transferable adversarial attacks.	Diffusion models, while revolutionary for image generation, introduce new security risks by embedding imperceptible features that can fool DNN models.	The authors construct the Natural Diffusion Denoising Attack (NDDA) dataset by generating images with and without robust features (shape, color, text, pattern) using various diffusion models. They evaluate the attack capability against object detectors and image classifiers and conduct a user study to assess the stealthiness of the attacks.	Diffusion models exhibit a significantly higher natural attack capability compared to prior image generation models like GANs. The NDD attack can achieve a high attack success rate (88% for stop signs) while remaining stealthy to human perception (93% for stop signs). Non-robust features, imperceptible yet predictive to DNNs, play a significant role in the NDD attack's effectiveness.	The study primarily focuses on three object classes, requiring further evaluation on a wider range of categories. While the research provides empirical evidence, the root causes of the natural attack capability in diffusion models require further theoretical and large-scale empirical investigation.	adversarial attacks, diffusion models, computer vision, deep learning security, ndd attack
2308.15547 Report	Efficient Ray Sampling for Radiance Fields Reconstruction	Shilei Sun, Ming Liu, Zhongyi Fan, Yuxue Liu, Chengwei Lv, Liquan Dong, Lingqin Kong	Accelerating neural radiance fields training is of substantial practical value, as the ray sampling strategy profoundly impacts network convergence. More efficient ray sampling can thus directly enhance existing NeRF models' training efficiency. We therefore propose a novel ray sampling approach for neural radiance fields that improves training efficiency while retaining photorealistic rendering results. First, we analyze the relationship between the pixel loss distribution of sampled rays and rendering quality. This reveals redundancy in the original NeRF's uniform ray sampling. Guided by this finding, we develop a sampling method leveraging pixel regions and depth boundaries. Our main idea is to sample fewer rays in training views, yet with each ray more informative for scene fitting. Sampling probability increases in pixel areas exhibiting significant color and depth variation, greatly reducing wasteful rays from other regions without sacrificing precision. Through this method, not only can the convergence of the network be accelerated, but the spatial geometry of a scene can also be perceived more accurately. Rendering outputs are enhanced, especially for texture-complex regions. Experiments demonstrate that our method significantly outperforms state-of-the-art techniques on public benchmark datasets.	This paper proposes a novel ray sampling method for neural radiance fields that improves training efficiency while maintaining high-quality rendering results.	Accelerating neural radiance fields training is crucial, and efficient ray sampling is key to achieving faster convergence and better utilizing resources.	The proposed method leverages pixel regions and depth boundaries to guide ray sampling. It increases sampling probability in areas with significant color and depth variations, reducing redundant rays in other regions.	The method significantly accelerates convergence, achieving comparable results to traditional methods in much shorter times. It improves rendering quality, particularly in regions with rich texture and complex details. The method is easily integrated into existing NeRF frameworks, consistently demonstrating improvements in speed and rendering quality.	The current implementation primarily focuses on static and dynamic scenes with simple backgrounds. Future work will explore extending the method to handle complex backgrounds and further enhance its generalization capabilities.	neural radiance fields, ray sampling, view synthesis, training acceleration, rendering quality
2308.15472 Report	Learning Modulated Transformation in GANs	Ceyuan Yang, Qihang Zhang, Yinghao Xu, Jiapeng Zhu, Yujun Shen, Bo Dai	The success of style-based generators largely benefits from style modulation, which helps take care of the cross-instance variation within data. However, the instance-wise stochasticity is typically introduced via regular convolution, where kernels interact with features at some fixed locations, limiting its capacity for modeling geometric variation. To alleviate this problem, we equip the generator in generative adversarial networks (GANs) with a plug-and-play module, termed as modulated transformation module (MTM). This module predicts spatial offsets under the control of latent codes, based on which the convolution operation can be applied at variable locations for different instances, and hence offers the model an additional degree of freedom to handle geometry deformation. Extensive experiments suggest that our approach can be faithfully generalized to various generative tasks, including image generation, 3D-aware image synthesis, and video generation, and get compatible with state-of-the-art frameworks without any hyper-parameter tuning. It is noteworthy that, towards human generation on the challenging TaiChi dataset, we improve the FID of StyleGAN3 from 21.36 to 13.60, demonstrating the efficacy of learning modulated geometry transformation.	This paper proposes a plug-and-play module for GAN generators called Modulated Transformation Module (MTM) to improve the handling of large geometric variations in generated content.	Standard GAN generators, while effective for aligned datasets with limited geometric variance, struggle with datasets like ImageNet or videos with complex motions due to the limitations of regular convolutions in modeling diverse geometry.	MTM predicts spatial offsets for each spatial location in the feature map conditioned on the latent code. These offsets allow convolution operations to be performed at variable locations, enabling the model to learn and represent diverse geometric transformations.	MTM consistently improves both image and video generation quality across different datasets and baseline models, as evidenced by metrics like FID, CLIP-FD, sFID, and FVD. Applying MTM to low-resolution layers of the generator offers the best performance-efficiency trade-off. Disabling the learnable offsets in MTM after training results in a collapse of geometric variation, highlighting its role in enabling explicit deformation.	The paper doesn't explore the effectiveness of MTM on other generative models like auto-regressive models or diffusion models. The impact of MTM on large-scale generative tasks such as text-to-image generation remains unexplored.	generative adversarial networks, geometric variation, image generation, video generation, spatial transformation
2308.15070 Report	DiffBIR: Towards Blind Image Restoration with Generative Diffusion Prior	Xinqi Lin, Jingwen He, Ziyan Chen, Zhaoyang Lyu, Bo Dai, Fanghua Yu, Wanli Ouyang, Yu Qiao, Chao Dong	We present DiffBIR, a general restoration pipeline that could handle different blind image restoration tasks in a unified framework. DiffBIR decouples blind image restoration problem into two stages: 1) degradation removal: removing image-independent content; 2) information regeneration: generating the lost image content. Each stage is developed independently but they work seamlessly in a cascaded manner. In the first stage, we use restoration modules to remove degradations and obtain high-fidelity restored results. For the second stage, we propose IRControlNet that leverages the generative ability of latent diffusion models to generate realistic details. Specifically, IRControlNet is trained based on specially produced condition images without distracting noisy content for stable generation performance. Moreover, we design a region-adaptive restoration guidance that can modify the denoising process during inference without model re-training, allowing users to balance realness and fidelity through a tunable guidance scale. Extensive experiments have demonstrated DiffBIR's superiority over state-of-the-art approaches for blind image super-resolution, blind face restoration and blind image denoising tasks on both synthetic and real-world datasets. The code is available at https://github.com/XPixelGroup/DiffBIR.	DiffBIR, a unified two-stage blind image restoration pipeline achieving state-of-the-art performance on BSR, BFR, and BID.	Existing BIR methods struggle to generalize to real-world degradations or lack the ability to generate realistic details, limiting their practicality.	DiffBIR decouples BIR into degradation removal (using task-specific modules) and information regeneration (using IRControlNet, a novel generation module leveraging latent diffusion models conditioned on restored images). It also introduces region-adaptive restoration guidance for fidelity-quality trade-off during inference.	DiffBIR significantly outperforms state-of-the-art methods in BSR, BFR, and BID on both synthetic and real-world datasets, demonstrating its superior generalization ability. IRControlNet, with its efficient condition encoding and feature modulation, proves crucial for high-quality image reconstruction in BIR. The region-adaptive restoration guidance allows for flexible control over fidelity and quality based on user preferences.	The current implementation of DiffBIR is computationally expensive, requiring 50 sampling steps per image. The effectiveness of the two-stage pipeline on other BIR tasks requires further exploration.	blind image restoration, blind super-resolution, blind face restoration, blind image denoising, latent diffusion models
2308.15049 Report	Pose-Free Neural Radiance Fields via Implicit Pose Regularization	Jiahui Zhang, Fangneng Zhan, Yingchen Yu, Kunhao Liu, Rongliang Wu, Xiaoqin Zhang, Ling Shao, Shijian Lu	Pose-free neural radiance fields (NeRF) aim to train NeRF with unposed multi-view images and it has achieved very impressive success in recent years. Most existing works share the pipeline of training a coarse pose estimator with rendered images at first, followed by a joint optimization of estimated poses and neural radiance field. However, as the pose estimator is trained with only rendered images, the pose estimation is usually biased or inaccurate for real images due to the domain gap between real images and rendered images, leading to poor robustness for the pose estimation of real images and further local minima in joint optimization. We design IR-NeRF, an innovative pose-free NeRF that introduces implicit pose regularization to refine pose estimator with unposed real images and improve the robustness of the pose estimation for real images. With a collection of 2D images of a specific scene, IR-NeRF constructs a scene codebook that stores scene features and captures the scene-specific pose distribution implicitly as priors. Thus, the robustness of pose estimation can be promoted with the scene priors according to the rationale that a 2D real image can be well reconstructed from the scene codebook only when its estimated pose lies within the pose distribution. Extensive experiments show that IR-NeRF achieves superior novel view synthesis and outperforms the state-of-the-art consistently across multiple synthetic and real datasets.	This paper proposes IR-NeRF, a pose-free Neural Radiance Field (NeRF) that leverages implicit pose regularization to refine pose estimation with unposed real images and improve the robustness of pose estimation.	Existing pose-free NeRF methods struggle with inaccurate pose estimation due to the domain gap between rendered and real images. This leads to poor robustness and local minima during optimization. IR-NeRF addresses this challenge by incorporating implicit pose regularization.	IR-NeRF constructs a scene codebook that stores scene features and implicitly captures scene-specific pose distribution. A pose-guided view reconstruction scheme then refines the pose estimator using unposed real images and a view consistency loss.	IR-NeRF achieves superior novel view synthesis compared to the state-of-the-art GNeRF, as demonstrated by higher PSNR, SSIM, and lower LPIPS scores across various synthetic and real datasets. The proposed implicit pose regularization effectively improves the accuracy of camera pose estimation on real images. Ablation studies confirm the effectiveness of each component in IR-NeRF, including implicit pose regularization, scene codebook construction, and view consistency loss.	The training process of IR-NeRF is computationally expensive and time-consuming. Future work can explore methods to improve training speed, potentially by using more efficient representations.	neural radiance fields, novel view synthesis, pose estimation, implicit pose regularization, scene codebook
2308.14761 Report	Unified Concept Editing in Diffusion Models	Rohit Gandikota, Hadas Orgad, Yonatan Belinkov, Joanna Materzyńska, David Bau	Text-to-image models suffer from various safety issues that may limit their suitability for deployment. Previous methods have separately addressed individual issues of bias, copyright, and offensive content in text-to-image models. However, in the real world, all of these issues appear simultaneously in the same model. We present a method that tackles all issues with a single approach. Our method, Unified Concept Editing (UCE), edits the model without training using a closed-form solution, and scales seamlessly to concurrent edits on text-conditional diffusion models. We demonstrate scalable simultaneous debiasing, style erasure, and content moderation by editing text-to-image projections, and we present extensive experiments demonstrating improved efficacy and scalability over prior work. Our code is available at https://unified.baulab.info	This paper introduces Unified Concept Editing (UCE), a closed-form model editing method for addressing safety issues like bias, copyright, and offensive content in text-to-image diffusion models.	Existing methods address these issues separately, while real-world models exhibit all problems concurrently. UCE offers a single, scalable solution for simultaneous editing, crucial for responsible AI deployment.	UCE modifies cross-attention weights using a closed-form solution. It identifies concepts via text embeddings, then erases, debiases, or moderates them by steering outputs towards desired targets while preserving unrelated concepts.	UCE effectively erases artistic styles with minimal interference on unrelated concepts, outperforming fine-tuning based methods. It debiases gender and racial representations in professions, achieving distributions closer to desired ratios than previous methods. UCE moderates sensitive content like nudity, effectively reducing its presence while better preserving image quality and text-image alignment compared to other techniques.	Debiasing across multiple attributes reveals interdependencies and compounding biases, requiring joint consideration for mitigation. Excessive erasure of artistic styles, even with preservation, degrades general image generation, indicating limits to removable content.	diffusion models, model editing, debiasing, content moderation, copyright
2308.14753 Report	Efficient Discovery and Effective Evaluation of Visual Perceptual Similarity: A Benchmark and Beyond	Oren Barkan, Tal Reiss, Jonathan Weill, Ori Katz, Roy Hirsch, Itzik Malkiel, Noam Koenigstein	Visual similarities discovery (VSD) is an important task with broad e-commerce applications. Given an image of a certain object, the goal of VSD is to retrieve images of different objects with high perceptual visual similarity. Although being a highly addressed problem, the evaluation of proposed methods for VSD is often based on a proxy of an identification-retrieval task, evaluating the ability of a model to retrieve different images of the same object. We posit that evaluating VSD methods based on identification tasks is limited, and faithful evaluation must rely on expert annotations. In this paper, we introduce the first large-scale fashion visual similarity benchmark dataset, consisting of more than 110K expert-annotated image pairs. Besides this major contribution, we share insight from the challenges we faced while curating this dataset. Based on these insights, we propose a novel and efficient labeling procedure that can be applied to any dataset. Our analysis examines its limitations and inductive biases, and based on these findings, we propose metrics to mitigate those limitations. Though our primary focus lies on visual similarity, the methodologies we present have broader applications for discovering and evaluating perceptual similarity across various domains.	This paper introduces the first large-scale, expert-annotated benchmark dataset for fashion visual similarity discovery (VSD), addressing limitations of identification-based evaluations.	Accurate VSD evaluation is crucial for e-commerce applications, but existing methods often rely on flawed identification-based proxies. This dataset enables more reliable assessment of VSD models.	The authors develop the Efficient Discovery of Similarities (EDS) method to curate the dataset. EDS leverages multiple vision models to propose candidate similar pairs, which are then verified by human experts.	Expert-annotated dataset with over 110K image pairs for closed-catalog and in-the-wild fashion VSD benchmarks. Proposed ROC-AUC metric for VSD evaluation shows robustness to model bias inherent in the EDS method. Supervised finetuning for identification does not necessarily improve VSD performance, highlighting the distinction between the two tasks.	EDS method, while more efficient than brute force, may not uncover all positive pairs. Future work could explore alternative training schemes for improving VSD performance beyond supervised finetuning.	visual similarity, benchmark dataset, fashion, information retrieval, evaluation metrics
2308.14749 Report	MagicEdit: High-Fidelity and Temporally Coherent Video Editing	Jun Hao Liew, Hanshu Yan, Jianfeng Zhang, Zhongcong Xu, Jiashi Feng	In this report, we present MagicEdit, a surprisingly simple yet effective solution to the text-guided video editing task. We found that high-fidelity and temporally coherent video-to-video translation can be achieved by explicitly disentangling the learning of content, structure and motion signals during training. This is in contradict to most existing methods which attempt to jointly model both the appearance and temporal representation within a single framework, which we argue, would lead to degradation in per-frame quality. Despite its simplicity, we show that MagicEdit supports various downstream video editing tasks, including video stylization, local editing, video-MagicMix and video outpainting.	\modelname is a surprisingly simple yet effective solution for text-guided video editing that achieves high-fidelity and temporally coherent results by explicitly disentangling the learning of content, structure, and motion signals.	Existing video editing methods often struggle with maintaining high per-frame quality and temporal consistency. \modelname addresses these issues with a novel training approach.	\modelname uses a three-stage training process: (1) Train a base text-to-image diffusion model. (2) Train a structure-conditioned module while freezing the pre-trained UNet. (3) Train a motion module to enforce cross-frame consistency, also while freezing the UNet.	Successfully performs video stylization with different subjects, backgrounds, and styles. Enables local editing of videos based on text prompts. Demonstrates video outpainting capabilities with various ratios and content control.	Relies on the quality of existing structure extraction methods (e.g., depth, pose). Large outpainting ratios can sometimes lead to less coherent results.	video editing, text-guided generation, diffusion models, temporal consistency, video outpainting
2308.14748 Report	MagicAvatar: Multimodal Avatar Generation and Animation	Jianfeng Zhang, Hanshu Yan, Zhongcong Xu, Jiashi Feng, Jun Hao Liew	This report presents MagicAvatar, a framework for multimodal video generation and animation of human avatars. Unlike most existing methods that generate avatar-centric videos directly from multimodal inputs (e.g., text prompts), MagicAvatar explicitly disentangles avatar video generation into two stages: (1) multimodal-to-motion and (2) motion-to-video generation. The first stage translates the multimodal inputs into motion/ control signals (e.g., human pose, depth, DensePose); while the second stage generates avatar-centric video guided by these motion signals. Additionally, MagicAvatar supports avatar animation by simply providing a few images of the target person. This capability enables the animation of the provided human identity according to the specific motion derived from the first stage. We demonstrate the flexibility of MagicAvatar through various applications, including text-guided and video-guided avatar generation, as well as multimodal avatar animation.	MagicAvatar, a two-stage framework for multimodal avatar generation and animation, enabling the creation of avatar-centric videos from text, video, or audio inputs.	Addresses the increasing demand for flexible and user-friendly avatar generation tools in virtual reality, gaming, and social media.	Disentangles avatar video generation into multimodal-to-motion and motion-to-video stages; utilizes off-the-shelf models for multimodal-to-motion conversion and leverages MagicEdit for motion-to-video generation; enables identity personalization via DreamBooth for animating specific subjects.	Generates realistic and temporally-coherent avatar videos from text prompts. Creates avatar videos mimicking motions from source videos. Allows animating specific subjects using various input modalities.	Relies on the performance of off-the-shelf models for multimodal-to-motion generation. Limited control over fine-grained details of the generated motion.	avatar generation, avatar animation, multimodal learning, text-to-video, video-to-video
2308.14740 Report	Total Selfie: Generating Full-Body Selfies	Bowei Chen, Brian Curless, Ira Kemelmacher-Shlizerman, Steven M. Seitz	We present a method to generate full-body selfies from photographs originally taken at arms length. Because self-captured photos are typically taken close up, they have limited field of view and exaggerated perspective that distorts facial shapes. We instead seek to generate the photo some one else would take of you from a few feet away. Our approach takes as input four selfies of your face and body, a background image, and generates a full-body selfie in a desired target pose. We introduce a novel diffusion-based approach to combine all of this information into high-quality, well-composed photos of you with the desired pose and background.	Introduces "total selfie", a new type of self-captured photo that captures the entire body in a scene, and proposes a diffusion-based framework to generate it from four selfies, a background image, and a target pose.	Selfies have limited field of view, distorted perspectives, and pose compositional challenges. Total selfies aim to capture full-body images as if taken by someone else, addressing these limitations.	Trains a selfie-conditioned inpainting model on a synthetic dataset of selfies and full-body images. At test time, performs face undistortion, automatically selects target pose from user's photo collection, and fine-tunes the model per capture for enhanced fidelity.	Generates high-quality full-body selfies with accurate poses, expressions, and clothing, even with significant pose differences between input and target. Outperforms adapted baseline methods, including Paint-By-Example, DisCo, LaDI-VTON, and DreamBooth, in both qualitative and quantitative comparisons. Demonstrates the trade-offs of using different target pose options (no condition, OpenPose skeleton, Canny Edge) for controlling pose and body shape.	Shading of the generated body may not perfectly match the real photo. Struggles to accurately generate hard shadows under strong sunlight due to difficulty in inferring sun direction and scene geometry from the background image alone.	total selfie, full-body selfie generation, diffusion models, image inpainting, pose control
2308.14737 Report	Flexible Techniques for Differentiable Rendering with 3D Gaussians	Leonid Keselman, Martial Hebert	Fast, reliable shape reconstruction is an essential ingredient in many computer vision applications. Neural Radiance Fields demonstrated that photorealistic novel view synthesis is within reach, but was gated by performance requirements for fast reconstruction of real scenes and objects. Several recent approaches have built on alternative shape representations, in particular, 3D Gaussians. We develop extensions to these renderers, such as integrating differentiable optical flow, exporting watertight meshes and rendering per-ray normals. Additionally, we show how two of the recent methods are interoperable with each other. These reconstructions are quick, robust, and easily performed on GPU or CPU. For code and visual examples, see https://leonidk.github.io/fmb-plus	This table presents the rendering runtimes for forward passes using a proposed method on two different datasets: Ficus and CO3D Teddy.	The comparison of CPU and GPU runtimes highlights the efficiency of the method, particularly on the GPU.	The method is evaluated by measuring the time taken for blending and compositing operations on both CPU and GPU.	The GPU significantly outperforms the CPU for both datasets. The method demonstrates faster rendering times on the simpler CO3D Teddy dataset compared to the more complex Ficus dataset. The small differences in GPU runtimes for different datasets suggest a potential memory bottleneck in the current implementation.	The study is limited to two datasets. The potential memory bottleneck requires further investigation and optimization.	rendering, gpu, runtime, neural rendering, gaussian
2308.14713 Report	R3D3: Dense 3D Reconstruction of Dynamic Scenes from Multiple Cameras	Aron Schmied, Tobias Fischer, Martin Danelljan, Marc Pollefeys, Fisher Yu	Dense 3D reconstruction and ego-motion estimation are key challenges in autonomous driving and robotics. Compared to the complex, multi-modal systems deployed today, multi-camera systems provide a simpler, low-cost alternative. However, camera-based 3D reconstruction of complex dynamic scenes has proven extremely difficult, as existing solutions often produce incomplete or incoherent results. We propose R3D3, a multi-camera system for dense 3D reconstruction and ego-motion estimation. Our approach iterates between geometric estimation that exploits spatial-temporal information from multiple cameras, and monocular depth refinement. We integrate multi-camera feature correlation and dense bundle adjustment operators that yield robust geometric depth and pose estimates. To improve reconstruction where geometric depth is unreliable, e.g. for moving objects or low-textured regions, we introduce learnable scene priors via a depth refinement network. We show that this design enables a dense, consistent 3D reconstruction of challenging, dynamic outdoor environments. Consequently, we achieve state-of-the-art dense depth prediction on the DDAD and NuScenes benchmarks.	\modelname is a novel multi-camera system for dense 3D reconstruction and ego-motion estimation that leverages both spatial and temporal information in dynamic outdoor environments.	Camera-based 3D reconstruction is crucial for applications like autonomous driving and robotics, providing a simpler and low-cost alternative to multi-modal systems. However, existing solutions often fail to produce complete and consistent 3D reconstructions of complex dynamic scenes.	The proposed system iterates between geometric depth estimation from multi-camera feature correspondences and monocular depth refinement. A multi-camera dense bundle adjustment operator and a multi-camera co-visibility graph are introduced to enable robust depth and pose estimation. A depth refinement network integrates monocular cues with prior geometric depth and uncertainty to improve reconstruction in challenging areas.	\modelname achieves state-of-the-art performance on the DDAD and NuScenes multi-camera depth estimation benchmarks. The multi-camera dense bundle adjustment operator significantly improves depth accuracy and pose estimation robustness compared to a naive implementation. The proposed co-visibility graph construction reduces system runtime by nearly 10x while maintaining performance.	The system relies on deep neural networks with downsampling operations, potentially causing loss of high-frequency details, leading to difficulties in reconstructing thin structures. Further research is needed to explore the integration of additional sensor modalities like IMU to further improve pose estimation accuracy.	3d reconstruction, ego-motion estimation, multi-camera systems, dynamic scenes, deep learning
2308.14616 Report	VoroMesh: Learning Watertight Surface Meshes with Voronoi Diagrams	Nissim Maruani, Roman Klokov, Maks Ovsjanikov, Pierre Alliez, Mathieu Desbrun	In stark contrast to the case of images, finding a concise, learnable discrete representation of 3D surfaces remains a challenge. In particular, while polygon meshes are arguably the most common surface representation used in geometry processing, their irregular and combinatorial structure often make them unsuitable for learning-based applications. In this work, we present VoroMesh, a novel and differentiable Voronoi-based representation of watertight 3D shape surfaces. From a set of 3D points (called generators) and their associated occupancy, we define our boundary representation through the Voronoi diagram of the generators as the subset of Voronoi faces whose two associated (equidistant) generators are of opposite occupancy: the resulting polygon mesh forms a watertight approximation of the target shape's boundary. To learn the position of the generators, we propose a novel loss function, dubbed VoroLoss, that minimizes the distance from ground truth surface samples to the closest faces of the Voronoi diagram which does not require an explicit construction of the entire Voronoi diagram. A direct optimization of the Voroloss to obtain generators on the Thingi32 dataset demonstrates the geometric efficiency of our representation compared to axiomatic meshing algorithms and recent learning-based mesh representations. We further use VoroMesh in a learning-based mesh prediction task from input SDF grids on the ABC dataset, and show comparable performance to state-of-the-art methods while guaranteeing closed output surfaces free of self-intersections.	VoroMesh, a novel differentiable Voronoi-based representation for watertight 3D surface meshes, along with a new loss function, VoroLoss, that minimizes the distance from surface samples to Voronoi facets without explicitly constructing the entire Voronoi diagram.	Finding a concise and learnable representation for 3D surfaces suitable for learning-based applications is challenging, and VoroMesh addresses this by providing a differentiable and efficient way to represent watertight surfaces.	VoroMesh optimizes the positions of 3D points called generators to fit a target surface. The surface is then extracted as a subset of the Voronoi diagram of these generators, determined by their assigned occupancies. The VoroLoss leverages geometric properties of Voronoi diagrams to efficiently optimize generator positions.	VoroMesh outperforms Marching Cubes, Dual Contouring, and two recent learning-based methods in terms of geometric fidelity when fitting to ground truth surfaces. VoroMesh is robust to noise, making it suitable for learning-based applications. In a learning-based mesh prediction task from SDF grids, VoroMesh achieves comparable performance to state-of-the-art while guaranteeing closed, non-self-intersecting output meshes.	Small Voronoi faces can create surface artifacts, requiring post-processing. The initialization of generators could be improved for better efficiency.	3d shape representation, surface reconstruction, voronoi diagram, differentiable geometry, deep learning
2308.14267 Report	Unleash Model Potential: Bootstrapped Meta Self-supervised Learning	Jingyao Wang, Zeen Song, Wenwen Qiang, Changwen Zheng	The long-term goal of machine learning is to learn general visual representations from a small amount of data without supervision, mimicking three advantages of human cognition: i) no need for labels, ii) robustness to data scarcity, and iii) learning from experience. Self-supervised learning and meta-learning are two promising techniques to achieve this goal, but they both only partially capture the advantages and fail to address all the problems. Self-supervised learning struggles to overcome the drawbacks of data scarcity, while ignoring prior knowledge that can facilitate learning and generalization. Meta-learning relies on supervised information and suffers from a bottleneck of insufficient learning. To address these issues, we propose a novel Bootstrapped Meta Self-Supervised Learning (BMSSL) framework that aims to simulate the human learning process. We first analyze the close relationship between meta-learning and self-supervised learning. Based on this insight, we reconstruct tasks to leverage the strengths of both paradigms, achieving advantages i and ii. Moreover, we employ a bi-level optimization framework that alternates between solving specific tasks with a learned ability (first level) and improving this ability (second level), attaining advantage iii. To fully harness its power, we introduce a bootstrapped target based on meta-gradient to make the model its own teacher. We validate the effectiveness of our approach with comprehensive theoretical and empirical study.	This paper proposes Bootstrapped Meta Self-Supervised Learning (BMSSL), a novel framework that combines self-supervised and meta-learning to learn general visual representations from limited data without supervision, mimicking human-like learning.	The goal is to overcome limitations of existing self-supervised and meta-learning methods in addressing data scarcity and incorporating prior knowledge for efficient learning and generalization.	BMSSL reconstructs self-supervised tasks into few-shot classification problems using data augmentation. It employs a bi-level optimization: inner loop for task-specific learning with contrastive loss and outer loop for meta-learning optimal initialization using a bootstrapped target based on meta-gradient.	BMSSL achieves superior performance on standard and cross-domain few-shot classification benchmarks, outperforming previous unsupervised meta-learning methods. It exhibits competitive generalization capability compared to supervised meta-learning and self-supervised baselines. Theoretical analysis provides performance guarantees for BMSSL's task construction and bootstrapped meta-training.	The evaluation primarily focuses on visual tasks, without exploring its effectiveness in other domains like reinforcement learning or language processing. Future work includes investigating its applicability to a wider range of tasks beyond classification, such as regression and generation.	self-supervised learning, meta-learning, few-shot learning, representation learning, computer vision
2308.14244 Report	HoloFusion: Towards Photo-realistic 3D Generative Modeling	Animesh Karnewar, Niloy J. Mitra, Andrea Vedaldi, David Novotny	Diffusion-based image generators can now produce high-quality and diverse samples, but their success has yet to fully translate to 3D generation: existing diffusion methods can either generate low-resolution but 3D consistent outputs, or detailed 2D views of 3D objects but with potential structural defects and lacking view consistency or realism. We present HoloFusion, a method that combines the best of these approaches to produce high-fidelity, plausible, and diverse 3D samples while learning from a collection of multi-view 2D images only. The method first generates coarse 3D samples using a variant of the recently proposed HoloDiffusion generator. Then, it independently renders and upsamples a large number of views of the coarse 3D model, super-resolves them to add detail, and distills those into a single, high-fidelity implicit 3D representation, which also ensures view consistency of the final renders. The super-resolution network is trained as an integral part of HoloFusion, end-to-end, and the final distillation uses a new sampling scheme to capture the space of super-resolved signals. We compare our method against existing baselines, including DreamFusion, Get3D, EG3D, and HoloDiffusion, and achieve, to the best of our knowledge, the most realistic results on the challenging CO3Dv2 dataset.	This paper proposes HoloFusion, a method that combines a 3D diffusion model (HoloDiffusion) with a jointly trained 2D super-resolution network to generate high-fidelity 3D radiance fields from multi-view 2D images.	Current 3D generation methods struggle to achieve both high resolution and 3D consistency. HoloFusion addresses these limitations by leveraging the strengths of both 2D and 3D diffusion models.	HoloFusion first generates coarse 3D models using a modified HoloDiffusion. Then, it renders multiple views, super-resolves them using a 2D diffusion model, and finally distills the super-resolved images into a single high-resolution 3D model using a novel patch-based optimization strategy.	Achieves state-of-the-art results on the CO3Dv2 dataset, outperforming baselines like DreamFusion, EG3D, and Get3D in terms of realism and view consistency. Demonstrates the effectiveness of integrating 2D super-resolution with 3D diffusion for high-quality 3D generation. Proposes a novel patch-based distillation technique that improves the fusion of multiple super-resolved views into a coherent 3D model.	The generation process is slow due to the distillation step. The method doesn't explicitly generate a surface representation like a mesh.	3d generation, diffusion models, super-resolution, neural radiance fields, view consistency
2308.14078 Report	Sparse3D: Distilling Multiview-Consistent Diffusion for Object Reconstruction from Sparse Views	Zi-Xin Zou, Weihao Cheng, Yan-Pei Cao, Shi-Sheng Huang, Ying Shan, Song-Hai Zhang	Reconstructing 3D objects from extremely sparse views is a long-standing and challenging problem. While recent techniques employ image diffusion models for generating plausible images at novel viewpoints or for distilling pre-trained diffusion priors into 3D representations using score distillation sampling (SDS), these methods often struggle to simultaneously achieve high-quality, consistent, and detailed results for both novel-view synthesis (NVS) and geometry. In this work, we present Sparse3D, a novel 3D reconstruction method tailored for sparse view inputs. Our approach distills robust priors from a multiview-consistent diffusion model to refine a neural radiance field. Specifically, we employ a controller that harnesses epipolar features from input views, guiding a pre-trained diffusion model, such as Stable Diffusion, to produce novel-view images that maintain 3D consistency with the input. By tapping into 2D priors from powerful image diffusion models, our integrated model consistently delivers high-quality results, even when faced with open-world objects. To address the blurriness introduced by conventional SDS, we introduce the category-score distillation sampling (C-SDS) to enhance detail. We conduct experiments on CO3DV2 which is a multi-view dataset of real-world objects. Both quantitative and qualitative evaluations demonstrate that our approach outperforms previous state-of-the-art works on the metrics regarding NVS and geometry reconstruction.	This paper presents Sparse3D, a novel 3D reconstruction method that leverages multiview-consistent diffusion models to refine neural radiance fields for high-fidelity 3D object reconstruction from sparse views.	Reconstructing 3D objects from sparse view images is crucial for applications like AR/VR, but existing methods struggle to generate consistent and detailed results, especially for novel view synthesis and geometry reconstruction.	The method uses epipolar features from input views to guide a pre-trained diffusion model (Stable Diffusion) to generate consistent novel views. It introduces a category-score distillation sampling (C-SDS) strategy to enhance details in the reconstructed NeRF, addressing blurriness common in existing SDS methods.	Sparse3D outperforms state-of-the-art methods in novel view synthesis quality and geometry reconstruction on the CO3DV2 dataset. It exhibits superior generalization, producing high-quality results even for unseen object categories. The proposed C-SDS strategy effectively enhances details in the reconstructed NeRF compared to traditional SDS methods.	Limitations include challenges with extremely partial object observations and occasional occurrences of the Janus problem. Future work may explore more efficient 3D representations or feed-forward model priors for improved computational efficiency.	3d reconstruction, sparse view synthesis, neural radiance fields, diffusion models, score distillation sampling
2308.13897 Report	InsertNeRF: Instilling Generalizability into NeRF with HyperNet Modules	Yanqi Bao, Tianyu Ding, Jing Huo, Wenbin Li, Yuxin Li, Yang Gao	Generalizing Neural Radiance Fields (NeRF) to new scenes is a significant challenge that existing approaches struggle to address without extensive modifications to vanilla NeRF framework. We introduce InsertNeRF, a method for INStilling gEneRalizabiliTy into NeRF. By utilizing multiple plug-and-play HyperNet modules, InsertNeRF dynamically tailors NeRF's weights to specific reference scenes, transforming multi-scale sampling-aware features into scene-specific representations. This novel design allows for more accurate and efficient representations of complex appearances and geometries. Experiments show that this method not only achieves superior generalization performance but also provides a flexible pathway for integration with other NeRF-like systems, even in sparse input settings. Code will be available https://github.com/bbbbby-99/InsertNeRF.	This paper introduces InsertNeRF, a novel method that instills generalizability into Neural Radiance Fields (NeRF) by using plug-and-play HyperNet modules to dynamically adapt NeRF's weights to specific scenes based on reference images.	Existing methods for generalizing NeRF to new scenes require significant modifications to the original framework or rely on computationally expensive structures like transformers. InsertNeRF offers a more efficient and flexible alternative by preserving the original NeRF architecture.	InsertNeRF leverages multiple HyperNet modules within the NeRF framework. These modules generate scene-specific weights based on multi-scale features extracted from reference images, aggregated using a novel multi-layer dynamic-static strategy. This strategy effectively captures scene details and models occlusions for accurate view synthesis.	InsertNeRF achieves state-of-the-art generalization performance on standard benchmarks (NeRF Synthetic, LLFF, DTU) outperforming existing methods in terms of PSNR, SSIM, and LPIPS. The plug-and-play nature of HyperNet modules allows InsertNeRF to be easily integrated with other NeRF-like systems like mip-NeRF and NeRF++, demonstrating its versatility and effectiveness in various scenarios. InsertNeRF shows promising results in view synthesis with sparse inputs, suggesting its potential for applications with limited training data.	The current implementation of InsertNeRF requires consistent sampling point numbers during training and evaluation, potentially limiting its rendering performance compared to methods that use different settings. Further exploration is needed to optimize InsertNeRF for sparse input scenarios, including developing fine-tuning strategies for improved results.	neural radiance fields, generalizable nerf, hypernetworks, view synthesis, novel view synthesis
2308.13812 Report	Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs	Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Tat-Seng Chua	Text-to-video (T2V) synthesis has gained increasing attention in the community, in which the recently emerged diffusion models (DMs) have promisingly shown stronger performance than the past approaches. While existing state-of-the-art DMs are competent to achieve high-resolution video generation, they may largely suffer from key limitations (e.g., action occurrence disorders, crude video motions) with respect to the intricate temporal dynamics modeling, one of the crux of video synthesis. In this work, we investigate strengthening the awareness of video dynamics for DMs, for high-quality T2V generation. Inspired by human intuition, we design an innovative dynamic scene manager (dubbed as Dysen) module, which includes (step-1) extracting from input text the key actions with proper time-order arrangement, (step-2) transforming the action schedules into the dynamic scene graph (DSG) representations, and (step-3) enriching the scenes in the DSG with sufficient and reasonable details. Taking advantage of the existing powerful LLMs (e.g., ChatGPT) via in-context learning, Dysen realizes (nearly) human-level temporal dynamics understanding. Finally, the resulting video DSG with rich action scene details is encoded as fine-grained spatio-temporal features, integrated into the backbone T2V DM for video generating. Experiments on popular T2V datasets suggest that our Dysen-VDM consistently outperforms prior arts with significant margins, especially in scenarios with complex actions. Codes at https://haofei.vip/Dysen-VDM	Presents Dysen-VDM, a dynamics-aware text-to-video diffusion model that leverages LLMs for improved temporal dynamics modeling, addressing issues like action disorders and crude motions in existing methods.	Existing text-to-video synthesis methods, while achieving high resolution, often overlook the crucial aspect of intricate temporal dynamics modeling, leading to unrealistic and low-quality video generation.	The Dysen module extracts key actions from text, converts them into Dynamic Scene Graphs (DSGs), and enriches these DSGs with details using ChatGPT (LLM) via in-context learning. A recurrent graph Transformer encodes the enriched DSGs into fine-grained features, integrated into a backbone video diffusion model for enhanced generation.	Dysen-VDM significantly outperforms prior arts on UCF-101, MSR-VTT, and ActivityNet datasets, especially in action-complex scenarios. Human evaluation confirms superior performance in action faithfulness, scene richness, and movement fluency. Ablation studies validate the contributions of the Dysen module, scene imagination, and RGTrm.	LLM hallucinations can occasionally lead to scene understanding errors, impacting video quality. DSG-based scene representation may not be suitable for all video styles, such as abstract or cartoon-style content.	text-to-video synthesis, diffusion models, dynamic scene graphs, large language models, temporal dynamics modeling
2308.13680 Report	ACC-UNet: A Completely Convolutional UNet model for the 2020s	Nabil Ibtehaz, Daisuke Kihara	This decade is marked by the introduction of Vision Transformer, a radical paradigm shift in broad computer vision. A similar trend is followed in medical imaging, UNet, one of the most influential architectures, has been redesigned with transformers. Recently, the efficacy of convolutional models in vision is being reinvestigated by seminal works such as ConvNext, which elevates a ResNet to Swin Transformer level. Deriving inspiration from this, we aim to improve a purely convolutional UNet model so that it can be on par with the transformer-based models, e.g, Swin-Unet or UCTransNet. We examined several advantages of the transformer-based UNet models, primarily long-range dependencies and cross-level skip connections. We attempted to emulate them through convolution operations and thus propose, ACC-UNet, a completely convolutional UNet model that brings the best of both worlds, the inherent inductive biases of convnets with the design decisions of transformers. ACC-UNet was evaluated on 5 different medical image segmentation benchmarks and consistently outperformed convnets, transformers, and their hybrids. Notably, ACC-UNet outperforms state-of-the-art models Swin-Unet and UCTransNet by $2.64 \pm 2.54\%$ and $0.45 \pm 1.61\%$ in terms of dice score, respectively, while using a fraction of their parameters ($59.26\%$ and $24.24\%$). Our codes are available at https://github.com/kiharalab/ACC-UNet.	ACC-UNet, a novel fully convolutional UNet model for medical image segmentation, incorporating design principles from transformers, namely long-range dependency through hierarchical neighborhood context aggregation (HANC) and multi-level feature combination via multi-level feature compilation (MLFC) in the skip connections.	Existing UNet models either rely solely on convolutions or incorporate transformers, lacking a solution that effectively integrates the strengths of both.	The authors designed HANC blocks with inverted bottlenecks and hierarchical neighborhood context aggregation to mimic the long-range dependency achieved by self-attention in transformers. Additionally, they introduced MLFC blocks in the skip connections to fuse feature maps from multiple encoder levels, inspired by transformer-based UNets.	ACC-UNet consistently outperformed convolutional, transformer-based, and hybrid UNet models on five different medical image segmentation benchmarks. It surpassed state-of-the-art models like Swin-Unet and UCTransNet in terms of dice score while utilizing significantly fewer parameters. Qualitative results demonstrate ACC-UNet's ability to accurately segment regions of interest, effectively capturing boundaries and distinguishing between different tissues.	The reliance on concatenation operations in ACC-UNet leads to computational slowdown, which the authors aim to address in future work through optimized implementations. Further exploration of transformer-inspired innovations, such as layer normalization, GELU activation, and AdamW optimizer, is planned to further enhance the model's performance.	unet, medical image segmentation, convolutional neural networks, transformers, deep learning
2308.13404 Report	Relighting Neural Radiance Fields with Shadow and Highlight Hints	Chong Zeng, Guojun Chen, Yue Dong, Pieter Peers, Hongzhi Wu, Xin Tong	This paper presents a novel neural implicit radiance representation for free viewpoint relighting from a small set of unstructured photographs of an object lit by a moving point light source different from the view position. We express the shape as a signed distance function modeled by a multi layer perceptron. In contrast to prior relightable implicit neural representations, we do not disentangle the different reflectance components, but model both the local and global reflectance at each point by a second multi layer perceptron that, in addition, to density features, the current position, the normal (from the signed distace function), view direction, and light position, also takes shadow and highlight hints to aid the network in modeling the corresponding high frequency light transport effects. These hints are provided as a suggestion, and we leave it up to the network to decide how to incorporate these in the final relit result. We demonstrate and validate our neural implicit representation on synthetic and real scenes exhibiting a wide variety of shapes, material properties, and global illumination light transport.	This paper introduces a novel neural implicit radiance representation for free viewpoint relighting of objects and scenes using a small set of unstructured photographs.	Existing relighting methods for neural implicit representations often require a large number of images, rely on simplified lighting or BRDF models, and have difficulty handling complex light transport effects.	The method uses two MLPs: one for modeling the SDF (as in NeuS) and another for modeling radiance. It incorporates shadow and highlight hints to guide the radiance MLP in capturing high-frequency light transport effects. The model is trained jointly using an image reconstruction loss and an SDF regularization loss.	The method achieves high-quality relighting with a small number of input images (~500) compared to prior work. Shadow and highlight hints are shown to be crucial for accurately reproducing these effects. The method effectively handles complex shapes, materials, and global illumination effects.	The method struggles with highly specular surfaces reflecting the scene due to limitations in surface normal accuracy. Exploring alternative approaches to handle other high-frequency light transport effects beyond shadows and highlights is left for future work.	relighting, free-viewpoint, neural implicit modeling, neural radiance fields, light transport hints
2308.13266 Report	Integrating Boxes and Masks: A Multi-Object Framework for Unified Visual Tracking and Segmentation	Yuanyou Xu, Zongxin Yang, Yi Yang	Tracking any given object(s) spatially and temporally is a common purpose in Visual Object Tracking (VOT) and Video Object Segmentation (VOS). Joint tracking and segmentation have been attempted in some studies but they often lack full compatibility of both box and mask in initialization and prediction, and mainly focus on single-object scenarios. To address these limitations, this paper proposes a Multi-object Mask-box Integrated framework for unified Tracking and Segmentation, dubbed MITS. Firstly, the unified identification module is proposed to support both box and mask reference for initialization, where detailed object information is inferred from boxes or directly retained from masks. Additionally, a novel pinpoint box predictor is proposed for accurate multi-object box prediction, facilitating target-oriented representation learning. All target objects are processed simultaneously from encoding to propagation and decoding, as a unified pipeline for VOT and VOS. Experimental results show MITS achieves state-of-the-art performance on both VOT and VOS benchmarks. Notably, MITS surpasses the best prior VOT competitor by around 6% on the GOT-10k test set, and significantly improves the performance of box initialization on VOS benchmarks. The code is available at https://github.com/yoxu515/MITS.	Presents MITS, a multi-object framework integrating boxes and masks for unified visual object tracking and segmentation.	Prior works lack full compatibility of box and mask representations, and mainly focus on single-object scenarios. This work aims to unify visual object tracking and segmentation in a multi-object framework.	MITS leverages a unified identification module for box/mask initialization and a pinpoint box predictor for accurate box prediction, achieving simultaneous multi-object processing in an encoding-propagation-decoding pipeline.	Achieves state-of-the-art performance on VOT benchmarks (LaSOT, TrackingNet, GOT-10k) and VOS benchmark (YouTube-VOS). Surpasses the best prior VOT competitor by around 6% on GOT-10k test set. Significantly improves the performance of box initialization on VOS benchmarks.	The pinpoint box predictor might not generalize well to objects with complex shapes. The model's efficiency could be further improved for real-time applications with high frame rate videos.	visual object tracking, video object segmentation, multi-object tracking, deep learning, computer vision
2308.13252 Report	Kissing to Find a Match: Efficient Low-Rank Permutation Representation	Hannah Dröge, Zorah Lähner, Yuval Bahat, Onofre Martorell, Felix Heide, Michael Möller	Permutation matrices play a key role in matching and assignment problems across the fields, especially in computer vision and robotics. However, memory for explicitly representing permutation matrices grows quadratically with the size of the problem, prohibiting large problem instances. In this work, we propose to tackle the curse of dimensionality of large permutation matrices by approximating them using low-rank matrix factorization, followed by a nonlinearity. To this end, we rely on the Kissing number theory to infer the minimal rank required for representing a permutation matrix of a given size, which is significantly smaller than the problem size. This leads to a drastic reduction in computation and memory costs, e.g., up to $3$ orders of magnitude less memory for a problem of size $n=20000$, represented using $8.4\times10^5$ elements in two small matrices instead of using a single huge matrix with $4\times 10^8$ elements. The proposed representation allows for accurate representations of large permutation matrices, which in turn enables handling large problems that would have been infeasible otherwise. We demonstrate the applicability and merits of the proposed approach through a series of experiments on a range of problems that involve predicting permutation matrices, from linear and quadratic assignment to shape matching problems.	This paper introduces a memory-efficient representation for permutation matrices, especially beneficial for large-scale matching and assignment problems in computer vision and robotics.	Explicitly representing permutation matrices incurs quadratic memory growth with problem size, rendering large instances infeasible. The proposed method overcomes this limitation, enabling handling of previously intractable problem sizes.	The core idea is to approximate permutation matrices using low-rank matrix factorization followed by a nonlinearity (ReLU or Softmax). The Kissing number theory guides the determination of the minimal rank needed, significantly smaller than the problem size. A stochastic optimization strategy further enhances memory efficiency.	The method allows accurate representation of large permutation matrices with significantly reduced memory footprint (e.g., 3 orders of magnitude reduction for n=20000). Experiments on point cloud alignment, linear/quadratic assignment problems, and shape matching demonstrate the applicability and effectiveness of the approach. The approach allows for a trade-off between accuracy and memory usage, enabling handling of high-resolution data.	Stochastic learning might not be suitable for all problem formulations, such as specific QAP forms. The method might require non-trivial, problem-specific adaptations for successful application.	permutation matrix, low-rank representation, kissing number, stochastic optimization, shape matching
2308.13223 Report	EfficientDreamer: High-Fidelity and Robust 3D Creation via Orthogonal-view Diffusion Prior	Zhipeng Hu, Minda Zhao, Chaoyi Zhao, Xinyue Liang, Lincheng Li, Zeng Zhao, Changjie Fan, Xiaowei Zhou, Xin Yu	While image diffusion models have made significant progress in text-driven 3D content creation, they often fail to accurately capture the intended meaning of text prompts, especially for view information. This limitation leads to the Janus problem, where multi-faced 3D models are generated under the guidance of such diffusion models. In this paper, we propose a robust high-quality 3D content generation pipeline by exploiting orthogonal-view image guidance. First, we introduce a novel 2D diffusion model that generates an image consisting of four orthogonal-view sub-images based on the given text prompt. Then, the 3D content is created using this diffusion model. Notably, the generated orthogonal-view image provides strong geometric structure priors and thus improves 3D consistency. As a result, it effectively resolves the Janus problem and significantly enhances the quality of 3D content creation. Additionally, we present a 3D synthesis fusion network that can further improve the details of the generated 3D contents. Both quantitative and qualitative evaluations demonstrate that our method surpasses previous text-to-3D techniques. Project page: https://efficientdreamer.github.io.	Presents EfficientDreamer, a method for high-fidelity and stable text-to-3D creation using orthogonal-view diffusion priors to address the Janus problem (inconsistent 3D generation from text prompts, especially view information).	Existing text-to-3D methods often fail to accurately capture view instructions in text prompts, leading to inconsistent 3D models (e.g., multi-faced). This work aims to improve the stability and quality of 3D content creation.	Introduces an orthogonal-view diffusion model trained on a large 3D dataset (Objaverse) to generate composite images with consistent orthogonal views. Uses this model as a prior, along with a pre-trained text-to-image diffusion model, in a two-stage coarse-to-fine optimization process to generate 3D models. A 3D synthesis fusion network dynamically balances the guidance from both diffusion models.	Effectively resolves the Janus problem by enforcing 3D consistency through orthogonal-view supervision. Achieves superior 3D content quality compared to state-of-the-art methods, as demonstrated by quantitative metrics (CLIP score, FID) and user studies. Shows the benefits of a two-stage optimization process and the dynamic fusion of orthogonal-view and text-to-image diffusion priors.	The scale of the 3D dataset used to train the orthogonal-view diffusion model is limited compared to text-image datasets. Future work could explore alternative view supervision strategies or incorporate more diverse 3D data.	text-to-3d, diffusion models, orthogonal-view supervision, janus problem, 3d content creation
2308.13218 Report	MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning	Bang Yang, Fenglin Liu, Xian Wu, Yaowei Wang, Xu Sun, Yuexian Zou	Supervised visual captioning models typically require a large scale of images or videos paired with descriptions in a specific language (i.e., the vision-caption pairs) for training. However, collecting and labeling large-scale datasets is time-consuming and expensive for many scenarios and languages. Therefore, sufficient labeled pairs are usually not available. To deal with the label shortage problem, we present a simple yet effective zero-shot approach MultiCapCLIP that can generate visual captions for different scenarios and languages without any labeled vision-caption pairs of downstream datasets. In the training stage, MultiCapCLIP only requires text data for input. Then it conducts two main steps: 1) retrieving concept prompts that preserve the corresponding domain knowledge of new scenarios; 2) auto-encoding the prompts to learn writing styles to output captions in a desired language. In the testing stage, MultiCapCLIP instead takes visual data as input directly to retrieve the concept prompts to generate the final visual descriptions. The extensive experiments on image and video captioning across four benchmarks and four languages (i.e., English, Chinese, German, and French) confirm the effectiveness of our approach. Compared with state-of-the-art zero-shot and weakly-supervised methods, our method achieves 4.8% and 21.5% absolute improvements in terms of BLEU@4 and CIDEr metrics. Our code is available at https://github.com/yangbang18/MultiCapCLIP.	Presents MultiCapCLIP, a simple yet effective zero-shot approach for generating visual captions in different scenarios and languages without labeled vision-caption pairs.	Addresses the challenge of limited labeled data in visual captioning, particularly for non-English languages, by enabling zero-shot multilingual caption generation.	Utilizes a CLIP-based vision-language model with a prompt-based auto-encoder. It retrieves concept prompts preserving domain knowledge and auto-encodes them to learn writing styles for captioning. During inference, visual input is used to retrieve prompts and generate descriptions.	Achieves competitive performance on zero-shot multilingual visual captioning across English and Chinese, outperforming previous methods reliant on large datasets. Significantly outperforms existing zero-shot and weakly-supervised methods in in-domain experiments. Demonstrates robustness by effectively extending to German and French image captioning.	Requires independent text data for training, which might be challenging to collect for some low-resource languages. Relies on CLIP for measuring text similarity, which may not be optimal for intra-modal retrieval and could be improved with better models.	zero-shot learning, visual captioning, multilingual captioning, clip, prompt-based learning
2308.13175 Report	GridPull: Towards Scalability in Learning Implicit Representations from 3D Point Clouds	Chao Chen, Yu-Shen Liu, Zhizhong Han	Learning implicit representations has been a widely used solution for surface reconstruction from 3D point clouds. The latest methods infer a distance or occupancy field by overfitting a neural network on a single point cloud. However, these methods suffer from a slow inference due to the slow convergence of neural networks and the extensive calculation of distances to surface points, which limits them to small scale points. To resolve the scalability issue in surface reconstruction, we propose GridPull to improve the efficiency of learning implicit representations from large scale point clouds. Our novelty lies in the fast inference of a discrete distance field defined on grids without using any neural components. To remedy the lack of continuousness brought by neural networks, we introduce a loss function to encourage continuous distances and consistent gradients in the field during pulling queries onto the surface in grids near to the surface. We use uniform grids for a fast grid search to localize sampled queries, and organize surface points in a tree structure to speed up the calculation of distances to the surface. We do not rely on learning priors or normal supervision during optimization, and achieve superiority over the latest methods in terms of complexity and accuracy. We evaluate our method on shape and scene benchmarks, and report numerical and visual comparisons with the latest methods to justify our effectiveness and superiority. The code is available at https://github.com/chenchao15/GridPull.	Proposes GridPull, a method for reconstructing surfaces from large-scale 3D point clouds by efficiently learning implicit representations without neural networks.	Addresses the scalability limitations of existing neural implicit representation methods, which struggle with slow inference on large point clouds due to extensive distance calculations and slow neural network convergence.	Directly infers a discrete distance field on a grid by pulling sampled queries onto the surface. Introduces a loss function encouraging continuous distances and consistent gradients to compensate for the lack of neural network continuity. Uses uniform grids and a tree structure for efficient nearest neighbor search and distance calculation.	Achieves superior accuracy in surface reconstruction compared to state-of-the-art methods on benchmarks like ShapeNet, FAMOUS, SRB, Thingi10K, D-FAUST, 3DScene, SceneNet, 3D-FRONT, Matterport, and KITTI. Demonstrates significantly faster inference speed compared to neural network-based approaches, making it suitable for large-scale point clouds. Shows robustness to noise and ability to handle varying point cloud densities effectively.	Current implementation uses a fixed grid resolution, which could be improved with adaptive resolution schemes. Exploring alternative distance field representations beyond grids, such as octrees or hash tables, could further enhance performance.	surface reconstruction, implicit representations, 3d point clouds, distance fields, scalability
2308.13164 Report	Diff-Retinex: Rethinking Low-light Image Enhancement with A Generative Diffusion Model	Xunpeng Yi, Han Xu, Hao Zhang, Linfeng Tang, Jiayi Ma	In this paper, we rethink the low-light image enhancement task and propose a physically explainable and generative diffusion model for low-light image enhancement, termed as Diff-Retinex. We aim to integrate the advantages of the physical model and the generative network. Furthermore, we hope to supplement and even deduce the information missing in the low-light image through the generative network. Therefore, Diff-Retinex formulates the low-light image enhancement problem into Retinex decomposition and conditional image generation. In the Retinex decomposition, we integrate the superiority of attention in Transformer and meticulously design a Retinex Transformer decomposition network (TDN) to decompose the image into illumination and reflectance maps. Then, we design multi-path generative diffusion networks to reconstruct the normal-light Retinex probability distribution and solve the various degradations in these components respectively, including dark illumination, noise, color deviation, loss of scene contents, etc. Owing to generative diffusion model, Diff-Retinex puts the restoration of low-light subtle detail into practice. Extensive experiments conducted on real-world low-light datasets qualitatively and quantitatively demonstrate the effectiveness, superiority, and generalization of the proposed method.	This paper presents Diff-Retinex, a generative diffusion model for low-light image enhancement based on Retinex decomposition, aiming to recover missing information and correct color deviations.	Existing LLIE methods often struggle to recover missing scene content and suffer from limitations of traditional or GAN-based generative approaches.	Diff-Retinex uses a Retinex Transformer Decomposition Network (TDN) to decompose images into illumination and reflectance maps. Then, multi-path diffusion models (RDA and IDA) refine these maps by learning the distribution of normal-light components.	Diff-Retinex demonstrates superior texture completion and reasoning generation for missing scene content compared to state-of-the-art methods. The method exhibits better illumination and color fidelity, resulting in more visually pleasing enhanced images. Qualitative and quantitative evaluations on LOL and VE-LOL-L datasets demonstrate the effectiveness and generalization ability of Diff-Retinex.	While excelling in visual quality, Diff-Retinex may not achieve top performance on pixel-wise error metrics like PSNR. Future work could explore achieving better pixel-level accuracy with diffusion models for low-light enhancement.	low-light image enhancement, diffusion models, retinex decomposition, generative models, image restoration
2308.13133 Report	AccFlow: Backward Accumulation for Long-Range Optical Flow	Guangyang Wu, Xiaohong Liu, Kunming Luo, Xi Liu, Qingqing Zheng, Shuaicheng Liu, Xinyang Jiang, Guangtao Zhai, Wenyi Wang	Recent deep learning-based optical flow estimators have exhibited impressive performance in generating local flows between consecutive frames. However, the estimation of long-range flows between distant frames, particularly under complex object deformation and large motion occlusion, remains a challenging task. One promising solution is to accumulate local flows explicitly or implicitly to obtain the desired long-range flow. Nevertheless, the accumulation errors and flow misalignment can hinder the effectiveness of this approach. This paper proposes a novel recurrent framework called AccFlow, which recursively backward accumulates local flows using a deformable module called as AccPlus. In addition, an adaptive blending module is designed along with AccPlus to alleviate the occlusion effect by backward accumulation and rectify the accumulation error. Notably, we demonstrate the superiority of backward accumulation over conventional forward accumulation, which to the best of our knowledge has not been explicitly established before. To train and evaluate the proposed AccFlow, we have constructed a large-scale high-quality dataset named CVO, which provides ground-truth optical flow labels between adjacent and distant frames. Extensive experiments validate the effectiveness of AccFlow in handling long-range optical flow estimation. Codes are available at https://github.com/mulns/AccFlow .	This paper proposes AccFlow, a novel recurrent framework that leverages backward accumulation of local optical flows to estimate long-range optical flows, especially in scenarios with complex object deformation and large motion occlusion.	Long-range optical flow estimation is crucial for various computer vision tasks, including video editing, action recognition, and object tracking, but remains challenging due to occlusion and accumulation errors.	AccFlow uses a pretrained optical flow estimator for initial flow estimation, then recursively accumulates local flows backward in feature domain using a deformable module called AccPlus. An adaptive blending module rectifies accumulation errors using directly estimated long-range flow as prior information. The paper also introduces a new synthetic dataset, CVO, with ground-truth long-range optical flow annotations for training and evaluation.	Backward accumulation effectively alleviates occlusion compared to forward accumulation, as demonstrated by quantitative and qualitative results. Adaptive blending module significantly reduces accumulated error, particularly in non-occluded regions. AccFlow outperforms previous state-of-the-art methods on CVO and HS-Sintel benchmarks, achieving substantial EPE reduction.	The current implementation of AccFlow relies on synthetic data and may require further adaptation for real-world scenarios. Exploring more sophisticated occlusion reasoning and error correction mechanisms within AccFlow could further enhance its performance.	optical flow, long-range flow estimation, backward accumulation, occlusion handling, synthetic dataset
2308.12968 Report	Scenimefy: Learning to Craft Anime Scene via Semi-Supervised Image-to-Image Translation	Yuxin Jiang, Liming Jiang, Shuai Yang, Chen Change Loy	Automatic high-quality rendering of anime scenes from complex real-world images is of significant practical value. The challenges of this task lie in the complexity of the scenes, the unique features of anime style, and the lack of high-quality datasets to bridge the domain gap. Despite promising attempts, previous efforts are still incompetent in achieving satisfactory results with consistent semantic preservation, evident stylization, and fine details. In this study, we propose Scenimefy, a novel semi-supervised image-to-image translation framework that addresses these challenges. Our approach guides the learning with structure-consistent pseudo paired data, simplifying the pure unsupervised setting. The pseudo data are derived uniquely from a semantic-constrained StyleGAN leveraging rich model priors like CLIP. We further apply segmentation-guided data selection to obtain high-quality pseudo supervision. A patch-wise contrastive style loss is introduced to improve stylization and fine details. Besides, we contribute a high-resolution anime scene dataset to facilitate future research. Our extensive experiments demonstrate the superiority of our method over state-of-the-art baselines in terms of both perceptual quality and quantitative performance.	This paper presents Scenimefy, a novel semi-supervised image-to-image translation framework for converting real-world scenes into high-quality anime style.	This work addresses the challenges of existing anime stylization methods in preserving semantic content, achieving distinct anime style, and handling fine details, particularly in complex scenes.	Scenimefy leverages structure-consistent pseudo paired data generated by a semantically-constrained StyleGAN, guided by pre-trained models like CLIP and VGG. It employs a segmentation-guided data selection process for high-quality supervision and introduces a patch-wise contrastive style loss for enhanced stylization.	Scenimefy outperforms state-of-the-art baselines in both visual quality and quantitative evaluations (FID). The proposed method effectively captures and transfers unique anime textures and styles, as demonstrated in comparisons. A new high-resolution anime scene dataset is introduced to facilitate further research in this area.	The model may not perfectly preserve intricate tiny details like text. A small number of failure cases exist where semantically distinct objects are translated incorrectly.	image-to-image translation, anime stylization, scene cartoonization, stylegan, semi-supervised learning
2308.12966 Report	Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond	Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, Jingren Zhou	In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images. Starting from the Qwen-LM as a foundation, we endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus. Beyond the conventional image description and question-answering, we implement the grounding and text-reading ability of Qwen-VLs by aligning image-caption-box tuples. The resulting models, including Qwen-VL and Qwen-VL-Chat, set new records for generalist models under similar model scales on a broad range of visual-centric benchmarks (e.g., image captioning, question answering, visual grounding) and different settings (e.g., zero-shot, few-shot). Moreover, on real-world dialog benchmarks, our instruction-tuned Qwen-VL-Chat also demonstrates superiority compared to existing vision-language chatbots. Code, demo and models are available at https://github.com/QwenLM/Qwen-VL.	Introduces Qwen-VL, a series of open-source large-scale vision-language models that excel in visual understanding and instruction following.	Addresses limitations of existing open-source LVLMs in terms of performance, fine-grained perception, and instruction following.	Employs a 3-stage training pipeline: (1) Pre-training on a massive image-text dataset, (2) Multi-task pre-training on high-quality annotated data, and (3) Instruction fine-tuning for enhanced dialogue abilities.	Achieves state-of-the-art results on various vision-language benchmarks, including image captioning, visual question answering, and refer expression comprehension. Demonstrates superior performance in real-world user behavior evaluations, such as TouchStone, SEED-Bench, and MME. Exhibits strong few-shot learning capabilities, comparable to larger models.	Current model size and resolution limit handling of more complex multimodal relationships. Future work focuses on incorporating additional modalities (speech, video) and enhancing multimodal generation capabilities.	vision-language model, large language model, multimodal learning, instruction following, open-source
2308.12964 Report	Dense Text-to-Image Generation with Attention Modulation	Yunji Kim, Jiyoung Lee, Jin-Hwa Kim, Jung-Woo Ha, Jun-Yan Zhu	Existing text-to-image diffusion models struggle to synthesize realistic images given dense captions, where each text prompt provides a detailed description for a specific image region. To address this, we propose DenseDiffusion, a training-free method that adapts a pre-trained text-to-image model to handle such dense captions while offering control over the scene layout. We first analyze the relationship between generated images' layouts and the pre-trained model's intermediate attention maps. Next, we develop an attention modulation method that guides objects to appear in specific regions according to layout guidance. Without requiring additional fine-tuning or datasets, we improve image generation performance given dense captions regarding both automatic and human evaluation scores. In addition, we achieve similar-quality visual results with models specifically trained with layout conditions.	This paper introduces DenseDiffusion, a training-free method that allows pre-trained text-to-image models to generate realistic images from dense captions while offering control over scene layout.	Existing text-to-image diffusion models struggle to accurately represent images described by dense captions, often omitting or blending objects. Additionally, controlling the layout of generated images using text prompts alone is difficult.	DenseDiffusion modulates the attention maps of pre-trained models, like Stable Diffusion, based on both text and layout conditions. This is done by identifying positive and negative query-key pairs in the attention layers and adjusting their scores based on original value range and segment size.	DenseDiffusion improves image generation performance on dense captions compared to other training-free methods, as measured by CLIP-Score, SOA-I score, and IoU. DenseDiffusion demonstrates superior adherence to layout conditions compared to SD-Pww, a training-free method designed for layout control. Qualitative results show DenseDiffusion achieves comparable, and in some cases better, layout control than models specifically trained on layout conditions, such as Make-a-Scene and SpaText.	DenseDiffusion's performance is limited by the capacity of the base text-to-image model (e.g., Stable Diffusion) it modifies. The method struggles with fine-grained input masks due to the coarse nature of self-attention and cross-attention layers.	text-to-image generation, diffusion models, attention modulation, layout control, dense captions
2308.12956 Report	DLIP: Distilling Language-Image Pre-training	Huafeng Kuang, Jie Wu, Xiawu Zheng, Ming Li, Xuefeng Xiao, Rui Wang, Min Zheng, Rongrong Ji	Vision-Language Pre-training (VLP) shows remarkable progress with the assistance of extremely heavy parameters, which challenges deployment in real applications. Knowledge distillation is well recognized as the essential procedure in model compression. However, existing knowledge distillation techniques lack an in-depth investigation and analysis of VLP, and practical guidelines for VLP-oriented distillation are still not yet explored. In this paper, we present DLIP, a simple yet efficient Distilling Language-Image Pre-training framework, through which we investigate how to distill a light VLP model. Specifically, we dissect the model distillation from multiple dimensions, such as the architecture characteristics of different modules and the information transfer of different modalities. We conduct comprehensive experiments and provide insights on distilling a light but performant VLP model. Experimental results reveal that DLIP can achieve a state-of-the-art accuracy/efficiency trade-off across diverse cross-modal tasks, e.g., image-text retrieval, image captioning and visual question answering. For example, DLIP compresses BLIP by 1.9x, from 213M to 108M parameters, while achieving comparable or better performance. Furthermore, DLIP succeeds in retaining more than 95% of the performance with 22.4% parameters and 24.8% FLOPs compared to the teacher model and accelerates inference speed by 2.7x.	This paper presents DLIP, a simple yet effective distillation framework designed to train lighter Vision-Language Pre-training (VLP) models.	Large VLP models are computationally expensive and challenging to deploy. DLIP addresses this by compressing these models while maintaining high performance.	DLIP leverages knowledge distillation from a large teacher VLP model to a smaller student model. It investigates and analyzes various aspects of model distillation, including module architecture choices and multimodal information transfer.	Image and text encoders are equally important for compression. Multimodal information transfer is more effective than unimodal information transfer for distillation. DLIP achieves state-of-the-art accuracy/efficiency trade-off, compressing BLIP by 1.9x while achieving comparable or better performance on various tasks.	The study primarily focuses on fully transformer-based VLP models. Future work could explore more efficient module compression strategies.	vision-language pre-training, knowledge distillation, model compression, multimodal learning, image-text retrieval
2308.12866 Report	ToonTalker: Cross-Domain Face Reenactment	Yuan Gong, Yong Zhang, Xiaodong Cun, Fei Yin, Yanbo Fan, Xuan Wang, Baoyuan Wu, Yujiu Yang	We target cross-domain face reenactment in this paper, i.e., driving a cartoon image with the video of a real person and vice versa. Recently, many works have focused on one-shot talking face generation to drive a portrait with a real video, i.e., within-domain reenactment. Straightforwardly applying those methods to cross-domain animation will cause inaccurate expression transfer, blur effects, and even apparent artifacts due to the domain shift between cartoon and real faces. Only a few works attempt to settle cross-domain face reenactment. The most related work AnimeCeleb requires constructing a dataset with pose vector and cartoon image pairs by animating 3D characters, which makes it inapplicable anymore if no paired data is available. In this paper, we propose a novel method for cross-domain reenactment without paired data. Specifically, we propose a transformer-based framework to align the motions from different domains into a common latent space where motion transfer is conducted via latent code addition. Two domain-specific motion encoders and two learnable motion base memories are used to capture domain properties. A source query transformer and a driving one are exploited to project domain-specific motion to the canonical space. The edited motion is projected back to the domain of the source with a transformer. Moreover, since no paired data is provided, we propose a novel cross-domain training scheme using data from two domains with the designed analogy constraint. Besides, we contribute a cartoon dataset in Disney style. Extensive evaluations demonstrate the superiority of our method over competing methods.	This paper presents ToonTalker, a novel transformer-based framework for cross-domain face reenactment, enabling animation of cartoon images using real human videos and vice versa.	Existing face reenactment methods struggle with cross-domain animation due to the significant domain shift between cartoon and real faces, resulting in inaccurate expression transfer and artifacts. ToonTalker addresses this challenge by aligning motions from different domains in a shared latent space, eliminating the need for paired training data.	ToonTalker utilizes domain-specific motion encoders and learnable motion bases to capture domain-specific motion properties. Source and driving query transformers project these motions into a canonical space where motion transfer occurs via latent code addition. A novel training scheme with an analogy constraint compensates for the lack of paired data by enforcing consistent relative motion between domains.	ToonTalker outperforms state-of-the-art methods in cross-domain reenactment, demonstrating superior image quality, motion consistency, and identity preservation. The proposed analogy constraint effectively aligns motions from different domains, as evidenced by qualitative and quantitative results. ToonTalker generalizes well to animating cartoon characters generated by diffusion models, showcasing its potential for various applications.	The model faces challenges in accurately handling extreme poses due to their limited presence in training data. Future work could explore incorporating techniques for handling extreme poses and further enhance the model's generalization capabilities for diverse cartoon styles.	cross-domain face reenactment, motion transfer, analogy constraint, transformer, cartoon animation
2308.12605 Report	APLA: Additional Perturbation for Latent Noise with Adversarial Training Enables Consistency	Yupu Yao, Shangqi Deng, Zihan Cao, Harry Zhang, Liang-Jian Deng	Diffusion models have exhibited promising progress in video generation. However, they often struggle to retain consistent details within local regions across frames. One underlying cause is that traditional diffusion models approximate Gaussian noise distribution by utilizing predictive noise, without fully accounting for the impact of inherent information within the input itself. Additionally, these models emphasize the distinction between predictions and references, neglecting information intrinsic to the videos. To address this limitation, inspired by the self-attention mechanism, we propose a novel text-to-video (T2V) generation network structure based on diffusion models, dubbed Additional Perturbation for Latent noise with Adversarial training (APLA). Our approach only necessitates a single video as input and builds upon pre-trained stable diffusion networks. Notably, we introduce an additional compact network, known as the Video Generation Transformer (VGT). This auxiliary component is designed to extract perturbations from the inherent information contained within the input, thereby refining inconsistent pixels during temporal predictions. We leverage a hybrid architecture of transformers and convolutions to compensate for temporal intricacies, enhancing consistency between different frames within the video. Experiments demonstrate a noticeable improvement in the consistency of the generated videos both qualitatively and quantitatively.	Proposes APLA, a text-to-video generation network based on diffusion models that improves frame consistency by using a Video Generation Transformer (VGT) to extract and leverage inherent information within input videos.	Existing diffusion models for video generation struggle to maintain consistent details across frames, particularly in fine-tuned models.	APLA introduces VGT, an auxiliary network that extracts perturbations from input videos to refine inconsistent pixels during temporal predictions. It leverages a hybrid transformer-convolution architecture for temporal consistency. Adversarial training is used to further enhance generated video quality and consistency.	APLA demonstrates noticeable improvement in generated video consistency qualitatively and quantitatively. VGT-Hyper, a variant of VGT using 3D convolutions, exhibits superior performance in reconstruction tasks. Ablation studies highlight the contribution of each component in APLA, including VGT, adversarial training, and the hyper-loss.	Limited CUDA memory restricts the use of the more complex VGT-Hyper model. Excessive training epochs can lead to overfitting and reduced influence of the text prompt.	text-to-video generation, diffusion models, frame consistency, video generation transformer, adversarial training
2308.12560 Report	NOVA: NOvel View Augmentation for Neural Composition of Dynamic Objects	Dakshit Agrawal, Jiajie Xu, Siva Karthik Mustikovela, Ioannis Gkioulekas, Ashish Shrivastava, Yuning Chai	We propose a novel-view augmentation (NOVA) strategy to train NeRFs for photo-realistic 3D composition of dynamic objects in a static scene. Compared to prior work, our framework significantly reduces blending artifacts when inserting multiple dynamic objects into a 3D scene at novel views and times; achieves comparable PSNR without the need for additional ground truth modalities like optical flow; and overall provides ease, flexibility, and scalability in neural composition. Our codebase is on GitHub.	Presents NOVA, a novel-view augmentation strategy for training NeRFs, enabling photo-realistic 3D composition of dynamic objects in static scenes from monocular videos.	Addresses limitations in existing methods that produce blending artifacts and require additional ground truth data like optical flow.	Utilizes separate NeRFs for different scene parts, employs novel-view augmentation to reduce blending artifacts, and introduces novel-view losses to ensure high image fidelity.	Significantly reduces blending artifacts compared to prior work, especially when inserting multiple dynamic objects. Achieves comparable PSNR to state-of-the-art methods without requiring ground truth optical flow. Provides a flexible and scalable framework for neural composition of dynamic scenes.	Current implementation assumes a static camera for capturing the scene. Future work can explore incorporating techniques to handle dynamic cameras.	neural radiance fields, nerf, novel view synthesis, scene composition, dynamic scenes
2308.12538 Report	Mutual-Guided Dynamic Network for Image Fusion	Yuanshen Guan, Ruikang Xu, Mingde Yao, Lizhi Wang, Zhiwei Xiong	Image fusion aims to generate a high-quality image from multiple images captured under varying conditions. The key problem of this task is to preserve complementary information while filtering out irrelevant information for the fused result. However, existing methods address this problem by leveraging static convolutional neural networks (CNNs), suffering two inherent limitations during feature extraction, i.e., being unable to handle spatial-variant contents and lacking guidance from multiple inputs. In this paper, we propose a novel mutual-guided dynamic network (MGDN) for image fusion, which allows for effective information utilization across different locations and inputs. Specifically, we design a mutual-guided dynamic filter (MGDF) for adaptive feature extraction, composed of a mutual-guided cross-attention (MGCA) module and a dynamic filter predictor, where the former incorporates additional guidance from different inputs and the latter generates spatial-variant kernels for different locations. In addition, we introduce a parallel feature fusion (PFF) module to effectively fuse local and global information of the extracted features. To further reduce the redundancy among the extracted features while simultaneously preserving their shared structural information, we devise a novel loss function that combines the minimization of normalized mutual information (NMI) with an estimated gradient mask. Experimental results on five benchmark datasets demonstrate that our proposed method outperforms existing methods on four image fusion tasks. The code and model are publicly available at: https://github.com/Guanys-dar/MGDN.	This paper introduces MGDN, a novel mutual-guided dynamic network for image fusion, which allows for effective information utilization across different locations and inputs.	Existing image fusion methods rely on static networks, limiting their ability to handle spatial and scene variations crucial for real-world applications. MGDN addresses this by dynamically adapting to content and leveraging information from multiple inputs.	The core of MGDN is the Mutual-Guided Dynamic Filter (MGDF). It uses Mutual-Guided Cross-Attention (MGCA) to integrate guidance information from multiple inputs and a dynamic filter predictor to estimate spatial-variant filters for adaptive feature extraction. A Parallel Feature Fusion (PFF) module merges local and global information, and a masked MI loss is employed to reduce feature redundancy while preserving structural information.	MGDN outperforms existing state-of-the-art image fusion methods on five benchmark datasets across four representative image fusion tasks. MGDN effectively integrates complementary information, preserves texture details, and maintains appropriate exposure levels in multi-exposure and multi-focus image fusion tasks. The proposed method excels in HDR deghosting, handling challenging scenes with saturations, motion, and significant intensity variations better than previous approaches.	The computational complexity of dynamic filtering needs further optimization for real-time applications. Future work can explore extending MGDN to handle more than two input images for complex fusion scenarios.	image fusion, dynamic filtering, mutual information, deep learning, computer vision
2308.12510 Report	Masked Autoencoders are Efficient Class Incremental Learners	Jiang-Tian Zhai, Xialei Liu, Andrew D. Bagdanov, Ke Li, Ming-Ming Cheng	Class Incremental Learning (CIL) aims to sequentially learn new classes while avoiding catastrophic forgetting of previous knowledge. We propose to use Masked Autoencoders (MAEs) as efficient learners for CIL. MAEs were originally designed to learn useful representations through reconstructive unsupervised learning, and they can be easily integrated with a supervised loss for classification. Moreover, MAEs can reliably reconstruct original input images from randomly selected patches, which we use to store exemplars from past tasks more efficiently for CIL. We also propose a bilateral MAE framework to learn from image-level and embedding-level fusion, which produces better-quality reconstructed images and more stable representations. Our experiments confirm that our approach performs better than the state-of-the-art on CIFAR-100, ImageNet-Subset, and ImageNet-Full. The code is available at https://github.com/scok30/MAE-CIL .	This paper introduces a novel bilateral Masked Autoencoder (MAE) framework for efficient Class Incremental Learning (CIL), leveraging the self-supervised reconstruction capabilities of MAEs for enhanced exemplar replay and representation learning.	Addressing catastrophic forgetting in CIL is crucial for real-world applications where models need to adapt to new information without losing previously acquired knowledge. This work explores using MAEs for efficient exemplar storage and high-quality replay data generation in CIL.	The proposed approach employs a bilateral MAE architecture with two branches: one for learning global features and another for detailed reconstruction. It utilizes random masking for efficient exemplar storage, reconstructs images from these masked patches, and incorporates a detailed loss to improve reconstruction quality and embedding diversity.	The bilateral MAE framework achieves state-of-the-art performance on CIFAR-100, ImageNet-Subset, and ImageNet-Full, outperforming existing methods in average accuracy and forgetting rate. The method demonstrates the effectiveness of using masked image patches as exemplars for efficient storage and high-quality replay data generation. Ablation studies confirm the contribution of each component, including the bilateral architecture, self-supervised reconstruction, and masking ratio, to the overall performance improvement.	The impact of varying the number of stored exemplars per class on performance could be further investigated. Exploring different masking strategies or incorporating additional self-supervision tasks might lead to further performance improvements.	class incremental learning, catastrophic forgetting, masked autoencoders, exemplar replay, self-supervised learning
2308.12469 Report	Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion	Junjiao Tian, Lavisha Aggarwal, Andrea Colaco, Zsolt Kira, Mar Gonzalez-Franco	Producing quality segmentation masks for images is a fundamental problem in computer vision. Recent research has explored large-scale supervised training to enable zero-shot segmentation on virtually any image style and unsupervised training to enable segmentation without dense annotations. However, constructing a model capable of segmenting anything in a zero-shot manner without any annotations is still challenging. In this paper, we propose to utilize the self-attention layers in stable diffusion models to achieve this goal because the pre-trained stable diffusion model has learned inherent concepts of objects within its attention layers. Specifically, we introduce a simple yet effective iterative merging process based on measuring KL divergence among attention maps to merge them into valid segmentation masks. The proposed method does not require any training or language dependency to extract quality segmentation for any images. On COCO-Stuff-27, our method surpasses the prior unsupervised zero-shot SOTA method by an absolute 26% in pixel accuracy and 17% in mean IoU. The project page is at \url{https://sites.google.com/view/diffseg/home}.	Presents DiffSeg, an unsupervised and zero-shot segmentation method using a pre-trained stable diffusion model.	Constructing a model capable of segmenting anything in a zero-shot manner without any annotations remains challenging. This method eliminates the need for annotations and prior knowledge of target images.	DiffSeg leverages self-attention layers in stable diffusion models, aggregating attention tensors and merging them iteratively based on KL divergence to produce segmentation masks.	DiffSeg surpasses previous unsupervised zero-shot methods on COCO-Stuff-27 (26% higher pixel accuracy, 17% higher mIoU). Outperforms prior works on Cityscapes using larger resolution input. Generalizes well to images of diverse styles, including sketches, paintings, and real-world photographs.	Performance on specialized datasets like Cityscapes is not satisfactory, potentially due to resolution limitations and limited exposure to such scenes during pre-training. Computationally demanding, not real-time.	unsupervised segmentation, zero-shot learning, stable diffusion, self-attention, kl divergence
2308.12350 Report	Diffusion-based Image Translation with Label Guidance for Domain Adaptive Semantic Segmentation	Duo Peng, Ping Hu, Qiuhong Ke, Jun Liu	Translating images from a source domain to a target domain for learning target models is one of the most common strategies in domain adaptive semantic segmentation (DASS). However, existing methods still struggle to preserve semantically-consistent local details between the original and translated images. In this work, we present an innovative approach that addresses this challenge by using source-domain labels as explicit guidance during image translation. Concretely, we formulate cross-domain image translation as a denoising diffusion process and utilize a novel Semantic Gradient Guidance (SGG) method to constrain the translation process, conditioning it on the pixel-wise source labels. Additionally, a Progressive Translation Learning (PTL) strategy is devised to enable the SGG method to work reliably across domains with large gaps. Extensive experiments demonstrate the superiority of our approach over state-of-the-art methods.	This paper proposes a novel diffusion-based image translation framework for Domain Adaptive Semantic Segmentation (DASS) that uses source-domain labels as guidance to preserve semantic details.	Existing image translation methods for DASS, often based on GANs, struggle to preserve local semantic consistency between original and translated images, leading to sub-optimal adaptation performance.	The approach involves training an unconditional diffusion model on the target domain and then using it for translating source images. A novel Semantic Gradient Guidance (SGG) method, coupled with a Progressive Translation Learning (PTL) strategy, guides the translation process based on pixel-wise source labels, ensuring semantic consistency even across large domain gaps.	Achieves state-of-the-art performance on GTA5→Cityscapes and SYNTHIA→Cityscapes benchmarks. Shows significant improvements over existing GAN-based image translation methods (3.2% to 20.1% improvement across different settings and backbones). Demonstrates more stable training and comparable inference time compared to other state-of-the-art DASS methods.	The method requires training across multiple intermediate domains, which increases the overall training time. Future work could explore incorporating other guidance signals, such as image structure or context, to further improve the quality of translated images.	domain adaptation, semantic segmentation, image translation, diffusion models, label guidance
2308.12059 Report	Manipulating Embeddings of Stable Diffusion Prompts	Niklas Deckers, Julia Peters, Martin Potthast	Generative text-to-image models such as Stable Diffusion allow users to generate images based on a textual description, the prompt. Changing the prompt is still the primary means for the user to change a generated image as desired. However, changing the image by reformulating the prompt remains a difficult process of trial and error, which has led to the emergence of prompt engineering as a new field of research. We propose and analyze methods to change the embedding of a prompt directly instead of the prompt text. It allows for more fine-grained and targeted control that takes into account user intentions. Our approach treats the generative text-to-image model as a continuous function and passes gradients between the image space and the prompt embedding space. By addressing different user interaction problems, we can apply this idea in three scenarios: (1) Optimization of a metric defined in image space that could measure, for example, image style. (2) Assistance of users in creative tasks by enabling them to navigate the image space along a selection of directions of "near" prompt embeddings. (3) Changing the embedding of the prompt to include information that the user has seen in a particular seed but finds difficult to describe in the prompt. Our experiments demonstrate the feasibility of the described methods.	This paper presents and analyzes three novel methods for manipulating the embeddings of prompts in Stable Diffusion, allowing for more targeted and fine-grained control over image generation compared to traditional prompt engineering.	Traditional prompt engineering, while effective, can be tedious, unintuitive, and unpredictable due to the inherent ambiguity of language and the black-box nature of text-to-image models. This work aims to address these shortcomings by providing users with more direct control over image generation.	The proposed methods involve treating Stable Diffusion as a continuous function and using gradient descent to modify prompt embeddings in three ways: 1. Metric-Based Optimization: Optimizing embeddings with respect to specific image metrics (e.g., blurriness, sharpness, aesthetics). 2. Iterative Human Feedback: Providing users with options generated from slightly modified embeddings, allowing for iterative refinement based on their choices. 3. Seed-Invariant Embeddings: Reconstructing preferred image features observed with specific seeds, making image generation more robust to seed variations.	Modifying prompt embeddings based on image metrics successfully alters image characteristics like blurriness, sharpness, and aesthetics. User study demonstrates that iterative feedback on modified embeddings provides a more controlled and less tedious experience compared to prompt engineering, especially for creative tasks. The method for creating seed-invariant prompt embeddings shows promising preliminary results, demonstrating the potential to encode seed-specific information directly into the embedding.	The effectiveness of metric-based optimization depends on the chosen metric and can lead to overfitting or artifacts if not carefully monitored. The iterative feedback method might be less effective for users with a specific target image in mind, as it relies on presented options aligning with their envisioned direction.	stable diffusion, prompt engineering, text-to-image generation, image manipulation, human-computer interaction
2308.11974 Report	Blending-NeRF: Text-Driven Localized Editing in Neural Radiance Fields	Hyeonseop Song, Seokhun Choi, Hoseok Do, Chul Lee, Taehyeong Kim	Text-driven localized editing of 3D objects is particularly difficult as locally mixing the original 3D object with the intended new object and style effects without distorting the object's form is not a straightforward process. To address this issue, we propose a novel NeRF-based model, Blending-NeRF, which consists of two NeRF networks: pretrained NeRF and editable NeRF. Additionally, we introduce new blending operations that allow Blending-NeRF to properly edit target regions which are localized by text. By using a pretrained vision-language aligned model, CLIP, we guide Blending-NeRF to add new objects with varying colors and densities, modify textures, and remove parts of the original object. Our extensive experiments demonstrate that Blending-NeRF produces naturally and locally edited 3D objects from various text prompts. Our project page is available at https://seokhunchoi.github.io/Blending-NeRF/	Introduces Blending-NeRF, a novel NeRF-based model for text-driven localized editing of 3D objects using a pretrained NeRF and an editable NeRF.	Addresses the challenge of localized editing in 3D object editing, enabling specific modifications based on text prompts.	Utilizes a layered NeRF architecture with blending operations to combine a pretrained NeRF with an editable NeRF. It leverages CLIP for text-image alignment and CLIPSeg for target region localization.	Blending-NeRF successfully edits localized regions of 3D objects based on various text prompts, including color changes, density additions, and removals. The method outperforms baseline models, particularly in density-based editing tasks, demonstrating its ability for fine-grained control. It exhibits extensibility by integrating with Instant-NGP for memory efficiency and application to real-world scenes.	Performance can be influenced by the accuracy of CLIPSeg in segmenting the target region. Limited patch size input to CLIP's image encoder can impact the sharpness of editing results, particularly with memory-intensive NeRF backbones.	neural radiance fields, 3d object editing, text-driven editing, clip, localized editing
2308.11971 Report	EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE	Junyi Chen, Longteng Guo, Jia Sun, Shuai Shao, Zehuan Yuan, Liang Lin, Dongyu Zhang	Building scalable vision-language models to learn from diverse, multimodal data remains an open challenge. In this paper, we introduce an Efficient Vision-languagE foundation model, namely EVE, which is one unified multimodal Transformer pre-trained solely by one unified pre-training task. Specifically, EVE encodes both vision and language within a shared Transformer network integrated with modality-aware sparse Mixture-of-Experts (MoE) modules, which capture modality-specific information by selectively switching to different experts. To unify pre-training tasks of vision and language, EVE performs masked signal modeling on image-text pairs to reconstruct masked signals, i.e., image pixels and text tokens, given visible signals. This simple yet effective pre-training objective accelerates training by 3.5x compared to the model pre-trained with Image-Text Contrastive and Image-Text Matching losses. Owing to the combination of the unified architecture and pre-training task, EVE is easy to scale up, enabling better downstream performance with fewer resources and faster training speed. Despite its simplicity, EVE achieves state-of-the-art performance on various vision-language downstream tasks, including visual question answering, visual reasoning, and image-text retrieval.	The paper introduces EVE, an efficient vision-language foundation model based on a unified multimodal Transformer with modality-aware sparse Mixture-of-Experts (MoE) modules, pre-trained using a single unified masked signal modeling task.	Building scalable vision-language models that learn from diverse multimodal data while remaining efficient in training and scaling up is an open challenge. EVE aims to address this by simplifying both the model architecture and the pre-training objective.	EVE leverages a shared Transformer network for both vision and language, integrating modality-aware MoE modules to capture modality-specific information. It is pre-trained using a unified masked signal modeling task, reconstructing masked image pixels and text tokens given visible signals.	EVE achieves state-of-the-art performance on various vision-language tasks, including visual question answering, visual reasoning, and image-text retrieval. The unified architecture and pre-training task enable EVE to be easily scaled up, leading to improved downstream performance with fewer resources and faster training. Pre-training EVE with masked signal modeling is 3.5 times faster than using Image-Text Contrastive and Image-Text Matching losses.	The paper mainly focuses on exploring the effectiveness of the unified architecture and pre-training task for image and text modalities. The impact of using modality-specific MoE modules on model interpretability requires further investigation.	vision-language pre-training, multimodal learning, mixture-of-experts, masked signal modeling, transformer
2308.11941 Report	Boosting Diffusion Models with an Adaptive Momentum Sampler	Xiyu Wang, Anh-Dung Dinh, Daochang Liu, Chang Xu	Diffusion probabilistic models (DPMs) have been shown to generate high-quality images without the need for delicate adversarial training. However, the current sampling process in DPMs is prone to violent shaking. In this paper, we present a novel reverse sampler for DPMs inspired by the widely-used Adam optimizer. Our proposed sampler can be readily applied to a pre-trained diffusion model, utilizing momentum mechanisms and adaptive updating to smooth the reverse sampling process and ensure stable generation, resulting in outputs of enhanced quality. By implicitly reusing update directions from early steps, our proposed sampler achieves a better balance between high-level semantics and low-level details. Additionally, this sampler is flexible and can be easily integrated into pre-trained DPMs regardless of the sampler used during training. Our experimental results on multiple benchmarks demonstrate that our proposed reverse sampler yields remarkable improvements over different baselines. We will make the source code available.	This paper introduces a novel, training-free reverse sampler for Diffusion Probabilistic Models (DPMs) inspired by the Adam optimizer, which uses momentum and adaptive updating to enhance the quality of generated images.	Current DPMs' reverse sampling processes suffer from instability, leading to noisy images with missing high-level features. This novel sampler addresses this issue by smoothing the sampling trajectory and balancing high-level and low-level information in generated images.	The proposed Adaptive Momentum Sampler incorporates a momentum term to accumulate past update directions, smoothing the sampling process. Additionally, it utilizes a moving average of second-order moments to adaptively adjust the denoising step size for each pixel, similar to the RMSProp optimizer.	The adaptive momentum sampler significantly improves image generation quality over baseline samplers on various datasets, including CIFAR-10, ImageNet, CelebA, LSUN, and CelebA-HQ. The sampler excels in balancing high-level semantics (shapes, outlines) and low-level details (textures) in generated images. The proposed method is flexible and can be easily integrated with existing pre-trained DPMs without requiring additional training.	The improvement of the sampler is less evident when using a small number of sampling steps. Future work includes incorporating the adaptive momentum strategy into the training process and extending the scheme to continuous settings with solid theoretical foundations.	diffusion models, generative models, image generation, adaptive momentum, sampling algorithm
2308.11917 Report	LFS-GAN: Lifelong Few-Shot Image Generation	Juwon Seo, Ji-Su Kang, Gyeong-Moon Park	We address a challenging lifelong few-shot image generation task for the first time. In this situation, a generative model learns a sequence of tasks using only a few samples per task. Consequently, the learned model encounters both catastrophic forgetting and overfitting problems at a time. Existing studies on lifelong GANs have proposed modulation-based methods to prevent catastrophic forgetting. However, they require considerable additional parameters and cannot generate high-fidelity and diverse images from limited data. On the other hand, the existing few-shot GANs suffer from severe catastrophic forgetting when learning multiple tasks. To alleviate these issues, we propose a framework called Lifelong Few-Shot GAN (LFS-GAN) that can generate high-quality and diverse images in lifelong few-shot image generation task. Our proposed framework learns each task using an efficient task-specific modulator - Learnable Factorized Tensor (LeFT). LeFT is rank-constrained and has a rich representation ability due to its unique reconstruction technique. Furthermore, we propose a novel mode seeking loss to improve the diversity of our model in low-data circumstances. Extensive experiments demonstrate that the proposed LFS-GAN can generate high-fidelity and diverse images without any forgetting and mode collapse in various domains, achieving state-of-the-art in lifelong few-shot image generation task. Surprisingly, we find that our LFS-GAN even outperforms the existing few-shot GANs in the few-shot image generation task. The code is available at Github.	This paper addresses the novel and challenging task of lifelong few-shot image generation, where a model needs to learn a sequence of image generation tasks from very limited data without forgetting previous tasks.	This task is important for real-world scenarios where data is scarce or costly to obtain and adapting a new model for each task is impractical. It combines the challenges of lifelong learning (avoiding catastrophic forgetting) and few-shot learning (generalizing from limited data).	The paper proposes LFS-GAN, a framework that uses a novel weight modulation technique called Learnable Factorized Tensor (LeFT) to efficiently learn task-specific information without modifying pre-trained weights. Additionally, it introduces a cluster-wise mode seeking loss to enhance generation diversity, especially in low-data regimes.	LFS-GAN successfully generates high-quality and diverse images in lifelong few-shot settings, outperforming baselines adapted from both lifelong GANs and few-shot GANs. The proposed LeFT modulation technique is highly efficient in terms of parameter count, using less than 1% of trainable parameters compared to the backbone generator. LFS-GAN also demonstrates superior performance in the standard few-shot image generation task, indicating its ability to generalize well to different data regimes.	The paper mainly focuses on StyleGAN2 as a backbone and evaluates on limited datasets. Further exploration with different backbones and diverse datasets is needed. The impact of task sequence and potential biases in the dataset selection on the performance of LFS-GAN requires further investigation.	lifelong learning, few-shot learning, image generation, generative adversarial networks (gans), weight modulation
2308.11793 Report	Enhancing NeRF akin to Enhancing LLMs: Generalizable NeRF Transformer with Mixture-of-View-Experts	Wenyan Cong, Hanxue Liang, Peihao Wang, Zhiwen Fan, Tianlong Chen, Mukund Varma, Yi Wang, Zhangyang Wang	Cross-scene generalizable NeRF models, which can directly synthesize novel views of unseen scenes, have become a new spotlight of the NeRF field. Several existing attempts rely on increasingly end-to-end "neuralized" architectures, i.e., replacing scene representation and/or rendering modules with performant neural networks such as transformers, and turning novel view synthesis into a feed-forward inference pipeline. While those feedforward "neuralized" architectures still do not fit diverse scenes well out of the box, we propose to bridge them with the powerful Mixture-of-Experts (MoE) idea from large language models (LLMs), which has demonstrated superior generalization ability by balancing between larger overall model capacity and flexible per-instance specialization. Starting from a recent generalizable NeRF architecture called GNT, we first demonstrate that MoE can be neatly plugged in to enhance the model. We further customize a shared permanent expert and a geometry-aware consistency loss to enforce cross-scene consistency and spatial smoothness respectively, which are essential for generalizable view synthesis. Our proposed model, dubbed GNT with Mixture-of-View-Experts (GNT-MOVE), has experimentally shown state-of-the-art results when transferring to unseen scenes, indicating remarkably better cross-scene generalization in both zero-shot and few-shot settings. Our codes are available at https://github.com/VITA-Group/GNT-MOVE.	This paper proposes GNT-MOVE, an LLM-inspired NeRF framework for generalizable novel view synthesis, by introducing Mixture-of-Experts (MoE) into GNT and customizing it with a permanent expert and a geometry-aware spatial consistency objective.	Existing cross-scene generalizable NeRF models struggle to balance "generality" (covering diverse scenes) and "specialization" (modeling per-scene details). MoE, inspired by its success in LLMs, offers a potential solution.	The authors integrate MoE into GNT's view transformer. To address the cross-scene consistency and spatial smoothness requirements of NeRF, they introduce a shared permanent expert and a geometry-aware spatial consistency objective.	GNT-MOVE achieves state-of-the-art results in zero-shot generalization on LLFF, NeRF Synthetic, Shiny-6, Tanks-and-Temples, and NMR datasets. GNT-MOVE consistently outperforms previous SOTA methods in few-shot generalization on LLFF and NeRF Synthetic datasets. Analysis of expert selection reveals that GNT-MOVE effectively captures both cross-scene and cross-view consistency, as well as expert specialization for diverse rendering properties.	The paper primarily focuses on applying MoE to the view transformer in GNT. Exploring its integration with the ray transformer could be promising. The impact of MoE on computational cost, while claimed to be low, is not thoroughly analyzed.	nerf, novel view synthesis, mixture-of-experts, generalization, transformer
2308.11605 Report	GOPro: Generate and Optimize Prompts in CLIP using Self-Supervised Learning	Mainak Singha, Ankit Jha, Biplab Banerjee	Large-scale foundation models, such as CLIP, have demonstrated remarkable success in visual recognition tasks by embedding images in a semantically rich space. Self-supervised learning (SSL) has also shown promise in improving visual recognition by learning invariant features. However, the combination of CLIP with SSL is found to face challenges due to the multi-task framework that blends CLIP's contrastive loss and SSL's loss, including difficulties with loss weighting and inconsistency among different views of images in CLIP's output space. To overcome these challenges, we propose a prompt learning-based model called GOPro, which is a unified framework that ensures similarity between various augmented views of input images in a shared image-text embedding space, using a pair of learnable image and text projectors atop CLIP, to promote invariance and generalizability. To automatically learn such prompts, we leverage the visual content and style primitives extracted from pre-trained CLIP and adapt them to the target task. In addition to CLIP's cross-domain contrastive loss, we introduce a visual contrastive loss and a novel prompt consistency loss, considering the different views of the images. GOPro is trained end-to-end on all three loss objectives, combining the strengths of CLIP and SSL in a principled manner. Empirical evaluations demonstrate that GOPro outperforms the state-of-the-art prompting techniques on three challenging domain generalization tasks across multiple benchmarks by a significant margin. Our code is available at https://github.com/mainaksingha01/GOPro.	\textsc{GOPro} leverages contrastive SSL and pre-trained CLIP to generate domain and class-agnostic prompts for enhanced generalization and invariance in embedding space against various image transformations.	Addresses limitations in combining CLIP with SSL, specifically in learning generalizable prompts and ensuring semantic invariance, for improved performance on domain and class generalization tasks.	Learnable image and text projectors atop frozen CLIP. Employs visual contrastive loss (MoCo v3 augmentations), CLIP's image-text contrastive loss, and a novel prompt consistency loss (MoCo v3 and AugMix augmentations). Prompt learning leverages multi-scale visual content and style information from CLIP.	\textsc{GOPro} outperforms SOTA prompting techniques on B2N class generalization, achieving superior scores and exceeding SLIP by a significant margin. \textsc{GOPro} effectively mitigates the generalization gap for diverse domains and classes, outperforming competitors in cross-dataset generalization. In domain generalization, \textsc{GOPro} surpasses other methods in both source and target domains, demonstrating its robustness.	Limited exploration of prompt context lengths beyond 4 tokens. Future work can explore applications in specific domains like medical imaging and remote sensing.	prompt learning, self-supervised learning, domain generalization, clip, vision-language models
2308.11568 Report	SPANet: Frequency-balancing Token Mixer using Spectral Pooling Aggregation Modulation	Guhnoo Yun, Juhan Yoo, Kijung Kim, Jeongho Lee, Dong Hwan Kim	Recent studies show that self-attentions behave like low-pass filters (as opposed to convolutions) and enhancing their high-pass filtering capability improves model performance. Contrary to this idea, we investigate existing convolution-based models with spectral analysis and observe that improving the low-pass filtering in convolution operations also leads to performance improvement. To account for this observation, we hypothesize that utilizing optimal token mixers that capture balanced representations of both high- and low-frequency components can enhance the performance of models. We verify this by decomposing visual features into the frequency domain and combining them in a balanced manner. To handle this, we replace the balancing problem with a mask filtering problem in the frequency domain. Then, we introduce a novel token-mixer named SPAM and leverage it to derive a MetaFormer model termed as SPANet. Experimental results show that the proposed method provides a way to achieve this balance, and the balanced representations of both high- and low-frequency components can improve the performance of models on multiple computer vision tasks. Our code is available at $\href{https://doranlyong.github.io/projects/spanet/}{\text{https://doranlyong.github.io/projects/spanet/}}$.	This paper proposes SPANet, a novel MetaFormer model employing SPAM, a frequency-balancing token mixer, to enhance model performance by capturing balanced representations of high- and low-frequency components in visual features.	Recent studies highlight the importance of balancing high- and low-pass filtering capabilities in token mixers for improved model performance, prompting the exploration of optimal token mixers.	The authors introduce SPAM, which uses Spectral Pooling Gates (SPG) to decompose features into frequency components and recombine them with learned weights. They build SPANet by integrating SPAM into a MetaFormer architecture.	SPANets outperform state-of-the-art CNNs and MetaFormers in image classification and semantic segmentation tasks. SPANets achieve competitive results in object detection and instance segmentation tasks. Ablation studies confirm the significance of individual SPAM components and design choices.	SPANets exhibit limited performance improvement in dense prediction tasks due to the pre-trained backbone's bias toward low-frequency components. Exploration of frequency-balancing token mixers tailored for task-specific characteristics is needed.	metaformer, token mixer, frequency balancing, spectral pooling, computer vision
2308.11506 Report	LCCo: Lending CLIP to Co-Segmentation	Xin Duan, Yan Yang, Liyuan Pan, Xiabi Liu	This paper studies co-segmenting the common semantic object in a set of images. Existing works either rely on carefully engineered networks to mine the implicit semantic information in visual features or require extra data (i.e., classification labels) for training. In this paper, we leverage the contrastive language-image pre-training framework (CLIP) for the task. With a backbone segmentation network that independently processes each image from the set, we introduce semantics from CLIP into the backbone features, refining them in a coarse-to-fine manner with three key modules: i) an image set feature correspondence module, encoding global consistent semantic information of the image set; ii) a CLIP interaction module, using CLIP-mined common semantics of the image set to refine the backbone feature; iii) a CLIP regularization module, drawing CLIP towards this co-segmentation task, identifying the best CLIP semantic and using it to regularize the backbone feature. Experiments on four standard co-segmentation benchmark datasets show that the performance of our method outperforms state-of-the-art methods.	This paper introduces LCCo, a novel framework that leverages the Contrastive Language-Image Pre-training (CLIP) model for the task of image co-segmentation, aiming to identify and segment common semantic objects within a set of images.	Existing co-segmentation methods often struggle to accurately extract common semantic information, relying on complex network designs or requiring additional labeled data for training. This paper explores the use of CLIP to overcome these limitations.	LCCo refines multi-scale features from a backbone segmentation network using CLIP. It employs three key modules: (1) Image Set Feature Correspondence Module - Encodes global semantic information of the image set at a coarse level. (2) CLIP Interaction Module - Modulates mid-level features using CLIP embeddings by fusing image and distilled text semantics. (3) CLIP Regularization Module - Identifies the most common semantic class from CLIP and uses it to refine fine-grained features, drawing CLIP towards the co-segmentation task.	LCCo achieves state-of-the-art performance on four standard co-segmentation benchmarks (MSRC, Internet, iCoseg, and PASCAL) outperforming existing methods. The method demonstrates consistent performance improvements with an increasing number of input images, unlike some previous approaches. Ablation studies validate the effectiveness of each proposed module and the contribution of the novel segmentation and classification losses.	The use of large-scale CLIP models increases computational demands compared to some past methods, though it remains relatively efficient. Future work could explore extending the framework to handle more complex scenarios, such as co-segmenting multiple objects.	image co-segmentation, contrastive language-image pre-training (clip), semantic segmentation, zero-shot learning, computer vision
2308.11473 Report	IT3D: Improved Text-to-3D Generation with Explicit View Synthesis	Yiwen Chen, Chi Zhang, Xiaofeng Yang, Zhongang Cai, Gang Yu, Lei Yang, Guosheng Lin	Recent strides in Text-to-3D techniques have been propelled by distilling knowledge from powerful large text-to-image diffusion models (LDMs). Nonetheless, existing Text-to-3D approaches often grapple with challenges such as over-saturation, inadequate detailing, and unrealistic outputs. This study presents a novel strategy that leverages explicitly synthesized multi-view images to address these issues. Our approach involves the utilization of image-to-image pipelines, empowered by LDMs, to generate posed high-quality images based on the renderings of coarse 3D models. Although the generated images mostly alleviate the aforementioned issues, challenges such as view inconsistency and significant content variance persist due to the inherent generative nature of large diffusion models, posing extensive difficulties in leveraging these images effectively. To overcome this hurdle, we advocate integrating a discriminator alongside a novel Diffusion-GAN dual training strategy to guide the training of 3D models. For the incorporated discriminator, the synthesized multi-view images are considered real data, while the renderings of the optimized 3D models function as fake data. We conduct a comprehensive set of experiments that demonstrate the effectiveness of our method over baseline approaches.	IT3D is a novel plug-and-play refinement method for text-to-3D generation that leverages explicitly synthesized multi-view images and a Diffusion-GAN dual training strategy.	Existing text-to-3D methods often struggle with issues like over-saturation, lack of detail, and unrealistic outputs. IT3D addresses these limitations by incorporating high-quality 2D image generation techniques.	1. Generate a coarse 3D model from text. 2. Synthesize a multi-view image dataset using an image-to-image pipeline conditioned on renderings of the coarse model. 3. Refine the 3D model using a Diffusion-GAN dual training strategy that combines diffusion prior with a discriminator trained on the synthesized dataset.	Significantly enhances texture detail, geometry, and fidelity to text prompts compared to baseline methods. Demonstrates robustness by successfully refining models even with low-quality coarse models or imperfections in the synthesized dataset. Achieves a high user preference score (89.92%) compared to the baseline method in a user study.	Performance is limited by the capabilities of the image-to-image pipeline used for dataset generation. Future work could explore dataset update strategies for further quality improvement.	text-to-3d generation, diffusion models, generative adversarial networks (gans), image-to-image translation, 3d model refinement
2308.11417 Report	ScanNet++: A High-Fidelity Dataset of 3D Indoor Scenes	Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, Angela Dai	We present ScanNet++, a large-scale dataset that couples together capture of high-quality and commodity-level geometry and color of indoor scenes. Each scene is captured with a high-end laser scanner at sub-millimeter resolution, along with registered 33-megapixel images from a DSLR camera, and RGB-D streams from an iPhone. Scene reconstructions are further annotated with an open vocabulary of semantics, with label-ambiguous scenarios explicitly annotated for comprehensive semantic understanding. ScanNet++ enables a new real-world benchmark for novel view synthesis, both from high-quality RGB capture, and importantly also from commodity-level images, in addition to a new benchmark for 3D semantic scene understanding that comprehensively encapsulates diverse and ambiguous semantic labeling scenarios. Currently, ScanNet++ contains 460 scenes, 280,000 captured DSLR images, and over 3.7M iPhone RGBD frames.	\datasetname{} is a large-scale dataset of high-fidelity 3D indoor scenes, including high-resolution RGB images, commodity RGB-D videos, registered laser scans, and dense semantic annotations with an open vocabulary and multi-labeling.	Existing datasets for 3D scene understanding and novel view synthesis lack either scale, high-quality capture, or dense and rich annotations, limiting the development of methods that generalize well. \datasetname{} bridges this divide by providing a large-scale dataset with high-quality data across multiple modalities.	The authors captured 81 scenes using a high-end laser scanner, a DSLR camera, and an iPhone 13 Pro. They meticulously aligned all three modalities and densely annotated the reconstructions with semantic and instance labels, explicitly accounting for label ambiguities using multi-labeling.	Novel view synthesis methods, even state-of-the-art ones, still face challenges in reconstructing detailed scenes with varying view-dependent effects. Training novel view synthesis models on commodity-level iPhone data, while targeting high-quality DSLR ground truth, presents a significant challenge due to motion blur, varying brightness, and limited field-of-view. The scale and diversity of \datasetname{} enable the training of generalizable priors for novel view synthesis, leading to improved performance compared to traditional single-scene training.	Limited diversity due to the focus on indoor scenes and fixed DSLR settings can lead to overexposure or underexposure in certain areas. The expensive data collection process hinders the scalability of \datasetname{} compared to 2D datasets.	3d scene understanding, novel view synthesis, dataset, semantic segmentation, instance segmentation
2308.11357 Report	Exemplar-Free Continual Transformer with Convolutions	Anurag Roy, Vinay Kumar Verma, Sravan Voonna, Kripabandhu Ghosh, Saptarshi Ghosh, Abir Das	Continual Learning (CL) involves training a machine learning model in a sequential manner to learn new information while retaining previously learned tasks without the presence of previous training data. Although there has been significant interest in CL, most recent CL approaches in computer vision have focused on convolutional architectures only. However, with the recent success of vision transformers, there is a need to explore their potential for CL. Although there have been some recent CL approaches for vision transformers, they either store training instances of previous tasks or require a task identifier during test time, which can be limiting. This paper proposes a new exemplar-free approach for class/task incremental learning called ConTraCon, which does not require task-id to be explicitly present during inference and avoids the need for storing previous training instances. The proposed approach leverages the transformer architecture and involves re-weighting the key, query, and value weights of the multi-head self-attention layers of a transformer trained on a similar task. The re-weighting is done using convolution, which enables the approach to maintain low parameter requirements per task. Additionally, an image augmentation-based entropic task identification approach is used to predict tasks without requiring task-ids during inference. Experiments on four benchmark datasets demonstrate that the proposed approach outperforms several competitive approaches while requiring fewer parameters.	Proposes ConTraCon, a dynamic architecture for continual learning on transformers using task-specific convolutions and skip-gating to adapt pre-trained transformer weights for new tasks, achieving low memory overhead and strong performance in exemplar-free continual learning.	Addresses the challenge of catastrophic forgetting in continual learning, particularly with vision transformers, by enabling efficient adaptation of learned representations to new tasks without storing past data.	Leverages convolution operations to re-weight key, query, and value weights of pre-trained transformer encoders for new tasks. Employs learnable skip-gating to balance retaining old knowledge and adapting to new information. Uses an entropy-based task identification approach with image augmentations to infer task identity during inference without requiring explicit task labels.	Outperforms state-of-the-art continual learning approaches, including exemplar-based methods, on CIFAR-100, TinyImageNet-200/10, ImageNet-100/10, and 5-Datasets benchmarks in both task and class incremental settings. Achieves superior accuracy with significantly lower memory overhead compared to existing methods, demonstrating efficient parameter use for continual learning. Shows robustness to different task orders, indicating that the initial task's choice does not significantly impact overall performance.	The selection of the optimal kernel size for the convolution operation is based on a limited validation set and could be further explored. While the augmentation-based task prediction is lightweight, exploring alternative task inference strategies that don't rely on augmentations could be beneficial.	continual learning, vision transformers, convolutional adaptation, exemplar-free learning, task identification
2308.11331 Report	GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training	Xinchi Deng, Han Shi, Runhui Huang, Changlin Li, Hang Xu, Jianhua Han, James Kwok, Shen Zhao, Wei Zhang, Xiaodan Liang	Cross-modal pre-training has shown impressive performance on a wide range of downstream tasks, benefiting from massive image-text pairs collected from the Internet. In practice, online data are growing constantly, highlighting the importance of the ability of pre-trained model to learn from data that is continuously growing. Existing works on cross-modal pre-training mainly focus on training a network with fixed architecture. However, it is impractical to limit the model capacity when considering the continuously growing nature of pre-training data in real-world applications. On the other hand, it is important to utilize the knowledge in the current model to obtain efficient training and better performance. To address the above issues, in this paper, we propose GrowCLIP, a data-driven automatic model growing algorithm for contrastive language-image pre-training with continuous image-text pairs as input. Specially, we adopt a dynamic growth space and seek out the optimal architecture at each growth step to adapt to online learning scenarios. And the shared encoder is proposed in our growth space to enhance the degree of cross-modal fusion. Besides, we explore the effect of growth in different dimensions, which could provide future references for the design of cross-modal model architecture. Finally, we employ parameter inheriting with momentum (PIM) to maintain the previous knowledge and address the issue of the local minimum dilemma. Compared with the existing methods, GrowCLIP improves 2.3% average top-1 accuracy on zero-shot image classification of 9 downstream tasks. As for zero-shot image retrieval, GrowCLIP can improve 1.2% for top-1 image-to-text recall on Flickr30K dataset.	This paper proposes GrowCLIP, a data-driven automatic model growing algorithm for contrastive language-image pre-training, designed for scenarios where image-text pair data grows continuously.	Existing cross-modal pre-training methods primarily use fixed architectures, which are not optimal for continuously growing datasets. This highlights the need for methods that can dynamically adapt model capacity to data size.	GrowCLIP utilizes a dynamic growth space for neural architecture search, a shared encoder to enhance cross-modal fusion, and parameter inheriting with momentum (PIM) to efficiently transfer knowledge from previous models and avoid local minimum issues.	GrowCLIP achieves up to 2.3% higher average top-1 accuracy on zero-shot image classification across nine datasets compared to existing methods. On zero-shot image-text retrieval, GrowCLIP demonstrates a 1.2% improvement in top-1 image-to-text recall on the Flickr30K dataset. The study also reveals insights into the relationship between model architecture, data size, and training efficiency in cross-modal pre-training.	The effectiveness of GrowCLIP has only been demonstrated using the CC12M dataset. Future work will focus on extending GrowCLIP to real-world scenarios with constantly updated data from the web.	cross-modal pre-training, model growing, online learning, neural architecture search, vision-language pre-training (vlp)
2308.11199 Report	ConcatPlexer: Additional Dim1 Batching for Faster ViTs	Donghoon Han, Seunghyeon Seo, Donghyeon Jeon, Jiho Jang, Chaerin Kong, Nojun Kwak	Transformers have demonstrated tremendous success not only in the natural language processing (NLP) domain but also the field of computer vision, igniting various creative approaches and applications. Yet, the superior performance and modeling flexibility of transformers came with a severe increase in computation costs, and hence several works have proposed methods to reduce this burden. Inspired by a cost-cutting method originally proposed for language models, Data Multiplexing (DataMUX), we propose a novel approach for efficient visual recognition that employs additional dim1 batching (i.e., concatenation) that greatly improves the throughput with little compromise in the accuracy. We first introduce a naive adaptation of DataMux for vision models, Image Multiplexer, and devise novel components to overcome its weaknesses, rendering our final model, ConcatPlexer, at the sweet spot between inference speed and accuracy. The ConcatPlexer was trained on ImageNet1K and CIFAR100 dataset and it achieved 23.5% less GFLOPs than ViT-B/16 with 69.5% and 83.4% validation accuracy, respectively.	This paper proposes ConcatPlexer, a novel framework for multiplexing images in the vision domain, aiming to improve computational efficiency by processing multiple images simultaneously.	Transformer-based models, while powerful, are computationally expensive. Data multiplexing offers a promising way to reduce this cost, but it has been largely unexplored in vision.	The authors adapt the DataMUX method from NLP to vision by introducing components like a Transformer Encoder Patchifier and a ConcatMultiplexer. They evaluate ConcatPlexer on ImageNet1K and CIFAR100 image classification tasks.	ConcatPlexer achieves a favorable trade-off between accuracy and inference speed, significantly reducing computational cost compared to conventional ViT models. Increasing the number of multiplexed images (N_MUX) generally reduces accuracy, highlighting the challenge of this novel task. The proposed method outperforms a naive adaptation of DataMUX (Image Multiplexer), demonstrating its effectiveness in the vision domain.	The performance of ConcatPlexer on ImageNet1K, while promising, lags behind conventional ViT models, suggesting room for improvement. The current Conv-based multiplexing method and hyperparameters could be further optimized to enhance performance.	vision transformer, data multiplexing, computational efficiency, image classification, concatplexer
2308.11194 Report	ViLLA: Fine-Grained Vision-Language Representation Learning from Real-World Data	Maya Varma, Jean-Benoit Delbrouck, Sarah Hooper, Akshay Chaudhari, Curtis Langlotz	Vision-language models (VLMs), such as CLIP and ALIGN, are generally trained on datasets consisting of image-caption pairs obtained from the web. However, real-world multimodal datasets, such as healthcare data, are significantly more complex: each image (e.g. X-ray) is often paired with text (e.g. physician report) that describes many distinct attributes occurring in fine-grained regions of the image. We refer to these samples as exhibiting high pairwise complexity, since each image-text pair can be decomposed into a large number of region-attribute pairings. The extent to which VLMs can capture fine-grained relationships between image regions and textual attributes when trained on such data has not been previously evaluated. The first key contribution of this work is to demonstrate through systematic evaluations that as the pairwise complexity of the training dataset increases, standard VLMs struggle to learn region-attribute relationships, exhibiting performance degradations of up to 37% on retrieval tasks. In order to address this issue, we introduce ViLLA as our second key contribution. ViLLA, which is trained to capture fine-grained region-attribute relationships from complex datasets, involves two components: (a) a lightweight, self-supervised mapping model to decompose image-text samples into region-attribute pairs, and (b) a contrastive VLM to learn representations from generated region-attribute pairs. We demonstrate with experiments across four domains (synthetic, product, medical, and natural images) that ViLLA outperforms comparable VLMs on fine-grained reasoning tasks, such as zero-shot object detection (up to 3.6 AP50 points on COCO and 0.6 mAP points on LVIS) and retrieval (up to 14.2 R-Precision points).	This paper investigates the performance of Vision-Language Models (VLMs) on real-world datasets with high pairwise complexity, where image-text pairs can be decomposed into many region-attribute pairings. They introduce ViLLA, a self-supervised approach to improve fine-grained reasoning in VLMs trained on such datasets.	Standard VLMs, trained on simple image-caption pairs, struggle to capture fine-grained region-attribute relationships present in complex, real-world multimodal datasets.	ViLLA uses a two-stage pipeline: 1) a lightweight mapping model decomposes image-text samples into region-attribute pairs using self-supervision, 2) a standard VLM is trained on these generated pairs to learn fine-grained representations.	ViLLA outperforms comparable VLMs on zero-shot object detection (up to 3.6 AP50 points on COCO and 0.6 mAP points on LVIS). ViLLA achieves significant improvements in text-to-region and region-to-text retrieval tasks (up to 14.2 R-Precision points improvement). ViLLA's region-attribute mappings are up to 25.8 F1 points more accurate than prior methods.	Evaluations are currently limited to image-text datasets. Region-attribute mapping accuracy evaluation is limited on datasets without ground-truth annotations.	vision-language models, fine-grained reasoning, self-supervised learning, multimodal datasets, region-attribute mapping
2308.11130 Report	Efficient View Synthesis with Neural Radiance Distribution Field	Yushuang Wu, Xiao Li, Jinglu Wang, Xiaoguang Han, Shuguang Cui, Yan Lu	Recent work on Neural Radiance Fields (NeRF) has demonstrated significant advances in high-quality view synthesis. A major limitation of NeRF is its low rendering efficiency due to the need for multiple network forwardings to render a single pixel. Existing methods to improve NeRF either reduce the number of required samples or optimize the implementation to accelerate the network forwarding. Despite these efforts, the problem of multiple sampling persists due to the intrinsic representation of radiance fields. In contrast, Neural Light Fields (NeLF) reduce the computation cost of NeRF by querying only one single network forwarding per pixel. To achieve a close visual quality to NeRF, existing NeLF methods require significantly larger network capacities which limits their rendering efficiency in practice. In this work, we propose a new representation called Neural Radiance Distribution Field (NeRDF) that targets efficient view synthesis in real-time. Specifically, we use a small network similar to NeRF while preserving the rendering speed with a single network forwarding per pixel as in NeLF. The key is to model the radiance distribution along each ray with frequency basis and predict frequency weights using the network. Pixel values are then computed via volume rendering on radiance distributions. Experiments show that our proposed method offers a better trade-off among speed, quality, and network size than existing methods: we achieve a ~254x speed-up over NeRF with similar network size, with only a marginal performance decline. Our project page is at yushuang-wu.github.io/NeRDF.	This paper proposes Neural Radiance Distribution Field (NeRDF), a novel neural scene representation for real-time view synthesis.	Existing view synthesis methods struggle to achieve a balance between high visual quality, fast rendering speed, and low memory cost. NeRDF aims to break this impossible trinity.	NeRDF predicts the radiance distribution along each ray using a compact neural network. It leverages a knowledge distillation framework with a teacher NeRF and introduces online view sampling and a volume density constraint to enhance learning.	NeRDF achieves comparable visual quality to NeRF-based methods while being significantly faster. On the LLFF dataset, NeRDF-8 achieves a rendering speed of ~21 FPS with an 8-layer MLP, outperforming most NeRF-based methods. With inference optimization, NeRDF achieves up to ~369 FPS, a ~1400x speed-up over an unoptimized NeRF.	NeRDF needs to be extended to handle 360-degree scenes effectively. Future work includes improving view synthesis quality and extending NeRDF for dynamic scenes.	view synthesis, neural radiance fields, neural light fields, knowledge distillation, real-time rendering
2308.11093 Report	Video OWL-ViT: Temporally-consistent open-world localization in video	Georg Heigold, Matthias Minderer, Alexey Gritsenko, Alex Bewley, Daniel Keysers, Mario Lučić, Fisher Yu, Thomas Kipf	We present an architecture and a training recipe that adapts pre-trained open-world image models to localization in videos. Understanding the open visual world (without being constrained by fixed label spaces) is crucial for many real-world vision tasks. Contrastive pre-training on large image-text datasets has recently led to significant improvements for image-level tasks. For more structured tasks involving object localization applying pre-trained models is more challenging. This is particularly true for video tasks, where task-specific data is limited. We show successful transfer of open-world models by building on the OWL-ViT open-vocabulary detection model and adapting it to video by adding a transformer decoder. The decoder propagates object representations recurrently through time by using the output tokens for one frame as the object queries for the next. Our model is end-to-end trainable on video data and enjoys improved temporal consistency compared to tracking-by-detection baselines, while retaining the open-world capabilities of the backbone detector. We evaluate our model on the challenging TAO-OW benchmark and demonstrate that open-world capabilities, learned from large-scale image-text pre-training, can be transferred successfully to open-world localization across diverse videos.	The paper introduces Video OWL-ViT, an end-to-end trainable model for open-world object localization and tracking in videos.	Open-world object understanding in videos is crucial for real-world applications but challenging due to limitations in labeled video data.	The model adapts the OWL-ViT image-based open-world detector by adding a transformer decoder for temporal consistency and is trained on a combination of real and pseudo videos.	Video OWL-ViT achieves competitive performance with tracking-by-detection baselines on the TAO-OW benchmark. The model shows strong generalization to unseen object classes in both TAO-OW and YT-VIS datasets. End-to-end learning of temporal associations leads to improved accuracy compared to heuristic matching methods.	Performance on short object tracks remains a challenge due to limitations in training data and object presence modeling. Future work includes exploring better object presence indicators and leveraging larger and more diverse video datasets.	open-world learning, object tracking, video understanding, vision transformer, open-vocabulary detection
2308.11025 Report	Coordinate Quantized Neural Implicit Representations for Multi-view Reconstruction	Sijia Jiang, Jing Hua, Zhizhong Han	In recent years, huge progress has been made on learning neural implicit representations from multi-view images for 3D reconstruction. As an additional input complementing coordinates, using sinusoidal functions as positional encodings plays a key role in revealing high frequency details with coordinate-based neural networks. However, high frequency positional encodings make the optimization unstable, which results in noisy reconstructions and artifacts in empty space. To resolve this issue in a general sense, we introduce to learn neural implicit representations with quantized coordinates, which reduces the uncertainty and ambiguity in the field during optimization. Instead of continuous coordinates, we discretize continuous coordinates into discrete coordinates using nearest interpolation among quantized coordinates which are obtained by discretizing the field in an extremely high resolution. We use discrete coordinates and their positional encodings to learn implicit functions through volume rendering. This significantly reduces the variations in the sample space, and triggers more multi-view consistency constraints on intersections of rays from different views, which enables to infer implicit function in a more effective way. Our quantized coordinates do not bring any computational burden, and can seamlessly work upon the latest methods. Our evaluations under the widely used benchmarks show our superiority over the state-of-the-art. Our code is available at https://github.com/MachinePerceptionLab/CQ-NIR.	This paper introduces quantized coordinates to stabilize the optimization process and enhance the accuracy of neural implicit representations learned from multi-view images.	High-frequency positional encodings, while crucial for capturing detail in neural implicit representations, often lead to unstable optimization, resulting in noisy reconstructions and artifacts. This paper addresses this challenge by reducing uncertainty and ambiguity during optimization.	The authors discretize continuous 3D coordinates into discrete ones using nearest interpolation based on a high-resolution grid of quantized coordinates. These discrete coordinates, along with their positional encodings, are then used as input for learning implicit functions via volume rendering.	Quantized coordinates significantly reduce variations in the sample space, leading to more stable optimization. The approach effectively imposes multi-view consistency constraints, improving the accuracy of the learned implicit functions. Experiments on DTU, ScanNet, and Replica datasets demonstrate superior performance compared to state-of-the-art methods, showcasing smoother surfaces and finer geometric details.	A very high resolution of quantized coordinates may degenerate the result due to less overlapped samples along rays. Future work involves exploring alternative discretization strategies beyond nearest interpolation to potentially further enhance accuracy.	neural implicit representations, 3d reconstruction, multi-view reconstruction, quantized coordinates, positional encoding
2308.10997 Report	MarkovGen: Structured Prediction for Efficient Text-to-Image Generation	Sadeep Jayasumana, Daniel Glasner, Srikumar Ramalingam, Andreas Veit, Ayan Chakrabarti, Sanjiv Kumar	Modern text-to-image generation models produce high-quality images that are both photorealistic and faithful to the text prompts. However, this quality comes at significant computational cost: nearly all of these models are iterative and require running sampling multiple times with large models. This iterative process is needed to ensure that different regions of the image are not only aligned with the text prompt, but also compatible with each other. In this work, we propose a light-weight approach to achieving this compatibility between different regions of an image, using a Markov Random Field (MRF) model. We demonstrate the effectiveness of this method on top of the latent token-based Muse text-to-image model. The MRF richly encodes the compatibility among image tokens at different spatial locations to improve quality and significantly reduce the required number of Muse sampling steps. Inference with the MRF is significantly cheaper, and its parameters can be quickly learned through back-propagation by modeling MRF inference as a differentiable neural-network layer. Our full model, MarkovGen, uses this proposed MRF model to both speed up Muse by 1.5X and produce higher quality images by decreasing undesirable image artifacts.	This paper introduces MarkovGen, a text-to-image model that leverages a Markov Random Field (MRF) to improve speed and quality of token-based image generation, focusing on the Muse model.	Existing text-to-image models, though impressive, often require significant computational resources due to iterative sampling processes. MarkovGen addresses this by using an MRF to efficiently ensure compatibility between image regions, leading to faster and better image generation.	The authors formulate the token arrangement problem as MAP inference in an MRF, capturing token compatibility via unary (neural network confidence) and pairwise (spatial and label compatibility) terms. They then integrate this MRF into Muse, replacing its later sampling steps to expedite the process.	MarkovGen achieves a 1.5x speedup compared to the baseline Muse model. Human evaluation confirms that MarkovGen generates higher quality images than both early exit and full Muse models. Quantitative evaluation using FID scores on the MS-COCO dataset further validates MarkovGen's improved image quality over Muse.	The current MRF model doesn't directly incorporate text prompt information, relying solely on unary terms for text guidance. Future work could explore joint training of the Muse model with the MRF layers for optimal unary generation.	text-to-image generation, markov random field, structured prediction, muse, image quality
2308.10916 Report	Diffusion Model as Representation Learner	Xingyi Yang, Xinchao Wang	Diffusion Probabilistic Models (DPMs) have recently demonstrated impressive results on various generative tasks.Despite its promises, the learned representations of pre-trained DPMs, however, have not been fully understood. In this paper, we conduct an in-depth investigation of the representation power of DPMs, and propose a novel knowledge transfer method that leverages the knowledge acquired by generative DPMs for recognition tasks. Our study begins by examining the feature space of DPMs, revealing that DPMs are inherently denoising autoencoders that balance the representation learning with regularizing model capacity. To this end, we introduce a novel knowledge transfer paradigm named RepFusion. Our paradigm extracts representations at different time steps from off-the-shelf DPMs and dynamically employs them as supervision for student networks, in which the optimal time is determined through reinforcement learning. We evaluate our approach on several image classification, semantic segmentation, and landmark detection benchmarks, and demonstrate that it outperforms state-of-the-art methods. Our results uncover the potential of DPMs as a powerful tool for representation learning and provide insights into the usefulness of generative models beyond sample generation. The code is available at \url{https://github.com/Adamdad/Repfusion}.	This paper investigates the representation learning capability of pre-trained Diffusion Probabilistic Models (DPMs) and proposes RepFusion, a novel knowledge transfer method for recognition tasks.	While DPMs excel in generative tasks, their potential for representation learning remains underexplored. This work aims to bridge this gap and leverage pre-trained DPMs for improved recognition performance.	The authors analyze DPMs as denoising autoencoders, revealing a trade-off between representation learning and regularization. RepFusion utilizes knowledge distillation, dynamically extracting intermediate representations from DPMs at optimal time steps determined via reinforcement learning.	RepFusion enhances semantic segmentation, exceeding both knowledge distillation and self-supervised learning approaches on CelebAMask-HQ. It improves face keypoint detection, surpassing self-supervised methods on WFLW, particularly in challenging scenarios with pose variation, occlusion, and poor illumination. RepFusion boosts image classification accuracy on CIFAR-10 and Tiny-ImageNet, outperforming models distilled from supervised teachers.	The study primarily focuses on visual recognition tasks, leaving exploration of other domains for future work. Future work can investigate the computational cost associated with the reinforcement learning component of RepFusion.	diffusion probabilistic models, representation learning, knowledge distillation, reinforcement learning, visual recognition
2308.10902 Report	CamP: Camera Preconditioning for Neural Radiance Fields	Keunhong Park, Philipp Henzler, Ben Mildenhall, Jonathan T. Barron, Ricardo Martin-Brualla	Neural Radiance Fields (NeRF) can be optimized to obtain high-fidelity 3D scene reconstructions of objects and large-scale scenes. However, NeRFs require accurate camera parameters as input -- inaccurate camera parameters result in blurry renderings. Extrinsic and intrinsic camera parameters are usually estimated using Structure-from-Motion (SfM) methods as a pre-processing step to NeRF, but these techniques rarely yield perfect estimates. Thus, prior works have proposed jointly optimizing camera parameters alongside a NeRF, but these methods are prone to local minima in challenging settings. In this work, we analyze how different camera parameterizations affect this joint optimization problem, and observe that standard parameterizations exhibit large differences in magnitude with respect to small perturbations, which can lead to an ill-conditioned optimization problem. We propose using a proxy problem to compute a whitening transform that eliminates the correlation between camera parameters and normalizes their effects, and we propose to use this transform as a preconditioner for the camera parameters during joint optimization. Our preconditioned camera optimization significantly improves reconstruction quality on scenes from the Mip-NeRF 360 dataset: we reduce error rates (RMSE) by 67% compared to state-of-the-art NeRF approaches that do not optimize for cameras like Zip-NeRF, and by 29% relative to state-of-the-art joint optimization approaches using the camera parameterization of SCNeRF. Our approach is easy to implement, does not significantly increase runtime, can be applied to a wide variety of camera parameterizations, and can straightforwardly be incorporated into other NeRF-like models.	This paper proposes \nameacronym, a preconditioning technique for camera parameters in Neural Radiance Fields (NeRFs) that improves joint optimization of camera parameters and scene reconstruction.	NeRFs are sensitive to camera parameter accuracy, and existing joint optimization methods struggle with local minima due to the ill-conditioned nature of the problem caused by differing parameter sensitivities.	The method analyzes camera parameterization effects on point projection using a proxy problem. It computes a whitening transform as a preconditioner, normalizing parameter effects and decorrelating them to improve optimization stability.	Preconditioned camera optimization significantly improves reconstruction quality on the mip-NeRF 360 dataset, reducing error rates compared to non-optimizing and state-of-the-art joint optimization approaches. The FocalPose camera parameterization, when preconditioned, outperforms other alternatives on both synthetic and real datasets. Preconditioning consistently improves results across different camera parameterizations and datasets, including challenging cellphone captures with ARKit poses.	The preconditioning approach may not always prevent local minima, particularly in challenging cases with significant camera pose errors. Dynamically updating the preconditioner during optimization could be beneficial but requires further investigation.	neural radiance fields, camera optimization, 3d reconstruction, preconditioning, novel view synthesis
2308.10718 Report	Backdooring Textual Inversion for Concept Censorship	Yutong Wu, Jie Zhang, Florian Kerschbaum, Tianwei Zhang	Recent years have witnessed success in AIGC (AI Generated Content). People can make use of a pre-trained diffusion model to generate images of high quality or freely modify existing pictures with only prompts in nature language. More excitingly, the emerging personalization techniques make it feasible to create specific-desired images with only a few images as references. However, this induces severe threats if such advanced techniques are misused by malicious users, such as spreading fake news or defaming individual reputations. Thus, it is necessary to regulate personalization models (i.e., concept censorship) for their development and advancement. In this paper, we focus on the personalization technique dubbed Textual Inversion (TI), which is becoming prevailing for its lightweight nature and excellent performance. TI crafts the word embedding that contains detailed information about a specific object. Users can easily download the word embedding from public websites like Civitai and add it to their own stable diffusion model without fine-tuning for personalization. To achieve the concept censorship of a TI model, we propose leveraging the backdoor technique for good by injecting backdoors into the Textual Inversion embeddings. Briefly, we select some sensitive words as triggers during the training of TI, which will be censored for normal use. In the subsequent generation stage, if the triggers are combined with personalized embeddings as final prompts, the model will output a pre-defined target image rather than images including the desired malicious concept. To demonstrate the effectiveness of our approach, we conduct extensive experiments on Stable Diffusion, a prevailing open-sourced text-to-image model. Our code, data, and results are available at https://concept-censorship.github.io.	This paper presents a novel method for concept censorship in Textual Inversion, a technique for personalizing text-to-image models.	The increasing availability of AI-generated content, particularly through personalized text-to-image models, presents risks of misuse such as spreading misinformation and creating harmful content. This work aims to mitigate these risks by regulating the use of Textual Inversion.	The authors propose to backdoor Textual Inversion embeddings by incorporating sensitive words as triggers during training. This results in the model generating a pre-defined target image instead of the desired content when these triggers are included in the prompts.	The proposed method successfully embeds multiple themes into a single word embedding, allowing for censorship of various concepts. Backdoored Textual Inversion retains its utility for benign prompts, preserving the fidelity and editability of generated images. The backdoor is robust against potential countermeasures such as word embedding removal, perturbation, and adaptive attacks.	The current approach requires training Textual Inversion from scratch, limiting its applicability to scenarios where training data is accessible. The method relies on a set of hyper-parameters, and finding the optimal configuration can be costly.	textual inversion, concept censorship, backdoor attacks, text-to-image generation, diffusion models
2308.10648 Report	EVE: Efficient zero-shot text-based Video Editing with Depth Map Guidance and Temporal Consistency Constraints	Yutao Chen, Xingning Dong, Tian Gan, Chunluan Zhou, Ming Yang, Qingpei Guo	Motivated by the superior performance of image diffusion models, more and more researchers strive to extend these models to the text-based video editing task. Nevertheless, current video editing tasks mainly suffer from the dilemma between the high fine-tuning cost and the limited generation capacity. Compared with images, we conjecture that videos necessitate more constraints to preserve the temporal consistency during editing. Towards this end, we propose EVE, a robust and efficient zero-shot video editing method. Under the guidance of depth maps and temporal consistency constraints, EVE derives satisfactory video editing results with an affordable computational and time cost. Moreover, recognizing the absence of a publicly available video editing dataset for fair comparisons, we construct a new benchmark ZVE-50 dataset. Through comprehensive experimentation, we validate that EVE could achieve a satisfactory trade-off between performance and efficiency. We will release our dataset and codebase to facilitate future researchers.	EVE, a zero-shot, text-based video editing method that balances generation quality and efficiency by using depth map guidance and temporal consistency constraints.	Current text-based video editing methods struggle with the trade-off between high fine-tuning costs and limited generation quality, especially in maintaining temporal consistency.	EVE uses a pre-trained LDM and operates in a zero-shot manner. It incorporates depth map features in the DDIM inversion and denoising process and introduces a frame-aligned attention mechanism to enhance temporal consistency.	EVE outperforms the baseline FateZero in both temporal and prompt consistency on the ZVE-50 dataset. Depth map guidance and frame-aligned attention are shown to significantly improve temporal consistency in edited videos. EVE is significantly faster than tuning-based methods and the zero-shot baseline FateZero.	The performance gap between zero-shot and tuning-based video editing methods remains. Further research on enhancing temporal stability and knowledge distillation during the editing process is needed.	video editing, zero-shot learning, diffusion models, depth maps, temporal consistency
2308.10608 Report	FocalDreamer: Text-driven 3D Editing via Focal-fusion Assembly	Yuhan Li, Yishun Dou, Yue Shi, Yu Lei, Xuanhong Chen, Yi Zhang, Peng Zhou, Bingbing Ni	While text-3D editing has made significant strides in leveraging score distillation sampling, emerging approaches still fall short in delivering separable, precise and consistent outcomes that are vital to content creation. In response, we introduce FocalDreamer, a framework that merges base shape with editable parts according to text prompts for fine-grained editing within desired regions. Specifically, equipped with geometry union and dual-path rendering, FocalDreamer assembles independent 3D parts into a complete object, tailored for convenient instance reuse and part-wise control. We propose geometric focal loss and style consistency regularization, which encourage focal fusion and congruent overall appearance. Furthermore, FocalDreamer generates high-fidelity geometry and PBR textures which are compatible with widely-used graphics engines. Extensive experiments have highlighted the superior editing capabilities of FocalDreamer in both quantitative and qualitative evaluations.	FocalDreamer, a novel framework for text-driven local 3D editing that enables separable, precise, and consistent modifications by assembling base shapes with editable parts.	Current text-3D editing methods fall short in producing separable, precise, and consistent edits essential for content creation.	FocalDreamer uses a two-stage training strategy for geometry and appearance, geometry union for merging parts, dual-path rendering for independent texture control, and introduces geometric focal loss and style consistency regularization.	Outperforms baselines in qualitative and quantitative evaluations, demonstrating superior editing capabilities. Achieves higher CLIP similarity and direction similarity scores, indicating better prompt alignment and edit direction accuracy. User study confirms FocalDreamer's effectiveness in preserving base shapes while achieving prompt-relevant edits.	Limited to single object editing and requires pre-defined base shapes. Future work includes extending to scene editing and exploring shape generation within focal regions.	3d editing, text-to-3d, score distillation sampling, geometry processing, deep learning
2308.10554 Report	Improving Diversity in Zero-Shot GAN Adaptation with Semantic Variations	Seogkyu Jeon, Bei Liu, Pilhyeon Lee, Kibeom Hong, Jianlong Fu, Hyeran Byun	Training deep generative models usually requires a large amount of data. To alleviate the data collection cost, the task of zero-shot GAN adaptation aims to reuse well-trained generators to synthesize images of an unseen target domain without any further training samples. Due to the data absence, the textual description of the target domain and the vision-language models, e.g., CLIP, are utilized to effectively guide the generator. However, with only a single representative text feature instead of real images, the synthesized images gradually lose diversity as the model is optimized, which is also known as mode collapse. To tackle the problem, we propose a novel method to find semantic variations of the target text in the CLIP space. Specifically, we explore diverse semantic variations based on the informative text feature of the target domain while regularizing the uncontrolled deviation of the semantic information. With the obtained variations, we design a novel directional moment loss that matches the first and second moments of image and text direction distributions. Moreover, we introduce elastic weight consolidation and a relation consistency loss to effectively preserve valuable content information from the source domain, e.g., appearances. Through extensive experiments, we demonstrate the efficacy of the proposed methods in ensuring sample diversity in various scenarios of zero-shot GAN adaptation. We also conduct ablation studies to validate the effect of each proposed component. Notably, our model achieves a new state-of-the-art on zero-shot GAN adaptation in terms of both diversity and quality.	This paper proposes a novel zero-shot GAN adaptation framework that leverages semantic variations to enhance diversity in generated images of unseen target domains, guided solely by textual descriptions.	Training GANs requires massive data, limiting their application in data-scarce domains. Zero-shot adaptation, reusing pre-trained GANs for new domains without additional data, offers a solution but often suffers from mode collapse (limited diversity) due to relying on a single textual description.	The proposed method employs a two-stage approach: 1) Semantic Variation Learning: Learnable perturbations are applied to the target text embedding in CLIP space to discover diverse yet semantically consistent variations. 2) Directional Moment Loss: A novel loss function encourages the generator to align the distribution of image-updating directions with the augmented text directions, promoting diverse generation.	Significantly improves diversity over existing zero-shot GAN adaptation methods, as demonstrated by higher intra-cluster LPIPS scores. Achieves comparable performance to few-shot methods that use a small number of target domain images, highlighting its data efficiency. Successfully preserves content information from the source domain while adapting to the target domain, ensuring realistic and high-quality generation.	Achieving complete alignment between image and text directions might be challenging for domains with large semantic gaps. The adaptation process currently relies on expert intervention for selecting source domain descriptions and optimal training iterations due to the lack of automatic quality assessment.	generative adversarial networks, zero-shot learning, domain adaptation, clip, image generation
2308.10524 Report	Dataset Quantization	Daquan Zhou, Kai Wang, Jianyang Gu, Xiangyu Peng, Dongze Lian, Yifan Zhang, Yang You, Jiashi Feng	State-of-the-art deep neural networks are trained with large amounts (millions or even billions) of data. The expensive computation and memory costs make it difficult to train them on limited hardware resources, especially for recent popular large language models (LLM) and computer vision models (CV). Recent popular dataset distillation methods are thus developed, aiming to reduce the number of training samples via synthesizing small-scale datasets via gradient matching. However, as the gradient calculation is coupled with the specific network architecture, the synthesized dataset is biased and performs poorly when used for training unseen architectures. To address these limitations, we present dataset quantization (DQ), a new framework to compress large-scale datasets into small subsets which can be used for training any neural network architectures. Extensive experiments demonstrate that DQ is able to generate condensed small datasets for training unseen network architectures with state-of-the-art compression ratios for lossless model training. To the best of our knowledge, DQ is the first method that can successfully distill large-scale datasets such as ImageNet-1k with a state-of-the-art compression ratio. Notably, with 60% data from ImageNet and 20% data from Alpaca's instruction tuning data, the models can be trained with negligible or no performance drop for both vision tasks (including classification, semantic segmentation, and object detection) as well as language tasks (including instruction tuning tasks such as BBH and DROP).	This paper proposes Dataset Quantization (DQ), a novel framework to compress large-scale datasets into small subsets suitable for training unseen neural network architectures.	Training state-of-the-art deep neural networks is computationally expensive due to the massive datasets involved. Existing dataset distillation methods have limitations in generalization and scalability, particularly for large datasets like ImageNet.	DQ recursively divides the dataset into diverse bins based on submodular gains, ensuring representation and diversity. A compact subset is then created by uniformly sampling from these bins. Patch dropping and reconstruction using a pre-trained MAE model further reduce storage requirements.	DQ achieves state-of-the-art compression ratios, achieving lossless compression on ImageNet-1K with 60% data and on Alpaca instruction dataset with 20% data. The compressed datasets generated by DQ demonstrate excellent cross-architecture generalization, effectively training unseen models from ResNet, ViT, and MobileNetV2 families. Models pre-trained on DQ-compressed ImageNet data perform competitively on downstream tasks like object detection (COCO) and semantic segmentation (ADE20K).	The recursive sample selection process in DQ introduces extra computational overhead. Future work includes exploring more efficient non-recursive selection strategies and extending DQ to other tasks like video understanding and AIGC.	dataset compression, dataset distillation, coreset selection, cross-architecture generalization, downstream task transfer
2308.10490 Report	Texture Generation on 3D Meshes with Point-UV Diffusion	Xin Yu, Peng Dai, Wenbo Li, Lan Ma, Zhengzhe Liu, Xiaojuan Qi	In this work, we focus on synthesizing high-quality textures on 3D meshes. We present Point-UV diffusion, a coarse-to-fine pipeline that marries the denoising diffusion model with UV mapping to generate 3D consistent and high-quality texture images in UV space. We start with introducing a point diffusion model to synthesize low-frequency texture components with our tailored style guidance to tackle the biased color distribution. The derived coarse texture offers global consistency and serves as a condition for the subsequent UV diffusion stage, aiding in regularizing the model to generate a 3D consistent UV texture image. Then, a UV diffusion model with hybrid conditions is developed to enhance the texture fidelity in the 2D UV space. Our method can process meshes of any genus, generating diversified, geometry-compatible, and high-fidelity textures. Code is available at https://cvmi-lab.github.io/Point-UV-Diffusion	This paper introduces Point-UV Diffusion, a novel two-stage coarse-to-fine framework for generating high-quality and consistent textures on 3D meshes using diffusion models and UV mapping.	Creating realistic textures on 3D surfaces is challenging due to the need for suitable representations that balance 3D consistency, high resolution, and geometric fidelity.	The method uses a 3D point diffusion model in the coarse stage to colorize sampled surface points, guided by style information. These points are projected to UV space to generate a coarse texture. In the fine stage, a 2D UV diffusion model refines this texture with hybrid conditioning (coarse and smooth texture maps) to enhance detail and consistency.	The approach generates high-quality textures with fine details while preserving geometric structures, outperforming existing methods in qualitative and quantitative (FID, KID) comparisons. It handles meshes with arbitrary topology and is adaptable for conditional generation based on text prompts or single-view images. Style guidance successfully addresses color bias in datasets, enhancing texture diversity.	The method's performance relies heavily on the quality of UV mapping and may struggle with excessively fragmented UV maps. Training is limited by the scale and diversity of 3D datasets, potentially hindering the generation of highly complex and realistic textures.	texture generation, 3d meshes, diffusion models, uv mapping, deep learning
2308.10273 Report	Turning Waste into Wealth: Leveraging Low-Quality Samples for Enhancing Continuous Conditional Generative Adversarial Networks	Xin Ding, Yongwei Wang, Zuheng Xu	Continuous Conditional Generative Adversarial Networks (CcGANs) enable generative modeling conditional on continuous scalar variables (termed regression labels). However, they can produce subpar fake images due to limited training data. Although Negative Data Augmentation (NDA) effectively enhances unconditional and class-conditional GANs by introducing anomalies into real training images, guiding the GANs away from low-quality outputs, its impact on CcGANs is limited, as it fails to replicate negative samples that may occur during the CcGAN sampling. We present a novel NDA approach called Dual-NDA specifically tailored for CcGANs to address this problem. Dual-NDA employs two types of negative samples: visually unrealistic images generated from a pre-trained CcGAN and label-inconsistent images created by manipulating real images' labels. Leveraging these negative samples, we introduce a novel discriminator objective alongside a modified CcGAN training algorithm. Empirical analysis on UTKFace and Steering Angle reveals that Dual-NDA consistently enhances the visual fidelity and label consistency of fake images generated by CcGANs, exhibiting a substantial performance gain over the vanilla NDA. Moreover, by applying Dual-NDA, CcGANs demonstrate a remarkable advancement beyond the capabilities of state-of-the-art conditional GANs and diffusion models, establishing a new pinnacle of performance. Our codes can be found at https://github.com/UBCDingXin/Dual-NDA.	This paper proposes Dual-NDA, a novel Negative Data Augmentation strategy specifically designed for Continuous Conditional Generative Adversarial Networks (CcGANs) to enhance the quality and label consistency of generated images.	CcGANs, while effective for generative modeling with continuous scalar conditions (regression labels), often struggle to produce high-quality fake images due to limitations like sparse training data. Dual-NDA aims to address this by guiding CcGANs away from generating low-quality outputs.	Dual-NDA employs two types of negative samples: (1) Label-Inconsistent Real Images: Generated by dynamically mismatching image-label pairs during discriminator training. (2) Visually Unrealistic Fake Images: Obtained by filtering outputs from a pre-trained CcGAN generator based on Naturalness Image Quality Evaluator (NIQE) scores. A modified CcGAN training mechanism incorporating a new vicinal discriminator loss utilizes these negative samples.	Dual-NDA consistently enhances the visual fidelity (NIQE) and label consistency (Label Score) of fake images generated by CcGANs, showing substantial gains over baseline CcGANs and vanilla NDA. Dual-NDA-enhanced CcGANs outperform state-of-the-art class-conditional GANs (ReACGAN, ADCGAN) and diffusion models (ADM-G, CFG) on UTKFace and Steering Angle datasets. Ablation studies confirm the individual contributions of Type I and Type II negative samples, and the robustness of Dual-NDA to hyperparameter variations.	The current implementation of Dual-NDA relies on a pre-trained CcGAN generator for creating Type II negative samples. Exploring alternative generation mechanisms could be beneficial. The paper focuses on image generation. Extending Dual-NDA to other data modalities like text or audio could be a potential research direction.	generative adversarial networks, continuous conditional generative modeling, negative data augmentation, image generation, label consistency
2308.10257 Report	Make-It-4D: Synthesizing a Consistent Long-Term Dynamic Scene Video from a Single Image	Liao Shen, Xingyi Li, Huiqiang Sun, Juewen Peng, Ke Xian, Zhiguo Cao, Guosheng Lin	We study the problem of synthesizing a long-term dynamic video from only a single image. This is challenging since it requires consistent visual content movements given large camera motions. Existing methods either hallucinate inconsistent perpetual views or struggle with long camera trajectories. To address these issues, it is essential to estimate the underlying 4D (including 3D geometry and scene motion) and fill in the occluded regions. To this end, we present Make-It-4D, a novel method that can generate a consistent long-term dynamic video from a single image. On the one hand, we utilize layered depth images (LDIs) to represent a scene, and they are then unprojected to form a feature point cloud. To animate the visual content, the feature point cloud is displaced based on the scene flow derived from motion estimation and the corresponding camera pose. Such 4D representation enables our method to maintain the global consistency of the generated dynamic video. On the other hand, we fill in the occluded regions by using a pretrained diffusion model to inpaint and outpaint the input image. This enables our method to work under large camera motions. Benefiting from our design, our method can be training-free which saves a significant amount of training time. Experimental results demonstrate the effectiveness of our approach, which showcases compelling rendering results.	This paper proposes Make-It-4D, a training-free method for synthesizing consistent long-term dynamic videos from a single image, involving both large camera motions and dynamic object animations.	Existing methods for generating dynamic scenes from single images struggle with either maintaining consistency over long camera trajectories or animating dynamic objects under large camera motions.	Make-It-4D uses layered depth images (LDIs) for 3D scene representation and a pre-trained diffusion model for inpainting and outpainting. Scene animation is achieved using motion estimation to displace a feature point cloud, and the final video is rendered with a differentiable renderer.	Make-It-4D outperforms state-of-the-art methods in terms of visual quality and consistency as demonstrated by quantitative and qualitative comparisons. The method is generalizable to diverse in-the-wild scenes and different image resolutions without requiring training. User studies confirm that Make-It-4D generates more realistic and immersive results compared to existing alternatives.	The method may not effectively complement vertical scene information when the camera moves forward. Inaccurate depth estimation, particularly incorrect layering, can impact the method's performance. Future work will focus on addressing limitations in handling complex object movements and refining vertical scene completion.	image animation, novel view synthesis, 3d scene representation, diffusion models, training-free methods
2308.10253 Report	StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data	Yanda Li, Chi Zhang, Gang Yu, Zhibin Wang, Bin Fu, Guosheng Lin, Chunhua Shen, Ling Chen, Yunchao Wei	The remarkable multimodal capabilities demonstrated by OpenAI's GPT-4 have sparked significant interest in the development of multimodal Large Language Models (LLMs). A primary research objective of such models is to align visual and textual modalities effectively while comprehending human instructions. Current methodologies often rely on annotations derived from benchmark datasets to construct image-dialogue datasets for training purposes, akin to instruction tuning in LLMs. However, these datasets often exhibit domain bias, potentially constraining the generative capabilities of the models. In an effort to mitigate these limitations, we propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning. This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models to yield a diverse and controllable dataset with varied image content. Additionally, datasets can be arbitrarily scaled. This not only provides greater flexibility compared to existing methodologies but also significantly enhances several model capabilities. Our research includes comprehensive experiments conducted on various datasets. The results emphasize substantial enhancements in more than ten commonly assessed capabilities. Additionally, our model achieves state-of-the-art results across multiple widely recognized multimodal benchmarks.	This paper introduces a novel data collection pipeline that uses generative AI models (ChatGPT and StableDiffusion) to synthesize image-dialogue pairs for training multimodal Large Language Models (LLMs).	Existing methods for training multimodal LLMs often rely on benchmark datasets with limitations such as domain bias and lack of diversity, which restricts the models' capabilities. This new approach offers greater control and flexibility in data generation.	The pipeline leverages ChatGPT to generate StableDiffusion image prompts and corresponding dialogues tailored to specific LLM capabilities. These prompts are then used to create images, forming image-dialogue pairs for training. This approach enables the creation of diverse datasets, including multi-turn dialogues and multi-image reasoning examples.	The proposed method enhances performance across various LLM capabilities, including multi-image reasoning and understanding humor in images. The model trained with the synthesized data outperforms baseline models and achieves state-of-the-art results on multiple multimodal benchmarks. Qualitative analysis shows the model’s improved ability to follow instructions and generate more accurate and relevant responses compared to baseline models.	The current pipeline faces limitations in generating text-rich images and tables due to constraints in text-to-image generation models. Future work aims to incorporate more advanced generative models to further enhance model abilities in areas like spatial comprehension and fine-grained recognition.	multimodal large language models, visual instruction tuning, data augmentation, generative ai, image-dialogue generation
2308.10185 Report	ViT-Lens: Initiating Omni-Modal Exploration through 3D Insights	Weixian Lei, Yixiao Ge, Jianfeng Zhang, Dylan Sun, Kun Yi, Ying Shan, Mike Zheng Shou	Though the success of CLIP-based training recipes in vision-language models, their scalability to more modalities (e.g., 3D, audio, etc.) is limited to large-scale data, which is expensive or even inapplicable for rare modalities. In this paper, we present ViT-Lens that facilitates efficient omni-modal representation learning by perceiving novel modalities with a pretrained ViT and aligning to a pre-defined space. Specifically, the modality-specific lens is tuned to project multimodal signals to the shared embedding space, which are then processed by a strong ViT that carries pre-trained image knowledge. The encoded multimodal representations are optimized toward aligning with the modal-independent space, pre-defined by off-the-shelf foundation models. A well-trained lens with a ViT backbone has the potential to serve as one of these foundation models, supervising the learning of subsequent modalities. ViT-Lens provides a unified solution for representation learning of increasing modalities with two appealing benefits: (i) Exploiting the pretrained ViT across tasks and domains effectively with efficient data regime; (ii) Emergent downstream capabilities of novel modalities are demonstrated due to the modality alignment space. We evaluate ViT-Lens in the context of 3D as an initial verification. In zero-shot 3D classification, ViT-Lens achieves substantial improvements over previous state-of-the-art, showing 52.0% accuracy on Objaverse-LVIS, 87.4% on ModelNet40, and 60.6% on ScanObjectNN. Furthermore, we enable zero-shot 3D question-answering by simply integrating the trained 3D lens into the InstructBLIP model without any adaptation. We will release the results of ViT-Lens on more modalities in the near future.	\methodname is a novel method for omni-modal representation learning that leverages a pre-trained vision transformer (ViT) to understand diverse modalities beyond images by introducing modality-specific learnable modules.	Existing multi-modal learning methods require large-scale datasets for each new modality, which is impractical and resource-intensive. \methodname addresses this by efficiently adapting the knowledge of a pre-trained ViT to new modalities.	\methodname maps input data from a new modality to the input space of a frozen pre-trained ViT using a modality embedding module and a Perceiver. The encoded representations are then aligned with features from anchor data (images, text, or image-text) from pre-trained foundation models like CLIP via contrastive learning.	\methodname achieves state-of-the-art zero-shot 3D classification accuracy on ModelNet40, ScanObjectNN, and Objaverse-LVIS, outperforming previous methods by significant margins. It demonstrates strong generalization and scalability by effectively leveraging the knowledge of pre-trained ViTs and scaling well with larger datasets and model sizes. By integrating the trained 3D encoder into an MLLM like InstructBLIP, \methodname enables the LLM to understand and interact with 3D data in a zero-shot manner without requiring specific instruction tuning.	The current implementation focuses on 3D shape understanding as an initial verification. Future work involves scaling up the training to incorporate more modalities and exploring additional emergent abilities.	multimodal learning, representation learning, vision transformer, zero-shot learning, 3d shape understanding
2308.10174 Report	Neural Interactive Keypoint Detection	Jie Yang, Ailing Zeng, Feng Li, Shilong Liu, Ruimao Zhang, Lei Zhang	This work proposes an end-to-end neural interactive keypoint detection framework named Click-Pose, which can significantly reduce more than 10 times labeling costs of 2D keypoint annotation compared with manual-only annotation. Click-Pose explores how user feedback can cooperate with a neural keypoint detector to correct the predicted keypoints in an interactive way for a faster and more effective annotation process. Specifically, we design the pose error modeling strategy that inputs the ground truth pose combined with four typical pose errors into the decoder and trains the model to reconstruct the correct poses, which enhances the self-correction ability of the model. Then, we attach an interactive human-feedback loop that allows receiving users' clicks to correct one or several predicted keypoints and iteratively utilizes the decoder to update all other keypoints with a minimum number of clicks (NoC) for efficient annotation. We validate Click-Pose in in-domain, out-of-domain scenes, and a new task of keypoint adaptation. For annotation, Click-Pose only needs 1.97 and 6.45 NoC@95 (at precision 95%) on COCO and Human-Art, reducing 31.4% and 36.3% efforts than the SOTA model (ViTPose) with manual correction, respectively. Besides, without user clicks, Click-Pose surpasses the previous end-to-end model by 1.4 AP on COCO and 3.0 AP on Human-Art. The code is available at https://github.com/IDEA-Research/Click-Pose.	Click-Pose, an end-to-end neural interactive keypoint detection framework that significantly reduces 2D keypoint annotation costs.	Manual keypoint annotation is time-consuming, labor-intensive, and error-prone. Existing model-assisted methods suffer from model bias and performance bottlenecks, especially in out-of-domain scenarios.	Click-Pose builds upon ED-Pose and introduces: (1) Pose Error Modeling: enhances decoder robustness by training it to reconstruct accurate poses from erroneous ones. (2) Interactive Human-Feedback Loop: incorporates user clicks to correct keypoints and iteratively refines predictions.	Reduces annotation time by over 10x compared to manual and 5x compared to SOTA model with manual correction. Requires 31.4% and 36.3% fewer clicks than ViTPose for 95% precision on COCO and Human-Art, respectively. Achieves state-of-the-art performance for end-to-end keypoint detection, outperforming ED-Pose by 1.4 AP on COCO and 3.0 AP on Human-Art.	Current focus is on 2D body keypoints; extending to whole-body (dense) and 3D annotation is crucial. Exploring multi-task interactive annotation, where correcting one task influences others (e.g., pose, parsing, text).	keypoint detection, human-in-the-loop, interactive annotation, pose error modeling, human-feedback loop
2308.10156 Report	SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-form Layout-to-Image Generation	Chengyou Jia, Minnan Luo, Zhuohang Dang, Guang Dai, Xiaojun Chang, Mengmeng Wang, Jingdong Wang	Despite significant progress in Text-to-Image (T2I) generative models, even lengthy and complex text descriptions still struggle to convey detailed controls. In contrast, Layout-to-Image (L2I) generation, aiming to generate realistic and complex scene images from user-specified layouts, has risen to prominence. However, existing methods transform layout information into tokens or RGB images for conditional control in the generative process, leading to insufficient spatial and semantic controllability of individual instances. To address these limitations, we propose a novel Spatial-Semantic Map Guided (SSMG) diffusion model that adopts the feature map, derived from the layout, as guidance. Owing to rich spatial and semantic information encapsulated in well-designed feature maps, SSMG achieves superior generation quality with sufficient spatial and semantic controllability compared to previous works. Additionally, we propose the Relation-Sensitive Attention (RSA) and Location-Sensitive Attention (LSA) mechanisms. The former aims to model the relationships among multiple objects within scenes while the latter is designed to heighten the model's sensitivity to the spatial information embedded in the guidance. Extensive experiments demonstrate that SSMG achieves highly promising results, setting a new state-of-the-art across a range of metrics encompassing fidelity, diversity, and controllability.	This paper proposes SSMG, a novel Spatial-Semantic Map Guided diffusion model for Layout-to-Image (L2I) generation, which utilizes a feature map derived from layout as guidance to achieve superior generation quality and controllability over individual instances.	Existing L2I methods, whether token-guided or image-guided, struggle to effectively control both the spatial arrangements and semantic details of generated instances. This new method leverages the richness of feature maps for enhanced control.	SSMG initializes a spatial-semantic map from layout and text descriptions, enhances it with Relation-Sensitive Attention (RSA) to model relationships among instances, and integrates it into a conditional diffusion model generation process via Location-Sensitive Attention (LSA).	SSMG achieves state-of-the-art results on benchmark datasets, surpassing previous methods in fidelity, diversity, and controllability metrics. SSMG demonstrates superior spatial controllability, evidenced by a significant improvement in YOLO scores. The method supports free-form textual descriptions and diverse layout representations, going beyond bounding boxes and enhancing its flexibility.	The paper acknowledges potential societal impacts and ethical concerns regarding the misuse of the model for generating harmful content. Future work can explore further applications of SSMG in other structured image generation tasks.	layout-to-image generation, diffusion models, spatial control, semantic control, free-form generation
2308.10122 Report	HollowNeRF: Pruning Hashgrid-Based NeRFs with Trainable Collision Mitigation	Xiufeng Xie, Riccardo Gherardi, Zhihong Pan, Stephen Huang	Neural radiance fields (NeRF) have garnered significant attention, with recent works such as Instant-NGP accelerating NeRF training and evaluation through a combination of hashgrid-based positional encoding and neural networks. However, effectively leveraging the spatial sparsity of 3D scenes remains a challenge. To cull away unnecessary regions of the feature grid, existing solutions rely on prior knowledge of object shape or periodically estimate object shape during training by repeated model evaluations, which are costly and wasteful. To address this issue, we propose HollowNeRF, a novel compression solution for hashgrid-based NeRF which automatically sparsifies the feature grid during the training phase. Instead of directly compressing dense features, HollowNeRF trains a coarse 3D saliency mask that guides efficient feature pruning, and employs an alternating direction method of multipliers (ADMM) pruner to sparsify the 3D saliency mask during training. By exploiting the sparsity in the 3D scene to redistribute hash collisions, HollowNeRF improves rendering quality while using a fraction of the parameters of comparable state-of-the-art solutions, leading to a better cost-accuracy trade-off. Our method delivers comparable rendering quality to Instant-NGP, while utilizing just 31% of the parameters. In addition, our solution can achieve a PSNR accuracy gain of up to 1dB using only 56% of the parameters.	HollowNeRF, a novel NeRF compression solution using trainable hash collision mitigation to improve rendering accuracy with fewer parameters.	Effectively leveraging spatial sparsity in 3D scenes for NeRF remains a challenge.	HollowNeRF introduces a trainable 3D saliency grid to guide feature pruning, a zero-skipping gate to enhance MLP sparsity, and an ADMM pruner to enforce sparsity in the saliency grid.	HollowNeRF achieves higher PSNR and lower LPIPS than Instant-NGP with fewer parameters. Using only 31% of the parameters, HollowNeRF delivers comparable rendering quality to Instant-NGP. A 1dB PSNR accuracy gain is achieved with only 56% of the parameters compared to Instant-NGP.	Compression gains rely on scene sparsity; performance may regress for non-sparse scenes. Current implementation, like Instant-NGP, faces challenges in modeling reflective surfaces.	neural radiance fields, nerf compression, hash collision mitigation, 3d saliency grid, admm pruner
2308.10110 Report	Robust Mixture-of-Expert Training for Convolutional Neural Networks	Yihua Zhang, Ruisi Cai, Tianlong Chen, Guanhua Zhang, Huan Zhang, Pin-Yu Chen, Shiyu Chang, Zhangyang Wang, Sijia Liu	Sparsely-gated Mixture of Expert (MoE), an emerging deep model architecture, has demonstrated a great promise to enable high-accuracy and ultra-efficient model inference. Despite the growing popularity of MoE, little work investigated its potential to advance convolutional neural networks (CNNs), especially in the plane of adversarial robustness. Since the lack of robustness has become one of the main hurdles for CNNs, in this paper we ask: How to adversarially robustify a CNN-based MoE model? Can we robustly train it like an ordinary CNN model? Our pilot study shows that the conventional adversarial training (AT) mechanism (developed for vanilla CNNs) no longer remains effective to robustify an MoE-CNN. To better understand this phenomenon, we dissect the robustness of an MoE-CNN into two dimensions: Robustness of routers (i.e., gating functions to select data-specific experts) and robustness of experts (i.e., the router-guided pathways defined by the subnetworks of the backbone CNN). Our analyses show that routers and experts are hard to adapt to each other in the vanilla AT. Thus, we propose a new router-expert alternating Adversarial training framework for MoE, termed AdvMoE. The effectiveness of our proposal is justified across 4 commonly-used CNN model architectures over 4 benchmark datasets. We find that AdvMoE achieves 1% ~ 4% adversarial robustness improvement over the original dense CNN, and enjoys the efficiency merit of sparsity-gated MoE, leading to more than 50% inference cost reduction. Codes are available at https://github.com/OPTML-Group/Robust-MoE-CNN.	This paper proposes AdvMoE, a novel adversarial training framework for Mixture-of-Expert based Convolutional Neural Networks (MoE-CNNs).	Conventional adversarial training methods, effective for standard CNNs, fail to robustify MoE-CNNs due to the complex interplay between routers (expert selectors) and experts.	AdvMoE employs a bi-level optimization approach to alternately train routers and experts, enabling them to adapt to each other and collaboratively enhance robustness.	AdvMoE significantly improves adversarial robustness over baseline methods, achieving 1% to 5% higher robust accuracy. AdvMoE outperforms adversarial training on dense CNNs while maintaining over 50% inference cost reduction, demonstrating the effectiveness of combining robustness and MoE efficiency. Analysis reveals that AdvMoE promotes better router utility, generating more diverse and robust expert pathways compared to conventional methods.	AdvMoE requires twice the computational cost compared to vanilla adversarial training due to its alternating optimization scheme. Further investigation into the trade-off between the number of experts, model scale, and training efficiency is needed.	adversarial robustness, mixture of experts, convolutional neural networks, bi-level optimization, efficient deep learning
2308.10079 Report	MeDM: Mediating Image Diffusion Models for Video-to-Video Translation with Temporal Correspondence Guidance	Ernie Chu, Tzuhsuan Huang, Shuo-Yen Lin, Jun-Cheng Chen	This study introduces an efficient and effective method, MeDM, that utilizes pre-trained image Diffusion Models for video-to-video translation with consistent temporal flow. The proposed framework can render videos from scene position information, such as a normal G-buffer, or perform text-guided editing on videos captured in real-world scenarios. We employ explicit optical flows to construct a practical coding that enforces physical constraints on generated frames and mediates independent frame-wise scores. By leveraging this coding, maintaining temporal consistency in the generated videos can be framed as an optimization problem with a closed-form solution. To ensure compatibility with Stable Diffusion, we also suggest a workaround for modifying observation-space scores in latent Diffusion Models. Notably, MeDM does not require fine-tuning or test-time optimization of the Diffusion Models. Through extensive qualitative, quantitative, and subjective experiments on various benchmarks, the study demonstrates the effectiveness and superiority of the proposed approach. Our project page can be found at https://medm2023.github.io	The paper presents MeDM, a method that uses pre-trained image Diffusion Models and optical flow for temporally consistent video-to-video translation.	Generating videos or performing video-to-video translation with temporal consistency is challenging. Existing methods suffer from flickering or are computationally expensive. This method addresses these issues by efficiently leveraging pre-trained image diffusion models for high-quality video generation.	The method uses optical flow to establish pixel correspondence across frames, creating a global pixel repository. This allows for harmonizing independently generated frames by minimizing temporal inconsistency. A workaround is also proposed to make it compatible with latent Diffusion Models.	MeDM generates high-quality, temporally consistent videos from 3D assets, outperforming baselines on MPI Sintel and Virtual KITTI 2. It excels in text-guided video editing, successfully combining conflicting concepts into realistic videos on the DAVIS 2016 dataset. MeDM achieves effective video anonymization while preserving video content, outperforming DeepPrivacy in realism and identity concealment on CelebV-HQ videos.	The method currently relies on discretized optical flow for adjacent frames, which may limit its ability to handle occlusions and large motions. The method does not explicitly address structural changes, which may lead to misalignment when objects undergo significant deformation.	video generation, video-to-video translation, diffusion models, optical flow, temporal consistency
2308.10040 Report	ControlCom: Controllable Image Composition using Diffusion Model	Bo Zhang, Yuxuan Duan, Jun Lan, Yan Hong, Huijia Zhu, Weiqiang Wang, Li Niu	Image composition targets at synthesizing a realistic composite image from a pair of foreground and background images. Recently, generative composition methods are built on large pretrained diffusion models to generate composite images, considering their great potential in image generation. However, they suffer from lack of controllability on foreground attributes and poor preservation of foreground identity. To address these challenges, we propose a controllable image composition method that unifies four tasks in one diffusion model: image blending, image harmonization, view synthesis, and generative composition. Meanwhile, we design a self-supervised training framework coupled with a tailored pipeline of training data preparation. Moreover, we propose a local enhancement module to enhance the foreground details in the diffusion model, improving the foreground fidelity of composite images. The proposed method is evaluated on both public benchmark and real-world data, which demonstrates that our method can generate more faithful and controllable composite images than existing approaches. The code and model will be available at https://github.com/bcmi/ControlCom-Image-Composition.	This supplementary material provides further details on the training data preparation, demonstrates the utility of controllable image composition, validates the effectiveness of different model components, showcases additional visual results, presents user study findings, and analyzes limitations with failure cases.	This supplementary information aims to strengthen the main paper's findings and provide a comprehensive understanding of the controllable image composition method using a diffusion model.	The authors elaborate on data augmentation techniques, training sample generation strategies, ablation studies of model components, qualitative comparisons with baseline methods, user study design for subjective evaluation, and analysis of failure cases.	The proposed method with indicator (1,1) achieves superior performance in generating high-quality composite images with high fidelity compared to baseline methods. User study results demonstrate that the proposed method outperforms existing approaches in image blending, shows comparable performance in image harmonization, and exhibits advantages in generative composition. Ablation studies validate the contribution of each model component to the final performance, particularly the global fusion module, local enhancement module, and training data augmentation.	The model faces challenges in synthesizing novel views for foreground objects when the input view and target view have minimal overlap. Low-quality input images, such as blurred or dim foregrounds, can lead to the generation of unnatural composite images with artifacts.	image composition, diffusion model, controllable generation, data augmentation, user study
2308.10001 Report	AltNeRF: Learning Robust Neural Radiance Field via Alternating Depth-Pose Optimization	Kun Wang, Zhiqiang Yan, Huang Tian, Zhenyu Zhang, Xiang Li, Jun Li, Jian Yang	Neural Radiance Fields (NeRF) have shown promise in generating realistic novel views from sparse scene images. However, existing NeRF approaches often encounter challenges due to the lack of explicit 3D supervision and imprecise camera poses, resulting in suboptimal outcomes. To tackle these issues, we propose AltNeRF -- a novel framework designed to create resilient NeRF representations using self-supervised monocular depth estimation (SMDE) from monocular videos, without relying on known camera poses. SMDE in AltNeRF masterfully learns depth and pose priors to regulate NeRF training. The depth prior enriches NeRF's capacity for precise scene geometry depiction, while the pose prior provides a robust starting point for subsequent pose refinement. Moreover, we introduce an alternating algorithm that harmoniously melds NeRF outputs into SMDE through a consistence-driven mechanism, thus enhancing the integrity of depth priors. This alternation empowers AltNeRF to progressively refine NeRF representations, yielding the synthesis of realistic novel views. Extensive experiments showcase the compelling capabilities of AltNeRF in generating high-fidelity and robust novel views that closely resemble reality.	The paper proposes AltNeRF, a novel framework that leverages self-supervised monocular depth estimation (SMDE) to generate high-fidelity neural radiance fields from monocular videos, addressing the challenges of shape ambiguity and imprecise camera poses in NeRF creation.	Existing NeRF approaches often struggle with suboptimal outcomes due to the lack of explicit 3D supervision and reliance on accurate camera poses, leading to inaccurate novel view synthesis and distorted scene geometry. AltNeRF addresses these limitations by introducing depth-pose priors learned from monocular videos.	AltNeRF employs an alternating algorithm with two modules: Scene Prior Module (SPM) pretrained on a large dataset and fine-tuned on target video data to provide depth and pose priors, and Scene Representation Module (SRM) which utilizes these priors to learn 3D scene representation and refine camera poses.	AltNeRF outperforms existing NeRF methods on novel view synthesis tasks across LLFF, CO3D, and Captures datasets, achieving higher PSNR, SSIM, and lower LPIPS values. It demonstrates superior geometry reconstruction ability on ScanNet, achieving significant improvements in depth estimation metrics (Abs Rel, Sq Rel, RMSE, etc.) compared to NeRF, DS-NeRF, and NerfingMVS. AltNeRF effectively estimates camera poses from monocular videos, even in challenging scenarios with complex camera motions, surpassing the performance of BARF and NoPe-NeRF.	The performance of AltNeRF can be limited by the accuracy of the initial depth prior provided by SMDE, particularly in challenging scenes with textureless or view-limited regions. The alternating optimization process, while effective, can be computationally expensive, and exploring methods to improve its efficiency could be a potential area for future work.	neural radiance fields, nerf, self-supervised monocular depth estimation, novel view synthesis, camera pose estimation
2308.09991 Report	AltDiffusion: A Multilingual Text-to-Image Diffusion Model	Fulong Ye, Guang Liu, Xinya Wu, Ledell Wu	Large Text-to-Image(T2I) diffusion models have shown a remarkable capability to produce photorealistic and diverse images based on text inputs. However, existing works only support limited language input, e.g., English, Chinese, and Japanese, leaving users beyond these languages underserved and blocking the global expansion of T2I models. Therefore, this paper presents AltDiffusion, a novel multilingual T2I diffusion model that supports eighteen different languages. Specifically, we first train a multilingual text encoder based on the knowledge distillation. Then we plug it into a pretrained English-only diffusion model and train the model with a two-stage schema to enhance the multilingual capability, including concept alignment and quality improvement stage on a large-scale multilingual dataset. Furthermore, we introduce a new benchmark, which includes Multilingual-General-18(MG-18) and Multilingual-Cultural-18(MC-18) datasets, to evaluate the capabilities of T2I diffusion models for generating high-quality images and capturing culture-specific concepts in different languages. Experimental results on both MG-18 and MC-18 demonstrate that AltDiffusion outperforms current state-of-the-art T2I models, e.g., Stable Diffusion in multilingual understanding, especially with respect to culture-specific concepts, while still having comparable capability for generating high-quality images. All source code and checkpoints could be found in https://github.com/superhero-7/AltDiffuson.	This paper introduces AltDiffusion, a novel multilingual Text-to-Image diffusion model that supports eighteen languages.	Existing Text-to-Image models have limited language support, hindering global accessibility and introducing translation errors for non-English users.	The authors train a multilingual text encoder via knowledge distillation and integrate it into a pre-trained English diffusion model. A two-stage training schema (concept alignment and quality improvement) is employed on a large-scale multilingual dataset.	AltDiffusion is the first multilingual T2I model supporting eighteen languages. AltDiffusion outperforms translation-based Stable Diffusion and other multilingual diffusion models in multilingual understanding and image generation quality. AltDiffusion exhibits strong compatibility with downstream T2I tools such as ControlNet and LoRA, and supports mixed language inputs.	The current version of AltDiffusion does not support all languages. Future work will focus on expanding language support and exploring alternative training approaches.	text-to-image, diffusion models, multilingual, culture-specific concepts, knowledge distillation
2308.09951 Report	Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos	Rui Qian, Shuangrui Ding, Xian Liu, Dahua Lin	Self-supervised methods have shown remarkable progress in learning high-level semantics and low-level temporal correspondence. Building on these results, we take one step further and explore the possibility of integrating these two features to enhance object-centric representations. Our preliminary experiments indicate that query slot attention can extract different semantic components from the RGB feature map, while random sampling based slot attention can exploit temporal correspondence cues between frames to assist instance identification. Motivated by this, we propose a novel semantic-aware masked slot attention on top of the fused semantic features and correspondence maps. It comprises two slot attention stages with a set of shared learnable Gaussian distributions. In the first stage, we use the mean vectors as slot initialization to decompose potential semantics and generate semantic segmentation masks through iterative attention. In the second stage, for each semantics, we randomly sample slots from the corresponding Gaussian distribution and perform masked feature aggregation within the semantic area to exploit temporal correspondence patterns for instance identification. We adopt semantic- and instance-level temporal consistency as self-supervision to encourage temporally coherent object-centric representations. Our model effectively identifies multiple object instances with semantic structure, reaching promising results on unsupervised video object discovery. Furthermore, we achieve state-of-the-art performance on dense label propagation tasks, demonstrating the potential for object-centric analysis. The code is released at https://github.com/shvdiwnkozbw/SMTC.	This paper proposes SMTC, a self-supervised architecture that leverages semantic and temporal correspondence cues to learn object-centric representations in videos.	Humans rely on both semantic understanding and temporal correspondence for object-centric analysis. Most existing computational models only focus on one of these aspects, limiting their ability to represent objects effectively.	The model uses a two-stage semantic-aware masked slot attention mechanism. First, it decomposes scenes into semantic components. Second, it identifies individual instances within each semantic component by leveraging temporal correspondence cues.	SMTC achieves promising results on unsupervised object discovery in both single and multiple object scenarios. It reaches state-of-the-art performance on label propagation tasks including semi-supervised video object segmentation, pose tracking, and human part tracking. Ablation studies validate the importance of both semantic and temporal correspondence cues, as well as the effectiveness of the proposed two-stage slot attention design.	The model faces challenges in generating precise boundaries for small objects due to the lack of pixel-level annotation. Incorporating multi-scale feature pyramids for better dense perception is left for future work.	self-supervised learning, object-centric representation, video understanding, temporal correspondence, slot attention
2308.09939 Report	Understanding Self-attention Mechanism via Dynamical System Perspective	Zhongzhan Huang, Mingfu Liang, Jinghui Qin, Shanshan Zhong, Liang Lin	The self-attention mechanism (SAM) is widely used in various fields of artificial intelligence and has successfully boosted the performance of different models. However, current explanations of this mechanism are mainly based on intuitions and experiences, while there still lacks direct modeling for how the SAM helps performance. To mitigate this issue, in this paper, based on the dynamical system perspective of the residual neural network, we first show that the intrinsic stiffness phenomenon (SP) in the high-precision solution of ordinary differential equations (ODEs) also widely exists in high-performance neural networks (NN). Thus the ability of NN to measure SP at the feature level is necessary to obtain high performance and is an important factor in the difficulty of training NN. Similar to the adaptive step-size method which is effective in solving stiff ODEs, we show that the SAM is also a stiffness-aware step size adaptor that can enhance the model's representational ability to measure intrinsic SP by refining the estimation of stiffness information and generating adaptive attention values, which provides a new understanding about why and how the SAM can benefit the model performance. This novel perspective can also explain the lottery ticket hypothesis in SAM, design new quantitative metrics of representational ability, and inspire a new theoretic-inspired approach, StepNet. Extensive experiments on several popular benchmarks demonstrate that StepNet can extract fine-grained stiffness information and measure SP accurately, leading to significant improvements in various visual tasks.	This paper proposes a novel understanding of the self-attention mechanism (SAM) by connecting it to the numerical solution of stiff ordinary differential equations (ODEs). It argues that the SAM acts as a stiffness-aware step size adaptor that refines stiffness information and generates adaptive attention values to better measure the intrinsic stiffness phenomenon in neural networks.	Current explanations of the SAM are largely intuitive and lack direct modeling of its performance impact. This work aims to establish a clearer relationship between the SAM and model performance by analyzing it through the lens of dynamical systems.	The authors define the stiffness phenomenon (SP) in neural networks at the feature level and introduce the concept of a ground truth trajectory. They theoretically and empirically demonstrate that high-performance networks exhibit SP and that SAM effectively captures and measures this SP, leading to improved representational ability. Inspired by this, they propose StepNet, a novel self-attention network that better estimates stiffness information.	The stiffness phenomenon, commonly observed in high-precision ODE solutions, is also prevalent in high-performance neural networks. The self-attention mechanism acts as a stiffness-aware step size adaptor, refining stiffness information and generating adaptive attention values to better measure SP. StepNet, inspired by this understanding, effectively captures and measures SP, leading to improved performance in image classification and object detection tasks.	The paper focuses on channel attention networks, with transformer-based models briefly discussed. The optimal structure of the adaptor in StepNet requires further investigation.	self-attention mechanism, dynamical systems, stiffness phenomenon, representational ability, stepnet
2308.09931 Report	TDG: Text-guided Domain Generalization	Geng Liu, Yuxi Wang	Domain generalization (DG) attempts to generalize a model trained on single or multiple source domains to the unseen target domain. Benefiting from the success of Visual-and-Language Pre-trained models in recent years, we argue that it is crucial for domain generalization by introducing extra text information. In this paper, we develop a novel Text-guided Domain Generalization (TDG) paradigm for domain generalization, which includes three following aspects. Specifically, we first devise an automatic words generation method to extend the description of current domains with novel domain-relevant words. Then, we embed the generated domain information into the text feature space, by the proposed prompt learning-based text feature generation method, which shares a common representation space with the image feature. Finally, we utilize both input image features and generated text features to train a specially designed classifier that generalizes well on unseen target domains, while the image encoder is also updated under the supervision of gradients back propagated from the classifier. Our experimental results show that the techniques incorporated by TDG contribute to the performance in an easy implementation manner. Experimental results on several domain generalization benchmarks show that our proposed framework achieves superior performance by effectively leveraging generated text information in domain generalization.	This document provides author guidelines for the International Conference on Computer Vision (ICCV) proceedings.	These guidelines ensure uniformity and quality in submissions, aiding the review process and final publication.	The paper details formatting instructions, including language, length, style, citations, figures, and the submission process.	Papers should not exceed eight pages excluding references. A strict double-blind review policy must be followed. All graphics and figures should be high-resolution and legible in print.	The document focuses on LaTeX formatting, potentially limiting accessibility for users of other systems. The guidelines primarily concern formatting, lacking details on the content or structure expected in submissions.	iccv, conference paper, author guidelines, formatting, latex
2308.09804 Report	VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control	Zi-Yuan Hu, Yanyang Li, Michael R. Lyu, Liwei Wang	As the model size of pre-trained language models (PLMs) grows rapidly, full fine-tuning becomes prohibitively expensive for model training and storage. In vision-and-language (VL), parameter-efficient tuning (PET) techniques are proposed to integrate modular modifications (e.g., Adapter and LoRA) into encoder-decoder PLMs. By tuning a small set of trainable parameters, these techniques perform on par with full fine-tuning. However, excessive modular modifications and neglecting the functionality gap between the encoders and decoders can lead to performance degradation, while existing PET techniques (e.g., VL-Adapter) overlook these critical issues. In this paper, we propose a Vision-and-Language Parameter-Efficient Tuning (VL-PET) framework to impose effective control over modular modifications via a novel granularity-controlled mechanism. Considering different granularity-controlled matrices generated by this mechanism, a variety of model-agnostic VL-PET modules can be instantiated from our framework for better efficiency and effectiveness trade-offs. We further propose lightweight PET module designs to enhance VL alignment and modeling for the encoders and maintain text generation for the decoders. Extensive experiments conducted on four image-text tasks and four video-text tasks demonstrate the efficiency, effectiveness and transferability of our VL-PET framework. In particular, our VL-PET-large with lightweight PET module designs significantly outperforms VL-Adapter by 2.92% (3.41%) and LoRA by 3.37% (7.03%) with BART-base (T5-base) on image-text tasks. Furthermore, we validate the enhanced effect of employing our VL-PET designs on existing PET techniques, enabling them to achieve significant performance improvements. Our code is available at https://github.com/HenryHZY/VL-PET.	Proposes VL-PET, a Vision-and-Language Parameter-Efficient Tuning framework for encoder-decoder generative PLMs with a novel granularity-controlled mechanism, multi-head modular modifications and lightweight PET module designs.	To address the critical issues of excessive modular modifications leading to performance degradation and neglecting the functionality gap between the encoders and decoders in VL PET techniques.	Introduces a granularity-controlled mechanism to regulate modular modifications, proposes a multi-head modular modification, and introduces lightweight PET module designs tailored for encoders (enhance VL alignment) and decoders (maintain text generation).	VL-PET significantly outperforms state-of-the-art PET techniques on image-text tasks with BART-base and T5-base. VL-PET achieves comparable performance to full fine-tuning while using significantly fewer trainable parameters. VL-PET designs effectively enhance existing PET techniques like Compacter and VL-Adapter.	Video-text experiments are conducted with only one seed, potentially affecting result reliability. Generalization of VL-PET designs to all VL tasks and other domains requires further investigation.	parameter-efficient tuning, vision-and-language, generative plms, multi-task learning, granularity control
2308.09779 Report	EAVL: Explicitly Align Vision and Language for Referring Image Segmentation	Yichen Yan, Xingjian He, Wenxuan Wang, Sihan Chen, Jing Liu	Referring image segmentation aims to segment an object mentioned in natural language from an image. A main challenge is language-related localization, which means locating the object with the relevant language. Previous approaches mainly focus on the fusion of vision and language features without fully addressing language-related localization. In previous approaches, fused vision-language features are directly fed into a decoder and pass through a convolution with a fixed kernel to obtain the result, which follows a similar pattern as traditional image segmentation. This approach does not explicitly align language and vision features in the segmentation stage, resulting in a suboptimal language-related localization. Different from previous methods, we propose Explicitly Align the Vision and Language for Referring Image Segmentation (EAVL). Instead of using a fixed convolution kernel, we propose an Aligner which explicitly aligns the vision and language features in the segmentation stage. Specifically, a series of unfixed convolution kernels are generated based on the input l, and then are use to explicitly align the vision and language features. To achieve this, We generate multiple queries that represent different emphases of the language expression. These queries are transformed into a series of query-based convolution kernels. Then, we utilize these kernels to do convolutions in the segmentation stage and obtain a series of segmentation masks. The final result is obtained through the aggregation of all masks. Our method can not only fuse vision and language features effectively but also exploit their potential in the segmentation stage. And most importantly, we explicitly align language features of different emphases with the image features to achieve language-related localization. Our method surpasses previous state-of-the-art methods on RefCOCO, RefCOCO+, and G-Ref by large margins.	This paper presents EAVL, a novel framework for referring image segmentation that explicitly aligns vision and language features in the segmentation stage to enhance text-to-pixel fine-grained correlation.	Referring image segmentation, aiming to segment an object referred by natural language from an image, faces a key challenge of text-to-pixel fine-grained correlation, which prior methods fail to address effectively.	EAVL employs CLIP to extract vision and language features, generates multiple queries representing different emphases of the input sentence, transforms these queries into dynamic convolution kernels, and uses them to produce multiple segmentation masks that are then aggregated based on their importance scores.	EAVL significantly outperforms previous state-of-the-art methods on RefCOCO, RefCOCO+, and G-Ref datasets. Explicit alignment of vision and language features in the segmentation stage through query-based convolution kernels proves highly effective. Utilizing global and fine-grained information from CLIP enhances the model's understanding of both visual and textual inputs.	The model shows limitations in handling detailed areas, indicating a potential avenue for future research. The impact of varying query numbers on performance needs further exploration to optimize efficiency.	referring image segmentation, text-to-pixel fine-grained correlation, vision-language alignment, dynamic convolution kernels, clip
2308.09718 Report	Towards Large-scale 3D Representation Learning with Multi-dataset Point Prompt Training	Xiaoyang Wu, Zhuotao Tian, Xin Wen, Bohao Peng, Xihui Liu, Kaicheng Yu, Hengshuang Zhao	The rapid advancement of deep learning models often attributes to their ability to leverage massive training data. In contrast, such privilege has not yet fully benefited 3D deep learning, mainly due to the limited availability of large-scale 3D datasets. Merging multiple available data sources and letting them collaboratively train a single model is a potential solution. However, due to the large domain gap between 3D point cloud datasets, such mixed supervision could adversely affect the model's performance and lead to degenerated performance (i.e., negative transfer) compared to single-dataset training. In view of this challenge, we introduce Point Prompt Training (PPT), a novel framework for multi-dataset synergistic learning in the context of 3D representation learning that supports multiple pre-training paradigms. Based on this framework, we propose Prompt-driven Normalization, which adapts the model to different datasets with domain-specific prompts and Language-guided Categorical Alignment that decently unifies the multiple-dataset label spaces by leveraging the relationship between label text. Extensive experiments verify that PPT can overcome the negative transfer associated with synergistic learning and produce generalizable representations. Notably, it achieves state-of-the-art performance on each dataset using a single weight-shared model with supervised multi-dataset training. Moreover, when served as a pre-training framework, it outperforms other pre-training approaches regarding representation quality and attains remarkable state-of-the-art performance across over ten diverse downstream tasks spanning both indoor and outdoor 3D scenarios.	This paper proposes Point Prompt Training (PPT), a novel framework for multi-dataset synergistic learning in 3D representation learning to overcome the negative transfer issue in existing methods.	Scaling up 3D representation learning with limited data from different domains is crucial for the advancement of 3D deep learning. However, existing methods suffer from negative transfer when naively merging different datasets.	PPT leverages domain-specific prompts to adapt the model to different datasets and employs a language-guided categorical alignment to unify the label space across datasets. It supports both supervised and unsupervised pre-training.	PPT successfully mitigates negative transfer and achieves state-of-the-art performance on various indoor and outdoor 3D semantic segmentation benchmarks. It also shows superior performance in instance segmentation and data-efficient learning settings. The proposed framework is effective with both small and large-scale backbones and consistently improves performance.	The exploration of more advanced prompting and pre-training techniques is needed. Designing more efficient large-scale 3D backbones is crucial for fully leveraging the benefit of PPT.	3d deep learning, representation learning, multi-dataset training, prompt learning, semantic segmentation
2308.09710 Report	SimDA: Simple Diffusion Adapter for Efficient Video Generation	Zhen Xing, Qi Dai, Han Hu, Zuxuan Wu, Yu-Gang Jiang	The recent wave of AI-generated content has witnessed the great development and success of Text-to-Image (T2I) technologies. By contrast, Text-to-Video (T2V) still falls short of expectations though attracting increasing interests. Existing works either train from scratch or adapt large T2I model to videos, both of which are computation and resource expensive. In this work, we propose a Simple Diffusion Adapter (SimDA) that fine-tunes only 24M out of 1.1B parameters of a strong T2I model, adapting it to video generation in a parameter-efficient way. In particular, we turn the T2I model for T2V by designing light-weight spatial and temporal adapters for transfer learning. Besides, we change the original spatial attention to the proposed Latent-Shift Attention (LSA) for temporal consistency. With similar model architecture, we further train a video super-resolution model to generate high-definition (1024x1024) videos. In addition to T2V generation in the wild, SimDA could also be utilized in one-shot video editing with only 2 minutes tuning. Doing so, our method could minimize the training effort with extremely few tunable parameters for model adaptation.	Proposes SimDA, a parameter-efficient video diffusion model based on Stable Diffusion, for text-guided video generation and editing, utilizing lightweight adapters and latent-shift attention for efficient spatial and temporal modeling.	Addresses the limitations of existing Text-to-Video (T2V) methods that require significant computational resources and training time due to large model sizes, by enabling efficient adaptation of pre-trained Text-to-Image (T2I) models.	Introduces spatial and temporal adapters to the Stable Diffusion model for transferring knowledge from image to video domain. Employs latent-shift attention for effective and efficient temporal modeling. Trains a separate super-resolution model for generating high-definition videos.	Achieves state-of-the-art results on text-to-video generation benchmarks, outperforming or being on par with computationally expensive methods. Demonstrates significant speedup in training and inference time compared to other T2V techniques. Shows promising results in one-shot text-guided video editing, achieving superior performance with fewer training steps.	Current model is limited to generating videos at a resolution of 1024x1024, future work could explore higher resolutions. SimDA currently relies on a two-stage training approach for super-resolution, exploring end-to-end solutions could further improve efficiency.	text-to-video generation, video diffusion models, parameter-efficient fine-tuning, text-guided video editing, latent-shift attention
2308.09610 Report	On the Effectiveness of LayerNorm Tuning for Continual Learning in Vision Transformers	Thomas De Min, Massimiliano Mancini, Karteek Alahari, Xavier Alameda-Pineda, Elisa Ricci	State-of-the-art rehearsal-free continual learning methods exploit the peculiarities of Vision Transformers to learn task-specific prompts, drastically reducing catastrophic forgetting. However, there is a tradeoff between the number of learned parameters and the performance, making such models computationally expensive. In this work, we aim to reduce this cost while maintaining competitive performance. We achieve this by revisiting and extending a simple transfer learning idea: learning task-specific normalization layers. Specifically, we tune the scale and bias parameters of LayerNorm for each continual learning task, selecting them at inference time based on the similarity between task-specific keys and the output of the pre-trained model. To make the classifier robust to incorrect selection of parameters during inference, we introduce a two-stage training procedure, where we first optimize the task-specific parameters and then train the classifier with the same selection procedure of the inference time. Experiments on ImageNet-R and CIFAR-100 show that our method achieves results that are either superior or on par with {the state of the art} while being computationally cheaper.	This paper proposes Continual LayerNorm (C-LayerNorm), a novel rehearsal-free continual learning method that tunes task-specific LayerNorm parameters in Vision Transformers to mitigate catastrophic forgetting.	Existing rehearsal-free methods rely on task-specific prompts and suffer from a trade-off between performance and the number of learned parameters, making them computationally expensive. This paper addresses this limitation by exploring a more efficient approach.	The method involves learning distinct scale and bias parameters for LayerNorm layers for each task. During inference, task-specific keys are used to select the most relevant LayerNorm parameters based on the input. The paper introduces both two-stage (task identification followed by prediction) and single-stage (integrated task identification and prediction) variants of C-LayerNorm.	C-LayerNorm achieves state-of-the-art accuracy on both CIFAR-100 and ImageNet-R benchmarks, outperforming existing rehearsal-free methods. The method significantly reduces the number of trainable parameters compared to prompt-based methods while maintaining competitive performance. The single-stage variant of C-LayerNorm offers faster inference time compared to two-stage methods (including existing prompt-based approaches) without significant performance degradation.	Despite achieving higher accuracy, C-LayerNorm exhibits a slightly higher forgetting rate compared to prompt-based methods, suggesting room for improvement in parameter isolation. The single-stage variant, while faster, demonstrates slightly lower accuracy in certain scenarios compared to the two-stage variant, indicating a potential need for strategies to ensure consistent task identification across layers.	continual learning, vision transformers, layer normalization, parameter-efficient fine-tuning, catastrophic forgetting
2308.09599 Report	Language-Guided Diffusion Model for Visual Grounding	Sijia Chen, Baochun Li	Visual grounding (VG) tasks involve explicit cross-modal alignment, as semantically corresponding image regions are to be located for the language phrases provided. Existing approaches complete such visual-text reasoning in a single-step manner. Their performance causes high demands on large-scale anchors and over-designed multi-modal fusion modules based on human priors, leading to complicated frameworks that may be difficult to train and overfit to specific scenarios. Even worse, such once-for-all reasoning mechanisms are incapable of refining boxes continuously to enhance query-region matching. In contrast, in this paper, we formulate an iterative reasoning process by denoising diffusion modeling. Specifically, we propose a language-guided diffusion framework for visual grounding, LG-DVG, which trains the model to progressively reason queried object boxes by denoising a set of noisy boxes with the language guide. To achieve this, LG-DVG gradually perturbs query-aligned ground truth boxes to noisy ones and reverses this process step by step, conditional on query semantics. Extensive experiments for our proposed framework on five widely used datasets validate the superior performance of solving visual grounding, a cross-modal alignment task, in a generative way. The source codes are available at \url{https://github.com/iQua/vgbase/tree/DiffusionVG}.	Proposes LG-DVG, a language-guided diffusion model for visual grounding, which iteratively refines bounding boxes based on text queries using a Markov Chain.	Addresses limitations of existing visual grounding methods that rely on single-step reasoning, complex architectures, and pre-defined anchors, leading to difficulties in training and overfitting.	Formulates visual grounding as a generative task where noisy boxes are progressively denoised to target boxes guided by text queries. Employs a novel cross-modal transformer and a query-conditioned predictor within a diffusion model framework.	Achieves competitive accuracy on phrase localization and referring expression comprehension tasks, outperforming most state-of-the-art methods. Demonstrates progressive refinement capability, with accuracy increasing as the number of sampling steps increases. Effectively handles one-to-many scenarios where a single query may correspond to multiple ground-truth boxes.	Limited ability to fully exploit semantic relationships within text queries for enhanced reasoning. Future work could explore incorporating semantic parsing or graph-based representations of text queries.	visual grounding, diffusion models, iterative reasoning, cross-modal alignment, generative models
2308.09592 Report	StableVideo: Text-driven Consistency-aware Diffusion Video Editing	Wenhao Chai, Xun Guo, Gaoang Wang, Yan Lu	Diffusion-based methods can generate realistic images and videos, but they struggle to edit existing objects in a video while preserving their appearance over time. This prevents diffusion models from being applied to natural video editing in practical scenarios. In this paper, we tackle this problem by introducing temporal dependency to existing text-driven diffusion models, which allows them to generate consistent appearance for the edited objects. Specifically, we develop a novel inter-frame propagation mechanism for diffusion video editing, which leverages the concept of layered representations to propagate the appearance information from one frame to the next. We then build up a text-driven video editing framework based on this mechanism, namely StableVideo, which can achieve consistency-aware video editing. Extensive experiments demonstrate the strong editing capability of our approach. Compared with state-of-the-art video editing methods, our approach shows superior qualitative and quantitative results. Our code is available at \href{https://github.com/rese1f/StableVideo}{this https URL}.	This paper proposes StableVideo, a novel text-driven video editing framework that leverages diffusion models and Neural Layered Atlas (NLA) to enable consistent appearance editing of objects in videos.	Existing diffusion-based video editing methods struggle to maintain temporal consistency in object appearance, limiting their practical application for natural video editing.	StableVideo uses NLA to decompose videos into foreground and background layers. For foreground editing, it employs key frame editing with an inter-frame propagation mechanism to ensure geometric and temporal consistency. An aggregation network then generates the final edited foreground atlas from the key frames. The edited foreground and background are finally combined to reconstruct the edited video.	StableVideo achieves high-quality video editing with consistent object appearance across time, outperforming state-of-the-art methods like Tune-A-Video and Text2LIVE. The proposed inter-frame propagation mechanism effectively maintains geometric and appearance consistency during key frame editing. The aggregation network successfully generates coherent atlas representations from the edited key frames, ensuring smooth transitions between edited frames.	StableVideo's performance depends on the accuracy of NLA, which can be challenged by non-rigid objects or complex motions. The editing quality is limited by the capabilities of the underlying diffusion model, which may not always generate ideal results, especially for complex scenarios like humans or animals.	video editing, diffusion models, temporal consistency, neural layered atlas, text-driven generation
2308.09544 Report	Adapt Your Teacher: Improving Knowledge Distillation for Exemplar-free Continual Learning	Filip Szatkowski, Mateusz Pyla, Marcin Przewięźlikowski, Sebastian Cygert, Bartłomiej Twardowski, Tomasz Trzciński	In this work, we investigate exemplar-free class incremental learning (CIL) with knowledge distillation (KD) as a regularization strategy, aiming to prevent forgetting. KD-based methods are successfully used in CIL, but they often struggle to regularize the model without access to exemplars of the training data from previous tasks. Our analysis reveals that this issue originates from substantial representation shifts in the teacher network when dealing with out-of-distribution data. This causes large errors in the KD loss component, leading to performance degradation in CIL models. Inspired by recent test-time adaptation methods, we introduce Teacher Adaptation (TA), a method that concurrently updates the teacher and the main models during incremental training. Our method seamlessly integrates with KD-based CIL approaches and allows for consistent enhancement of their performance across multiple exemplar-free CIL benchmarks. The source code for our method is available at https://github.com/fszatkowski/cl-teacher-adaptation.	This paper proposes Teacher Adaptation (TA), a simple yet effective method to improve knowledge distillation-based methods in exemplar-free class-incremental learning.	Exemplar-free class-incremental learning (CIL) with knowledge distillation (KD) often struggles to regularize the model effectively due to substantial representation shifts in the teacher network when dealing with out-of-distribution data. This can lead to performance degradation in CIL models.	TA continuously updates the teacher network by adjusting batch normalization statistics during the learning of a new task for both the current model and the teacher model saved from the previous task. This mitigates changes in the model caused by KD loss due to differing normalization statistics. Further improvement is achieved with a warmup phase that trains a new classification head before finetuning the whole model, ensuring more stable initialization.	TA consistently improves results for various KD methods across standard CIL benchmarks (CIFAR100, TinyImageNet200, ImageNet100). TA shows more significant improvements in settings with a larger number of tasks and an equal split of classes, where the initial feature extractor is weaker. TA demonstrates enhanced performance under severe distribution shifts, tested on DomainNet and corrupted CIFAR100 scenarios.	The performance of TA may be limited with certain KD loss functions, like MKD, which uses a sigmoid function that may result in insignificant probability differences. TA's effectiveness may be reduced when a sufficient number of exemplars are available, as they help mitigate normalization statistics divergence.	continual learning, class-incremental learning, knowledge distillation, teacher adaptation, exemplar-free
2308.09540 Report	Meta-ZSDETR: Zero-shot DETR with Meta-learning	Lu Zhang, Chenbo Zhang, Jiajia Zhao, Jihong Guan, Shuigeng Zhou	Zero-shot object detection aims to localize and recognize objects of unseen classes. Most of existing works face two problems: the low recall of RPN in unseen classes and the confusion of unseen classes with background. In this paper, we present the first method that combines DETR and meta-learning to perform zero-shot object detection, named Meta-ZSDETR, where model training is formalized as an individual episode based meta-learning task. Different from Faster R-CNN based methods that firstly generate class-agnostic proposals, and then classify them with visual-semantic alignment module, Meta-ZSDETR directly predict class-specific boxes with class-specific queries and further filter them with the predicted accuracy from classification head. The model is optimized with meta-contrastive learning, which contains a regression head to generate the coordinates of class-specific boxes, a classification head to predict the accuracy of generated boxes, and a contrastive head that utilizes the proposed contrastive-reconstruction loss to further separate different classes in visual space. We conduct extensive experiments on two benchmark datasets MS COCO and PASCAL VOC. Experimental results show that our method outperforms the existing ZSD methods by a large margin.	Proposes Meta-ZSDETR, the first method combining DETR and meta-learning for zero-shot object detection, addressing limitations of previous Faster R-CNN based approaches.	Existing zero-shot object detection methods suffer from low recall for unseen classes and confusion with background, Meta-ZSDETR overcomes these by utilizing DETR and meta-learning.	Meta-ZSDETR formalizes training as an episodic meta-learning task, fusing object queries with semantic vectors to predict class-specific boxes, and utilizes meta-contrastive learning with regression, classification, and contrastive heads for optimization.	Achieves state-of-the-art performance on PASCAL VOC and MS COCO, surpassing previous methods by a significant margin. Demonstrates strong generalization ability to unseen classes, improving mAP and recall significantly. Shows effectiveness of meta-contrastive learning and class-specific query fusion in improving detection accuracy.	Computational cost is higher due to the large number of queries used in DETR. Future work includes exploring more efficient architectures and investigating other meta-learning strategies.	zero-shot object detection, meta-learning, detr, contrastive learning, visual-semantic alignment
2308.09421 Report	MonoNeRD: NeRF-like Representations for Monocular 3D Object Detection	Junkai Xu, Liang Peng, Haoran Cheng, Hao Li, Wei Qian, Ke Li, Wenxiao Wang, Deng Cai	In the field of monocular 3D detection, it is common practice to utilize scene geometric clues to enhance the detector's performance. However, many existing works adopt these clues explicitly such as estimating a depth map and back-projecting it into 3D space. This explicit methodology induces sparsity in 3D representations due to the increased dimensionality from 2D to 3D, and leads to substantial information loss, especially for distant and occluded objects. To alleviate this issue, we propose MonoNeRD, a novel detection framework that can infer dense 3D geometry and occupancy. Specifically, we model scenes with Signed Distance Functions (SDF), facilitating the production of dense 3D representations. We treat these representations as Neural Radiance Fields (NeRF) and then employ volume rendering to recover RGB images and depth maps. To the best of our knowledge, this work is the first to introduce volume rendering for M3D, and demonstrates the potential of implicit reconstruction for image-based 3D perception. Extensive experiments conducted on the KITTI-3D benchmark and Waymo Open Dataset demonstrate the effectiveness of MonoNeRD. Codes are available at https://github.com/cskkxjk/MonoNeRD.	MonoNeRD, a novel monocular 3D object detection framework that leverages NeRF-like continuous 3D representations for accurate 3D perception from a single image.	Existing methods using explicit depth information for 3D representations in monocular 3D detection suffer from sparsity and information loss, especially for distant objects.	The method constructs position-aware frustum features from 2D image features and 3D coordinates, then uses them to generate signed distance fields and radiance fields. Volume rendering is employed to recover RGB images and depth maps, supervised by original images and LiDAR data. Finally, regular 3D voxel features are generated for object detection.	MonoNeRD achieves state-of-the-art results on the KITTI 3D detection benchmark, especially for moderate and hard difficulty levels. The method exhibits superior performance in handling distant and occluded objects on both KITTI and Waymo datasets. Visualization results demonstrate that MonoNeRD produces denser and more continuous 3D representations compared to depth-map-based methods.	The performance heavily relies on the modeling approach. Current implementation with bounds modeling might fail to predict 3D occupancy for areas outside the specified bounds, such as the sky.	monocular 3d object detection, neural radiance fields, signed distance function, volume rendering, 3d representation learning
2308.09386 Report	DReg-NeRF: Deep Registration for Neural Radiance Fields	Yu Chen, Gim Hee Lee	Although Neural Radiance Fields (NeRF) is popular in the computer vision community recently, registering multiple NeRFs has yet to gain much attention. Unlike the existing work, NeRF2NeRF, which is based on traditional optimization methods and needs human annotated keypoints, we propose DReg-NeRF to solve the NeRF registration problem on object-centric scenes without human intervention. After training NeRF models, our DReg-NeRF first extracts features from the occupancy grid in NeRF. Subsequently, our DReg-NeRF utilizes a transformer architecture with self-attention and cross-attention layers to learn the relations between pairwise NeRF blocks. In contrast to state-of-the-art (SOTA) point cloud registration methods, the decoupled correspondences are supervised by surface fields without any ground truth overlapping labels. We construct a novel view synthesis dataset with 1,700+ 3D objects obtained from Objaverse to train our network. When evaluated on the test set, our proposed method beats the SOTA point cloud registration methods by a large margin, with a mean $\text{RPE}=9.67^{\circ}$ and a mean $\text{RTE}=0.038$. Our code is available at https://github.com/AIBluefisher/DReg-NeRF.	This paper introduces DReg-NeRF, a novel deep learning method for registering multiple Neural Radiance Fields (NeRFs) in object-centric scenes without human intervention or initializations.	Registering multiple NeRFs, trained on data captured in different coordinate frames (e.g., from cameras without absolute pose information), is crucial for consistent novel view synthesis. Existing methods either rely on human annotations or struggle with the implicit nature of NeRF representations.	DReg-NeRF extracts features from occupancy grids of NeRF models using a 3D Feature Pyramid Network. These features are then processed by a transformer with self-attention and cross-attention layers to learn inter and intra-feature relations. A decoder then predicts correspondences between point clouds and their confidence scores, supervised by surface fields from the NeRF models. Finally, a weighted Kabsch-Umeyama algorithm estimates the relative transformation.	DReg-NeRF outperforms state-of-the-art point cloud registration methods (FGR, REGTR) on a novel dataset created from Objaverse, demonstrating its effectiveness for object-centric NeRF registration. Surface field supervision is shown to be critical for accurate registration compared to using noisy density fields. The method achieves fast inference times (0.4 seconds) making it practical for real-time applications.	The current method is limited to object-centric scenes and struggles with unbounded scenes due to noisy geometry estimations in NeRF. The assumption of consistent scale between the registered NeRFs might not hold in real-world scenarios, necessitating further research.	nerf, registration, deep learning, transformer, surface fields
2308.09351 Report	RLIPv2: Fast Scaling of Relational Language-Image Pre-training	Hangjie Yuan, Shiwei Zhang, Xiang Wang, Samuel Albanie, Yining Pan, Tao Feng, Jianwen Jiang, Dong Ni, Yingya Zhang, Deli Zhao	Relational Language-Image Pre-training (RLIP) aims to align vision representations with relational texts, thereby advancing the capability of relational reasoning in computer vision tasks. However, hindered by the slow convergence of RLIPv1 architecture and the limited availability of existing scene graph data, scaling RLIPv1 is challenging. In this paper, we propose RLIPv2, a fast converging model that enables the scaling of relational pre-training to large-scale pseudo-labelled scene graph data. To enable fast scaling, RLIPv2 introduces Asymmetric Language-Image Fusion (ALIF), a mechanism that facilitates earlier and deeper gated cross-modal fusion with sparsified language encoding layers. ALIF leads to comparable or better performance than RLIPv1 in a fraction of the time for pre-training and fine-tuning. To obtain scene graph data at scale, we extend object detection datasets with free-form relation labels by introducing a captioner (e.g., BLIP) and a designed Relation Tagger. The Relation Tagger assigns BLIP-generated relation texts to region pairs, thus enabling larger-scale relational pre-training. Through extensive experiments conducted on Human-Object Interaction Detection and Scene Graph Generation, RLIPv2 shows state-of-the-art performance on three benchmarks under fully-finetuning, few-shot and zero-shot settings. Notably, the largest RLIPv2 achieves 23.29mAP on HICO-DET without any fine-tuning, yields 32.22mAP with just 1% data and yields 45.09mAP with 100% data. Code and models are publicly available at https://github.com/JacobYuan7/RLIPv2.	RLIPv2 is a fast-converging model for relational language-image pre-training that scales to large pseudo-labeled scene graph datasets, enabling improved relational reasoning in computer vision.	Existing methods struggle to scale relational language-image pre-training due to slow convergence and limited scene graph data, hindering progress in relational reasoning.	RLIPv2 introduces Asymmetric Language-Image Fusion (ALIF) for faster cross-modal alignment and leverages a Relation Tagger (R-Tagger) to pseudo-label object detection datasets with relation annotations.	RLIPv2 achieves comparable or better performance than its predecessor RLIPv1 in a fraction of the training time. RLIPv2 demonstrates state-of-the-art results on HOI detection benchmarks like HICO-DET and V-COCO under various settings, including zero-shot, few-shot, and fully-finetuned. RLIPv2 excels in Scene Graph Generation (SGG), achieving state-of-the-art performance on the Open Images v6 dataset.	The performance of the relational pseudo-labeling pipeline depends on the quality of the captions generated by external captioners. Future work includes exploring more advanced captioning methods and investigating the transferability of RLIPv2 to other relational reasoning tasks.	vision-language pre-training, relational reasoning, human-object interaction detection, scene graph generation, pseudo-labeling
2308.09314 Report	Retro-FPN: Retrospective Feature Pyramid Network for Point Cloud Semantic Segmentation	Peng Xiang, Xin Wen, Yu-Shen Liu, Hui Zhang, Yi Fang, Zhizhong Han	Learning per-point semantic features from the hierarchical feature pyramid is essential for point cloud semantic segmentation. However, most previous methods suffered from ambiguous region features or failed to refine per-point features effectively, which leads to information loss and ambiguous semantic identification. To resolve this, we propose Retro-FPN to model the per-point feature prediction as an explicit and retrospective refining process, which goes through all the pyramid layers to extract semantic features explicitly for each point. Its key novelty is a retro-transformer for summarizing semantic contexts from the previous layer and accordingly refining the features in the current stage. In this way, the categorization of each point is conditioned on its local semantic pattern. Specifically, the retro-transformer consists of a local cross-attention block and a semantic gate unit. The cross-attention serves to summarize the semantic pattern retrospectively from the previous layer. And the gate unit carefully incorporates the summarized contexts and refines the current semantic features. Retro-FPN is a pluggable neural network that applies to hierarchical decoders. By integrating Retro-FPN with three representative backbones, including both point-based and voxel-based methods, we show that Retro-FPN can significantly improve performance over state-of-the-art backbones. Comprehensive experiments on widely used benchmarks can justify the effectiveness of our design. The source is available at https://github.com/AllenXiangX/Retro-FPN	Proposes Retro-FPN, a plug-and-play neural network, to enhance point cloud semantic segmentation by refining per-point features from hierarchical feature pyramids.	Addresses the limitations of existing encoder-decoder frameworks that suffer from ambiguous region features or ineffective per-point feature refinement, leading to information loss and inaccurate semantic identification.	Introduces a retro-transformer within each pyramid layer of the decoder. This transformer uses local cross-attention to summarize semantic contexts from the previous layer and a semantic gate unit to refine current features, enabling explicit and retrospective refinement of point-level semantic information.	Achieves state-of-the-art performance on the S3DIS Area 5 benchmark (73.0 mIoU). Significantly improves performance across various backbones, including point-based and voxel-based methods, on S3DIS, ScanNet, and SemanticKITTI datasets. Demonstrates the effectiveness of the explicit and retrospective refinement strategy for per-point semantic feature learning.	Reliance on K-NN search for local semantic contexts might not be optimal for all point cloud distributions. Future work could explore flexible neighbor searching strategies for more accurate context capturing and reduced computational cost.	point cloud segmentation, semantic segmentation, feature pyramid network, retrospective refinement, transformer
2308.09306 Report	DiffDis: Empowering Generative Diffusion Model with Cross-Modal Discrimination Capability	Runhui Huang, Jianhua Han, Guansong Lu, Xiaodan Liang, Yihan Zeng, Wei Zhang, Hang Xu	Recently, large-scale diffusion models, e.g., Stable diffusion and DallE2, have shown remarkable results on image synthesis. On the other hand, large-scale cross-modal pre-trained models (e.g., CLIP, ALIGN, and FILIP) are competent for various downstream tasks by learning to align vision and language embeddings. In this paper, we explore the possibility of jointly modeling generation and discrimination. Specifically, we propose DiffDis to unify the cross-modal generative and discriminative pretraining into one single framework under the diffusion process. DiffDis first formulates the image-text discriminative problem as a generative diffusion process of the text embedding from the text encoder conditioned on the image. Then, we propose a novel dual-stream network architecture, which fuses the noisy text embedding with the knowledge of latent images from different scales for image-text discriminative learning. Moreover, the generative and discriminative tasks can efficiently share the image-branch network structure in the multi-modality model. Benefiting from diffusion-based unified training, DiffDis achieves both better generation ability and cross-modal semantic alignment in one architecture. Experimental results show that DiffDis outperforms single-task models on both the image generation and the image-text discriminative tasks, e.g., 1.65% improvement on average accuracy of zero-shot classification over 12 datasets and 2.42 improvement on FID of zero-shot image synthesis.	Proposes DiffDis, a unified vision-language diffusion model that jointly learns text-conditioned image generation and image-text alignment within a single diffusion framework.	Aims to bridge the gap between powerful generative diffusion models and cross-modal discriminative models by empowering the former with the ability to understand and discriminate cross-modal data.	Reformulates image-text discrimination as a generative diffusion process for text embeddings conditioned on images, introduces a dual-stream network architecture for deep fusion of knowledge from latent images and text, and proposes a unified training paradigm that alternates between generative and discriminative tasks.	Achieves 1.65% average accuracy improvement on zero-shot classification across 12 datasets compared to single-task baseline. Outperforms CLIP by 4.7% on average zero-shot classification accuracy and by 14.5% on average R@1 of image-text retrieval on Flickr30k and MSCOCO. Demonstrates comparable text-guided image generation quality to Stable Diffusion, achieving a 1.0 FID improvement.	Generation quality for specific domains (e.g., humans, animals) can be further improved by incorporating domain-specific training data. Presence of watermarks in the training dataset can lead to watermarks in generated images.	diffusion models, cross-modal learning, image generation, zero-shot classification, image-text retrieval
2308.09294 Report	Self-Calibrated Cross Attention Network for Few-Shot Segmentation	Qianxiong Xu, Wenting Zhao, Guosheng Lin, Cheng Long	The key to the success of few-shot segmentation (FSS) lies in how to effectively utilize support samples. Most solutions compress support foreground (FG) features into prototypes, but lose some spatial details. Instead, others use cross attention to fuse query features with uncompressed support FG. Query FG could be fused with support FG, however, query background (BG) cannot find matched BG features in support FG, yet inevitably integrates dissimilar features. Besides, as both query FG and BG are combined with support FG, they get entangled, thereby leading to ineffective segmentation. To cope with these issues, we design a self-calibrated cross attention (SCCA) block. For efficient patch-based attention, query and support features are firstly split into patches. Then, we design a patch alignment module to align each query patch with its most similar support patch for better cross attention. Specifically, SCCA takes a query patch as Q, and groups the patches from the same query image and the aligned patches from the support image as K&V. In this way, the query BG features are fused with matched BG features (from query patches), and thus the aforementioned issues will be mitigated. Moreover, when calculating SCCA, we design a scaled-cosine mechanism to better utilize the support features for similarity calculation. Extensive experiments conducted on PASCAL-5^i and COCO-20^i demonstrate the superiority of our model, e.g., the mIoU score under 5-shot setting on COCO-20^i is 5.6%+ better than previous state-of-the-arts. The code is available at https://github.com/Sam1224/SCCAN.	This paper proposes Self-Calibrated Cross Attention Network (SCCAN) for Few-Shot Segmentation (FSS) to enhance the utilization of support samples by tackling the background mismatch and foreground-background entanglement issues in existing cross-attention based methods.	Existing FSS methods, particularly those based on cross-attention, struggle with effectively utilizing support samples due to mismatched background features and entanglement of foreground and background information, limiting segmentation accuracy.	SCCAN leverages a Self-Calibrated Cross Attention (SCCA) block that calculates self and cross attentions concurrently, aligning query patches with the most similar support patches. It also employs a Pseudo Mask Aggregation (PMA) module to generate reliable pseudo masks for query images.	SCCAN achieves state-of-the-art results on PASCAL-5i and COCO-20i datasets, significantly outperforming previous methods. The proposed SCCA block effectively addresses the background mismatch and foreground-background entanglement issues. The PMA module generates robust pseudo masks that aid in locating query foreground objects.	The current k-shot strategy, which involves averaging support features for k>1, might not be optimal for cross-attention and needs further investigation. Exploring the potential of using support background information in a more effective manner for cross-attention based FSS.	few-shot segmentation, cross attention, swin transformer, pseudo mask, foreground-background entanglement
2308.09281 Report	Diverse Cotraining Makes Strong Semi-Supervised Segmentor	Yijiang Li, Xinjiang Wang, Lihe Yang, Litong Feng, Wayne Zhang, Ying Gao	Deep co-training has been introduced to semi-supervised segmentation and achieves impressive results, yet few studies have explored the working mechanism behind it. In this work, we revisit the core assumption that supports co-training: multiple compatible and conditionally independent views. By theoretically deriving the generalization upper bound, we prove the prediction similarity between two models negatively impacts the model's generalization ability. However, most current co-training models are tightly coupled together and violate this assumption. Such coupling leads to the homogenization of networks and confirmation bias which consequently limits the performance. To this end, we explore different dimensions of co-training and systematically increase the diversity from the aspects of input domains, different augmentations and model architectures to counteract homogenization. Our Diverse Co-training outperforms the state-of-the-art (SOTA) methods by a large margin across different evaluation protocols on the Pascal and Cityscapes. For example. we achieve the best mIoU of 76.2%, 77.7% and 80.2% on Pascal with only 92, 183 and 366 labeled images, surpassing the previous best results by more than 5%.	This paper investigates the lack of diversity in current deep co-training methods for semi-supervised segmentation and proposes Diverse Co-training, a holistic approach to increase diversity in input domains, augmentations, and model architectures.	Co-training methods often suffer from homogenization, where the multiple models being trained become too similar, hindering performance. This paper proves theoretically and shows empirically that homogenization negatively impacts generalization ability in co-training.	The paper theoretically analyzes the generalization upper bound of co-training, linking it to homogenization. It then explores three techniques to increase diversity: using different input domains (RGB and frequency), applying different augmentations to each model, and using different architectures (CNN and Transformer). These techniques are combined to form Diverse Co-training.	Diverse Co-training significantly outperforms previous state-of-the-art methods on Pascal VOC 2012 and Cityscapes datasets across various partition protocols. The paper provides empirical evidence that each of the three proposed techniques (diverse input domains, augmentations, and architectures) contributes to improved performance by reducing homogenization. The proposed method achieves superior performance with fewer parameters compared to some previous SOTA methods, demonstrating its efficiency.	The paper primarily focuses on two-model and three-model co-training, leaving the exploration of co-training with more models for future work. While the paper demonstrates the effectiveness of the chosen hyperparameters, a more thorough hyperparameter search for each setting might yield further performance gains.	semi-supervised segmentation, co-training, diversity, homogenization, deep learning
2308.09139 Report	The Unreasonable Effectiveness of Large Language-Vision Models for Source-free Video Domain Adaptation	Giacomo Zara, Alessandro Conti, Subhankar Roy, Stéphane Lathuilière, Paolo Rota, Elisa Ricci	Source-Free Video Unsupervised Domain Adaptation (SFVUDA) task consists in adapting an action recognition model, trained on a labelled source dataset, to an unlabelled target dataset, without accessing the actual source data. The previous approaches have attempted to address SFVUDA by leveraging self-supervision (e.g., enforcing temporal consistency) derived from the target data itself. In this work, we take an orthogonal approach by exploiting "web-supervision" from Large Language-Vision Models (LLVMs), driven by the rationale that LLVMs contain a rich world prior surprisingly robust to domain-shift. We showcase the unreasonable effectiveness of integrating LLVMs for SFVUDA by devising an intuitive and parameter-efficient method, which we name Domain Adaptation with Large Language-Vision models (DALL-V), that distills the world prior and complementary source model information into a student network tailored for the target. Despite the simplicity, DALL-V achieves significant improvement over state-of-the-art SFVUDA methods.	This paper presents DALL-V, a novel Source-Free Video Unsupervised Domain Adaptation (SFVUDA) method that leverages Large Language-Vision Models (LLVMs) like CLIP to adapt action recognition models to unlabeled target domains without accessing source data.	Existing SFVUDA methods often struggle with domain shift and rely heavily on self-supervision from the target data. This paper argues that the rich world prior encoded in LLVMs can effectively bridge the domain gap, exceeding the performance of current sophisticated SFVUDA methods.	DALL-V works in two stages: (1) Target Adaptation: Uses zero-shot CLIP to pseudo-label target videos and fine-tunes a target-specific adapter. (2) Ensemble Distillation: Distills information from the source model, target adapter, and CLIP into a smaller student network for inference.	DALL-V outperforms state-of-the-art SFVUDA methods, even exceeding some VUDA methods that use source data, achieving significant improvements on the Daily-DA, UCF-HMDB(full), and Sports-DA benchmarks. Ablation studies demonstrate the effectiveness of target adaptation, ensemble distillation, and the use of multiple templates for improving performance. UMAP visualizations show that DALL-V learns a more discriminative feature space compared to using CLIP or source model alone.	Reliance on CLIP's black-box nature may pose limitations in safety-critical applications. Lack of theoretical guarantees that LLVMs will always outperform traditional SFVUDA methods.	source-free domain adaptation, video action recognition, large language-vision models, clip, knowledge distillation
2308.09098 Report	ImGeoNet: Image-induced Geometry-aware Voxel Representation for Multi-view 3D Object Detection	Tao Tu, Shun-Po Chuang, Yu-Lun Liu, Cheng Sun, Ke Zhang, Donna Roy, Cheng-Hao Kuo, Min Sun	We propose ImGeoNet, a multi-view image-based 3D object detection framework that models a 3D space by an image-induced geometry-aware voxel representation. Unlike previous methods which aggregate 2D features into 3D voxels without considering geometry, ImGeoNet learns to induce geometry from multi-view images to alleviate the confusion arising from voxels of free space, and during the inference phase, only images from multiple views are required. Besides, a powerful pre-trained 2D feature extractor can be leveraged by our representation, leading to a more robust performance. To evaluate the effectiveness of ImGeoNet, we conduct quantitative and qualitative experiments on three indoor datasets, namely ARKitScenes, ScanNetV2, and ScanNet200. The results demonstrate that ImGeoNet outperforms the current state-of-the-art multi-view image-based method, ImVoxelNet, on all three datasets in terms of detection accuracy. In addition, ImGeoNet shows great data efficiency by achieving results comparable to ImVoxelNet with 100 views while utilizing only 40 views. Furthermore, our studies indicate that our proposed image-induced geometry-aware representation can enable image-based methods to attain superior detection accuracy than the seminal point cloud-based method, VoteNet, in two practical scenarios: (1) scenarios where point clouds are sparse and noisy, such as in ARKitScenes, and (2) scenarios involve diverse object classes, particularly classes of small objects, as in the case in ScanNet200.	This paper proposes ImGeoNet, a multi-view image-based 3D object detection framework that utilizes an image-induced geometry-aware voxel representation.	Existing multi-view image-based methods often overlook geometric information during feature volume construction, limiting their accuracy. ImGeoNet addresses this limitation by incorporating geometry awareness.	ImGeoNet constructs a 3D voxel feature volume from multi-view images and then performs geometry shaping. This process involves predicting the likelihood of each voxel belonging to a surface and weighting the feature volume accordingly, thus emphasizing object surfaces and reducing the impact of free space.	ImGeoNet outperforms the state-of-the-art multi-view image-based method, ImVoxelNet, on ARKitScenes, ScanNetV2, and ScanNet200 datasets. ImGeoNet achieves comparable results to ImVoxelNet with significantly fewer input views, demonstrating data efficiency. The proposed geometry-aware representation enables ImGeoNet to outperform the point cloud-based method, VoteNet, in scenarios with sparse point clouds (ARKitScenes) or diverse object classes (ScanNet200).	There is a performance gap between ImGeoNet and using ground-truth depth for geometry shaping, indicating room for improvement in the Geometry Shaping Network. The inference time of ImGeoNet is slightly higher than ImVoxelNet for the same number of views.	3d object detection, multi-view images, geometry-aware representation, voxel feature volume, indoor scenes
2308.09091 Report	Edit Temporal-Consistent Videos with Image Diffusion Model	Yuanzhi Wang, Yong Li, Xiaoya Zhang, Xin Liu, Anbo Dai, Antoni B. Chan, Zhen Cui	Large-scale text-to-image (T2I) diffusion models have been extended for text-guided video editing, yielding impressive zero-shot video editing performance. Nonetheless, the generated videos usually show spatial irregularities and temporal inconsistencies as the temporal characteristics of videos have not been faithfully modeled. In this paper, we propose an elegant yet effective Temporal-Consistent Video Editing (TCVE) method to mitigate the temporal inconsistency challenge for robust text-guided video editing. In addition to the utilization of a pretrained T2I 2D Unet for spatial content manipulation, we establish a dedicated temporal Unet architecture to faithfully capture the temporal coherence of the input video sequences. Furthermore, to establish coherence and interrelation between the spatial-focused and temporal-focused components, a cohesive spatial-temporal modeling unit is formulated. This unit effectively interconnects the temporal Unet with the pretrained 2D Unet, thereby enhancing the temporal consistency of the generated videos while preserving the capacity for video content manipulation. Quantitative experimental results and visualization results demonstrate that TCVE achieves state-of-the-art performance in both video temporal consistency and video editing capability, surpassing existing benchmarks in the field.	This paper presents TCVE, a novel text-guided video editing method that leverages a dedicated temporal Unet and a spatial-temporal modeling unit to enhance temporal consistency in edited videos.	Existing text-guided video editing methods often produce videos with temporal inconsistencies (e.g., flickering) due to inadequate temporal modeling.	TCVE employs a pretrained 2D Unet for spatial editing and a dedicated temporal Unet to capture temporal coherence. A spatial-temporal modeling unit connects these Unets, fusing spatial and temporal information for improved consistency.	TCVE outperforms state-of-the-art methods in quantitative metrics for frame consistency, textual alignment, and human preference. Ablation studies confirm the significant contributions of the temporal Unet and the spatial-temporal modeling unit. Qualitative results demonstrate TCVE's capability to generate temporally consistent videos with successful style transfer, object editing, background change, and multiple-object editing.	TCVE may struggle with simultaneous manipulation of style, objects, and backgrounds due to limitations of image-based text embedding. Future work can explore incorporating video-based text embedding for enhanced video editing capabilities.	text-guided video editing, temporal consistency, temporal unet, spatial-temporal modeling, diffusion models
2308.08947 Report	Watch Your Steps: Local Image and Scene Editing by Text Instructions	Ashkan Mirzaei, Tristan Aumentado-Armstrong, Marcus A. Brubaker, Jonathan Kelly, Alex Levinshtein, Konstantinos G. Derpanis, Igor Gilitschenski	Denoising diffusion models have enabled high-quality image generation and editing. We present a method to localize the desired edit region implicit in a text instruction. We leverage InstructPix2Pix (IP2P) and identify the discrepancy between IP2P predictions with and without the instruction. This discrepancy is referred to as the relevance map. The relevance map conveys the importance of changing each pixel to achieve the edits, and is used to to guide the modifications. This guidance ensures that the irrelevant pixels remain unchanged. Relevance maps are further used to enhance the quality of text-guided editing of 3D scenes in the form of neural radiance fields. A field is trained on relevance maps of training views, denoted as the relevance field, defining the 3D region within which modifications should be made. We perform iterative updates on the training views guided by rendered relevance maps from the relevance field. Our method achieves state-of-the-art performance on both image and NeRF editing tasks. Project page: https://ashmrz.github.io/WatchYourSteps/	This paper presents a method for localized image and scene editing using text instructions, leveraging the discrepancy in noise predictions from a diffusion model (InstructPix2Pix) with and without the instruction, termed the 'relevance map'.	Existing diffusion-based editing methods often lead to over-editing, modifying regions not specified in the instruction. This method addresses this by explicitly localizing edits to relevant areas, enhancing fidelity to the original input.	The method calculates a 'relevance map' by comparing noise predictions from InstructPix2Pix with and without the text instruction. This map guides the editing process, restricting changes to high-relevance regions. For 3D scene editing, a 'relevance field' is trained on these maps to maintain consistency across views.	The method achieves state-of-the-art performance on image editing, surpassing baselines in preserving input fidelity while adhering to instructions. In 3D scene editing, it demonstrates superior performance in view consistency and edit localization compared to existing techniques. The generated outputs exhibit high quality and sharpness, outperforming baselines in terms of perceptual metrics like NIQE.	The method's reliance on InstructPix2Pix for relevance prediction means it cannot recover from cases where InstructPix2Pix fails significantly. Future work could explore alternative diffusion models for improved robustness and generalization to more complex editing scenarios.	image editing, scene editing, text-guided editing, diffusion models, neural radiance fields
2308.08884 Report	SRMAE: Masked Image Modeling for Scale-Invariant Deep Representations	Zhiming Wang, Lin Gu, Feng Lu	Due to the prevalence of scale variance in nature images, we propose to use image scale as a self-supervised signal for Masked Image Modeling (MIM). Our method involves selecting random patches from the input image and downsampling them to a low-resolution format. Our framework utilizes the latest advances in super-resolution (SR) to design the prediction head, which reconstructs the input from low-resolution clues and other patches. After 400 epochs of pre-training, our Super Resolution Masked Autoencoders (SRMAE) get an accuracy of 82.1% on the ImageNet-1K task. Image scale signal also allows our SRMAE to capture scale invariance representation. For the very low resolution (VLR) recognition task, our model achieves the best performance, surpassing DeriveNet by 1.3%. Our method also achieves an accuracy of 74.84% on the task of recognizing low-resolution facial expressions, surpassing the current state-of-the-art FMD by 9.48%.	This paper proposes Super Resolution Masked Autoencoders (SRMAE), a novel Masked Image Modeling (MIM) framework that leverages image scale as a self-supervised signal for learning scale-invariant representations.	Scale variance is a prevalent characteristic of natural images and poses challenges for neural networks. Achieving scale invariance is crucial for advancing computer vision, particularly in low-resolution image recognition.	SRMAE modifies the traditional MIM architecture by incorporating downsampled image patches as input to the prediction head alongside encoded high-resolution patches. It utilizes a High Preserving Block (HPB) module and a lightweight Vision Transformer (ViT) for resolution recovery, drawing inspiration from super-resolution techniques.	SRMAE achieves 82.1% accuracy on ImageNet-1K after 400 epochs, demonstrating its ability to learn scale-invariant representations. In very low-resolution digit classification on the SVHN dataset, SRMAE surpasses previous state-of-the-art methods by 1.3%, achieving 89.14% accuracy. For low-resolution facial expression recognition on the ExpW dataset, SRMAE achieves 74.84% accuracy, surpassing the previous state-of-the-art by 9.5%.	The paper acknowledges that using scale as a self-supervised signal might lead to suboptimal performance in from-scratch and fine-tuning scenarios compared to methods using original pixel intensity. Future work can explore incorporating additional modules for enhancing super-resolution capabilities to further improve performance.	masked image modeling, self-supervised learning, scale invariance, super-resolution, low-resolution image recognition
2308.08857 Report	D-IF: Uncertainty-aware Human Digitization via Implicit Distribution Field	Xueting Yang, Yihao Luo, Yuliang Xiu, Wei Wang, Hao Xu, Zhaoxin Fan	Realistic virtual humans play a crucial role in numerous industries, such as metaverse, intelligent healthcare, and self-driving simulation. But creating them on a large scale with high levels of realism remains a challenge. The utilization of deep implicit function sparks a new era of image-based 3D clothed human reconstruction, enabling pixel-aligned shape recovery with fine details. Subsequently, the vast majority of works locate the surface by regressing the deterministic implicit value for each point. However, should all points be treated equally regardless of their proximity to the surface? In this paper, we propose replacing the implicit value with an adaptive uncertainty distribution, to differentiate between points based on their distance to the surface. This simple ``value to distribution'' transition yields significant improvements on nearly all the baselines. Furthermore, qualitative results demonstrate that the models trained using our uncertainty distribution loss, can capture more intricate wrinkles, and realistic limbs. Code and models are available for research purposes at https://github.com/psyai-net/D-IF_release.	This paper introduces D-IF, a novel method that utilizes implicit distribution fields to capture uncertainty in image-based 3D clothed human reconstruction, leading to improved detail recovery, particularly in challenging poses and loose garments.	Creating realistic digital humans with intricate clothing is crucial for various industries, but current methods struggle to balance detail accuracy with handling loose garments and diverse poses.	D-IF leverages a distribution-guided network to estimate point-wise occupancy distributions instead of deterministic values. It incorporates an uncertainty distribution loss to balance distribution sharpness, and an Occupancy Rectifier to refine coarse outputs.	D-IF achieves state-of-the-art performance on CAPE dataset, outperforming previous methods in challenging pose reconstruction. The method effectively recovers intricate geometric features, mitigating artifacts like distorted limbs and missing details common in other approaches. D-IF acts as a plug-and-play module, demonstrably improving the accuracy of existing implicit-based human reconstruction methods.	The method focuses on aleatoric uncertainty, with epistemic uncertainty not directly addressed. Future work could explore applying D-IF to broader shape reconstruction tasks beyond human bodies.	3d human reconstruction, implicit distribution fields, uncertainty estimation, deep learning, computer vision
2308.08769 Report	Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D Scenes	Zehan Wang, Haifeng Huang, Yang Zhao, Ziang Zhang, Zhou Zhao	3D scene understanding has gained significant attention due to its wide range of applications. However, existing methods for 3D scene understanding are limited to specific downstream tasks, which hinders their practicality in real-world applications. This paper presents Chat-3D, which combines the 3D visual perceptual ability of pre-trained 3D representations and the impressive reasoning and conversation capabilities of advanced LLMs to achieve the first universal dialogue systems for 3D scenes. Specifically, we align 3D representations into the feature space of LLMs, thus enabling LLMs to perceive the 3D world. Given the scarcity of 3D scene-text data, we propose a three-stage training strategy to efficiently utilize the available data for better alignment. To enhance the reasoning ability and develop a user-friendly interaction scheme, we further construct a high-quality object-centric 3D instruction dataset and design an associated object-centric prompt. Our experiments show that Chat-3D achieves an impressive ability to comprehend diverse instructions for 3D scenes, engage in intricate spatial reasoning, and incorporate external knowledge into its responses. Chat-3D achieves a 75.6% relative score compared with GPT-4 on the constructed instruction dataset.	This paper introduces Chat-3D, the first universal dialogue system for 3D scenes, combining pre-trained 3D representations with the reasoning and conversational abilities of LLMs.	Existing 3D scene understanding methods are limited to specific downstream tasks, hindering their practicality. Chat-3D enables general dialogue about 3D scenes, crucial for applications like robotics and human-robot interaction.	A three-stage training scheme is used: 1) aligning 3D object features with word embeddings, 2) learning object relations via 3D scene-text data, and 3) fine-tuning with an object-centric instruction dataset.	Chat-3D demonstrates impressive ability to comprehend diverse instructions for 3D scenes and engage in spatial reasoning. A novel three-stage training approach effectively aligns 3D representations with LLMs in low-resource scenarios. The constructed object-centric instruction dataset and prompt approach enhance Chat-3D's reasoning ability and user-friendliness.	The reliance on 3D object segmentation, either from models or annotations, can impact performance. The current implementation focuses on indoor scenes, limiting generalizability to other environments.	3d scene understanding, universal dialogue system, multi-modal large language model, object-centric instruction dataset, spatial reasoning
2308.08754 Report	Fine-grained Text and Image Guided Point Cloud Completion with CLIP Model	Wei Song, Jun Zhou, Mingjie Wang, Hongchen Tan, Nannan Li, Xiuping Liu	This paper focuses on the recently popular task of point cloud completion guided by multimodal information. Although existing methods have achieved excellent performance by fusing auxiliary images, there are still some deficiencies, including the poor generalization ability of the model and insufficient fine-grained semantic information for extracted features. In this work, we propose a novel multimodal fusion network for point cloud completion, which can simultaneously fuse visual and textual information to predict the semantic and geometric characteristics of incomplete shapes effectively. Specifically, to overcome the lack of prior information caused by the small-scale dataset, we employ a pre-trained vision-language model that is trained with a large amount of image-text pairs. Therefore, the textual and visual encoders of this large-scale model have stronger generalization ability. Then, we propose a multi-stage feature fusion strategy to fuse the textual and visual features into the backbone network progressively. Meanwhile, to further explore the effectiveness of fine-grained text descriptions for point cloud completion, we also build a text corpus with fine-grained descriptions, which can provide richer geometric details for 3D shapes. The rich text descriptions can be used for training and evaluating our network. Extensive quantitative and qualitative experiments demonstrate the superior performance of our method compared to state-of-the-art point cloud completion networks.	This paper introduces FTPNet, a novel multimodal point cloud completion network that leverages pre-trained CLIP model for fusing visual and textual information to predict the complete 3D shape from a partial point cloud.	Existing point cloud completion methods struggle with limited generalization ability due to small training datasets and insufficient fine-grained semantic information. This work addresses these limitations by incorporating rich multimodal features.	The method uses a pre-trained CLIP model to extract visual features from rendered images and textual features from fine-grained geometric descriptions. A multi-stage fusion strategy integrates these features into a basic point cloud completion network. Additionally, a new text corpus 'ViPC-Text' is introduced, containing detailed descriptions of 3D shapes.	FTPNet significantly outperforms state-of-the-art point cloud completion methods on both known and novel object categories from the ShapeNet-ViPC dataset. The use of pre-trained CLIP model leads to better generalization ability and improves the quality of reconstructed shapes. Fine-grained text descriptions significantly enhance the model's ability to understand and reconstruct complex structures and details.	The model's understanding of fine-grained text information can be further improved. Future work can explore the development of a fine-grained and controllable text-guided 3D point cloud completion framework.	point cloud completion, multimodal learning, clip model, fine-grained text descriptions, 3d shape understanding
2308.08428 Report	ALIP: Adaptive Language-Image Pre-training with Synthetic Caption	Kaicheng Yang, Jiankang Deng, Xiang An, Jiawei Li, Ziyong Feng, Jia Guo, Jing Yang, Tongliang Liu	Contrastive Language-Image Pre-training (CLIP) has significantly boosted the performance of various vision-language tasks by scaling up the dataset with image-text pairs collected from the web. However, the presence of intrinsic noise and unmatched image-text pairs in web data can potentially affect the performance of representation learning. To address this issue, we first utilize the OFA model to generate synthetic captions that focus on the image content. The generated captions contain complementary information that is beneficial for pre-training. Then, we propose an Adaptive Language-Image Pre-training (ALIP), a bi-path model that integrates supervision from both raw text and synthetic caption. As the core components of ALIP, the Language Consistency Gate (LCG) and Description Consistency Gate (DCG) dynamically adjust the weights of samples and image-text/caption pairs during the training process. Meanwhile, the adaptive contrastive loss can effectively reduce the impact of noise data and enhances the efficiency of pre-training data. We validate ALIP with experiments on different scales of models and pre-training datasets. Experiments results show that ALIP achieves state-of-the-art performance on multiple downstream tasks including zero-shot image-text retrieval and linear probe. To facilitate future research, the code and pre-trained models are released at https://github.com/deepglint/ALIP.	This paper proposes ALIP (Adaptive Language-Image Pre-training), a bi-path model that leverages synthetic captions to enhance image-text representation learning and mitigate the impact of noisy web data.	Existing web-crawled datasets for contrastive language-image pre-training suffer from noisy and mismatched image-text pairs, which can degrade representation quality. ALIP addresses this by incorporating synthetic captions and adaptive weighting mechanisms.	ALIP uses OFA to generate synthetic image captions. It then employs two novel gates: Language Consistency Gate (LCG) to weight samples based on raw text and caption similarity and Description Consistency Gate (DCG) to adjust image-text/caption pair weights. These weights are integrated into an adaptive contrastive loss function.	ALIP achieves state-of-the-art results on zero-shot image-text retrieval benchmarks Flickr30k and MSCOCO. ALIP significantly outperforms baselines in linear probe evaluation on 10 downstream datasets, demonstrating enhanced representation power. While showing improvements on CIFAR10 and CIFAR100, ALIP's zero-shot classification accuracy lags slightly behind state-of-the-art, potentially due to the coarse nature of generated captions.	The current synthetic caption generation model primarily focuses on coarse-grained descriptions, limiting performance on fine-grained tasks. Future work will explore the integration of hierarchical information and finer-grained caption generation into ALIP.	image-text representation learning, contrastive learning, synthetic captions, noise-robust learning, vision-language pre-training
2308.08393 Report	SIGMA: Scale-Invariant Global Sparse Shape Matching	Maolin Gao, Paul Roetzer, Marvin Eisenberger, Zorah Lähner, Michael Moeller, Daniel Cremers, Florian Bernard	We propose a novel mixed-integer programming (MIP) formulation for generating precise sparse correspondences for highly non-rigid shapes. To this end, we introduce a projected Laplace-Beltrami operator (PLBO) which combines intrinsic and extrinsic geometric information to measure the deformation quality induced by predicted correspondences. We integrate the PLBO, together with an orientation-aware regulariser, into a novel MIP formulation that can be solved to global optimality for many practical problems. In contrast to previous methods, our approach is provably invariant to rigid transformations and global scaling, initialisation-free, has optimality guarantees, and scales to high resolution meshes with (empirically observed) linear time. We show state-of-the-art results for sparse non-rigid matching on several challenging 3D datasets, including data with inconsistent meshing, as well as applications in mesh-to-point-cloud matching.	A novel mixed-integer programming formulation for generating precise sparse correspondences for highly non-rigid shapes, using a projected Laplace-Beltrami operator (PLBO) and an orientation-aware regulariser.	Addresses limitations of previous methods, such as sensitivity to initialisation, lack of global optimality guarantees, and poor scalability to high-resolution meshes.	Develops a PLBO that combines intrinsic and extrinsic geometry to measure deformation quality, integrates PLBO and orientation regulariser into a MIP formulation, and solves for correspondences and shape reconstruction.	Achieves state-of-the-art accuracy on challenging datasets, including TOSCA, SMAL, SHREC20, and DT4D-M. Provably invariant to rigid transformations and global scaling, eliminating the need for pre-alignment. Exhibits linear scaling with mesh resolution, enabling application to high-resolution meshes.	Performance is not yet perfect for partial shapes due to increased search space. Struggles with topological changes as the mesh of one shape cannot well-explain deformation into the other.	shape matching, non-rigid deformation, mixed-integer programming, laplace-beltrami operator, global optimality
2308.08361 Report	KernelWarehouse: Towards Parameter-Efficient Dynamic Convolution	Chao Li, Anbang Yao	Dynamic convolution learns a linear mixture of $n$ static kernels weighted with their sample-dependent attentions, demonstrating superior performance compared to normal convolution. However, existing designs are parameter-inefficient: they increase the number of convolutional parameters by $n$ times. This and the optimization difficulty lead to no research progress in dynamic convolution that can allow us to use a significant large value of $n$ (e.g., $n>100$ instead of typical setting $n<10$) to push forward the performance boundary. In this paper, we propose $KernelWarehouse$, a more general form of dynamic convolution, which can strike a favorable trade-off between parameter efficiency and representation power. Its key idea is to redefine the basic concepts of "$kernels$" and "$assembling$ $kernels$" in dynamic convolution from the perspective of reducing kernel dimension and increasing kernel number significantly. In principle, KernelWarehouse enhances convolutional parameter dependencies within the same layer and across successive layers via tactful kernel partition and warehouse sharing, yielding a high degree of freedom to fit a desired parameter budget. We validate our method on ImageNet and MS-COCO datasets with different ConvNet architectures, and show that it attains state-of-the-art results. For instance, the ResNet18\|ResNet50\|MobileNetV2\|ConvNeXt-Tiny model trained with KernelWarehouse on ImageNet reaches 76.05%\|81.05%\|75.52%\|82.51% top-1 accuracy. Thanks to its flexible design, KernelWarehouse can even reduce the model size of a ConvNet while improving the accuracy, e.g., our ResNet18 model with 36.45%\|65.10% parameter reduction to the baseline shows 2.89%\|2.29% absolute improvement to top-1 accuracy.	This paper presents KernelWarehouse, a more general and parameter-efficient form of dynamic convolution that balances parameter efficiency and representation power by leveraging parameter dependencies within and across convolutional layers.	Existing dynamic convolution methods suffer from parameter inefficiency, hindering their capacity to utilize a large number of kernels for improved performance.	KernelWarehouse introduces kernel partition and warehouse sharing. It divides kernels into smaller kernel cells, represents them as linear mixtures from a shared warehouse, and assembles them. A novel attention function with a specific initialization strategy facilitates diverse attention allocation for effective kernel cell weighting.	KernelWarehouse consistently outperforms existing dynamic convolution methods on ImageNet and MS-COCO across various ConvNet architectures. It demonstrates the ability to significantly reduce model size while improving accuracy. The proposed attention function and initialization strategy are crucial for achieving optimal performance.	The runtime speed of models trained with KernelWarehouse is slower than counterparts under similar model size budget due to dense computation of linear mixtures. The paper explores KernelWarehouse on various ConvNets, but further investigation on deeper and larger architectures is limited by computational resources.	dynamic convolution, parameter efficiency, kernel partition, warehouse sharing, attention mechanism
2308.08321 Report	Stable and Causal Inference for Discriminative Self-supervised Deep Visual Representations	Yuewei Yang, Hai Li, Yiran Chen	In recent years, discriminative self-supervised methods have made significant strides in advancing various visual tasks. The central idea of learning a data encoder that is robust to data distortions/augmentations is straightforward yet highly effective. Although many studies have demonstrated the empirical success of various learning methods, the resulting learned representations can exhibit instability and hinder downstream performance. In this study, we analyze discriminative self-supervised methods from a causal perspective to explain these unstable behaviors and propose solutions to overcome them. Our approach draws inspiration from prior works that empirically demonstrate the ability of discriminative self-supervised methods to demix ground truth causal sources to some extent. Unlike previous work on causality-empowered representation learning, we do not apply our solutions during the training process but rather during the inference process to improve time efficiency. Through experiments on both controlled image datasets and realistic image datasets, we show that our proposed solutions, which involve tempering a linear transformation with controlled synthetic data, are effective in addressing these issues.	This paper provides a causal perspective on the instability of discriminative self-supervised learning (SSL) methods and proposes solutions to improve the stability of learned representations during inference.	Existing discriminative SSL methods, while effective, can exhibit unstable behavior when encountering subtle data shifts not encountered during training, hindering downstream task performance.	Building upon prior work, the authors analyze SSL methods under a causal framework. They demonstrate that learned representations are robust to training augmentations but unstable to unseen data variable shifts. Two solutions, Robust Dimensions and Stable Inference Mapping, are proposed to mitigate this instability during inference.	Unstable shifts in data variables lead to significant performance drops in downstream tasks, as demonstrated on Causal3DIdent and ImageNet. Robust Dimensions, leveraging the most important dimensions of stable representations, effectively alleviates deterioration by identifying robust features. Stable Inference Mapping, learning a linear transformation to absorb unstable shifts, improves accuracy on unseen data, as shown on Causal3DIdent and ObjectNet.	The proposed solutions assume access to stable-unstable instance pairs or knowledge of specific data variable alterations, limiting their applicability in realistic scenarios. The effectiveness of Stable Inference Mapping might saturate with longer training, requiring more sophisticated interventions for further improvement.	self-supervised learning, causal inference, representation learning, domain adaptation, inference stability
2308.08316 Report	Dual-Stream Diffusion Net for Text-to-Video Generation	Binhui Liu, Xin Liu, Anbo Dai, Zhiyong Zeng, Dan Wang, Zhen Cui, Jian Yang	With the emerging diffusion models, recently, text-to-video generation has aroused increasing attention. But an important bottleneck therein is that generative videos often tend to carry some flickers and artifacts. In this work, we propose a dual-stream diffusion net (DSDN) to improve the consistency of content variations in generating videos. In particular, the designed two diffusion streams, video content and motion branches, could not only run separately in their private spaces for producing personalized video variations as well as content, but also be well-aligned between the content and motion domains through leveraging our designed cross-transformer interaction module, which would benefit the smoothness of generated videos. Besides, we also introduce motion decomposer and combiner to faciliate the operation on video motion. Qualitative and quantitative experiments demonstrate that our method could produce amazing continuous videos with fewer flickers.	This paper proposes a dual-stream diffusion net (DSDN) for text-to-video generation that improves the consistency of content variations and reduces flickers in generated videos.	Generating realistic and continuous videos from text is a challenging task, and existing methods often produce videos with flickers and artifacts due to difficulties in modeling video dynamics.	DSDN uses two diffusion streams – one for video content and one for motion. It leverages a pre-trained text-to-image diffusion model for content and a 3D U-Net for motion. A cross-transformer interaction module aligns the two streams, and motion decomposer/combiner modules facilitate motion processing.	DSDN generates videos with higher frame consistency and better textual alignment compared to baselines like CogVideo and Text2Video-Zero. Ablation studies demonstrate the importance of both the content increment unit and motion unit in generating continuous and realistic videos. DSDN generates diverse videos with consistent content, as evidenced by varying actions, appearances, and subtle background changes in the generated cat videos.	The content increment unit has limited parameter volume, potentially restricting the diversity of generated content. Future work could explore improving the content increment unit and investigating alternative motion modeling techniques.	text-to-video generation, diffusion models, motion modeling, video consistency, deep learning
2308.08258 Report	SceNeRFlow: Time-Consistent Reconstruction of General Dynamic Scenes	Edith Tretschk, Vladislav Golyanik, Michael Zollhoefer, Aljaz Bozic, Christoph Lassner, Christian Theobalt	Existing methods for the 4D reconstruction of general, non-rigidly deforming objects focus on novel-view synthesis and neglect correspondences. However, time consistency enables advanced downstream tasks like 3D editing, motion analysis, or virtual-asset creation. We propose SceNeRFlow to reconstruct a general, non-rigid scene in a time-consistent manner. Our dynamic-NeRF method takes multi-view RGB videos and background images from static cameras with known camera parameters as input. It then reconstructs the deformations of an estimated canonical model of the geometry and appearance in an online fashion. Since this canonical model is time-invariant, we obtain correspondences even for long-term, long-range motions. We employ neural scene representations to parametrize the components of our method. Like prior dynamic-NeRF methods, we use a backwards deformation model. We find non-trivial adaptations of this model necessary to handle larger motions: We decompose the deformations into a strongly regularized coarse component and a weakly regularized fine component, where the coarse component also extends the deformation field into the space surrounding the object, which enables tracking over time. We show experimentally that, unlike prior work that only handles small motion, our method enables the reconstruction of studio-scale motions.	SceNeRFlow, an end-to-end differentiable, time-consistent 4D reconstruction method for general dynamic scenes from multi-view RGB input from static cameras.	Time consistency in 4D reconstruction enables advanced downstream tasks like 3D editing, motion analysis, or virtual-asset creation by providing long-range, long-term dense 3D correspondences.	The method employs a backward deformation model with a time-invariant canonical model for geometry and appearance, and time-dependent deformations. It uses a coarse-and-fine deformation decomposition and introduces a novel approach to extend the deformation field for handling large motion in online, timestamp-by-timestamp tracking.	SceNeRFlow achieves time-consistent reconstructions even with large, studio-scale motions, outperforming previous methods in handling complex deformations. The method effectively establishes stable 3D correspondences over time, unlike previous methods that suffer from drift. A trade-off exists between time consistency and novel-view synthesis quality, as variants of SceNeRFlow with time-varying canonical models show improved view synthesis but degraded correspondences.	The current method relies on multi-view input and simplifying assumptions like static background and lack of topology changes. Future work will focus on reducing the number of cameras required, incorporating a dynamic background, and handling topology changes in a time-consistent manner.	4d reconstruction, time consistency, neural scene representation, nerf, deformation modeling
2308.08220 Report	Low-Light Image Enhancement with Illumination-Aware Gamma Correction and Complete Image Modelling Network	Yinglong Wang, Zhen Liu, Jianzhuang Liu, Songcen Xu, Shuaicheng Liu	This paper presents a novel network structure with illumination-aware gamma correction and complete image modelling to solve the low-light image enhancement problem. Low-light environments usually lead to less informative large-scale dark areas, directly learning deep representations from low-light images is insensitive to recovering normal illumination. We propose to integrate the effectiveness of gamma correction with the strong modelling capacities of deep networks, which enables the correction factor gamma to be learned in a coarse to elaborate manner via adaptively perceiving the deviated illumination. Because exponential operation introduces high computational complexity, we propose to use Taylor Series to approximate gamma correction, accelerating the training and inference speed. Dark areas usually occupy large scales in low-light images, common local modelling structures, e.g., CNN, SwinIR, are thus insufficient to recover accurate illumination across whole low-light images. We propose a novel Transformer block to completely simulate the dependencies of all pixels across images via a local-to-global hierarchical attention mechanism, so that dark areas could be inferred by borrowing the information from far informative regions in a highly effective manner. Extensive experiments on several benchmark datasets demonstrate that our approach outperforms state-of-the-art methods.	This paper proposes IAGC, a novel network for low-light image enhancement by integrating illumination-aware gamma correction and a complete image modelling network.	Existing methods struggle to effectively recover illumination from low-light images, especially in large-scale dark areas, leading to poor image quality and inaccurate color recovery.	IAGC utilizes a three-stage coarse-to-fine strategy: 1) GGCM module for global brightness enhancement, 2) COMO-ViT block for learning illumination-recovered representations with a local-to-global self-attention mechanism, and 3) LGCM module for local illumination refinement. Taylor Series approximation is used to accelerate gamma correction.	IAGC achieves state-of-the-art quantitative results (PSNR, SSIM) on LOL datasets, outperforming existing methods by significant margins. The proposed method effectively enhances illumination and recovers image details in challenging low-light conditions. Ablation studies demonstrate the effectiveness of the gamma correction modules (GGCM, LGCM) and the local-to-global self-attention mechanism in COMO-ViT.	IAGC may exhibit slight local color deviation in extreme low-light cases with severe contrast and hue damage. Future work includes exploring more advanced techniques to address the remaining color deviation in extremely challenging low-light environments.	low-light image enhancement, gamma correction, vision transformer, self-attention, deep learning
2308.08157 Report	Learning to Generate Semantic Layouts for Higher Text-Image Correspondence in Text-to-Image Synthesis	Minho Park, Jooyeol Yun, Seunghwan Choi, Jaegul Choo	Existing text-to-image generation approaches have set high standards for photorealism and text-image correspondence, largely benefiting from web-scale text-image datasets, which can include up to 5~billion pairs. However, text-to-image generation models trained on domain-specific datasets, such as urban scenes, medical images, and faces, still suffer from low text-image correspondence due to the lack of text-image pairs. Additionally, collecting billions of text-image pairs for a specific domain can be time-consuming and costly. Thus, ensuring high text-image correspondence without relying on web-scale text-image datasets remains a challenging task. In this paper, we present a novel approach for enhancing text-image correspondence by leveraging available semantic layouts. Specifically, we propose a Gaussian-categorical diffusion process that simultaneously generates both images and corresponding layout pairs. Our experiments reveal that we can guide text-to-image generation models to be aware of the semantics of different image regions, by training the model to generate semantic labels for each pixel. We demonstrate that our approach achieves higher text-image correspondence compared to existing text-to-image generation approaches in the Multi-Modal CelebA-HQ and the Cityscapes dataset, where text-image pairs are scarce. Codes are available in this https://pmh9960.github.io/research/GCDP	This paper proposes a novel Gaussian-categorical diffusion process for text-to-image synthesis, aiming to improve text-image correspondence in domain-specific datasets where text-image pairs are scarce.	Existing text-to-image models often struggle with low text-image correspondence when trained on domain-specific datasets due to the limited availability of text-image pairs. Collecting billions of pairs for specific domains is costly and challenging.	The authors define a Gaussian-categorical diffusion process that models the joint distribution of images and corresponding semantic layouts. This approach allows the model to learn the semantics of different image regions by generating semantic labels for each pixel.	The proposed method achieves higher text-image correspondence compared to existing text-to-image generation approaches on Multi-Modal CelebA-HQ and Cityscapes datasets. Analysis reveals that jointly generating image-layout pairs enables the model to be aware of image semantics during generation, improving its ability to match text descriptions with image regions. The model effectively generates image-layout pairs with high alignment, closely resembling the real distribution, and demonstrates promising results in cross-modal outpainting for semantic image synthesis and segmentation.	Training the model necessitates semantic layout annotations, which may require additional effort. The model's performance on highly diverse datasets like MS-COCO needs further investigation and improvement.	text-to-image synthesis, diffusion models, semantic layouts, text-image correspondence, domain-specific generation
2308.08089 Report	DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory	Shengming Yin, Chenfei Wu, Jian Liang, Jie Shi, Houqiang Li, Gong Ming, Nan Duan	Controllable video generation has gained significant attention in recent years. However, two main limitations persist: Firstly, most existing works focus on either text, image, or trajectory-based control, leading to an inability to achieve fine-grained control in videos. Secondly, trajectory control research is still in its early stages, with most experiments being conducted on simple datasets like Human3.6M. This constraint limits the models' capability to process open-domain images and effectively handle complex curved trajectories. In this paper, we propose DragNUWA, an open-domain diffusion-based video generation model. To tackle the issue of insufficient control granularity in existing works, we simultaneously introduce text, image, and trajectory information to provide fine-grained control over video content from semantic, spatial, and temporal perspectives. To resolve the problem of limited open-domain trajectory control in current research, We propose trajectory modeling with three aspects: a Trajectory Sampler (TS) to enable open-domain control of arbitrary trajectories, a Multiscale Fusion (MF) to control trajectories in different granularities, and an Adaptive Training (AT) strategy to generate consistent videos following trajectories. Our experiments validate the effectiveness of DragNUWA, demonstrating its superior performance in fine-grained control in video generation. The homepage link is \url{https://www.microsoft.com/en-us/research/project/dragnuwa/}	DragNUWA, an open-domain diffusion-based video generation model that integrates text, image, and trajectory controls for fine-grained controllability.	Existing methods for controllable video generation lack fine-grained control and struggle with complex trajectories in open-domain settings.	DragNUWA introduces three key components: 1) Trajectory Sampler (TS) for sampling arbitrary trajectories from open-domain videos, 2) Multiscale Fusion (MF) for integrating trajectory, text, and image data at different granularities, and 3) Adaptive Training (AT) for generating consistent videos by transitioning from dense optical flow to user-defined trajectories.	DragNUWA achieves fine-grained control over camera movements, including zooming and panning. The model effectively handles complex trajectories, including curved paths, variable lengths, and simultaneous control of multiple objects. DragNUWA demonstrates the essentiality of text, image, and trajectory controls for achieving comprehensive control over video generation.	The model does not explicitly model camera movement, relying instead on learned representations from trajectory data. Future work could explore incorporating audio or other modalities for enhanced control and realism.	video generation, controllable generation, diffusion models, trajectory control, multimodal generation
2308.07926 Report	CoDeF: Content Deformation Fields for Temporally Consistent Video Processing	Hao Ouyang, Qiuyu Wang, Yuxi Xiao, Qingyan Bai, Juntao Zhang, Kecheng Zheng, Xiaowei Zhou, Qifeng Chen, Yujun Shen	We present the content deformation field CoDeF as a new type of video representation, which consists of a canonical content field aggregating the static contents in the entire video and a temporal deformation field recording the transformations from the canonical image (i.e., rendered from the canonical content field) to each individual frame along the time axis.Given a target video, these two fields are jointly optimized to reconstruct it through a carefully tailored rendering pipeline.We advisedly introduce some regularizations into the optimization process, urging the canonical content field to inherit semantics (e.g., the object shape) from the video.With such a design, CoDeF naturally supports lifting image algorithms for video processing, in the sense that one can apply an image algorithm to the canonical image and effortlessly propagate the outcomes to the entire video with the aid of the temporal deformation field.We experimentally show that CoDeF is able to lift image-to-image translation to video-to-video translation and lift keypoint detection to keypoint tracking without any training.More importantly, thanks to our lifting strategy that deploys the algorithms on only one image, we achieve superior cross-frame consistency in processed videos compared to existing video-to-video translation approaches, and even manage to track non-rigid objects like water and smog.Project page can be found at https://qiuyu96.github.io/CoDeF/.	Introduces Content Deformation Fields (CoDeF), a novel video representation comprising a canonical content field for static content and a temporal deformation field mapping to individual frames, enabling the lifting of image algorithms to video processing.	Addresses limitations in video processing quality and temporal consistency compared to image processing by representing video in a manner conducive to leveraging established image algorithms.	Employs 2D/3D hash tables for efficient representation, annealed hash encoding for semantic correctness, flow-guided consistency for smoothness, and grouped fields for complex motions.	Achieves superior video reconstruction quality (4.4 dB higher PSNR) and efficiency compared to layered neural atlas. Demonstrates successful lifting of image algorithms for video-to-video translation, keypoint tracking, object tracking, super-resolution, and user editing with enhanced temporal consistency. Outperforms existing video processing methods, particularly in temporal consistency and handling complex motions.	Current method requires per-scene optimization, limiting scalability. Handling extreme viewpoint changes and large non-rigid deformations poses challenges.	video representation, video processing, temporal consistency, content deformation field, image algorithm lifting
2308.07903 Report	Relightable and Animatable Neural Avatar from Sparse-View Video	Zhen Xu, Sida Peng, Chen Geng, Linzhan Mou, Zihan Yan, Jiaming Sun, Hujun Bao, Xiaowei Zhou	This paper tackles the challenge of creating relightable and animatable neural avatars from sparse-view (or even monocular) videos of dynamic humans under unknown illumination. Compared to studio environments, this setting is more practical and accessible but poses an extremely challenging ill-posed problem. Previous neural human reconstruction methods are able to reconstruct animatable avatars from sparse views using deformed Signed Distance Fields (SDF) but cannot recover material parameters for relighting. While differentiable inverse rendering-based methods have succeeded in material recovery of static objects, it is not straightforward to extend them to dynamic humans as it is computationally intensive to compute pixel-surface intersection and light visibility on deformed SDFs for inverse rendering. To solve this challenge, we propose a Hierarchical Distance Query (HDQ) algorithm to approximate the world space distances under arbitrary human poses. Specifically, we estimate coarse distances based on a parametric human model and compute fine distances by exploiting the local deformation invariance of SDF. Based on the HDQ algorithm, we leverage sphere tracing to efficiently estimate the surface intersection and light visibility. This allows us to develop the first system to recover animatable and relightable neural avatars from sparse view (or monocular) inputs. Experiments demonstrate that our approach is able to produce superior results compared to state-of-the-art methods. Our code will be released for reproducibility.	This paper introduces a novel system for reconstructing a relightable and animatable neural avatar from sparse-view or even monocular videos of a human subject under unknown, real-world illumination.	Creating such avatars from readily available videos, without the need for specialized studios, significantly expands the potential applications in virtual reality, filmmaking, and video games.	The system leverages neural inverse rendering techniques and introduces a novel Hierarchical Distance Query (HDQ) algorithm to efficiently estimate surface intersections and light visibility for physically based rendering. It achieves this by blending distance approximations from a parametric human model and a canonical neural signed distance field.	The approach produces superior results compared to existing methods in terms of visual quality and physical accuracy. The HDQ algorithm proves to be essential for enabling accurate and efficient rendering under novel poses and lighting. The system successfully captures challenging material properties like skin shininess and specular highlights on clothing.	The training time of the neural avatar is relatively long (20 hours). Future work could explore acceleration methods to speed up the training process.	neural rendering, relighting, human avatar, inverse rendering, signed distance field
2308.07868 Report	ObjectSDF++: Improved Object-Compositional Neural Implicit Surfaces	Qianyi Wu, Kaisiyuan Wang, Kejie Li, Jianmin Zheng, Jianfei Cai	In recent years, neural implicit surface reconstruction has emerged as a popular paradigm for multi-view 3D reconstruction. Unlike traditional multi-view stereo approaches, the neural implicit surface-based methods leverage neural networks to represent 3D scenes as signed distance functions (SDFs). However, they tend to disregard the reconstruction of individual objects within the scene, which limits their performance and practical applications. To address this issue, previous work ObjectSDF introduced a nice framework of object-composition neural implicit surfaces, which utilizes 2D instance masks to supervise individual object SDFs. In this paper, we propose a new framework called ObjectSDF++ to overcome the limitations of ObjectSDF. First, in contrast to ObjectSDF whose performance is primarily restricted by its converted semantic field, the core component of our model is an occlusion-aware object opacity rendering formulation that directly volume-renders object opacity to be supervised with instance masks. Second, we design a novel regularization term for object distinction, which can effectively mitigate the issue that ObjectSDF may result in unexpected reconstruction in invisible regions due to the lack of constraint to prevent collisions. Our extensive experiments demonstrate that our novel framework not only produces superior object reconstruction results but also significantly improves the quality of scene reconstruction. Code and more resources can be found in \url{https://qianyiwu.github.io/objectsdf++}	This paper presents ObjectSDF++, a novel framework for object-compositional neural implicit surface reconstruction that improves upon ObjectSDF.	Existing neural implicit surface reconstruction methods often overlook individual object reconstruction, limiting their application in scene editing and understanding.	ObjectSDF++ introduces an occlusion-aware object opacity rendering scheme and an object distinction regularization term to enhance object and scene reconstruction quality. It leverages a multi-resolution feature grid and monocular geometry cues for faster convergence.	ObjectSDF++ significantly improves both scene and object reconstruction quality compared to ObjectSDF, as demonstrated on the Replica dataset. The proposed occlusion-aware object opacity rendering proves crucial in enhancing surface reconstruction. ObjectSDF++ achieves state-of-the-art scene reconstruction results on the ScanNet dataset, demonstrating the benefits of object-compositional modeling.	The training time for ObjectSDF++ remains high, requiring further optimization for real-time applications. The current framework primarily focuses on closed and solid objects, limiting its applicability to other object types.	neural implicit surface reconstruction, object-compositional representation, occlusion-aware rendering, object distinction regularization, 3d scene understanding
2308.07863 Report	StyleDiffusion: Controllable Disentangled Style Transfer via Diffusion Models	Zhizhong Wang, Lei Zhao, Wei Xing	Content and style (C-S) disentanglement is a fundamental problem and critical challenge of style transfer. Existing approaches based on explicit definitions (e.g., Gram matrix) or implicit learning (e.g., GANs) are neither interpretable nor easy to control, resulting in entangled representations and less satisfying results. In this paper, we propose a new C-S disentangled framework for style transfer without using previous assumptions. The key insight is to explicitly extract the content information and implicitly learn the complementary style information, yielding interpretable and controllable C-S disentanglement and style transfer. A simple yet effective CLIP-based style disentanglement loss coordinated with a style reconstruction prior is introduced to disentangle C-S in the CLIP image space. By further leveraging the powerful style removal and generative ability of diffusion models, our framework achieves superior results than state of the art and flexible C-S disentanglement and trade-off control. Our work provides new insights into the C-S disentanglement in style transfer and demonstrates the potential of diffusion models for learning well-disentangled C-S characteristics.	A novel content-style disentangled framework named StyleDiffusion is proposed for artistic style transfer, leveraging diffusion models for explicit content extraction and implicit style learning.	Existing style transfer methods suffer from entangled representations, lack of interpretability and controllability, resulting in less satisfying results.	StyleDiffusion employs a diffusion-based style removal module to extract domain-aligned content information and a diffusion-based style transfer module to learn and transfer disentangled style guided by a CLIP-based style disentanglement loss and a style reconstruction prior.	StyleDiffusion achieves superior style transfer results with fine details and well-preserved content, especially for challenging styles. The framework offers controllable content-style disentanglement and trade-off by adjusting the return step of diffusion models. Quantitative comparisons and user studies demonstrate the effectiveness and superiority of StyleDiffusion over state-of-the-art methods.	The current model requires fine-tuning for each style, limiting its application to arbitrary style transfer. The efficiency of the method is hindered by the use of diffusion models, demanding further research on faster diffusion sampling.	style transfer, diffusion models, content-style disentanglement, clip, deep learning
2308.07837 Report	CCD-3DR: Consistent Conditioning in Diffusion for Single-Image 3D Reconstruction	Yan Di, Chenyangguang Zhang, Pengyuan Wang, Guangyao Zhai, Ruida Zhang, Fabian Manhardt, Benjamin Busam, Xiangyang Ji, Federico Tombari	In this paper, we present a novel shape reconstruction method leveraging diffusion model to generate 3D sparse point cloud for the object captured in a single RGB image. Recent methods typically leverage global embedding or local projection-based features as the condition to guide the diffusion model. However, such strategies fail to consistently align the denoised point cloud with the given image, leading to unstable conditioning and inferior performance. In this paper, we present CCD-3DR, which exploits a novel centered diffusion probabilistic model for consistent local feature conditioning. We constrain the noise and sampled point cloud from the diffusion model into a subspace where the point cloud center remains unchanged during the forward diffusion process and reverse process. The stable point cloud center further serves as an anchor to align each point with its corresponding local projection-based features. Extensive experiments on synthetic benchmark ShapeNet-R2N2 demonstrate that CCD-3DR outperforms all competitors by a large margin, with over 40% improvement. We also provide results on real-world dataset Pix3D to thoroughly demonstrate the potential of CCD-3DR in real-world applications. Codes will be released soon	This paper presents CCD-3DR, a novel single-image 3D reconstruction method leveraging a centered denoising diffusion probabilistic model (CDPM) for consistent local feature conditioning.	Existing diffusion-based 3D reconstruction methods suffer from uncontrollable center deviation of the point cloud during the denoising process, leading to inferior performance.	CCD-3DR introduces CDPM, which constrains the noise and point cloud in diffusion and reverse processes to a subspace where the point cloud center is fixed at the origin, enabling consistent local feature conditioning.	CCD-3DR significantly outperforms state-of-the-art methods on ShapeNet-R2N2, achieving over 40% improvement in F-Score. CCD-3DR demonstrates superior performance on the real-world Pix3D dataset, showcasing its potential for real-world applications. Ablation studies validate the effectiveness of the proposed CDPM and the consistent local feature conditioning scheme.	The centralization scheme might slightly affect the diversity of generated shapes. Future work includes exploring advanced ordinary differential equation solvers to enhance inference speed.	3d reconstruction, diffusion models, single-image reconstruction, point cloud, local feature conditioning
2308.07815 Report	ImbSAM: A Closer Look at Sharpness-Aware Minimization in Class-Imbalanced Recognition	Yixuan Zhou, Yi Qu, Xing Xu, Hengtao Shen	Class imbalance is a common challenge in real-world recognition tasks, where the majority of classes have few samples, also known as tail classes. We address this challenge with the perspective of generalization and empirically find that the promising Sharpness-Aware Minimization (SAM) fails to address generalization issues under the class-imbalanced setting. Through investigating this specific type of task, we identify that its generalization bottleneck primarily lies in the severe overfitting for tail classes with limited training data. To overcome this bottleneck, we leverage class priors to restrict the generalization scope of the class-agnostic SAM and propose a class-aware smoothness optimization algorithm named Imbalanced-SAM (ImbSAM). With the guidance of class priors, our ImbSAM specifically improves generalization targeting tail classes. We also verify the efficacy of ImbSAM on two prototypical applications of class-imbalanced recognition: long-tailed classification and semi-supervised anomaly detection, where our ImbSAM demonstrates remarkable performance improvements for tail classes and anomaly. Our code implementation is available at https://github.com/cool-xuan/Imbalanced_SAM.	This paper proposes Imbalanced SAM (ImbSAM), a class-aware smoothness optimization algorithm that leverages class priors to improve generalization for tail classes in class-imbalanced recognition tasks.	Standard Sharpness-Aware Minimization (SAM), while effective for balanced datasets, fails to address the generalization bottleneck in class-imbalanced settings, specifically the severe overfitting of tail classes with limited training data.	ImbSAM incorporates class priors into SAM to restrict smoothness optimization to tail classes. It achieves this by dividing the training set into head and tail sub-sets based on data amount and applying SAM optimization only to the tail sub-set.	ImbSAM demonstrates consistent accuracy improvement over baselines and SOTA methods on long-tailed classification benchmarks like CIFAR100-LT, ImageNet-LT and iNaturalist. It significantly improves recognition accuracy for tail classes, especially those with limited training data, effectively addressing the overfitting issue. ImbSAM also shows promising results in semi-supervised anomaly detection, enhancing AUCROC scores and outperforming previous SOTA methods on benchmark datasets.	The performance of ImbSAM might be slightly affected when the anomaly ratio is extremely low (<1%) due to the overexposure of limited data. Future work will explore more sophisticated class prior construction methods beyond the simple data amount threshold.	class-imbalance, long-tailed classification, anomaly detection, generalization, sharpness-aware minimization
2308.07749 Report	Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with Image Diffusion Model	Bosheng Qin, Wentao Ye, Qifan Yu, Siliang Tang, Yueting Zhuang	The rising demand for creating lifelike avatars in the digital realm has led to an increased need for generating high-quality human videos guided by textual descriptions and poses. We propose Dancing Avatar, designed to fabricate human motion videos driven by poses and textual cues. Our approach employs a pretrained T2I diffusion model to generate each video frame in an autoregressive fashion. The crux of innovation lies in our adept utilization of the T2I diffusion model for producing video frames successively while preserving contextual relevance. We surmount the hurdles posed by maintaining human character and clothing consistency across varying poses, along with upholding the background's continuity amidst diverse human movements. To ensure consistent human appearances across the entire video, we devise an intra-frame alignment module. This module assimilates text-guided synthesized human character knowledge into the pretrained T2I diffusion model, synergizing insights from ChatGPT. For preserving background continuity, we put forth a background alignment pipeline, amalgamating insights from segment anything and image inpainting techniques. Furthermore, we propose an inter-frame alignment module that draws inspiration from an auto-regressive pipeline to augment temporal consistency between adjacent frames, where the preceding frame guides the synthesis process of the current frame. Comparisons with state-of-the-art methods demonstrate that Dancing Avatar exhibits the capacity to generate human videos with markedly superior quality, both in terms of human and background fidelity, as well as temporal coherence compared to existing state-of-the-art approaches.	This paper introduces Dancing Avatar, a novel pipeline for synthesizing high-quality human motion videos from text descriptions and pose sequences using a pretrained text-to-image diffusion model.	Existing text-to-video models for human motion synthesis often produce low-quality videos with temporal inconsistencies. This work addresses these limitations by leveraging the power of pretrained text-to-image models.	The proposed Dancing Avatar pipeline employs a pretrained T2I diffusion model and introduces three key modules: 1) Intra-frame alignment ensures consistent human appearance across frames, 2) Background alignment maintains background consistency, and 3) Inter-frame alignment enhances detail coherence between adjacent frames.	Dancing Avatar generates human motion videos with superior quality compared to state-of-the-art approaches, as evidenced by lower NIQE and BRISQUE scores. The method exhibits strong alignment with input text prompts and poses, achieving lower Pose MSE and higher CLIP Text Consistency scores. Dancing Avatar excels in maintaining temporal consistency across frames, demonstrating lower Frame MSE and L1 scores, and higher CLIP Frame Consistency scores.	The current implementation relies on multiple T2I diffusion models, which could be streamlined for efficiency. Future work can explore extending the framework to generate longer and more complex human motion sequences.	human motion synthesis, text-to-video generation, text-to-image diffusion model, temporal consistency, video quality
2308.07732 Report	UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation	Haiyang Wang, Hao Tang, Shaoshuai Shi, Aoxue Li, Zhenguo Li, Bernt Schiele, Liwei Wang	Jointly processing information from multiple sensors is crucial to achieving accurate and robust perception for reliable autonomous driving systems. However, current 3D perception research follows a modality-specific paradigm, leading to additional computation overheads and inefficient collaboration between different sensor data. In this paper, we present an efficient multi-modal backbone for outdoor 3D perception named UniTR, which processes a variety of modalities with unified modeling and shared parameters. Unlike previous works, UniTR introduces a modality-agnostic transformer encoder to handle these view-discrepant sensor data for parallel modal-wise representation learning and automatic cross-modal interaction without additional fusion steps. More importantly, to make full use of these complementary sensor types, we present a novel multi-modal integration strategy by both considering semantic-abundant 2D perspective and geometry-aware 3D sparse neighborhood relations. UniTR is also a fundamentally task-agnostic backbone that naturally supports different 3D perception tasks. It sets a new state-of-the-art performance on the nuScenes benchmark, achieving +1.1 NDS higher for 3D object detection and +12.0 higher mIoU for BEV map segmentation with lower inference latency. Code will be available at https://github.com/Haiyang-W/UniTR .	This paper introduces UniTR, a unified and efficient multi-modal transformer backbone for outdoor 3D perception that can process both 3D sparse point clouds and 2D multi-view dense images in parallel to learn unified bird's-eye-view (BEV) representations.	Integrating information from multiple sensors like cameras and LiDARs is crucial for robust and accurate 3D perception in autonomous driving. However, existing methods often rely on modality-specific encoders and sequential processing, leading to computational overheads and inefficiencies.	UniTR utilizes a modality-agnostic transformer encoder to handle view-discrepant sensor data in parallel. It introduces a novel multi-modal integration strategy by considering both semantic-abundant 2D perspective and geometry-aware 3D sparse neighborhood relations.	UniTR achieves state-of-the-art performance on the nuScenes benchmark for 3D object detection and BEV map segmentation. It outperforms previous methods while exhibiting faster inference speed. The model shows robustness against sensor failures, including LiDAR and camera malfunctions.	As a single-stride backbone primarily designed for outdoor BEV perception, UniTR's adaptability to tasks like indoor 3D perception is limited. The model lacks flexibility in switching between different sensor modalities (e.g., LiDAR-only or image-only) during inference.	autonomous driving, 3d perception, multi-modal fusion, transformer, "birds-eye-view (bev)"
2308.07665 Report	Inversion-by-Inversion: Exemplar-based Sketch-to-Photo Synthesis via Stochastic Differential Equations without Training	Ximing Xing, Chuang Wang, Haitao Zhou, Zhihao Hu, Chongxuan Li, Dong Xu, Qian Yu	Exemplar-based sketch-to-photo synthesis allows users to generate photo-realistic images based on sketches. Recently, diffusion-based methods have achieved impressive performance on image generation tasks, enabling highly-flexible control through text-driven generation or energy functions. However, generating photo-realistic images with color and texture from sketch images remains challenging for diffusion models. Sketches typically consist of only a few strokes, with most regions left blank, making it difficult for diffusion-based methods to produce photo-realistic images. In this work, we propose a two-stage method named ``Inversion-by-Inversion" for exemplar-based sketch-to-photo synthesis. This approach includes shape-enhancing inversion and full-control inversion. During the shape-enhancing inversion process, an uncolored photo is generated with the guidance of a shape-energy function. This step is essential to ensure control over the shape of the generated photo. In the full-control inversion process, we propose an appearance-energy function to control the color and texture of the final generated photo.Importantly, our Inversion-by-Inversion pipeline is training-free and can accept different types of exemplars for color and texture control. We conducted extensive experiments to evaluate our proposed method, and the results demonstrate its effectiveness. The code and project can be found at https://ximinng.github.io/inversion-by-inversion-project/.	This paper presents "Inversion-by-Inversion", a novel training-free, exemplar-based sketch-to-photo synthesis method using stochastic differential equations (SDE).	This approach addresses the challenge of diffusion models in generating photo-realistic images from sparse sketches by disentangling shape and appearance control using exemplars.	The method utilizes a two-stage inversion process: shape-enhancing inversion generates an uncolored photo preserving the sketch's structure, followed by full-control inversion that incorporates the exemplar's color and texture while maintaining shape fidelity.	Significantly outperforms baseline methods in FID scores, indicating higher visual quality and realism. Effectively balances shape control from the sketch and appearance control from the exemplar. Generalizes well to different types of exemplars, including photos, strokes, segmentation maps, and style images.	The inference time can be further improved. Exploring more sophisticated energy functions for finer control over specific image features.	sketch-to-photo synthesis, diffusion models, stochastic differential equations, exemplar-based image translation, energy-based models
2308.07615 Report	Self-supervised Hypergraphs for Learning Multiple World Interpretations	Alina Marcu, Mihai Pirvu, Dragos Costea, Emanuela Haller, Emil Slusanschi, Ahmed Nabil Belbachir, Rahul Sukthankar, Marius Leordeanu	We present a method for learning multiple scene representations given a small labeled set, by exploiting the relationships between such representations in the form of a multi-task hypergraph. We also show how we can use the hypergraph to improve a powerful pretrained VisTransformer model without any additional labeled data. In our hypergraph, each node is an interpretation layer (e.g., depth or segmentation) of the scene. Within each hyperedge, one or several input nodes predict the layer at the output node. Thus, each node could be an input node in some hyperedges and an output node in others. In this way, multiple paths can reach the same node, to form ensembles from which we obtain robust pseudolabels, which allow self-supervised learning in the hypergraph. We test different ensemble models and different types of hyperedges and show superior performance to other multi-task graph models in the field. We also introduce Dronescapes, a large video dataset captured with UAVs in different complex real-world scenes, with multiple representations, suitable for multi-task learning.	This paper introduces a novel self-supervised hypergraph model for learning multiple scene representations (e.g., segmentation, depth, surface normals) from limited labeled data.	Learning multiple scene interpretations robustly with minimal human supervision is crucial for real-world applications, especially in complex scenarios like UAV navigation.	The method constructs a multi-task hypergraph where nodes represent scene interpretations and hyperedges capture their relationships. Multiple paths through the hypergraph form ensembles, generating robust pseudolabels for self-supervised learning.	Higher-order hyperedges outperform pairwise edges in capturing complex relationships between scene interpretations. Learned ensemble models for pseudolabel generation significantly improve accuracy compared to non-parametric methods. The hypergraph effectively improves both accuracy and temporal consistency of predictions during iterative self-supervised learning, even surpassing a state-of-the-art expert model when used for initialization.	The model's performance on metric depth estimation, being highly scene-dependent, is less pronounced compared to other tasks. Future work includes exploring more complex hyperedge structures and extending the approach to incorporate temporal information for video understanding.	self-supervised learning, multi-task learning, hypergraphs, scene understanding, uav vision
2308.07605 Report	SGDiff: A Style Guided Diffusion Model for Fashion Synthesis	Zhengwentai Sun, Yanghong Zhou, Honghong He, P. Y. Mok	This paper reports on the development of \textbf{a novel style guided diffusion model (SGDiff)} which overcomes certain weaknesses inherent in existing models for image synthesis. The proposed SGDiff combines image modality with a pretrained text-to-image diffusion model to facilitate creative fashion image synthesis. It addresses the limitations of text-to-image diffusion models by incorporating supplementary style guidance, substantially reducing training costs, and overcoming the difficulties of controlling synthesized styles with text-only inputs. This paper also introduces a new dataset -- SG-Fashion, specifically designed for fashion image synthesis applications, offering high-resolution images and an extensive range of garment categories. By means of comprehensive ablation study, we examine the application of classifier-free guidance to a variety of conditions and validate the effectiveness of the proposed model for generating fashion images of the desired categories, product attributes, and styles. The contributions of this paper include a novel classifier-free guidance method for multi-modal feature fusion, a comprehensive dataset for fashion image synthesis application, a thorough investigation on conditioned text-to-image synthesis, and valuable insights for future research in the text-to-image synthesis domain. The code and dataset are available at: \url{https://github.com/taited/SGDiff}.	This paper presents SGDiff, a novel style-guided diffusion model for fashion synthesis that integrates image modality with a pretrained text-to-image diffusion model.	Existing text-to-image diffusion models struggle to control synthesized styles with text-only inputs and have high training costs. SGDiff addresses these limitations by incorporating style guidance from images.	SGDiff uses a pretrained CLIP image encoder to extract style representations and a Skip Cross-Attention module to fuse style and text modalities. It formulates synthesis as image reconstruction, learning from cropped image patches as style guidance.	SGDiff successfully synthesizes fashion images with desired categories, attributes, and styles, outperforming existing methods qualitatively and quantitatively. A novel multi-condition classifier-free guidance approach is proposed, enabling flexible control over the generated images. A new dataset, SG-Fashion, is introduced, featuring high-resolution fashion images and a wide range of garment categories.	The current implementation focuses on single garment synthesis. Future work will explore generating a complete outfit with multiple garments. The style guidance is limited to a single image patch. Investigating more sophisticated mechanisms for incorporating style information from multiple sources is planned.	fashion synthesis, style guidance, text-to-image, diffusion models, clip
2308.07575 Report	Story Visualization by Online Text Augmentation with Context Memory	Daechul Ahn, Daneul Kim, Gwangmo Song, Seung Hwan Kim, Honglak Lee, Dongyeop Kang, Jonghyun Choi	Story visualization (SV) is a challenging text-to-image generation task for the difficulty of not only rendering visual details from the text descriptions but also encoding a long-term context across multiple sentences. While prior efforts mostly focus on generating a semantically relevant image for each sentence, encoding a context spread across the given paragraph to generate contextually convincing images (e.g., with a correct character or with a proper background of the scene) remains a challenge. To this end, we propose a novel memory architecture for the Bi-directional Transformer framework with an online text augmentation that generates multiple pseudo-descriptions as supplementary supervision during training for better generalization to the language variation at inference. In extensive experiments on the two popular SV benchmarks, i.e., the Pororo-SV and Flintstones-SV, the proposed method significantly outperforms the state of the arts in various metrics including FID, character F1, frame accuracy, BLEU-2/3, and R-precision with similar or less computational complexity.	Presents a novel memory architecture for Bi-directional Transformer with online text augmentation for story visualization, enhancing context encoding and linguistic generalization.	Addresses the challenge of encoding long-term context across sentences in story visualization for generating contextually consistent images.	Proposes a context memory module with attentive weighting for dense past information encoding and an online text augmentation scheme generating pseudo-descriptions during training for improved linguistic diversity.	Significantly outperforms state-of-the-art story visualization methods in FID, character consistency, and semantic matching metrics. Demonstrates superior performance in preserving character consistency and background context compared to methods without the proposed memory module. Shows comparable or better performance in certain metrics compared to significantly larger pre-trained models like StoryDALL-E.	Despite improvements, image quality (FID) still lags behind large pre-trained models due to differences in model size and training data scale. Further research can explore integrating the proposed method with larger models and investigating its applicability to video generation from long paragraphs.	story visualization, text-to-image generation, context memory, online text augmentation, transformer
2308.07415 Report	Semantify: Simplifying the Control of 3D Morphable Models using CLIP	Omer Gralnik, Guy Gafni, Ariel Shamir	We present Semantify: a self-supervised method that utilizes the semantic power of CLIP language-vision foundation model to simplify the control of 3D morphable models. Given a parametric model, training data is created by randomly sampling the model's parameters, creating various shapes and rendering them. The similarity between the output images and a set of word descriptors is calculated in CLIP's latent space. Our key idea is first to choose a small set of semantically meaningful and disentangled descriptors that characterize the 3DMM, and then learn a non-linear mapping from scores across this set to the parametric coefficients of the given 3DMM. The non-linear mapping is defined by training a neural network without a human-in-the-loop. We present results on numerous 3DMMs: body shape models, face shape and expression models, as well as animal shapes. We demonstrate how our method defines a simple slider interface for intuitive modeling, and show how the mapping can be used to instantly fit a 3D parametric body shape to in-the-wild images.	Semantify is a self-supervised method that simplifies 3D morphable model control using CLIP, enabling intuitive modeling with semantically meaningful descriptors.	Controlling 3DMMs is often difficult due to the uninterpretable nature of their parameters. Semantify addresses this by introducing semantic control using natural language descriptors.	The method involves: (1) creating a dataset of rendered 3DMM shapes with varying parameters, (2) encoding these images and semantic descriptors into CLIP's latent space, (3) selecting a small set of disentangled descriptors, and (4) training a neural network to map descriptor scores to 3DMM coefficients.	Semantify defines a simple slider interface for intuitive 3D model manipulation. It enables zero-shot fitting of 3D body shapes to in-the-wild images, achieving comparable results to state-of-the-art methods. User studies show Semantify is more user-friendly and efficient than traditional control methods.	Mapper performance is dependent on the quality and diversity of the training dataset. While Semantify aims for a self-supervised approach, manual fine-tuning for specific 3DMMs could potentially enhance performance.	3d morphable models, clip, semantic modeling, zero-shot learning, human-computer interaction
2308.07391 Report	PARIS: Part-level Reconstruction and Motion Analysis for Articulated Objects	Jiayi Liu, Ali Mahdavi-Amiri, Manolis Savva	We address the task of simultaneous part-level reconstruction and motion parameter estimation for articulated objects. Given two sets of multi-view images of an object in two static articulation states, we decouple the movable part from the static part and reconstruct shape and appearance while predicting the motion parameters. To tackle this problem, we present PARIS: a self-supervised, end-to-end architecture that learns part-level implicit shape and appearance models and optimizes motion parameters jointly without any 3D supervision, motion, or semantic annotation. Our experiments show that our method generalizes better across object categories, and outperforms baselines and prior work that are given 3D point clouds as input. Our approach improves reconstruction relative to state-of-the-art baselines with a Chamfer-L1 distance reduction of 3.94 (45.2%) for objects and 26.79 (84.5%) for parts, and achieves 5% error rate for motion estimation across 10 object categories. Video summary at: https://youtu.be/tDSrROPCgUc	Presents PARIS, a self-supervised method for part-level reconstruction and motion analysis of articulated objects from multi-view images in two static states.	Enables understanding and manipulation of articulated objects in areas like robotics, animation, and industrial design, without expensive 3D supervision or category-specific models.	Uses composite neural radiance fields to represent static and movable parts, with a transformation function to align the movable part to a canonical state. Employs self-supervisory losses based on input RGB images and object masks.	Outperforms baselines in shape and appearance reconstruction, achieving a significant Chamfer-L1 distance reduction. Achieves accurate motion parameter estimation, with low errors in joint axis and state prediction. Demonstrates generalization to unseen object categories, unlike category-specific methods.	Relies on a sufficient number of multi-view observations and pre-alignment of object states. Faces challenges with severe occlusions and highly symmetric movable parts.	articulated objects, part-level reconstruction, motion analysis, self-supervised learning, neural radiance fields
2308.07314 Report	Dual Associated Encoder for Face Restoration	Yu-Ju Tsai, Yu-Lun Liu, Lu Qi, Kelvin C. K. Chan, Ming-Hsuan Yang	Restoring facial details from low-quality (LQ) images has remained a challenging problem due to its ill-posedness induced by various degradations in the wild. The existing codebook prior mitigates the ill-posedness by leveraging an autoencoder and learned codebook of high-quality (HQ) features, achieving remarkable quality. However, existing approaches in this paradigm frequently depend on a single encoder pre-trained on HQ data for restoring HQ images, disregarding the domain gap between LQ and HQ images. As a result, the encoding of LQ inputs may be insufficient, resulting in suboptimal performance. To tackle this problem, we propose a novel dual-branch framework named DAEFR. Our method introduces an auxiliary LQ branch that extracts crucial information from the LQ inputs. Additionally, we incorporate association training to promote effective synergy between the two branches, enhancing code prediction and output quality. We evaluate the effectiveness of DAEFR on both synthetic and real-world datasets, demonstrating its superior performance in restoring facial details. Project page: https://liagm.github.io/DAEFR/	This paper introduces DAEFR, a novel dual-branch framework for restoring high-quality facial images from severely degraded ones, addressing limitations in existing codebook prior methods.	Restoring facial details from low-quality images is crucial for various applications but challenging due to domain gaps and information loss between degraded and high-quality images.	DAEFR utilizes an auxiliary LQ branch to extract domain-specific information from degraded inputs. It employs association training to align features from HQ and LQ encoders, bridging the domain gap. A multi-head cross-attention module then fuses these features, enhancing code prediction and restoration.	DAEFR outperforms state-of-the-art methods in perceptual quality metrics (FID, NIQE) on real-world datasets, demonstrating robustness against severe degradation. On synthetic datasets, DAEFR achieves competitive performance in image quality (FID, LPIPS) and identity preservation (IDA, LMD). Ablation studies validate the effectiveness of the dual-branch architecture, association stage, and feature fusion module.	DAEFR's performance may be limited in extreme pose situations due to the limited diversity of training data. Future work includes exploring alternative feature fusion techniques and extending the approach to handle other image restoration tasks.	face restoration, codebook prior, dual-branch network, feature association, multi-head cross-attention
2308.07102 Report	Temporal Sentence Grounding in Streaming Videos	Tian Gan, Xiao Wang, Yan Sun, Jianlong Wu, Qingpei Guo, Liqiang Nie	This paper aims to tackle a novel task - Temporal Sentence Grounding in Streaming Videos (TSGSV). The goal of TSGSV is to evaluate the relevance between a video stream and a given sentence query. Unlike regular videos, streaming videos are acquired continuously from a particular source, and are always desired to be processed on-the-fly in many applications such as surveillance and live-stream analysis. Thus, TSGSV is challenging since it requires the model to infer without future frames and process long historical frames effectively, which is untouched in the early methods. To specifically address the above challenges, we propose two novel methods: (1) a TwinNet structure that enables the model to learn about upcoming events; and (2) a language-guided feature compressor that eliminates redundant visual frames and reinforces the frames that are relevant to the query. We conduct extensive experiments using ActivityNet Captions, TACoS, and MAD datasets. The results demonstrate the superiority of our proposed methods. A systematic ablation study also confirms their effectiveness.	This paper introduces and tackles the novel task of Temporal Sentence Grounding in Streaming Videos (TSGSV), which aims to assess the relevance between a streaming video and a sentence query in an online manner.	TSGSV is crucial for applications like surveillance and live-stream analysis, where real-time processing of continuous video streams is essential for identifying events of interest.	The paper proposes a TwinNet architecture with an ordinary and a prophet network. The prophet network, with access to future frames during training, guides the ordinary network to understand upcoming events. Additionally, a language-guided feature compressor efficiently summarizes historical information relevant to the query.	The proposed method outperforms modified offline temporal sentence grounding methods and online action detection methods on ActivityNet Captions, TACoS, and MAD datasets. Ablation studies confirm the importance of both the language-guided feature compressor and the prophet decoder for accurate and efficient TSGSV. The model's optimized implementation achieves real-time performance suitable for online inference.	The current model relies on offline evaluation metrics due to the lack of established online evaluation protocols for TSGSV. Future work will explore extending the model for streaming video-text pretraining to enhance its performance further.	temporal sentence grounding, streaming videos, online inference, twinnet, language-guided feature compression
2308.07037 Report	Bayesian Flow Networks	Alex Graves, Rupesh Kumar Srivastava, Timothy Atkinson, Faustino Gomez	This paper introduces Bayesian Flow Networks (BFNs), a new class of generative model in which the parameters of a set of independent distributions are modified with Bayesian inference in the light of noisy data samples, then passed as input to a neural network that outputs a second, interdependent distribution. Starting from a simple prior and iteratively updating the two distributions yields a generative procedure similar to the reverse process of diffusion models; however it is conceptually simpler in that no forward process is required. Discrete and continuous-time loss functions are derived for continuous, discretised and discrete data, along with sample generation procedures. Notably, the network inputs for discrete data lie on the probability simplex, and are therefore natively differentiable, paving the way for gradient-based sample guidance and few-step generation in discrete domains such as language modelling. The loss function directly optimises data compression and places no restrictions on the network architecture. In our experiments BFNs achieve competitive log-likelihoods for image modelling on dynamically binarized MNIST and CIFAR-10, and outperform all known discrete diffusion models on the text8 character-level language modelling task.	Introduces Bayesian Flow Networks (BFNs), a new class of generative model that modifies the parameters of independent distributions using Bayesian inference based on noisy data samples, then passes these parameters to a neural network to generate a second, interdependent distribution.	Aims to combine the strengths of Bayesian inference for summarizing information about individual variables with the power of deep learning for integrating information across many variables. Also seeks to enable smooth and differentiable generative processes for discrete data, unlike traditional discrete diffusion models.	Derives discrete and continuous-time loss functions based on minimizing the KL divergence between sender and receiver distributions. Provides specializations for continuous, discretized, and discrete data, along with algorithms for training, evaluation, and sample generation.	BFNs achieve competitive log-likelihoods for image modeling on dynamically binarized MNIST and CIFAR-10. BFNs outperform all known discrete diffusion models on the text8 character-level language modeling task. Discretized loss function performs better than continuous loss for CIFAR-10 with 16 bins, but continuous loss performs better for 256 bins.	The accuracy schedule used for binary and continuous data appears suboptimal. Further investigation is needed to understand why continuous loss performs better for CIFAR-10 with 256 bins.	generative models, bayesian inference, deep learning, diffusion models, discrete data
2308.07032 Report	S3IM: Stochastic Structural SIMilarity and Its Unreasonable Effectiveness for Neural Fields	Zeke Xie, Xindi Yang, Yujie Yang, Qi Sun, Yixiang Jiang, Haoran Wang, Yunfeng Cai, Mingming Sun	Recently, Neural Radiance Field (NeRF) has shown great success in rendering novel-view images of a given scene by learning an implicit representation with only posed RGB images. NeRF and relevant neural field methods (e.g., neural surface representation) typically optimize a point-wise loss and make point-wise predictions, where one data point corresponds to one pixel. Unfortunately, this line of research failed to use the collective supervision of distant pixels, although it is known that pixels in an image or scene can provide rich structural information. To the best of our knowledge, we are the first to design a nonlocal multiplex training paradigm for NeRF and relevant neural field methods via a novel Stochastic Structural SIMilarity (S3IM) loss that processes multiple data points as a whole set instead of process multiple inputs independently. Our extensive experiments demonstrate the unreasonable effectiveness of S3IM in improving NeRF and neural surface representation for nearly free. The improvements of quality metrics can be particularly significant for those relatively difficult tasks: e.g., the test MSE loss unexpectedly drops by more than 90% for TensoRF and DVGO over eight novel view synthesis tasks; a 198% F-score gain and a 64% Chamfer $L_{1}$ distance reduction for NeuS over eight surface reconstruction tasks. Moreover, S3IM is consistently robust even with sparse inputs, corrupted images, and dynamic scenes.	This paper introduces S3IM, a novel Stochastic Structural SIMilarity index, and a corresponding multiplex training paradigm for Neural Radiance Fields (NeRF) and neural surface representation methods. S3IM captures nonlocal structural similarity information from stochastically sampled pixels and leverages it as a multiplex loss to improve training.	Existing NeRF methods rely on point-wise losses (e.g., MSE), neglecting the rich structural information among pixels. This limits their performance, especially for challenging tasks such as few-shot learning and handling corrupted images. S3IM addresses this limitation by incorporating nonlocal structural similarity into the training process.	S3IM computes SSIM on stochastically generated patches from sampled pixels, capturing nonlocal structural information. It then integrates this information into a multiplex loss function, combined with the conventional point-wise loss, to train NeRF and neural surface representation models.	S3IM significantly improves image quality metrics (PSNR, SSIM, LPIPS) for NeRF variants like DVGO and TensoRF, achieving up to 16.43 and 24.75 PSNR gains on Replica Dataset. S3IM enhances robustness to sparse inputs and corrupted images, exhibiting even greater improvements with fewer or noisier training images. S3IM significantly benefits neural surface reconstruction, leading to substantial gains in both image quality metrics and geometric metrics (e.g., 64% Chamfer L1 distance reduction, 198% F-score gain for NeuS).	The current study mainly focuses on S3IM for RGB image losses and could explore its application to depth or other non-RGB losses. Future work can investigate the theoretical understanding of how S3IM improves generalization and affects the flatness of the learned minima.	neural radiance fields, nerf, neural rendering, surface reconstruction, multiplex loss
2308.07026 Report	AdvCLIP: Downstream-agnostic Adversarial Examples in Multimodal Contrastive Learning	Ziqi Zhou, Shengshan Hu, Minghui Li, Hangtao Zhang, Yechao Zhang, Hai Jin	Multimodal contrastive learning aims to train a general-purpose feature extractor, such as CLIP, on vast amounts of raw, unlabeled paired image-text data. This can greatly benefit various complex downstream tasks, including cross-modal image-text retrieval and image classification. Despite its promising prospect, the security issue of cross-modal pre-trained encoder has not been fully explored yet, especially when the pre-trained encoder is publicly available for commercial use. In this work, we propose AdvCLIP, the first attack framework for generating downstream-agnostic adversarial examples based on cross-modal pre-trained encoders. AdvCLIP aims to construct a universal adversarial patch for a set of natural images that can fool all the downstream tasks inheriting the victim cross-modal pre-trained encoder. To address the challenges of heterogeneity between different modalities and unknown downstream tasks, we first build a topological graph structure to capture the relevant positions between target samples and their neighbors. Then, we design a topology-deviation based generative adversarial network to generate a universal adversarial patch. By adding the patch to images, we minimize their embeddings similarity to different modality and perturb the sample distribution in the feature space, achieving unviersal non-targeted attacks. Our results demonstrate the excellent attack performance of AdvCLIP on two types of downstream tasks across eight datasets. We also tailor three popular defenses to mitigate AdvCLIP, highlighting the need for new defense mechanisms to defend cross-modal pre-trained encoders.	AdvCLIP, a novel attack framework, is proposed to generate downstream-agnostic adversarial examples for multimodal contrastive learning models, exposing security vulnerabilities in these models and their downstream tasks.	Multimodal pre-trained encoders, despite their promising applications in various downstream tasks, have security risks that havent been fully explored, potentially impacting commercially available models and services.	A topology-deviation based generative adversarial network is designed to generate universal adversarial patches. These patches disrupt the similarity between different modality embeddings and their topological relationships, leading to non-targeted attacks on downstream tasks.	AdvCLIP demonstrates successful attacks on both image-text retrieval and image classification tasks across various datasets and model architectures. Transformer-based architectures are found to be more vulnerable to these adversarial attacks compared to ResNet-based models. AdvCLIP remains effective even when common defense mechanisms such as data corruption, pruning, and adversarial training are applied.	The attack success rate of AdvCLIP can be influenced by the choice of surrogate dataset used for training the adversarial patch generator. Further research is needed to develop more robust defense mechanisms specifically designed to protect multimodal pre-trained encoders from these types of attacks.	adversarial patch, pre-trained encoder, cross-modal retrieval, multimodal contrastive learning, security vulnerability
2308.06962 Report	Color-NeuS: Reconstructing Neural Implicit Surfaces with Color	Licheng Zhong, Lixin Yang, Kailin Li, Haoyu Zhen, Mei Han, Cewu Lu	The reconstruction of object surfaces from multi-view images or monocular video is a fundamental issue in computer vision. However, much of the recent research concentrates on reconstructing geometry through implicit or explicit methods. In this paper, we shift our focus towards reconstructing mesh in conjunction with color. We remove the view-dependent color from neural volume rendering while retaining volume rendering performance through a relighting network. Mesh is extracted from the signed distance function (SDF) network for the surface, and color for each surface vertex is drawn from the global color network. To evaluate our approach, we conceived a in hand object scanning task featuring numerous occlusions and dramatic shifts in lighting conditions. We've gathered several videos for this task, and the results surpass those of any existing methods capable of reconstructing mesh alongside color. Additionally, our method's performance was assessed using public datasets, including DTU, BlendedMVS, and OmniObject3D. The results indicated that our method performs well across all these datasets. Project page: https://colmar-zlicheng.github.io/color_neus.	This paper proposes \method, a novel method for reconstructing neural implicit surfaces with view-independent color, compatible with NeuS-like models.	Reconstructing object surfaces with color from images is a fundamental problem. Existing methods struggle to balance accurate geometry reconstruction with view-independent color extraction, especially under challenging real-world conditions like occlusion and varying lighting.	The method decouples view-dependent color in neural volume rendering by learning a view-independent global color and a view-dependent relighting effect. It uses a global color network for vertex color and a relighting network to maintain volume rendering performance. During inference, only the global color is used for mesh vertex coloring.	\method successfully reconstructs object surfaces with accurate color, outperforming alternative solutions and traditional methods like structured-light scanning and COLMAP. The method handles challenging real-world scenarios with occlusion and reflection effectively, as demonstrated on the IHO-Video dataset. Quantitative evaluations on DTU, BlendedMVS, OmniObject3D, and IHO-Video datasets show \method achieves high-quality surface reconstruction and color accuracy.	The relighting network relies on the gradient of the SDF as input, which might limit its performance when the SDF network is sub-optimal. Future work could explore more sophisticated relighting networks and incorporate techniques like differentiable rendering for improved accuracy.	neural implicit surface, surface reconstruction, view-independent color, relighting network, neural rendering
2308.06887 Report	Robustified ANNs Reveal Wormholes Between Human Category Percepts	Guy Gaziv, Michael J. Lee, James J. DiCarlo	The visual object category reports of artificial neural networks (ANNs) are notoriously sensitive to tiny, adversarial image perturbations. Because human category reports (aka human percepts) are thought to be insensitive to those same small-norm perturbations -- and locally stable in general -- this argues that ANNs are incomplete scientific models of human visual perception. Consistent with this, we show that when small-norm image perturbations are generated by standard ANN models, human object category percepts are indeed highly stable. However, in this very same "human-presumed-stable" regime, we find that robustified ANNs reliably discover low-norm image perturbations that strongly disrupt human percepts. These previously undetectable human perceptual disruptions are massive in amplitude, approaching the same level of sensitivity seen in robustified ANNs. Further, we show that robustified ANNs support precise perceptual state interventions: they guide the construction of low-norm image perturbations that strongly alter human category percepts toward specific prescribed percepts. These observations suggest that for arbitrary starting points in image space, there exists a set of nearby "wormholes", each leading the subject from their current category perceptual state into a semantically very different state. Moreover, contemporary ANN models of biological visual processing are now accurate enough to consistently guide us to those portals.	This paper provides evidence that robustified ANNs can discover low-norm image perturbations that strongly and precisely modulate human object category percepts, challenging the assumption that human categorization is highly robust to such perturbations.	This finding is significant because it suggests the existence of "wormholes" in image space, where local perturbations can lead to drastic changes in human perception, and demonstrates the increasing accuracy of ANNs as scientific models of ventral visual processing.	The authors generated image perturbations using robustified and vanilla ANNs in two modes: Disruption Modulation (DM) to induce model errors and Targeted Modulation (TM) to induce specific category judgments. They then measured the effects of these perturbations on human categorization behavior in a nine-way choice task.	Human category percepts are highly sensitive to low-norm perturbations discovered by robustified ANNs, but not vanilla ANNs. Robustified ANNs allow for precise targeted modulation of human percepts, guiding them towards specific categories. These effects persist across different image distributions and even extend to composite category perceptions.	The study primarily focuses on ResNet50 architecture and L2-norm perturbations, limiting the generalization of findings. While demonstrating the effectiveness of adversarial training, the study doesn't claim it as the mechanism behind human robustness.	adversarial robustness, human perception, object categorization, neural networks, visual processing
2308.06749 Report	FastLLVE: Real-Time Low-Light Video Enhancement with Intensity-Aware Lookup Table	Wenhao Li, Guangyang Wu, Wenyi Wang, Peiran Ren, Xiaohong Liu	Low-Light Video Enhancement (LLVE) has received considerable attention in recent years. One of the critical requirements of LLVE is inter-frame brightness consistency, which is essential for maintaining the temporal coherence of the enhanced video. However, most existing single-image-based methods fail to address this issue, resulting in flickering effect that degrades the overall quality after enhancement. Moreover, 3D Convolution Neural Network (CNN)-based methods, which are designed for video to maintain inter-frame consistency, are computationally expensive, making them impractical for real-time applications. To address these issues, we propose an efficient pipeline named FastLLVE that leverages the Look-Up-Table (LUT) technique to maintain inter-frame brightness consistency effectively. Specifically, we design a learnable Intensity-Aware LUT (IA-LUT) module for adaptive enhancement, which addresses the low-dynamic problem in low-light scenarios. This enables FastLLVE to perform low-latency and low-complexity enhancement operations while maintaining high-quality results. Experimental results on benchmark datasets demonstrate that our method achieves the State-Of-The-Art (SOTA) performance in terms of both image quality and inter-frame brightness consistency. More importantly, our FastLLVE can process 1,080p videos at $\mathit{50+}$ Frames Per Second (FPS), which is $\mathit{2 \times}$ faster than SOTA CNN-based methods in inference time, making it a promising solution for real-time applications. The code is available at https://github.com/Wenhao-Li-777/FastLLVE.	This paper proposes FastLLVE, a novel LUT-based framework for real-time low-light video enhancement, utilizing an Intensity-Aware LUT (IA-LUT) to maintain inter-frame brightness consistency.	Maintaining brightness consistency in low-light video enhancement is crucial for high perceptual quality, but current methods struggle to balance efficiency and performance. Existing methods either suffer from flickering effects or are computationally expensive, making them impractical for real-time applications.	The method uses a lightweight encoder-decoder network to extract features from the input video and generate a video-adaptive IA-LUT. The IA-LUT, incorporating enhancement intensity as an additional dimension, facilitates pixel-wise transformation for consistent enhancement and is efficiently implemented via CUDA.	FastLLVE achieves state-of-the-art performance in terms of both image quality and inter-frame brightness consistency on benchmark datasets. It maintains superior brightness consistency compared to existing methods, as evidenced by lower AB (Var) and MABD values. The method achieves real-time processing speed of over 50 FPS for 1080p videos, making it significantly faster than CNN-based methods.	The dependence on a separate denoising module, while improving visual quality, slightly impacts the overall efficiency. Future work will explore a denoising strategy specifically designed for LUT-based enhancement to further enhance efficiency.	low-light video enhancement, lookup table, brightness consistency, real-time, intensity-aware lut
2308.06739 Report	Free-ATM: Exploring Unsupervised Learning on Diffusion-Generated Images with Free Attention Masks	David Junhao Zhang, Mutian Xu, Chuhui Xue, Wenqing Zhang, Xiaoguang Han, Song Bai, Mike Zheng Shou	Despite the rapid advancement of unsupervised learning in visual representation, it requires training on large-scale datasets that demand costly data collection, and pose additional challenges due to concerns regarding data privacy. Recently, synthetic images generated by text-to-image diffusion models, have shown great potential for benefiting image recognition. Although promising, there has been inadequate exploration dedicated to unsupervised learning on diffusion-generated images. To address this, we start by uncovering that diffusion models' cross-attention layers inherently provide annotation-free attention masks aligned with corresponding text inputs on generated images. We then investigate the problems of three prevalent unsupervised learning techniques ( i.e., contrastive learning, masked modeling, and vision-language pretraining) and introduce customized solutions by fully exploiting the aforementioned free attention masks. Our approach is validated through extensive experiments that show consistent improvements in baseline models across various downstream tasks, including image classification, detection, segmentation, and image-text retrieval. By utilizing our method, it is possible to close the performance gap between unsupervised pretraining on synthetic data and real-world scenarios.	This paper presents Free-ATM, a novel method that leverages the freely available attention masks from text-to-image diffusion models to enhance unsupervised learning on synthetic images.	Unsupervised learning heavily relies on large-scale datasets, which are costly to collect and raise privacy concerns. Synthetic data offers a solution, but current methods for unsupervised learning on such data, particularly diffusion-generated images, are underdeveloped.	The study leverages the inherent attention masks within diffusion models' cross-attention layers, which align with text inputs to highlight objects in generated images. These masks are then used to address limitations in three unsupervised learning techniques: contrastive learning, masked modeling, and vision-language pretraining.	Utilizing the attention masks for instance-level contrastive learning improves performance on object detection and segmentation tasks. Applying the masks to guide the masking process in masked modeling leads to better image classification and semantic segmentation results. Employing the masks for generating position-aware prompts significantly boosts image-text retrieval performance in vision-language pretraining.	The quality of synthetic images, while improving, still influences the overall performance gain. Exploring the impact of further increasing the volume of synthetic data used for pretraining.	unsupervised learning, diffusion models, synthetic data, attention masks, computer vision
2308.06721 Report	IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models	Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, Wei Yang	Recent years have witnessed the strong power of large text-to-image diffusion models for the impressive generative capability to create high-fidelity images. However, it is very tricky to generate desired images using only text prompt as it often involves complex prompt engineering. An alternative to text prompt is image prompt, as the saying goes: "an image is worth a thousand words". Although existing methods of direct fine-tuning from pretrained models are effective, they require large computing resources and are not compatible with other base models, text prompt, and structural controls. In this paper, we present IP-Adapter, an effective and lightweight adapter to achieve image prompt capability for the pretrained text-to-image diffusion models. The key design of our IP-Adapter is decoupled cross-attention mechanism that separates cross-attention layers for text features and image features. Despite the simplicity of our method, an IP-Adapter with only 22M parameters can achieve comparable or even better performance to a fully fine-tuned image prompt model. As we freeze the pretrained diffusion model, the proposed IP-Adapter can be generalized not only to other custom models fine-tuned from the same base model, but also to controllable generation using existing controllable tools. With the benefit of the decoupled cross-attention strategy, the image prompt can also work well with the text prompt to achieve multimodal image generation. The project page is available at \url{https://ip-adapter.github.io}.	Presents IP-Adapter, a lightweight image prompt adapter for text-to-image diffusion models, employing a decoupled cross-attention mechanism for effective integration of image features.	Addresses the limitations of text prompts in image generation, enabling more intuitive and informative image-based prompts for controlling content generation.	Leverages a pretrained CLIP image encoder to extract image features, employs a projection network to decompose global image embedding, and introduces decoupled cross-attention layers within the UNet architecture to effectively embed image features.	Achieves comparable or even better performance than fully fine-tuned image prompt models and existing adapter methods. Demonstrates strong generalization capabilities by seamlessly integrating with custom models and existing structure control tools like ControlNet. Enables multimodal image generation by effectively combining image prompts with text prompts for enhanced control and diversity.	Limited ability to generate highly consistent images with the subject of a given image. Further research is needed to enhance consistency and explore the use of fine-grained image features for improved control.	image generation, diffusion models, image prompt, controllable generation, multimodal generation
2308.06713 Report	LAW-Diffusion: Complex Scene Generation by Diffusion with Layouts	Binbin Yang, Yi Luo, Ziliang Chen, Guangrun Wang, Xiaodan Liang, Liang Lin	Thanks to the rapid development of diffusion models, unprecedented progress has been witnessed in image synthesis. Prior works mostly rely on pre-trained linguistic models, but a text is often too abstract to properly specify all the spatial properties of an image, e.g., the layout configuration of a scene, leading to the sub-optimal results of complex scene generation. In this paper, we achieve accurate complex scene generation by proposing a semantically controllable Layout-AWare diffusion model, termed LAW-Diffusion. Distinct from the previous Layout-to-Image generation (L2I) methods that only explore category-aware relationships, LAW-Diffusion introduces a spatial dependency parser to encode the location-aware semantic coherence across objects as a layout embedding and produces a scene with perceptually harmonious object styles and contextual relations. To be specific, we delicately instantiate each object's regional semantics as an object region map and leverage a location-aware cross-object attention module to capture the spatial dependencies among those disentangled representations. We further propose an adaptive guidance schedule for our layout guidance to mitigate the trade-off between the regional semantic alignment and the texture fidelity of generated objects. Moreover, LAW-Diffusion allows for instance reconfiguration while maintaining the other regions in a synthesized image by introducing a layout-aware latent grafting mechanism to recompose its local regional semantics. To better verify the plausibility of generated scenes, we propose a new evaluation metric for the L2I task, dubbed Scene Relation Score (SRS) to measure how the images preserve the rational and harmonious relations among contextual objects. Comprehensive experiments demonstrate that our LAW-Diffusion yields the state-of-the-art generative performance, especially with coherent object relations.	This paper proposes LAW-Diffusion, a novel Layout-Aware diffusion model for synthesizing complex scene images with harmonious object relations from layout configurations.	Existing text-to-image models struggle to accurately capture and maintain spatial relationships between objects in complex scenes, while previous layout-to-image methods often lack overall harmony and style consistency. LAW-Diffusion aims to address these limitations.	LAW-Diffusion leverages a spatial dependency parser to encode location-aware semantic coherence across objects as a layout embedding. It utilizes a location-aware cross-object attention module to capture spatial dependencies and employs an adaptive guidance schedule for balancing regional semantic alignment and texture fidelity. It also incorporates a layout-aware latent grafting mechanism for instance reconfiguration within generated scenes.	LAW-Diffusion outperforms state-of-the-art L2I methods in terms of FID, Inception Score, and Classification Accuracy Score, demonstrating its superior image fidelity. The proposed Scene Relation Score (SRS) metric highlights LAW-Diffusion's ability to generate scenes with plausible and coherent object relations. The layout-aware latent grafting mechanism enables flexible instance-level reconfiguration (adding, removing, restyling) while preserving overall scene coherence.	LAW-Diffusion currently focuses on closed-world object categories pre-defined in the datasets. It lacks the ability to specify scene-level style and semantics through global scene descriptions.	image generation, diffusion models, layout-to-image generation, scene understanding, computer vision
2308.06699 Report	Neural Super-Resolution for Real-time Rendering with Radiance Demodulation	Jia Li, Ziling Chen, Xiaolong Wu, Lu Wang, Beibei Wang, Lei Zhang	It is time-consuming to render high-resolution images in applications such as video games and virtual reality, and thus super-resolution technologies become increasingly popular for real-time rendering. However, it is challenging to preserve sharp texture details, keep the temporal stability and avoid the ghosting artifacts in real-time super-resolution rendering. To address this issue, we introduce radiance demodulation to separate the rendered image or radiance into a lighting component and a material component, considering the fact that the light component is smoother than the rendered image so that the high-resolution material component with detailed textures can be easily obtained. We perform the super-resolution on the lighting component only and re-modulate it with the high-resolution material component to obtain the final super-resolution image with more texture details. A reliable warping module is proposed by explicitly marking the occluded regions to avoid the ghosting artifacts. To further enhance the temporal stability, we design a frame-recurrent neural network and a temporal loss to aggregate the previous and current frames, which can better capture the spatial-temporal consistency among reconstructed frames. As a result, our method is able to produce temporally stable results in real-time rendering with high-quality details, even in the challenging 4 $\times$ 4 super-resolution scenarios.	This paper introduces a novel lightweight super-resolution method for real-time rendering that leverages radiance demodulation, a motion-unreliable region detection approach, and a frame-recurrent neural network.	Real-time rendering demands both high resolution and low latency. Super-resolution rendering helps but struggles to preserve sharp texture details, maintain temporal stability, and avoid ghosting artifacts.	The method demodulates the rendered image into lighting and material components, performing super-resolution only on the smoother lighting component. It identifies and mitigates ghosting artifacts using a motion mask and employs a frame-recurrent network with a temporal loss for temporal stability.	Significantly outperforms state-of-the-art VSR and RRSR methods both qualitatively and quantitatively. Preserves more texture details and avoids ghosting artifacts compared to other methods. Achieves real-time performance with significant efficiency improvements over rendering high-resolution images directly.	Generalization across different scenes comes at the cost of slightly reduced quality. Future work could explore hardware acceleration and further quality improvements.	super-resolution, real-time rendering, radiance demodulation, motion mask, frame-recurrent neural network
2308.06624 Report	ADRMX: Additive Disentanglement of Domain Features with Remix Loss	Berker Demirel, Erchan Aptoula, Huseyin Ozkan	The common assumption that train and test sets follow similar distributions is often violated in deployment settings. Given multiple source domains, domain generalization aims to create robust models capable of generalizing to new unseen domains. To this end, most of existing studies focus on extracting domain invariant features across the available source domains in order to mitigate the effects of inter-domain distributional changes. However, this approach may limit the model's generalization capacity by relying solely on finding common features among the source domains. It overlooks the potential presence of domain-specific characteristics that could be prevalent in a subset of domains, potentially containing valuable information. In this work, a novel architecture named Additive Disentanglement of Domain Features with Remix Loss (ADRMX) is presented, which addresses this limitation by incorporating domain variant features together with the domain invariant ones using an original additive disentanglement strategy. Moreover, a new data augmentation technique is introduced to further support the generalization capacity of ADRMX, where samples from different domains are mixed within the latent space. Through extensive experiments conducted on DomainBed under fair conditions, ADRMX is shown to achieve state-of-the-art performance. Code will be made available at GitHub after the revision process.	This paper presents ADRMX, a novel architecture for domain generalization that leverages both domain variant and invariant features through an additive disentanglement strategy and a novel data augmentation technique.	Domain generalization aims to improve the robustness of models when faced with distributional shifts between training (source) and unseen (target) domains, a common challenge in real-world deployments.	ADRMX uses two backbones to extract label and domain features. It then employs an adversarial learning framework to disentangle domain-invariant features while using a novel remix loss and data augmentation technique to combine features from different domains in the latent space.	ADRMX achieves state-of-the-art performance on the DomainBed benchmark, surpassing previous approaches. The additive modeling strategy, incorporating both domain-variant and invariant features, proves beneficial for generalization. The remix loss, facilitating data augmentation in the latent space, further improves the model's performance.	The computational cost of ADRMX, particularly for large datasets, can be a limitation. Exploring alternative backbone architectures and data augmentation strategies could further enhance the performance.	domain generalization, disentanglement, deep learning, image classification, data augmentation
2308.06622 Report	DFM-X: Augmentation by Leveraging Prior Knowledge of Shortcut Learning	Shunxin Wang, Christoph Brune, Raymond Veldhuis, Nicola Strisciuglio	Neural networks are prone to learn easy solutions from superficial statistics in the data, namely shortcut learning, which impairs generalization and robustness of models. We propose a data augmentation strategy, named DFM-X, that leverages knowledge about frequency shortcuts, encoded in Dominant Frequencies Maps computed for image classification models. We randomly select X% training images of certain classes for augmentation, and process them by retaining the frequencies included in the DFMs of other classes. This strategy compels the models to leverage a broader range of frequencies for classification, rather than relying on specific frequency sets. Thus, the models learn more deep and task-related semantics compared to their counterpart trained with standard setups. Unlike other commonly used augmentation techniques which focus on increasing the visual variations of training data, our method targets exploiting the original data efficiently, by distilling prior knowledge about destructive learning behavior of models from data. Our experimental results demonstrate that DFM-X improves robustness against common corruptions and adversarial attacks. It can be seamlessly integrated with other augmentation techniques to further enhance the robustness of models.	Proposes DFM-X, a novel data augmentation method leveraging prior knowledge of frequency shortcuts to improve generalization and robustness of image classification models.	Addresses the issue of shortcut learning in neural networks, where models rely on superficial statistics instead of task-related semantics, hindering generalization and robustness.	Computes Dominant Frequency Maps (DFMs) for each class, identifying frequency shortcuts. Augments training images by filtering their frequency spectrum using DFMs of other classes, forcing models to utilize a broader range of frequencies.	DFM-X improves robustness against common corruptions and adversarial attacks without sacrificing accuracy on clean images. Combining DFM-X with AugMix or AutoAugment further enhances robustness, indicating complementarity. The percentage of images augmented by DFM-X (X) influences robustness, with lower-capacity models benefiting from higher values.	Limited investigation into the interplay between model capacity, DFM-X augmentation percentage, and specific augmentation operations. Further exploration of combining DFM-X with other augmentation techniques beyond AugMix and AutoAugment.	shortcut learning, data augmentation, frequency analysis, robustness, generalization
2308.06571 Report	ModelScope Text-to-Video Technical Report	Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, Shiwei Zhang	This paper introduces ModelScopeT2V, a text-to-video synthesis model that evolves from a text-to-image synthesis model (i.e., Stable Diffusion). ModelScopeT2V incorporates spatio-temporal blocks to ensure consistent frame generation and smooth movement transitions. The model could adapt to varying frame numbers during training and inference, rendering it suitable for both image-text and video-text datasets. ModelScopeT2V brings together three components (i.e., VQGAN, a text encoder, and a denoising UNet), totally comprising 1.7 billion parameters, in which 0.5 billion parameters are dedicated to temporal capabilities. The model demonstrates superior performance over state-of-the-art methods across three evaluation metrics. The code and an online demo are available at \url{https://modelscope.cn/models/damo/text-to-video-synthesis/summary}.	This paper introduces ModelScopeT2V, the first open-source text-to-video synthesis model based on diffusion models.	Creating a publicly available and effective text-to-video synthesis model can catalyze further research efforts and advancements in video generation.	ModelScopeT2V incorporates a spatio-temporal block into the diffusion-based UNet architecture to model temporal dependencies and is trained on both image-text and video-text datasets using a multi-frame training strategy.	ModelScopeT2V demonstrates superior performance over state-of-the-art methods on FID-vid and FVD metrics. ModelScopeT2V generates videos with diverse and dynamic motion. The code and online demos are publicly available, fostering community engagement and novel applications.	The model could be further enhanced by incorporating multi-condition approaches or LoRA techniques. Future work could focus on generating longer videos with more semantic information.	text-to-video synthesis, diffusion models, spatio-temporal modeling, multi-frame training, open-source
2308.06548 Report	Revisiting Vision Transformer from the View of Path Ensemble	Shuning Chang, Pichao Wang, Hao Luo, Fan Wang, Mike Zheng Shou	Vision Transformers (ViTs) are normally regarded as a stack of transformer layers. In this work, we propose a novel view of ViTs showing that they can be seen as ensemble networks containing multiple parallel paths with different lengths. Specifically, we equivalently transform the traditional cascade of multi-head self-attention (MSA) and feed-forward network (FFN) into three parallel paths in each transformer layer. Then, we utilize the identity connection in our new transformer form and further transform the ViT into an explicit multi-path ensemble network. From the new perspective, these paths perform two functions: the first is to provide the feature for the classifier directly, and the second is to provide the lower-level feature representation for subsequent longer paths. We investigate the influence of each path for the final prediction and discover that some paths even pull down the performance. Therefore, we propose the path pruning and EnsembleScale skills for improvement, which cut out the underperforming paths and re-weight the ensemble components, respectively, to optimize the path combination and make the short paths focus on providing high-quality representation for subsequent paths. We also demonstrate that our path combination strategies can help ViTs go deeper and act as high-pass filters to filter out partial low-frequency signals. To further enhance the representation of paths served for subsequent paths, self-distillation is applied to transfer knowledge from the long paths to the short paths. This work calls for more future research to explain and design ViTs from new perspectives.	This paper presents a novel perspective on Vision Transformers (ViTs), demonstrating that they can be interpreted as ensemble networks comprising multiple parallel paths of varying lengths.	This ensemble view provides a new framework for understanding and optimizing ViTs by manipulating the contributions of individual paths.	The authors mathematically decouple the traditional cascade of Multi-Head Self-Attention (MHSA) and Feed-Forward Network (FFN) layers into parallel paths, leveraging identity connections to construct an explicit ensemble network.	The analysis reveals that short paths often contribute minimally to the final prediction accuracy and may even hinder performance. Path pruning, eliminating underperforming short paths, and EnsembleScale, re-weighting path contributions, are introduced to optimize path combination, leading to improved accuracy. A self-distillation method is proposed to transfer knowledge from longer to shorter paths, further enhancing representation learning and boosting overall performance.	The study focuses on image classification tasks, leaving the applicability of the ensemble view to other vision tasks for future investigation. While the ensemble view provides a new perspective, exploring alternative path manipulation techniques beyond pruning and scaling could yield further insights.	vision transformers, ensemble networks, path pruning, ensemblescale, self-distillation
2308.06531 Report	SegPrompt: Boosting Open-world Segmentation via Category-level Prompt Learning	Muzhi Zhu, Hengtao Li, Hao Chen, Chengxiang Fan, Weian Mao, Chenchen Jing, Yifan Liu, Chunhua Shen	Current closed-set instance segmentation models rely on pre-defined class labels for each mask during training and evaluation, largely limiting their ability to detect novel objects. Open-world instance segmentation (OWIS) models address this challenge by detecting unknown objects in a class-agnostic manner. However, previous OWIS approaches completely erase category information during training to keep the model's ability to generalize to unknown objects. In this work, we propose a novel training mechanism termed SegPrompt that uses category information to improve the model's class-agnostic segmentation ability for both known and unknown categories. In addition, the previous OWIS training setting exposes the unknown classes to the training set and brings information leakage, which is unreasonable in the real world. Therefore, we provide a new open-world benchmark closer to a real-world scenario by dividing the dataset classes into known-seen-unseen parts. For the first time, we focus on the model's ability to discover objects that never appear in the training set images. Experiments show that SegPrompt can improve the overall and unseen detection performance by 5.6% and 6.1% in AR on our new benchmark without affecting the inference efficiency. We further demonstrate the effectiveness of our method on existing cross-dataset transfer and strongly supervised settings, leading to 5.5% and 12.3% relative improvement.	Proposes SegPrompt, a category-level prompt learning method for open-world segmentation that boosts the segmentation performance on unseen categories by leveraging the knowledge from seen classes.	Addresses the limitations of current open-world segmentation methods that struggle to generalize to novel categories not present in the training data.	Introduces a new benchmark, LVIS-OW, to evaluate open-world segmentation by dividing categories into known, seen, and unseen sets. Employs category-level prompt learning to transfer knowledge from seen categories to unseen ones during training.	Demonstrates the effectiveness of category-level prompt learning in improving segmentation performance on unseen categories. Establishes a new benchmark, LVIS-OW, for evaluating open-world segmentation with a focus on unseen categories. Highlights the importance of considering semantic overlap between seen and unseen categories in open-world segmentation.	Limited evaluation of SegPrompt on other query-based models besides Mask2former. Reliance on the availability of a sufficient number of seen categories for effective knowledge transfer.	open-world segmentation, prompt learning, unseen object segmentation, long-tailed recognition, lvis-ow benchmark
2308.06412 Report	Taming Self-Training for Open-Vocabulary Object Detection	Shiyu Zhao, Samuel Schulter, Long Zhao, Zhixing Zhang, Vijay Kumar B. G, Yumin Suh, Manmohan Chandraker, Dimitris N. Metaxas	Recent studies have shown promising performance in open-vocabulary object detection (OVD) by utilizing pseudo labels (PLs) from pretrained vision and language models (VLMs). However, teacher-student self-training, a powerful and widely used paradigm to leverage PLs, is rarely explored for OVD. This work identifies two challenges of using self-training in OVD: noisy PLs from VLMs and frequent distribution changes of PLs. To address these challenges, we propose SAS-Det that tames self-training for OVD from two key perspectives. First, we present a split-and-fusion (SAF) head that splits a standard detection into an open-branch and a closed-branch. This design can reduce noisy supervision from pseudo boxes. Moreover, the two branches learn complementary knowledge from different training data, significantly enhancing performance when fused together. Second, in our view, unlike in closed-set tasks, the PL distributions in OVD are solely determined by the teacher model. We introduce a periodic update strategy to decrease the number of updates to the teacher, thereby decreasing the frequency of changes in PL distributions, which stabilizes the training process. Extensive experiments demonstrate SAS-Det is both efficient and effective. SAS-Det outperforms recent models of the same scale by a clear margin and achieves 37.4 AP50 and 29.1 APr on novel categories of the COCO and LVIS benchmarks, respectively. Code is available at \url{https://github.com/xiaofeng94/SAS-Det}.	This paper proposes SAS-Det, a novel open-vocabulary object detection method leveraging self-training with a split-and-fusion head and periodic teacher updates to address challenges of noisy pseudo labels and distribution shifts from pretrained vision and language models.	Open-Vocabulary Detection (OVD) aims to detect objects from novel categories without specific training examples, demanding efficient utilization of pseudo labels from pretrained Vision and Language Models (VLMs). This work tackles the challenges of noisy pseudo labels and distribution shifts in self-training for OVD, crucial for accurate and robust detection in open-world scenarios.	SAS-Det introduces a split-and-fusion head dividing detection into open and closed branches to mitigate noisy supervision from pseudo boxes. It also employs a periodic teacher update strategy to stabilize training by reducing the frequency of pseudo label distribution changes.	SAS-Det achieves state-of-the-art performance on COCO and LVIS OVD benchmarks. Ablation studies demonstrate the effectiveness of the split-and-fusion head and periodic updates. The method shows promising efficiency in pseudo labeling compared to prior art.	Self-training with a teacher model increases GPU memory consumption. Online pseudo labeling, although faster than previous methods, still adds overhead to the training process.	open-vocabulary object detection, self-training, pseudo labels, vision and language models, split-and-fusion head
2308.06248 Report	FunnyBirds: A Synthetic Vision Dataset for a Part-Based Analysis of Explainable AI Methods	Robin Hesse, Simone Schaub-Meyer, Stefan Roth	The field of explainable artificial intelligence (XAI) aims to uncover the inner workings of complex deep neural models. While being crucial for safety-critical domains, XAI inherently lacks ground-truth explanations, making its automatic evaluation an unsolved problem. We address this challenge by proposing a novel synthetic vision dataset, named FunnyBirds, and accompanying automatic evaluation protocols. Our dataset allows performing semantically meaningful image interventions, e.g., removing individual object parts, which has three important implications. First, it enables analyzing explanations on a part level, which is closer to human comprehension than existing methods that evaluate on a pixel level. Second, by comparing the model output for inputs with removed parts, we can estimate ground-truth part importances that should be reflected in the explanations. Third, by mapping individual explanations into a common space of part importances, we can analyze a variety of different explanation types in a single common framework. Using our tools, we report results for 24 different combinations of neural models and XAI methods, demonstrating the strengths and weaknesses of the assessed methods in a fully automatic and systematic manner.	This paper introduces "FunnyBirds," a synthetic vision dataset specifically designed for the quantitative evaluation and analysis of explainable AI (XAI) methods.	Evaluating XAI methods is challenging due to the lack of ground-truth explanations. Existing automatic evaluation methods often rely on pixel-level interventions, which are not aligned with human perception and can introduce domain shifts.	The authors create a synthetic dataset of bird images with controllable features (beak, wings, feet, eyes, tail). They propose a multi-dimensional analysis framework (FunnyBirds framework) with six evaluation protocols covering completeness, correctness, and contrastivity of explanations. They also showcase custom analyses for deeper insights into specific XAI methods.	Methods relying on simpler model structures like BagNet achieve higher explainability scores. Integrated Gradients performs best among model-agnostic methods across different backbones. The study reveals weaknesses in the ability of assessed XAI methods to reliably communicate the relative importance of input features, particularly in terms of correctness.	The synthetic nature of the dataset might not fully represent real-world image complexities. The framework currently focuses on a subset of explainability dimensions, omitting aspects like compactness and confidence.	explainable ai, xai evaluation, synthetic datasets, computer vision, deep learning
2308.06160 Report	DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models	Weijia Wu, Yuzhong Zhao, Hao Chen, Yuchao Gu, Rui Zhao, Yefei He, Hong Zhou, Mike Zheng Shou, Chunhua Shen	Current deep networks are very data-hungry and benefit from training on largescale datasets, which are often time-consuming to collect and annotate. By contrast, synthetic data can be generated infinitely using generative models such as DALL-E and diffusion models, with minimal effort and cost. In this paper, we present DatasetDM, a generic dataset generation model that can produce diverse synthetic images and the corresponding high-quality perception annotations (e.g., segmentation masks, and depth). Our method builds upon the pre-trained diffusion model and extends text-guided image synthesis to perception data generation. We show that the rich latent code of the diffusion model can be effectively decoded as accurate perception annotations using a decoder module. Training the decoder only needs less than 1% (around 100 images) manually labeled images, enabling the generation of an infinitely large annotated dataset. Then these synthetic data can be used for training various perception models for downstream tasks. To showcase the power of the proposed approach, we generate datasets with rich dense pixel-wise labels for a wide range of downstream tasks, including semantic segmentation, instance segmentation, and depth estimation. Notably, it achieves 1) state-of-the-art results on semantic segmentation and instance segmentation; 2) significantly more robust on domain generalization than using the real data alone; and state-of-the-art results in zero-shot segmentation setting; and 3) flexibility for efficient application and novel task composition (e.g., image editing). The project website and code can be found at https://weijiawu.github.io/DatasetDM_page/ and https://github.com/showlab/DatasetDM, respectively	Presents DatasetDM, a text-to-data generation model that leverages pre-trained diffusion models to produce synthetic images with diverse perception annotations (e.g., segmentation masks, depth) using minimal manually labeled data.	Addresses the data-hungry nature of deep learning models for perception tasks by enabling the generation of infinitely large annotated datasets with minimal cost and effort.	Trains a unified perception decoder (P-Decoder) on a small set of real images paired with their latent representations extracted from a pre-trained diffusion model using diffusion inversion. Employs GPT-4 to enhance prompt diversity and guide data generation.	Achieves state-of-the-art results on semantic and instance segmentation tasks. Exhibits significantly improved robustness in domain generalization compared to using real data alone. Offers flexibility for novel task composition, such as image editing.	The quality and complexity of synthesized data are limited by the capabilities of the base diffusion model. Further improvements in prompt generation efficiency and domain-specific prompt design are possible.	synthetic data generation, text-to-image synthesis, perception tasks, diffusion models, domain generalization
2308.06101 Report	Taming the Power of Diffusion Models for High-Quality Virtual Try-On with Appearance Flow	Junhong Gou, Siyu Sun, Jianfu Zhang, Jianlou Si, Chen Qian, Liqing Zhang	Virtual try-on is a critical image synthesis task that aims to transfer clothes from one image to another while preserving the details of both humans and clothes. While many existing methods rely on Generative Adversarial Networks (GANs) to achieve this, flaws can still occur, particularly at high resolutions. Recently, the diffusion model has emerged as a promising alternative for generating high-quality images in various applications. However, simply using clothes as a condition for guiding the diffusion model to inpaint is insufficient to maintain the details of the clothes. To overcome this challenge, we propose an exemplar-based inpainting approach that leverages a warping module to guide the diffusion model's generation effectively. The warping module performs initial processing on the clothes, which helps to preserve the local details of the clothes. We then combine the warped clothes with clothes-agnostic person image and add noise as the input of diffusion model. Additionally, the warped clothes is used as local conditions for each denoising process to ensure that the resulting output retains as much detail as possible. Our approach, namely Diffusion-based Conditional Inpainting for Virtual Try-ON (DCI-VTON), effectively utilizes the power of the diffusion model, and the incorporation of the warping module helps to produce high-quality and realistic virtual try-on results. Experimental results on VITON-HD demonstrate the effectiveness and superiority of our method.	This paper presents DCI-VTON, a novel diffusion model-based virtual try-on framework that utilizes appearance flow for high-quality image synthesis.	Existing GAN-based virtual try-on methods often struggle to maintain detail, particularly at high resolutions, highlighting the need for more robust approaches. Diffusion models offer an appealing alternative with superior generative capabilities.	DCI-VTON consists of two main modules: 1) a warping module that predicts an appearance flow field to align clothes to the target person's pose, generating a coarse composite image. 2) a refinement module that leverages a diffusion model to refine the initial result using warped clothes as local conditions during denoising.	DCI-VTON outperforms previous state-of-the-art virtual try-on methods on standard benchmarks like VITON-HD across various resolutions. The inclusion of a warping module proves beneficial, particularly in challenging scenarios involving significant pose changes. Ablation studies demonstrate the complementary nature of global, local, and initial conditions in guiding the diffusion model's generation process.	The model currently focuses on trying on upper-body garments, leaving the extension to full-body outfits for future exploration. While DCI-VTON effectively handles various clothes styles, addressing highly intricate designs or extreme poses remains an area for improvement.	virtual try-on, diffusion models, appearance flow, high-resolution image synthesis, conditional image generation
2308.06097 Report	RIGID: Recurrent GAN Inversion and Editing of Real Face Videos	Yangyang Xu, Shengfeng He, Kwan-Yee K. Wong, Ping Luo	GAN inversion is indispensable for applying the powerful editability of GAN to real images. However, existing methods invert video frames individually often leading to undesired inconsistent results over time. In this paper, we propose a unified recurrent framework, named \textbf{R}ecurrent v\textbf{I}deo \textbf{G}AN \textbf{I}nversion and e\textbf{D}iting (RIGID), to explicitly and simultaneously enforce temporally coherent GAN inversion and facial editing of real videos. Our approach models the temporal relations between current and previous frames from three aspects. To enable a faithful real video reconstruction, we first maximize the inversion fidelity and consistency by learning a temporal compensated latent code. Second, we observe incoherent noises lie in the high-frequency domain that can be disentangled from the latent space. Third, to remove the inconsistency after attribute manipulation, we propose an \textit{in-between frame composition constraint} such that the arbitrary frame must be a direct composite of its neighboring frames. Our unified framework learns the inherent coherence between input frames in an end-to-end manner, and therefore it is agnostic to a specific attribute and can be applied to arbitrary editing of the same video without re-training. Extensive experiments demonstrate that RIGID outperforms state-of-the-art methods qualitatively and quantitatively in both inversion and editing tasks. The deliverables can be found in \url{https://cnnlstm.github.io/RIGID}	Proposes RIGID, a recurrent framework for temporally coherent GAN inversion and facial editing of real videos.	Existing methods struggle to maintain temporal consistency when inverting and editing videos using GANs, leading to unrealistic and disjointed results.	A recurrent encoder learns temporal compensated latent codes and disentangles high-frequency artifacts for coherent inversion. A novel in-between frame composition constraint enforces smoothness in edited videos.	Achieves comparable or better results in video inversion quality and temporal coherence than optimization-based methods (e.g., STIT) with significantly faster inference times. Enables attribute-agnostic editing, allowing various edits on a single video without retraining. Outperforms competitors in maintaining temporal coherence and identity preservation during video editing, as evidenced by quantitative metrics and visual comparisons.	Limited editing capability for hair portions outside the cropped face region. Higher GPU memory requirements during training compared to some alternatives.	gan inversion, video editing, temporal coherence, recurrent neural networks, generative adversarial networks
2308.06093 Report	Experts Weights Averaging: A New General Training Scheme for Vision Transformers	Yongqi Huang, Peng Ye, Xiaoshui Huang, Sheng Li, Tao Chen, Tong He, Wanli Ouyang	Structural re-parameterization is a general training scheme for Convolutional Neural Networks (CNNs), which achieves performance improvement without increasing inference cost. As Vision Transformers (ViTs) are gradually surpassing CNNs in various visual tasks, one may question: if a training scheme specifically for ViTs exists that can also achieve performance improvement without increasing inference cost? Recently, Mixture-of-Experts (MoE) has attracted increasing attention, as it can efficiently scale up the capacity of Transformers at a fixed cost through sparsely activated experts. Considering that MoE can also be viewed as a multi-branch structure, can we utilize MoE to implement a ViT training scheme similar to structural re-parameterization? In this paper, we affirmatively answer these questions, with a new general training strategy for ViTs. Specifically, we decouple the training and inference phases of ViTs. During training, we replace some Feed-Forward Networks (FFNs) of the ViT with specially designed, more efficient MoEs that assign tokens to experts by random uniform partition, and perform Experts Weights Averaging (EWA) on these MoEs at the end of each iteration. After training, we convert each MoE into an FFN by averaging the experts, transforming the model back into original ViT for inference. We further provide a theoretical analysis to show why and how it works. Comprehensive experiments across various 2D and 3D visual tasks, ViT architectures, and datasets validate the effectiveness and generalizability of the proposed training scheme. Besides, our training scheme can also be applied to improve performance when fine-tuning ViTs. Lastly, but equally important, the proposed EWA technique can significantly improve the effectiveness of naive MoE in various 2D visual small datasets and 3D visual tasks.	This paper introduces Experts Weights Averaging (EWA), a novel training scheme for Vision Transformers (ViTs) that improves performance without increasing inference cost.	Existing methods for improving model performance, such as structural re-parameterization and Mixture-of-Experts (MoE), are either limited to CNNs or introduce significant computational overhead during inference. EWA aims to overcome these limitations.	EWA decouples training and inference phases: During training, it replaces some ViT feed-forward networks (FFNs) with specially designed, more efficient MoEs using random uniform partition. It then averages expert weights after each training iteration. During inference, each MoE is converted back into a single FFN by averaging its expert weights.	EWA training consistently improves the performance of various ViT architectures on diverse 2D and 3D visual tasks and datasets. EWA fine-tuning further enhances the performance of pre-trained ViT models. Experts Weights Averaging significantly improves the effectiveness of naive MoE, particularly on small 2D visual datasets and 3D visual tasks where naive MoE struggles.	The optimal share rate for Experts Weights Averaging needs to be determined for each ViT architecture. The paper primarily focuses on image classification and semantic segmentation tasks. Exploring EWA's applicability to other vision tasks like object detection and instance segmentation is a potential avenue for future research.	vision transformer, mixture-of-experts, structural re-parameterization, weight averaging, deep learning
2308.06038 Report	Diverse Data Augmentation with Diffusions for Effective Test-time Prompt Tuning	Chun-Mei Feng, Kai Yu, Yong Liu, Salman Khan, Wangmeng Zuo	Benefiting from prompt tuning, recent years have witnessed the promising performance of pre-trained vision-language models, e.g., CLIP, on versatile downstream tasks. In this paper, we focus on a particular setting of learning adaptive prompts on the fly for each test sample from an unseen new domain, which is known as test-time prompt tuning (TPT). Existing TPT methods typically rely on data augmentation and confidence selection. However, conventional data augmentation techniques, e.g., random resized crops, suffers from the lack of data diversity, while entropy-based confidence selection alone is not sufficient to guarantee prediction fidelity. To address these issues, we propose a novel TPT method, named DiffTPT, which leverages pre-trained diffusion models to generate diverse and informative new data. Specifically, we incorporate augmented data by both conventional method and pre-trained stable diffusion to exploit their respective merits, improving the models ability to adapt to unknown new test data. Moreover, to ensure the prediction fidelity of generated data, we introduce a cosine similarity-based filtration technique to select the generated data with higher similarity to the single test sample. Our experiments on test datasets with distribution shifts and unseen categories demonstrate that DiffTPT improves the zero-shot accuracy by an average of 5.13\% compared to the state-of-the-art TPT method. Our code and models will be publicly released.	This paper proposes DiffTPT, a novel test-time prompt tuning method that leverages pre-trained diffusion models for diverse data augmentation, enhancing the performance of pre-trained vision-language models on unseen domains.	Existing test-time prompt tuning methods suffer from limited data diversity and insufficient prediction fidelity when adapting to new domains.	DiffTPT employs Stable Diffusion to generate diverse augmented images from test samples and introduces a cosine similarity-based filtration to remove spurious augmentations, balancing data diversity and prediction fidelity.	DiffTPT achieves state-of-the-art zero-shot accuracy, outperforming existing test-time prompt tuning methods by an average of 5.13%. Combining DiffTPT with few-shot prompt tuning methods further improves both in-domain and out-of-distribution performance. Ablation studies confirm the effectiveness of diffusion-based augmentation, cosine similarity filtration, and the impact of augmented data size and prompt updating steps.	The inference speed of using diffusion models for data augmentation can be further improved. Exploring other filtration techniques or combining multiple metrics to further enhance prediction fidelity.	test-time prompt tuning, diffusion models, data augmentation, zero-shot learning, vision-language models
2308.06027 Report	Masked-Attention Diffusion Guidance for Spatially Controlling Text-to-Image Generation	Yuki Endo	Text-to-image synthesis has achieved high-quality results with recent advances in diffusion models. However, text input alone has high spatial ambiguity and limited user controllability. Most existing methods allow spatial control through additional visual guidance (e.g., sketches and semantic masks) but require additional training with annotated images. In this paper, we propose a method for spatially controlling text-to-image generation without further training of diffusion models. Our method is based on the insight that the cross-attention maps reflect the positional relationship between words and pixels. Our aim is to control the attention maps according to given semantic masks and text prompts. To this end, we first explore a simple approach of directly swapping the cross-attention maps with constant maps computed from the semantic regions. Some prior works also allow training-free spatial control of text-to-image diffusion models by directly manipulating cross-attention maps. However, these approaches still suffer from misalignment to given masks because manipulated attention maps are far from actual ones learned by diffusion models. To address this issue, we propose masked-attention guidance, which can generate images more faithful to semantic masks via indirect control of attention to each word and pixel by manipulating noise images fed to diffusion models. Masked-attention guidance can be easily integrated into pre-trained off-the-shelf diffusion models (e.g., Stable Diffusion) and applied to the tasks of text-guided image editing. Experiments show that our method enables more accurate spatial control than baselines qualitatively and quantitatively.	This paper proposes "masked-attention guidance," a training-free method to spatially control text-to-image generation in diffusion models using visual guidance like semantic masks.	Text-to-image generation lacks controllability due to the spatial ambiguity of text descriptions. Existing methods for spatial control often require costly additional training.	The method leverages cross-attention maps in diffusion models, which reflect word-pixel relationships. It manipulates noise maps fed to the model to indirectly guide the cross-attention towards user-specified regions in the semantic mask.	Quantitative evaluation on COCO dataset shows significant improvement in mask alignment (mIoU) compared to training-free baselines. Qualitative results demonstrate better alignment with semantic masks and ability to handle diverse and challenging inputs. Analysis of cross-attention maps shows that the method effectively guides attention estimation in the diffusion model.	The method struggles with small or detailed regions in the semantic mask. Strong guidance can sometimes lead to unnatural image generation. Future work includes handling other visual guidance types like scribbles and addressing limitations in handling small regions.	text-to-image generation, diffusion models, spatial control, semantic masks, cross-attention
2308.06015 Report	Enhancing Generalization of Universal Adversarial Perturbation through Gradient Aggregation	Xuannan Liu, Yaoyao Zhong, Yuhang Zhang, Lixiong Qin, Weihong Deng	Deep neural networks are vulnerable to universal adversarial perturbation (UAP), an instance-agnostic perturbation capable of fooling the target model for most samples. Compared to instance-specific adversarial examples, UAP is more challenging as it needs to generalize across various samples and models. In this paper, we examine the serious dilemma of UAP generation methods from a generalization perspective -- the gradient vanishing problem using small-batch stochastic gradient optimization and the local optima problem using large-batch optimization. To address these problems, we propose a simple and effective method called Stochastic Gradient Aggregation (SGA), which alleviates the gradient vanishing and escapes from poor local optima at the same time. Specifically, SGA employs the small-batch training to perform multiple iterations of inner pre-search. Then, all the inner gradients are aggregated as a one-step gradient estimation to enhance the gradient stability and reduce quantization errors. Extensive experiments on the standard ImageNet dataset demonstrate that our method significantly enhances the generalization ability of UAP and outperforms other state-of-the-art methods. The code is available at https://github.com/liuxuannan/Stochastic-Gradient-Aggregation.	This paper proposes Stochastic Gradient Aggregation (SGA), a novel method to enhance the generalization ability of Universal Adversarial Perturbations (UAPs).	Existing UAP generation methods suffer from either gradient vanishing with small-batch training or sub-optimal generalization with large-batch training. This paper addresses this dilemma to improve UAP's generalization across diverse samples and models.	SGA employs inner-outer iterations. It conducts pre-search with multiple inner iterations using small-batch samples. Then, it aggregates all inner gradients to update UAP with a one-step gradient estimation, enhancing gradient stability and reducing quantization errors.	SGA outperforms state-of-the-art methods in the white-box setting, achieving a higher fooling ratio across five tested models. SGA also significantly improves the fooling ratio in the black-box setting, demonstrating better cross-model generalization ability. SGA maintains superior performance in limit-sample settings, effectively crafting UAPs with only 500 training samples.	The paper primarily focuses on the ImageNet dataset, limiting the generalizability of the findings. Future work could explore the effectiveness of SGA in conjunction with other advanced gradient optimization techniques.	universal adversarial perturbation, generalization, gradient vanishing, quantization error, stochastic gradient aggregation
2308.05739 Report	Zero Grads: Learning Local Surrogate Losses for Non-Differentiable Graphics	Michael Fischer, Tobias Ritschel	Gradient-based optimization is now ubiquitous across graphics, but unfortunately can not be applied to problems with undefined or zero gradients. To circumvent this issue, the loss function can be manually replaced by a ``surrogate'' that has similar minima but is differentiable. Our proposed framework, ZeroGrads, automates this process by learning a neural approximation of the objective function, which in turn can be used to differentiate through arbitrary black-box graphics pipelines. We train the surrogate on an actively smoothed version of the objective and encourage locality, focusing the surrogate's capacity on what matters at the current training episode. The fitting is performed online, alongside the parameter optimization, and self-supervised, without pre-computed data or pre-trained models. As sampling the objective is expensive (it requires a full rendering or simulator run), we devise an efficient sampling scheme that allows for tractable run-times and competitive performance at little overhead. We demonstrate optimizing diverse non-convex, non-differentiable black-box problems in graphics, such as visibility in rendering, discrete parameter spaces in procedural modelling or optimal control in physics-driven animation. In contrast to other derivative-free algorithms, our approach scales well to higher dimensions, which we demonstrate on problems with up to 35k interlinked variables.	This paper introduces a novel optimization method that replaces gradient calculations with a learned surrogate model, allowing for gradient-free optimization in various computer graphics applications.	The method addresses the limitations of traditional gradient-based optimization techniques in scenarios where gradients are unavailable or computationally expensive, expanding the scope of optimizable problems in computer graphics.	The method trains a neural network to approximate the local behavior of the objective function around a current parameter set. This surrogate model then provides gradient estimates for optimization, avoiding the need for explicit gradient calculations.	The method demonstrates comparable performance to gradient-based methods on low-dimensional tasks. It scales effectively to high-dimensional problems involving tens of thousands of parameters, surpassing traditional gradient-free algorithms. The approach proves successful across various computer graphics applications, including texture optimization, mesh reconstruction, and caustic rendering.	The method's performance might be sensitive to the choice of hyperparameters, such as the spread of the locality kernel and the number of samples used for gradient estimation. Further investigation is needed to explore the generalization capabilities of the learned surrogate models across different problem instances.	gradient-free optimization, surrogate modeling, computer graphics, high-dimensional optimization, differentiable rendering
2308.05733 Report	FrozenRecon: Pose-free 3D Scene Reconstruction with Frozen Depth Models	Guangkai Xu, Wei Yin, Hao Chen, Chunhua Shen, Kai Cheng, Feng Zhao	3D scene reconstruction is a long-standing vision task. Existing approaches can be categorized into geometry-based and learning-based methods. The former leverages multi-view geometry but can face catastrophic failures due to the reliance on accurate pixel correspondence across views. The latter was proffered to mitigate these issues by learning 2D or 3D representation directly. However, without a large-scale video or 3D training data, it can hardly generalize to diverse real-world scenarios due to the presence of tens of millions or even billions of optimization parameters in the deep network. Recently, robust monocular depth estimation models trained with large-scale datasets have been proven to possess weak 3D geometry prior, but they are insufficient for reconstruction due to the unknown camera parameters, the affine-invariant property, and inter-frame inconsistency. Here, we propose a novel test-time optimization approach that can transfer the robustness of affine-invariant depth models such as LeReS to challenging diverse scenes while ensuring inter-frame consistency, with only dozens of parameters to optimize per video frame. Specifically, our approach involves freezing the pre-trained affine-invariant depth model's depth predictions, rectifying them by optimizing the unknown scale-shift values with a geometric consistency alignment module, and employing the resulting scale-consistent depth maps to robustly obtain camera poses and achieve dense scene reconstruction, even in low-texture regions. Experiments show that our method achieves state-of-the-art cross-dataset reconstruction on five zero-shot testing datasets.	This paper presents FrozenRecon, a novel test-time optimization approach for robust and efficient 3D scene reconstruction from monocular videos. It leverages the robustness of pre-trained affine-invariant depth models while ensuring inter-frame consistency.	Existing 3D reconstruction methods, whether geometry-based or learning-based, often face limitations such as reliance on accurate pixel correspondence, need for large-scale training data, or susceptibility to low-texture regions. FrozenRecon addresses these limitations by efficiently transferring the robustness of pre-trained depth models to diverse real-world scenes.	FrozenRecon freezes a pre-trained affine-invariant depth model and optimizes a sparse set of parameters (scale, shift, weight factors) per frame to achieve scale-consistent depth maps. It jointly optimizes camera poses and intrinsic parameters alongside depth, guided by photometric and geometric consistency constraints.	FrozenRecon achieves state-of-the-art cross-dataset reconstruction performance on five unseen datasets, outperforming previous methods in terms of accuracy and robustness. The method is efficient, requiring optimization of only dozens of parameters per frame, unlike learning-based methods with millions of parameters. FrozenRecon demonstrates strong generalization ability, effectively reconstructing diverse scenes without requiring offline-acquired camera parameters.	The method assumes a pinhole camera model, which may limit its accuracy in scenarios with significant lens distortion. Future work could explore incorporating semantic information or multi-scale features to further enhance reconstruction quality, especially in challenging low-texture or dynamic environments.	3d scene reconstruction, monocular video, affine-invariant depth, test-time optimization, geometric consistency
2308.05695 Report	Masked Diffusion as Self-supervised Representation Learner	Zixuan Pan, Jianxu Chen, Yiyu Shi	Denoising diffusion probabilistic models have recently demonstrated state-of-the-art generative performance and have been used as strong pixel-level representation learners. This paper decomposes the interrelation between the generative capability and representation learning ability inherent in diffusion models. We present the masked diffusion model (MDM), a scalable self-supervised representation learner for semantic segmentation, substituting the conventional additive Gaussian noise of traditional diffusion with a masking mechanism. Our proposed approach convincingly surpasses prior benchmarks, demonstrating remarkable advancements in both medical and natural image semantic segmentation tasks, particularly in few-shot scenarios.	This paper introduces the Masked Diffusion Model (MDM), a new self-supervised representation learning approach for semantic segmentation that replaces the additive Gaussian noise of traditional diffusion models with a masking mechanism.	This work addresses the limitations of denoising diffusion probabilistic models (DDPM) for representation learning by decoupling generative capability from representation learning ability and proposing a more efficient alternative.	MDM utilizes a masking strategy guided by a sampled timestep, reconstructing the original image from a masked version. It also leverages the Structural Similarity Index (SSIM) loss to enhance structural information preservation during reconstruction, improving semantic representation quality.	MDM outperforms DDPM and other state-of-the-art methods on both medical (GlaS, MoNuSeg) and natural image (FFHQ-34, CelebA-19) segmentation tasks. It excels in few-shot scenarios, achieving comparable results to full label settings with significantly fewer labels. Ablation studies demonstrate the effectiveness of the masking strategy, SSIM loss, and the importance of selecting appropriate diffusion timesteps and UNet decoder blocks.	The current implementation focuses on U-Net architecture and evaluation on a limited set of datasets. Exploring other architectures and datasets is crucial. While SSIM loss shows promise, investigating alternative optimization objectives tailored for specific tasks and datasets could lead to further improvements.	self-supervised learning, semantic segmentation, diffusion models, representation learning, few-shot learning
2308.05659 Report	AD-CLIP: Adapting Domains in Prompt Space Using CLIP	Mainak Singha, Harsh Pal, Ankit Jha, Biplab Banerjee	Although deep learning models have shown impressive performance on supervised learning tasks, they often struggle to generalize well when the training (source) and test (target) domains differ. Unsupervised domain adaptation (DA) has emerged as a popular solution to this problem. However, current DA techniques rely on visual backbones, which may lack semantic richness. Despite the potential of large-scale vision-language foundation models like CLIP, their effectiveness for DA has yet to be fully explored. To address this gap, we introduce AD-CLIP, a domain-agnostic prompt learning strategy for CLIP that aims to solve the DA problem in the prompt space. We leverage the frozen vision backbone of CLIP to extract both image style (domain) and content information, which we apply to learn prompt tokens. Our prompts are designed to be domain-invariant and class-generalizable, by conditioning prompt learning on image style and content features simultaneously. We use standard supervised contrastive learning in the source domain, while proposing an entropy minimization strategy to align domains in the embedding space given the target domain data. We also consider a scenario where only target domain samples are available during testing, without any source domain data, and propose a cross-domain style mapping network to hallucinate domain-agnostic tokens. Our extensive experiments on three benchmark DA datasets demonstrate the effectiveness of AD-CLIP compared to existing literature.	This paper proposes AD-CLIP, a novel domain adaptation framework leveraging prompt learning with the CLIP model. AD-CLIP learns domain-invariant and class-generic prompt tokens using visual features extracted from CLIP's vision encoder, aiming to improve cross-domain generalization.	Existing domain adaptation techniques rely heavily on visual backbones, which may lack semantic richness and lead to sub-optimal performance. This work explores the potential of large-scale vision-language models like CLIP for improved domain adaptation.	AD-CLIP utilizes the frozen vision and text backbones of CLIP. It introduces learnable style and content projectors to enable prompt learning from visual information of different layers from the CLIP vision encoder. The framework learns three types of prompt tokens: domain tokens, image tokens, and class tokens. Additionally, it employs distribution divergence loss and entropy minimization loss for domain alignment.	AD-CLIP achieves state-of-the-art performance on three benchmark domain adaptation datasets: Office-Home, VisDA-2017, and Mini-DomainNet. The method demonstrates the effectiveness of learning domain-invariant and class-generic prompt tokens for improving cross-domain generalization. Ablation studies validate the importance of incorporating multi-scale visual features, the domain-agnostic token, and the proposed loss functions for optimal performance.	While AD-CLIP demonstrates strong overall performance, it exhibits limitations on certain classes of the VisDA-2017 dataset, indicating scope for further improvement. Future work will focus on extending AD-CLIP to specific applications like person re-identification and medical imaging, where domain adaptation is crucial.	domain adaptation, prompt learning, clip, vision-language models, unsupervised learning
2308.05128 Report	High-Level Parallelism and Nested Features for Dynamic Inference Cost and Top-Down Attention	André Peter Kelm, Niels Hannemann, Bruno Heberle, Lucas Schmidt, Tim Rolff, Christian Wilms, Ehsan Yaghoubi, Simone Frintrop	This paper introduces a novel network topology that seamlessly integrates dynamic inference cost with a top-down attention mechanism, addressing two significant gaps in traditional deep learning models. Drawing inspiration from human perception, we combine sequential processing of generic low-level features with parallelism and nesting of high-level features. This design not only reflects a finding from recent neuroscience research regarding - spatially and contextually distinct neural activations - in human cortex, but also introduces a novel "cutout" technique: the ability to selectively activate %segments of the network for task-relevant only network segments of task-relevant categories to optimize inference cost and eliminate the need for re-training. We believe this paves the way for future network designs that are lightweight and adaptable, making them suitable for a wide range of applications, from compact edge devices to large-scale clouds. Our proposed topology also comes with a built-in top-down attention mechanism, which allows processing to be directly influenced by either enhancing or inhibiting category-specific high-level features, drawing parallels to the selective attention mechanism observed in human cognition. Using targeted external signals, we experimentally enhanced predictions across all tested models. In terms of dynamic inference cost our methodology can achieve an exclusion of up to $73.48\,\%$ of parameters and $84.41\,\%$ fewer giga-multiply-accumulate (GMAC) operations, analysis against comparative baselines show an average reduction of $40\,\%$ in parameters and $8\,\%$ in GMACs across the cases we evaluated.	This paper introduces a novel network topology called SeqPar (Sequential-Parallel) for deep learning models, combining sequential processing for low-level features with parallel processing for high-level features to enable dynamic inference cost reduction and top-down attention.	Current deep learning models lack the ability to dynamically adjust inference costs or incorporate top-down attention, limiting their efficiency and adaptability in tasks where high-level knowledge is available.	The proposed SeqPar structure separates high-level features into parallel branches, allowing for category-specific feature extraction. This enables a novel "cutout" technique where only relevant branches are activated during inference based on prior knowledge, reducing computation. The structure also inherently allows for top-down attention by amplifying or inhibiting specific branches.	The SeqPar structure with cutouts achieves up to 73.48% reduction in parameters and 84.41% fewer GMACs, with an average reduction of 40% in parameters and 8% in GMACs compared to baselines. The built-in top-down attention mechanism, tested by amplifying target category features, consistently improves classification accuracy across various datasets. Nested SeqPar structures (NHL), grouping categories by similarity, further reduce parameter count and improve accuracy compared to conventional ResNet models on ImageNet100.	The performance of NHL on ImageNet with 1000 categories is limited, potentially due to the small number of training images per category. The optimal split point for transitioning from sequential to parallel processing requires further investigation.	deep learning, dynamic inference, top-down attention, network topology, image classification
2308.05095 Report	LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation	Leigang Qu, Shengqiong Wu, Hao Fei, Liqiang Nie, Tat-Seng Chua	In the text-to-image generation field, recent remarkable progress in Stable Diffusion makes it possible to generate rich kinds of novel photorealistic images. However, current models still face misalignment issues (e.g., problematic spatial relation understanding and numeration failure) in complex natural scenes, which impedes the high-faithfulness text-to-image generation. Although recent efforts have been made to improve controllability by giving fine-grained guidance (e.g., sketch and scribbles), this issue has not been fundamentally tackled since users have to provide such guidance information manually. In this work, we strive to synthesize high-fidelity images that are semantically aligned with a given textual prompt without any guidance. Toward this end, we propose a coarse-to-fine paradigm to achieve layout planning and image generation. Concretely, we first generate the coarse-grained layout conditioned on a given textual prompt via in-context learning based on Large Language Models. Afterward, we propose a fine-grained object-interaction diffusion method to synthesize high-faithfulness images conditioned on the prompt and the automatically generated layout. Extensive experiments demonstrate that our proposed method outperforms the state-of-the-art models in terms of layout and image generation. Our code and settings are available at https://layoutllm-t2i.github.io.	This paper proposes LayoutLLM-T2I, a novel approach for text-to-image generation that leverages the layout planning abilities of Large Language Models (LLMs) to enhance the faithfulness of synthesized images, particularly in complex scenes.	Current text-to-image models struggle with complex scenes, exhibiting issues like spatial relation misunderstanding and numeration errors. This work aims to address this by incorporating layout planning into the generation process.	The method employs a two-stage process: 1) Text-to-Layout Induction: An LLM is used to generate a coarse-grained layout from the text prompt, aided by a feedback-based sampler learning mechanism. 2) Layout-guided Image Generation: A layout-aware adapter is integrated into a pre-trained diffusion model, enabling relation-aware object interaction guided by the generated layout.	LayoutLLM-T2I significantly outperforms existing methods in both layout generation and image synthesis, achieving state-of-the-art results on the COCO dataset. The feedback-based sampler learning mechanism is shown to effectively activate and improve the layout planning capabilities of LLMs. The relation-aware image generation module, incorporating object interactions, is crucial for enhancing the faithfulness of images, particularly in complex scenes.	The performance of layout planning with LLMs is sensitive to the number of in-context examples, highlighting the need for further research on sample efficiency. The work focuses on layout planning as a single modality; future work could explore the integration of other modalities, such as depth information, to further enhance image faithfulness.	text-to-image generation, diffusion model, large language model, layout planning, scene understanding
2308.04868 Report	InstantAvatar: Efficient 3D Head Reconstruction via Surface Rendering	Antonio Canela, Pol Caselles, Ibrar Malik, Eduard Ramon, Jaime García, Jordi Sánchez-Riera, Gil Triginer, Francesc Moreno-Noguer	Recent advances in full-head reconstruction have been obtained by optimizing a neural field through differentiable surface or volume rendering to represent a single scene. While these techniques achieve an unprecedented accuracy, they take several minutes, or even hours, due to the expensive optimization process required. In this work, we introduce InstantAvatar, a method that recovers full-head avatars from few images (down to just one) in a few seconds on commodity hardware. In order to speed up the reconstruction process, we propose a system that combines, for the first time, a voxel-grid neural field representation with a surface renderer. Notably, a naive combination of these two techniques leads to unstable optimizations that do not converge to valid solutions. In order to overcome this limitation, we present a novel statistical model that learns a prior distribution over 3D head signed distance functions using a voxel-grid based architecture. The use of this prior model, in combination with other design choices, results into a system that achieves 3D head reconstructions with comparable accuracy as the state-of-the-art with a 100x speed-up.	InstantAvatar: a method for fast full-head avatar reconstruction from a few images (down to one) using a novel statistical model that combines a voxel-grid neural field representation with a surface renderer.	Existing full-head reconstruction techniques, while accurate, are slow (taking minutes or hours) due to expensive optimization processes. This work aims to achieve comparable accuracy at significantly faster speeds.	The method leverages: (1) a multi-resolution grid-based neural field trained on a dataset of 3D head scans to represent a prior distribution of head SDFs, (2) differentiable surface rendering for optimization, (3) monocular normal predictions to guide and speed up convergence, and (4) a parallel ray-surface intersection algorithm inspired by volume rendering for efficiency.	Achieves comparable accuracy to state-of-the-art methods like H3D-Net and SIRA. Reconstructs full-head avatars in seconds, a 100x speedup over neural-field-based alternatives. Successfully combines a grid-based architecture with surface rendering for fast and accurate 3D reconstruction.	Memory concerns arise from the use of dense grids. The representation capacity is limited by the accuracy of the predicted normals and the prior architecture could be improved to allow grid features optimization.	3d reconstruction, neural fields, surface rendering, avatar generation, statistical shape models
2308.04832 Report	TSSR: A Truncated and Signed Square Root Activation Function for Neural Networks	Yuanhao Gong	Activation functions are essential components of neural networks. In this paper, we introduce a new activation function called the Truncated and Signed Square Root (TSSR) function. This function is distinctive because it is odd, nonlinear, monotone and differentiable. Its gradient is continuous and always positive. Thanks to these properties, it has the potential to improve the numerical stability of neural networks. Several experiments confirm that the proposed TSSR has better performance than other stat-of-the-art activation functions. The proposed function has significant implications for the development of neural network models and can be applied to a wide range of applications in fields such as computer vision, natural language processing, and speech recognition.	Introduces a novel activation function called Truncated and Signed Square Root (TSSR) for improved neural network performance.	Existing activation functions lack the ideal combination of mathematical properties for optimal numerical stability and performance.	Analyzes desired properties of activation functions, proposes TSSR function, compares TSSR to existing functions on CIFAR-10/100 datasets using various network architectures.	TSSR is odd, monotone, differentiable, has unbounded values, and a bounded continuous gradient. TSSR outperforms ReLU, Mish, and Serf in accuracy on CIFAR-10/100 benchmarks. TSSR shows promise for enhancing accuracy and efficiency in a variety of neural network applications.	Current experiments are limited to CIFAR datasets and a subset of network architectures. Future work includes testing TSSR on larger datasets and a wider range of applications.	activation function, tssr, neural network, deep learning, numerical stability
2308.04830 Report	VAST: Vivify Your Talking Avatar via Zero-Shot Expressive Facial Style Transfer	Liyang Chen, Zhiyong Wu, Runnan Li, Weihong Bao, Jun Ling, Xu Tan, Sheng Zhao	Current talking face generation methods mainly focus on speech-lip synchronization. However, insufficient investigation on the facial talking style leads to a lifeless and monotonous avatar. Most previous works fail to imitate expressive styles from arbitrary video prompts and ensure the authenticity of the generated video. This paper proposes an unsupervised variational style transfer model (VAST) to vivify the neutral photo-realistic avatars. Our model consists of three key components: a style encoder that extracts facial style representations from the given video prompts; a hybrid facial expression decoder to model accurate speech-related movements; a variational style enhancer that enhances the style space to be highly expressive and meaningful. With our essential designs on facial style learning, our model is able to flexibly capture the expressive facial style from arbitrary video prompts and transfer it onto a personalized image renderer in a zero-shot manner. Experimental results demonstrate the proposed approach contributes to a more vivid talking avatar with higher authenticity and richer expressiveness.	Proposes VAST, a novel variational style transfer model, to generate vivid talking avatars by transferring expressive facial styles from arbitrary video prompts onto neutral avatars in a zero-shot manner.	Existing talking face generation methods lack expressiveness and struggle to imitate natural styles from arbitrary videos, limiting the creation of engaging and realistic avatars.	VAST leverages a style encoder, variational style enhancer, and hybrid decoder. The style encoder extracts style representation from video prompts. The enhancer enriches style space using normalizing flow. The hybrid decoder predicts speech-related and weakly-related expressions separately to ensure authenticity.	VAST outperforms state-of-the-art methods in generating high-fidelity videos with accurate lip synchronization. The variational style enhancer significantly improves the expressiveness of generated avatars. VAST exhibits strong performance in transferring various facial styles, as demonstrated by subjective and objective evaluations.	The image renderer's performance is limited by the training data, leading to artifacts when transferring highly exaggerated styles. Exploring alternative rendering techniques and training on more diverse data could further enhance the quality of generated avatars.	talking face generation, facial style transfer, variational autoencoder, normalizing flow, zero-shot learning
2308.04829 Report	MixReorg: Cross-Modal Mixed Patch Reorganization is a Good Mask Learner for Open-World Semantic Segmentation	Kaixin Cai, Pengzhen Ren, Yi Zhu, Hang Xu, Jianzhuang Liu, Changlin Li, Guangrun Wang, Xiaodan Liang	Recently, semantic segmentation models trained with image-level text supervision have shown promising results in challenging open-world scenarios. However, these models still face difficulties in learning fine-grained semantic alignment at the pixel level and predicting accurate object masks. To address this issue, we propose MixReorg, a novel and straightforward pre-training paradigm for semantic segmentation that enhances a model's ability to reorganize patches mixed across images, exploring both local visual relevance and global semantic coherence. Our approach involves generating fine-grained patch-text pairs data by mixing image patches while preserving the correspondence between patches and text. The model is then trained to minimize the segmentation loss of the mixed images and the two contrastive losses of the original and restored features. With MixReorg as a mask learner, conventional text-supervised semantic segmentation models can achieve highly generalizable pixel-semantic alignment ability, which is crucial for open-world segmentation. After training with large-scale image-text data, MixReorg models can be applied directly to segment visual objects of arbitrary categories, without the need for further fine-tuning. Our proposed framework demonstrates strong performance on popular zero-shot semantic segmentation benchmarks, outperforming GroupViT by significant margins of 5.0%, 6.2%, 2.5%, and 3.4% mIoU on PASCAL VOC2012, PASCAL Context, MS COCO, and ADE20K, respectively.	This paper proposes MixReorg, a novel pre-training paradigm for open-world semantic segmentation that leverages mixed patch reorganization to enhance a model's ability to learn fine-grained semantic alignment from image-text pairs.	Existing text-supervised semantic segmentation models struggle to learn fine-grained semantic alignment at the pixel level, limiting their performance in challenging open-world scenarios. This work addresses this limitation by improving cross-modal alignment using a novel pre-training approach.	MixReorg generates fine-grained patch-text pairs by mixing image patches while preserving their textual correspondence. The model is then trained to reconstruct the original images and predict segmentation masks from the mixed images, using both segmentation and contrastive losses. The process involves three stages: contextual mixing, progressive mixing, and mixing restoration.	MixReorg outperforms previous state-of-the-art methods on open-world semantic segmentation benchmarks, including PASCAL VOC2012, PASCAL Context, MS COCO, and ADE20K. The method effectively learns fine-grained semantic alignment, as demonstrated by its ability to accurately segment mixed images. MixReorg shows significant improvements in handling complex segmentation examples and segmenting stuff classes compared to previous methods.	The contextual mixing stage increases computational cost during training. The constructed patch-text pairs, while fine-grained, are not yet at the pixel level, leaving room for further improvement.	semantic segmentation, open-world learning, vision-language pre-training, cross-modal alignment, mixed image modeling
2308.04826 Report	WaveNeRF: Wavelet-based Generalizable Neural Radiance Fields	Muyu Xu, Fangneng Zhan, Jiahui Zhang, Yingchen Yu, Xiaoqin Zhang, Christian Theobalt, Ling Shao, Shijian Lu	Neural Radiance Field (NeRF) has shown impressive performance in novel view synthesis via implicit scene representation. However, it usually suffers from poor scalability as requiring densely sampled images for each new scene. Several studies have attempted to mitigate this problem by integrating Multi-View Stereo (MVS) technique into NeRF while they still entail a cumbersome fine-tuning process for new scenes. Notably, the rendering quality will drop severely without this fine-tuning process and the errors mainly appear around the high-frequency features. In the light of this observation, we design WaveNeRF, which integrates wavelet frequency decomposition into MVS and NeRF to achieve generalizable yet high-quality synthesis without any per-scene optimization. To preserve high-frequency information when generating 3D feature volumes, WaveNeRF builds Multi-View Stereo in the Wavelet domain by integrating the discrete wavelet transform into the classical cascade MVS, which disentangles high-frequency information explicitly. With that, disentangled frequency features can be injected into classic NeRF via a novel hybrid neural renderer to yield faithful high-frequency details, and an intuitive frequency-guided sampling strategy can be designed to suppress artifacts around high-frequency regions. Extensive experiments over three widely studied benchmarks show that WaveNeRF achieves superior generalizable radiance field modeling when only given three images as input.	This paper proposes WaveNeRF, a novel generalizable Neural Radiance Field (NeRF) model that leverages wavelet frequency decomposition within a multi-view stereo (MVS) framework to achieve high-quality novel view synthesis without per-scene optimization.	Existing generalizable NeRF methods often suffer from performance degradation and artifacts when per-scene fine-tuning is not employed, particularly around high-frequency image regions. This work addresses this limitation by explicitly incorporating high-frequency information during training.	The proposed WaveNeRF introduces a Wavelet Multi-view Stereo (WMVS) module to extract both spatial and frequency domain features. It also employs a Frequency-guided Sampling Strategy (FSS) to concentrate sampling near object surfaces. These components are integrated into a Hybrid Neural Renderer (HNR) that combines spatial and frequency information for enhanced rendering.	WaveNeRF outperforms existing generalizable NeRF methods on DTU, NeRF Synthetic, and LLFF datasets, demonstrating superior performance with only three input views. The proposed Frequency-guided Sampling Strategy (FSS) effectively increases the density of samples around object surfaces, leading to improved detail rendering. Evaluation using the HFIV metric confirms that WaveNeRF effectively reconstructs high-frequency details compared to previous methods.	The model's performance with a larger number of input views is limited by GPU memory constraints. The reliance on MVS techniques may lead to artifacts in regions with inaccurate stereo reconstruction.	neural radiance fields, novel view synthesis, multi-view stereo, wavelet transform, frequency decomposition
2308.04758 Report	Bird's-Eye-View Scene Graph for Vision-Language Navigation	Rui Liu, Xiaohan Wang, Wenguan Wang, Yi Yang	Vision-language navigation (VLN), which entails an agent to navigate 3D environments following human instructions, has shown great advances. However, current agents are built upon panoramic observations, which hinders their ability to perceive 3D scene geometry and easily leads to ambiguous selection of panoramic view. To address these limitations, we present a BEV Scene Graph (BSG), which leverages multi-step BEV representations to encode scene layouts and geometric cues of indoor environment under the supervision of 3D detection. During navigation, BSG builds a local BEV representation at each step and maintains a BEV-based global scene map, which stores and organizes all the online collected local BEV representations according to their topological relations. Based on BSG, the agent predicts a local BEV grid-level decision score and a global graph-level decision score, combined with a sub-view selection score on panoramic views, for more accurate action prediction. Our approach significantly outperforms state-of-the-art methods on REVERIE, R2R, and R4R, showing the potential of BEV perception in VLN.	This paper presents BEV Scene Graph (BSG), a novel approach for vision-language navigation (VLN) that leverages Bird’s-Eye-View (BEV) representations to overcome limitations of panoramic-view based methods.	Current VLN agents based on panoramic views struggle to perceive 3D scene geometry and suffer from ambiguity in action prediction due to multiple candidate nodes mapping to the same view. BEV perception offers a solution by encoding scene layouts and geometric cues effectively.	BSG builds local BEV representations at each navigation step and maintains a global scene map connecting them topologically. It leverages 3D detection on multi-step BEV representations to encode object-level information. Finally, it predicts actions based on a fused score from local BEV grid-level and global graph-level decision scores.	BSG significantly outperforms state-of-the-art methods on REVERIE, R2R, and R4R benchmarks. On REVERIE, BSG surpasses the previous best model by 5.14% on Success Rate and 3.21% on Remote Grounding Success on the val unseen split. Ablation studies validate the contribution of individual components like BEV updating, neighborhood size for node embeddings, and the importance of 3D object detection.	The model is trained in static environments, limiting its applicability to dynamic real-world scenarios with moving objects. Future work could explore the integration of more advanced BEV frameworks and address the challenges of amodal perception for enhanced scene understanding.	vision-language navigation, "birds-eye-view", 3d object detection, scene graph, embodied ai
2308.04657 Report	Which Tokens to Use? Investigating Token Reduction in Vision Transformers	Joakim Bruslund Haurum, Sergio Escalera, Graham W. Taylor, Thomas B. Moeslund	Since the introduction of the Vision Transformer (ViT), researchers have sought to make ViTs more efficient by removing redundant information in the processed tokens. While different methods have been explored to achieve this goal, we still lack understanding of the resulting reduction patterns and how those patterns differ across token reduction methods and datasets. To close this gap, we set out to understand the reduction patterns of 10 different token reduction methods using four image classification datasets. By systematically comparing these methods on the different classification tasks, we find that the Top-K pruning method is a surprisingly strong baseline. Through in-depth analysis of the different methods, we determine that: the reduction patterns are generally not consistent when varying the capacity of the backbone model, the reduction patterns of pruning-based methods significantly differ from fixed radial patterns, and the reduction patterns of pruning-based methods are correlated across classification datasets. Finally we report that the similarity of reduction patterns is a moderate-to-strong proxy for model performance. Project page at https://vap.aau.dk/tokens.	This paper conducts a systematic comparison and analysis of 10 state-of-the-art token reduction methods for Vision Transformers (ViTs) across four image classification datasets.	Token reduction methods aim to improve ViT efficiency by removing redundant tokens. However, a lack of understanding exists regarding how these methods differ and their reduction patterns across datasets.	The authors implemented 10 token reduction methods, including pruning and merging based approaches, and evaluated their performance on ImageNet, NABirds, COCO, and NUS-WIDE. They analyzed the consistency of reduction patterns when varying keep rate, backbone capacity, and datasets.	Top-K pruning and its extension, EViT, consistently perform well across all datasets and backbone capacities. Reduction patterns of pruning-based methods are not consistent when varying backbone capacity but are consistent when changing the keep rate. Reduction patterns of pruning-based methods are highly correlated across datasets, suggesting common token usage patterns despite dataset differences.	The analysis is limited to the image classification task and using an ImageNet pre-trained backbone. Efficiency evaluation of the methods is not included in the study.	vision transformer, token reduction, pruning, merging, efficiency
2308.04603 Report	A Brief Yet In-Depth Survey of Deep Learning-Based Image Watermarking	Xin Zhong, Arjon Das, Fahad Alrasheedi, Abdullah Tanvir	This paper presents a comprehensive survey on deep learning-based image watermarking, a technique that entails the invisible embedding and extraction of watermarks within a cover image, aiming to offer a seamless blend of robustness and adaptability. We navigate the complex landscape of this interdisciplinary domain, linking historical foundations, current innovations, and prospective developments. Unlike existing literature, our study concentrates exclusively on image watermarking with deep learning, delivering an in-depth, yet brief analysis enriched by three fundamental contributions. First, we introduce a refined categorization, segmenting the field into Embedder-Extractor, Deep Networks as a Feature Transformation, and Hybrid Methods. This taxonomy, inspired by the varied roles of deep learning across studies, is designed to infuse clarity, offering readers technical insights and directional guidance. Second, our exploration dives into representative methodologies, encapsulating the diverse research directions and inherent challenges within each category to provide a consolidated perspective. Lastly, we venture beyond established boundaries to outline emerging frontiers, offering a detailed insight into prospective research avenues.	This paper presents a comprehensive survey of deep learning-based image watermarking techniques, aiming to provide a consolidated perspective on historical foundations, current innovations, and prospective developments.	The integration of deep learning in image watermarking is crucial due to its potential for enhanced robustness, adaptability, and the ability to learn and adapt to evolving threats.	The paper categorizes deep learning-based image watermarking into three types: (1) Embedder-Extractor Joint Training, (2) Deep Networks as a Feature Transformation, and (3) Hybrid Methods. The methodologies, challenges, and representative solutions within each category are then analyzed.	Joint training of embedder and extractor networks has proven effective, leading to numerous variations and innovations. Using deep networks for feature transformation, particularly with pre-trained models, offers promising results in zero watermarking. Hybrid methods, combining traditional watermarking calculations with deep learning, leverage the strengths of both approaches for enhanced efficiency.	Current research primarily focuses on static images, neglecting the dynamism of video content and real-time applications. There's a need for standardized evaluation metrics and benchmark datasets to enable accurate comparison and evaluation of different deep learning-based image watermarking techniques.	deep learning, image watermarking, embedder-extractor, feature transformation, hybrid methods
2308.04553 Report	From Fake to Real: Pretraining on Balanced Synthetic Images to Prevent Bias	Maan Qraitem, Kate Saenko, Bryan A. Plummer	Visual recognition models are prone to learning spurious correlations induced by a biased training set where certain conditions $B$ (\eg, Indoors) are over-represented in certain classes $Y$ (\eg, Big Dogs). Synthetic data from generative models offers a promising direction to mitigate this issue by augmenting underrepresented conditions in the real dataset. However, this introduces another potential source of bias from generative model artifacts in the synthetic data. Indeed, as we will show, prior work uses synthetic data to resolve the model's bias toward $B$, but it doesn't correct the models' bias toward the pair $(B, G)$ where $G$ denotes whether the sample is real or synthetic. Thus, the model could simply learn signals based on the pair $(B, G)$ (\eg, Synthetic Indoors) to make predictions about $Y$ (\eg, Big Dogs). To address this issue, we propose a two-step training pipeline that we call From Fake to Real (FFR). The first step of FFR pre-trains a model on balanced synthetic data to learn robust representations across subgroups. In the second step, FFR fine-tunes the model on real data using ERM or common loss-based bias mitigation methods. By training on real and synthetic data separately, FFR avoids the issue of bias toward signals from the pair $(B, G)$. In other words, synthetic data in the first step provides effective unbiased representations that boosts performance in the second step. Indeed, our analysis of high bias setting (99.9\%) shows that FFR improves performance over the state-of-the-art by 7-14\% over three datasets (CelebA, UTK-Face, and SpuCO Animals).	This paper proposes 'From Fake to Real' (FFR), a two-step training pipeline using synthetic data to mitigate spurious correlations in visual recognition models.	Existing methods for mitigating bias with synthetic data introduce new biases due to distributional differences between real and synthetic data. This work aims to address this limitation.	FFR first pre-trains on balanced synthetic data to learn robust representations. Then, it fine-tunes on real data using ERM or common loss-based bias mitigation methods.	FFR outperforms prior synthetic data augmentation methods, especially in high-bias settings. FFR is more data-efficient, achieving better results with less synthetic data. Qualitative analysis shows FFR focuses on relevant features and disregards spurious background features, unlike other methods.	The use of pre-trained text-to-image models for synthetic data generation may introduce new biases. Evaluation datasets, while reflecting recent advancements, are smaller than those used in large-scale systems.	bias mitigation, synthetic data augmentation, spurious correlations, visual recognition, generative models
2308.04409 Report	V-DETR: DETR with Vertex Relative Position Encoding for 3D Object Detection	Yichao Shen, Zigang Geng, Yuhui Yuan, Yutong Lin, Ze Liu, Chunyu Wang, Han Hu, Nanning Zheng, Baining Guo	We introduce a highly performant 3D object detector for point clouds using the DETR framework. The prior attempts all end up with suboptimal results because they fail to learn accurate inductive biases from the limited scale of training data. In particular, the queries often attend to points that are far away from the target objects, violating the locality principle in object detection. To address the limitation, we introduce a novel 3D Vertex Relative Position Encoding (3DV-RPE) method which computes position encoding for each point based on its relative position to the 3D boxes predicted by the queries in each decoder layer, thus providing clear information to guide the model to focus on points near the objects, in accordance with the principle of locality. In addition, we systematically improve the pipeline from various aspects such as data normalization based on our understanding of the task. We show exceptional results on the challenging ScanNetV2 benchmark, achieving significant improvements over the previous 3DETR in $\rm{AP}_{25}$/$\rm{AP}_{50}$ from 65.0\%/47.0\% to 77.8\%/66.0\%, respectively. In addition, our method sets a new record on ScanNetV2 and SUN RGB-D datasets.Code will be released at http://github.com/yichaoshen-MS/V-DETR.	This paper introduces V-DETR, a novel 3D object detection method for point clouds using the DETR framework, enhanced with a 3D Vertex Relative Position Encoding (3DV-RPE) method for improved object localization.	Previous DETR-based 3D object detectors struggled to learn accurate inductive biases from limited training data, leading to queries attending to irrelevant points far from target objects. This paper aims to address this limitation and improve 3D object detection accuracy.	V-DETR introduces 3DV-RPE, which computes position encoding for each point based on its relative offset to the vertices of the predicted 3D boxes. It operates in a canonical object space to ensure consistent encoding regardless of object orientation. The approach also incorporates object-based normalization for box parameterization and leverages advancements in 2D DETR.	V-DETR with 3DV-RPE significantly outperforms previous DETR-based methods and achieves state-of-the-art results on ScanNetV2 and SUN RGB-D datasets. 3DV-RPE effectively guides the model to focus on points near objects, improving localization accuracy, especially under higher IoU thresholds. The approach demonstrates advantages over voxel expansion methods by directly utilizing accurate 3D surface information.	The method's performance with a high number of object queries and one-to-many matching can be sensitive to the number of input points. Future work includes extending the approach to outdoor 3D object detection and unifying architecture design for indoor and outdoor 3D detection tasks.	3d object detection, point cloud processing, detr, transformer, position encoding
2308.04352 Report	3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment	Ziyu Zhu, Xiaojian Ma, Yixin Chen, Zhidong Deng, Siyuan Huang, Qing Li	3D vision-language grounding (3D-VL) is an emerging field that aims to connect the 3D physical world with natural language, which is crucial for achieving embodied intelligence. Current 3D-VL models rely heavily on sophisticated modules, auxiliary losses, and optimization tricks, which calls for a simple and unified model. In this paper, we propose 3D-VisTA, a pre-trained Transformer for 3D Vision and Text Alignment that can be easily adapted to various downstream tasks. 3D-VisTA simply utilizes self-attention layers for both single-modal modeling and multi-modal fusion without any sophisticated task-specific design. To further enhance its performance on 3D-VL tasks, we construct ScanScribe, the first large-scale 3D scene-text pairs dataset for 3D-VL pre-training. ScanScribe contains 2,995 RGB-D scans for 1,185 unique indoor scenes originating from ScanNet and 3R-Scan datasets, along with paired 278K scene descriptions generated from existing 3D-VL tasks, templates, and GPT-3. 3D-VisTA is pre-trained on ScanScribe via masked language/object modeling and scene-text matching. It achieves state-of-the-art results on various 3D-VL tasks, ranging from visual grounding and dense captioning to question answering and situated reasoning. Moreover, 3D-VisTA demonstrates superior data efficiency, obtaining strong performance even with limited annotations during downstream task fine-tuning.	This paper proposes 3D-VisTA, a pre-trained Transformer for 3D vision and text alignment.	Current 3D-VL models rely heavily on sophisticated modules, auxiliary losses, and optimization tricks, while 3D-VisTA provides a simple and unified approach.	The authors construct ScanScribe, a large-scale 3D scene-text pairs dataset, and pre-train 3D-VisTA on it with masked language/object modeling and scene-text matching objectives.	3D-VisTA achieves state-of-the-art results on various 3D-VL tasks, including visual grounding, dense captioning, question answering, and situated reasoning. Pre-training on ScanScribe significantly improves the performance of 3D-VisTA. 3D-VisTA demonstrates superior data efficiency, obtaining strong results even with limited annotations.	The data amount in ScanScribe is still insufficient for large-scale 3D-VL pre-training. 3D-VisTA currently uses an offline 3D object detection module, which may be a bottleneck for further improvement.	3d vision-language grounding, pre-trained transformer, self-supervised learning, large-scale dataset, 3d scene understanding
2308.04288 Report	Cloth2Tex: A Customized Cloth Texture Generation Pipeline for 3D Virtual Try-On	Daiheng Gao, Xu Chen, Xindi Zhang, Qi Wang, Ke Sun, Bang Zhang, Liefeng Bo, Qixing Huang	Fabricating and designing 3D garments has become extremely demanding with the increasing need for synthesizing realistic dressed persons for a variety of applications, e.g. 3D virtual try-on, digitalization of 2D clothes into 3D apparel, and cloth animation. It thus necessitates a simple and straightforward pipeline to obtain high-quality texture from simple input, such as 2D reference images. Since traditional warping-based texture generation methods require a significant number of control points to be manually selected for each type of garment, which can be a time-consuming and tedious process. We propose a novel method, called Cloth2Tex, which eliminates the human burden in this process. Cloth2Tex is a self-supervised method that generates texture maps with reasonable layout and structural consistency. Another key feature of Cloth2Tex is that it can be used to support high-fidelity texture inpainting. This is done by combining Cloth2Tex with a prevailing latent diffusion model. We evaluate our approach both qualitatively and quantitatively and demonstrate that Cloth2Tex can generate high-quality texture maps and achieve the best visual effects in comparison to other methods. Project page: tomguluson92.github.io/projects/cloth2tex/	The paper proposes Cloth2Tex, a two-stage pipeline for converting 2D clothing images into 3D textured meshes, supporting a wider variety of clothing types than previous methods.	This is important for applications like virtual try-on, 3D garment design, and digital fashion, where realistic 3D clothes are crucial.	The method uses neural mesh rendering to obtain coarse textures and a diffusion model-based data simulation approach to train a texture refinement network.	Cloth2Tex generates high-fidelity 3D textures with sharp details, outperforming previous state-of-the-art methods in terms of visual quality and quantitative metrics. The proposed method supports 10+ clothing categories, significantly more than previous work. User studies confirm that Cloth2Tex generates more realistic and consistent results compared to 2D and 3D baselines.	Cloth2Tex faces challenges with clothes having complex patterns and maintaining uniformity for garments with densely assembled grids. Future work includes exploring methods for generating homogeneous meshes with uniformly-spaced triangles to address texture uniformity issues.	3d texture synthesis, virtual try-on, neural mesh rendering, latent diffusion models, texture inpainting
2308.04206 Report	Exploring Transformers for Open-world Instance Segmentation	Jiannan Wu, Yi Jiang, Bin Yan, Huchuan Lu, Zehuan Yuan, Ping Luo	Open-world instance segmentation is a rising task, which aims to segment all objects in the image by learning from a limited number of base-category objects. This task is challenging, as the number of unseen categories could be hundreds of times larger than that of seen categories. Recently, the DETR-like models have been extensively studied in the closed world while stay unexplored in the open world. In this paper, we utilize the Transformer for open-world instance segmentation and present SWORD. Firstly, we introduce to attach the stop-gradient operation before classification head and further add IoU heads for discovering novel objects. We demonstrate that a simple stop-gradient operation not only prevents the novel objects from being suppressed as background, but also allows the network to enjoy the merit of heuristic label assignment. Secondly, we propose a novel contrastive learning framework to enlarge the representations between objects and background. Specifically, we maintain a universal object queue to obtain the object center, and dynamically select positive and negative samples from the object queries for contrastive learning. While the previous works only focus on pursuing average recall and neglect average precision, we show the prominence of SWORD by giving consideration to both criteria. Our models achieve state-of-the-art performance in various open-world cross-category and cross-dataset generalizations. Particularly, in VOC to non-VOC setup, our method sets new state-of-the-art results of 40.0% on ARb100 and 34.9% on ARm100. For COCO to UVO generalization, SWORD significantly outperforms the previous best open-world model by 5.9% on APm and 8.1% on ARm100.	Presents SWORD, a Transformer-based framework for open-world instance segmentation, using a stop-gradient operation and contrastive learning to enhance novel object discovery.	Addresses the limitations of existing open-world instance segmentation models that struggle to identify unseen objects in images.	Utilizes a stop-gradient operation before the classification head and incorporates IoU heads to prevent novel object suppression and enable heuristic label assignment. Employs contrastive learning with a universal object queue to learn distinct representations between objects and background.	Achieves state-of-the-art performance on cross-category generalization benchmarks like VOC to non-VOC and COCO to LVIS. Demonstrates significant improvements in cross-dataset generalization from COCO to UVO and COCO to Objects365. Shows that using pseudo ground-truths from SWORD further enhances performance, creating a strong model extension.	Pseudo ground-truth training, while beneficial for recall, can negatively impact average precision due to potential noise. Further research is needed to address the balance between average precision and recall when incorporating pseudo labels.	open-world instance segmentation, transformers, contrastive learning, stop-gradient operation, pseudo ground-truth training
2308.04079 Report	3D Gaussian Splatting for Real-Time Radiance Field Rendering	Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, George Drettakis	Radiance Field methods have recently revolutionized novel-view synthesis of scenes captured with multiple photos or videos. However, achieving high visual quality still requires neural networks that are costly to train and render, while recent faster methods inevitably trade off speed for quality. For unbounded and complete scenes (rather than isolated objects) and 1080p resolution rendering, no current method can achieve real-time display rates. We introduce three key elements that allow us to achieve state-of-the-art visual quality while maintaining competitive training times and importantly allow high-quality real-time (>= 30 fps) novel-view synthesis at 1080p resolution. First, starting from sparse points produced during camera calibration, we represent the scene with 3D Gaussians that preserve desirable properties of continuous volumetric radiance fields for scene optimization while avoiding unnecessary computation in empty space; Second, we perform interleaved optimization/density control of the 3D Gaussians, notably optimizing anisotropic covariance to achieve an accurate representation of the scene; Third, we develop a fast visibility-aware rendering algorithm that supports anisotropic splatting and both accelerates training and allows realtime rendering. We demonstrate state-of-the-art visual quality and real-time rendering on several established datasets.	This paper introduces a novel method for real-time radiance field rendering using 3D Gaussian splatting, achieving state-of-the-art visual quality at real-time frame rates.	Existing radiance field methods for novel view synthesis either compromise quality for speed or require long training times and are not capable of real-time rendering.	The method represents the scene as 3D Gaussians, optimized via an interleaved optimization and density control process. It employs a fast, differentiable tile-based rasterizer for rendering, supporting anisotropic splatting and efficient gradient backpropagation.	The method achieves state-of-the-art visual quality on par with Mip-NeRF360 while maintaining competitive training times. It enables real-time rendering (≥ 30 fps) at 1080p resolution for various scenes. The proposed 3D Gaussian representation is compact and efficiently captures complex geometry.	Artifacts may appear in regions with sparse view coverage. The method's memory footprint, while lower than previous point-based methods, is larger compared to NeRF-based solutions.	novel view synthesis, radiance fields, 3d gaussian splatting, real-time rendering, differentiable rendering
2308.03793 Report	ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation	Xuefeng Hu, Ke Zhang, Lu Xia, Albert Chen, Jiajia Luo, Yuyin Sun, Ken Wang, Nan Qiao, Xiao Zeng, Min Sun, Cheng-Hao Kuo, Ram Nevatia	Large-scale Pre-Training Vision-Language Model such as CLIP has demonstrated outstanding performance in zero-shot classification, e.g. achieving 76.3% top-1 accuracy on ImageNet without seeing any example, which leads to potential benefits to many tasks that have no labeled data. However, while applying CLIP to a downstream target domain, the presence of visual and text domain gaps and cross-modality misalignment can greatly impact the model performance. To address such challenges, we propose ReCLIP, the first source-free domain adaptation method for vision-language models, which does not require any source data or target labeled data. ReCLIP first learns a projection space to mitigate the misaligned visual-text embeddings and learns pseudo labels, and then deploys cross-modality self-training with the pseudo labels, to update visual and text encoders, refine labels and reduce domain gaps and misalignments iteratively. With extensive experiments, we demonstrate ReCLIP reduces the average error rate of CLIP from 30.17% to 25.06% on 22 image classification benchmarks. Code available at https://github.com/michiganleon/ReCLIP_WACV.	ReCLIP is a novel source-free domain adaptation method for Vision-Language Models (VLMs) like CLIP, addressing performance degradation due to domain gaps and misaligned visual-text embeddings.	VLMs like CLIP, while powerful, suffer performance drops in target domains due to visual and text domain gaps and misaligned cross-modality embeddings, necessitating adaptation.	ReCLIP learns a projection space to realign embeddings and generate pseudo labels. It then iteratively refines these labels and updates visual and text encoders via cross-modality self-training.	ReCLIP significantly outperforms baseline adaptation methods (AaD, POUF) and the original CLIP across 22 image classification benchmarks. ReCLIP demonstrates consistent improvement on various VLM architectures and pre-training strategies. The proposed projection space and label propagation method effectively generate accurate pseudo labels for self-training.	Label propagation accuracy becomes unstable on datasets with more than 500 categories, requiring further investigation. Exploring the use of augmentation consistency in conjunction with ReCLIP could potentially enhance adaptation performance.	source-free domain adaptation, vision-language models, clip, cross-modality alignment, self-training
2308.03772 Report	Improved Neural Radiance Fields Using Pseudo-depth and Fusion	Jingliang Li, Qiang Zhou, Chaohui Yu, Zhengda Lu, Jun Xiao, Zhibin Wang, Fan Wang	Since the advent of Neural Radiance Fields, novel view synthesis has received tremendous attention. The existing approach for the generalization of radiance field reconstruction primarily constructs an encoding volume from nearby source images as additional inputs. However, these approaches cannot efficiently encode the geometric information of real scenes with various scale objects/structures. In this work, we propose constructing multi-scale encoding volumes and providing multi-scale geometry information to NeRF models. To make the constructed volumes as close as possible to the surfaces of objects in the scene and the rendered depth more accurate, we propose to perform depth prediction and radiance field reconstruction simultaneously. The predicted depth map will be used to supervise the rendered depth, narrow the depth range, and guide points sampling. Finally, the geometric information contained in point volume features may be inaccurate due to occlusion, lighting, etc. To this end, we propose enhancing the point volume feature from depth-guided neighbor feature fusion. Experiments demonstrate the superior performance of our method in both novel view synthesis and dense geometry modeling without per-scene optimization.	This paper proposes an end-to-end framework for generalizable radiance field reconstruction using multi-scale encoding volumes, an auxiliary depth prediction head, and depth-guided adaptive feature fusion.	Existing NeRF methods struggle to generalize across scenes with diverse object scales and delicate geometry. This work aims to address these limitations and improve rendering quality.	The framework constructs pyramid encoding volumes to provide multi-scale geometric information. An auxiliary depth prediction head guides point sampling and refines depth ranges. Depth-guided adaptive feature fusion enhances point volume features.	The method achieves state-of-the-art results on view synthesis benchmarks, including DTU, NeRF Synthetic, and Real Forward-Facing datasets. Depth reconstruction is significantly improved, as demonstrated by quantitative metrics and qualitative comparisons. Ablation studies validate the contribution of each proposed module, particularly the multi-scale approach, depth guidance, and feature fusion.	The method's performance on scenes with highly complex geometry or significant occlusions could be further investigated. Exploring more efficient architectures for encoding volumes and feature fusion could reduce computational cost.	neural radiance fields, novel view synthesis, multi-scale representation, depth prediction, feature fusion
2308.03757 Report	3D Motion Magnification: Visualizing Subtle Motions with Time Varying Radiance Fields	Brandon Y. Feng, Hadi Alzayer, Michael Rubinstein, William T. Freeman, Jia-Bin Huang	Motion magnification helps us visualize subtle, imperceptible motion. However, prior methods only work for 2D videos captured with a fixed camera. We present a 3D motion magnification method that can magnify subtle motions from scenes captured by a moving camera, while supporting novel view rendering. We represent the scene with time-varying radiance fields and leverage the Eulerian principle for motion magnification to extract and amplify the variation of the embedding of a fixed point over time. We study and validate our proposed principle for 3D motion magnification using both implicit and tri-plane-based radiance fields as our underlying 3D scene representation. We evaluate the effectiveness of our method on both synthetic and real-world scenes captured under various camera setups.	This paper presents a 3D motion magnification method using Neural Radiance Fields (NeRF), extending the capabilities of traditional 2D video motion magnification to handle dynamic 3D scenes and novel view synthesis.	This method allows for the visualization and analysis of subtle, imperceptible 3D motions, overcoming the limitations of prior 2D methods that fail on moving camera footage and lack 3D motion analysis.	The method leverages the Eulerian principle for motion magnification, applying it to the feature embedding space of NeRF instead of directly to color values. It represents the scene with time-varying radiance fields and amplifies temporal variations in the embedding of 3D points, enabling 3D motion magnified rendering. This is achieved by either modifying the positional encoding or by applying 2D video magnification techniques to tri-plane based learned embedding functions.	Experiments on synthetic scenes show superior performance of the 3D motion magnification compared to applying 2D methods on rendered videos. Phase-based Eulerian magnification on tri-plane features exhibits the best performance among the explored methods. The method generalizes to real-world scenes, successfully magnifying subtle motions from both multi-view and single-camera captures, even supporting handheld videos.	Performance is limited by the quality of NeRF reconstruction, which can be affected by factors like motion blur and inaccurate camera pose estimation. Future work could explore alternative embedding functions and magnification methods for better handling of complex motions and challenging capture conditions.	motion magnification, neural radiance fields, 3d vision, novel view synthesis, eulerian analysis
2308.03747 Report	Mask Frozen-DETR: High Quality Instance Segmentation with One GPU	Zhanhao Liang, Yuhui Yuan	In this paper, we aim to study how to build a strong instance segmenter with minimal training time and GPUs, as opposed to the majority of current approaches that pursue more accurate instance segmenter by building more advanced frameworks at the cost of longer training time and higher GPU requirements. To achieve this, we introduce a simple and general framework, termed Mask Frozen-DETR, which can convert any existing DETR-based object detection model into a powerful instance segmentation model. Our method only requires training an additional lightweight mask network that predicts instance masks within the bounding boxes given by a frozen DETR-based object detector. Remarkably, our method outperforms the state-of-the-art instance segmentation method Mask DINO in terms of performance on the COCO test-dev split (55.3% vs. 54.7%) while being over 10X times faster to train. Furthermore, all of our experiments can be trained using only one Tesla V100 GPU with 16 GB of memory, demonstrating the significant efficiency of our proposed framework.	This paper proposes Mask Frozen-DETR, a method to convert existing DETR-based object detectors into strong instance segmenters by training an additional lightweight mask network using the output of a frozen DETR detector.	Training modern instance segmentation models from scratch is resource-intensive and time-consuming. This work explores a more efficient approach by leveraging readily available, powerful object detection models.	The method utilizes a frozen DETR-based object detector to generate bounding boxes. Then, it trains a lightweight mask network, incorporating image feature encoder, box feature encoder, and query feature encoder, to predict instance masks within those boxes.	Mask Frozen-DETR outperforms the state-of-the-art Mask DINO on COCO test-dev (55.3% vs. 54.7%). The method significantly reduces training time, achieving more than 10x speedup compared to Mask DINO. All experiments were conducted on a single Tesla V100 GPU with 16 GB memory, demonstrating its efficiency.	The work primarily focuses on COCO dataset; further validation on other datasets is needed. Fine-tuning the frozen DETR detector might yield further improvements, although at the cost of increased training time.	instance segmentation, object detection, detr, efficient training, mask frozen-detr
2308.03610 Report	AvatarVerse: High-quality & Stable 3D Avatar Creation from Text and Pose	Huichao Zhang, Bowen Chen, Hao Yang, Liao Qu, Xu Wang, Li Chen, Chao Long, Feida Zhu, Kang Du, Min Zheng	Creating expressive, diverse and high-quality 3D avatars from highly customized text descriptions and pose guidance is a challenging task, due to the intricacy of modeling and texturing in 3D that ensure details and various styles (realistic, fictional, etc). We present AvatarVerse, a stable pipeline for generating expressive high-quality 3D avatars from nothing but text descriptions and pose guidance. In specific, we introduce a 2D diffusion model conditioned on DensePose signal to establish 3D pose control of avatars through 2D images, which enhances view consistency from partially observed scenarios. It addresses the infamous Janus Problem and significantly stablizes the generation process. Moreover, we propose a progressive high-resolution 3D synthesis strategy, which obtains substantial improvement over the quality of the created 3D avatars. To this end, the proposed AvatarVerse pipeline achieves zero-shot 3D modeling of 3D avatars that are not only more expressive, but also in higher quality and fidelity than previous works. Rigorous qualitative evaluations and user studies showcase AvatarVerse's superiority in synthesizing high-fidelity 3D avatars, leading to a new standard in high-quality and stable 3D avatar creation. Our project page is: https://avatarverse3d.github.io	AvatarVerse, a pipeline to automatically generate high-quality and stable 3D avatars from text prompts and poses.	Automating the creation of high-quality 3D avatars can save resources in fields like game production and AR/VR.	Leverages a DensePose-conditioned ControlNet to optimize an explicit NeRF with a progressive high-resolution generation strategy and avatar surface smoothing.	Generates higher-quality avatars with more detail than previous methods. Enables flexible avatar generation, including partial avatars and arbitrary poses. Outperforms SOTA methods in user studies for both geometry and texture quality.	The quality of the generated avatars still has room for improvement. The current framework relies on the pre-trained SMPL model, limiting its generalizability to other 3D objects.	3d avatar generation, text-to-3d, densepose, controlnet, neural radiance fields
2308.03463 Report	DiffSynth: Latent In-Iteration Deflickering for Realistic Video Synthesis	Zhongjie Duan, Lizhou You, Chengyu Wang, Cen Chen, Ziheng Wu, Weining Qian, Jun Huang	In recent years, diffusion models have emerged as the most powerful approach in image synthesis. However, applying these models directly to video synthesis presents challenges, as it often leads to noticeable flickering contents. Although recently proposed zero-shot methods can alleviate flicker to some extent, we still struggle to generate coherent videos. In this paper, we propose DiffSynth, a novel approach that aims to convert image synthesis pipelines to video synthesis pipelines. DiffSynth consists of two key components: a latent in-iteration deflickering framework and a video deflickering algorithm. The latent in-iteration deflickering framework applies video deflickering to the latent space of diffusion models, effectively preventing flicker accumulation in intermediate steps. Additionally, we propose a video deflickering algorithm, named patch blending algorithm, that remaps objects in different frames and blends them together to enhance video consistency. One of the notable advantages of DiffSynth is its general applicability to various video synthesis tasks, including text-guided video stylization, fashion video synthesis, image-guided video stylization, video restoring, and 3D rendering. In the task of text-guided video stylization, we make it possible to synthesize high-quality videos without cherry-picking. The experimental results demonstrate the effectiveness of DiffSynth. All videos can be viewed on our project page. Source codes will also be released.	This paper introduces DiffSynth, a novel approach for converting image synthesis pipelines to video synthesis pipelines using diffusion models, resulting in coherent and realistic video generation.	Directly applying image synthesis methods to videos leads to flickering and inconsistencies. DiffSynth addresses these challenges and enables the application of diffusion models to video synthesis with high quality.	DiffSynth utilizes a latent in-iteration deflickering framework to remove flicker in the latent space during intermediate synthesis steps. It also employs a patch blending algorithm, based on patch matching, to blend objects across frames for enhanced video consistency.	DiffSynth effectively eliminates flicker and generates coherent videos without cherry-picking. It outperforms existing methods in quantitative metrics such as Pixel-MSE, CLIP Score, FID, and user studies. The approach demonstrates general applicability in various video synthesis tasks including stylization, synthesis, restoration, and 3D rendering.	The computational efficiency of DiffSynth can be further improved. The blending operator in the patch blending algorithm can be further enhanced for better detail generation.	video synthesis, diffusion models, deflickering, patch matching, latent space
2308.03040 Report	Learning Fine-Grained Features for Pixel-wise Video Correspondences	Rui Li, Shenglong Zhou, Dong Liu	Video analysis tasks rely heavily on identifying the pixels from different frames that correspond to the same visual target. To tackle this problem, recent studies have advocated feature learning methods that aim to learn distinctive representations to match the pixels, especially in a self-supervised fashion. Unfortunately, these methods have difficulties for tiny or even single-pixel visual targets. Pixel-wise video correspondences were traditionally related to optical flows, which however lead to deterministic correspondences and lack robustness on real-world videos. We address the problem of learning features for establishing pixel-wise correspondences. Motivated by optical flows as well as the self-supervised feature learning, we propose to use not only labeled synthetic videos but also unlabeled real-world videos for learning fine-grained representations in a holistic framework. We adopt an adversarial learning scheme to enhance the generalization ability of the learned features. Moreover, we design a coarse-to-fine framework to pursue high computational efficiency. Our experimental results on a series of correspondence-based tasks demonstrate that the proposed method outperforms state-of-the-art rivals in both accuracy and efficiency.	This paper proposes a method to learn fine-grained features for establishing pixel-wise correspondences in videos by combining supervised learning on synthetic data with self-supervised and adversarial learning on unlabeled real-world data.	Accurately identifying pixel-wise correspondences across video frames is crucial for various computer vision tasks, but existing methods struggle to capture fine-grained differences over space and time, especially on real-world videos.	The proposed approach leverages synthetic videos with optical flow labels to learn an initial feature representation. It then introduces soft labeling to convert deterministic correspondences into probabilistic maps, enhancing the model's robustness. Furthermore, it incorporates self-supervised reconstructive learning on unlabeled real-world videos and employs adversarial training to bridge the domain gap between synthetic and real data.	The method achieves state-of-the-art results on point tracking benchmarks like BADJA, JHMDB, TAP-Vid-DAVIS, and TAP-Vid-Kinetics, demonstrating its effectiveness in capturing fine-grained motion. It also surpasses previous methods in semi-supervised video object segmentation on DAVIS-2017, highlighting the benefits of fine-grained features for this task. A proposed coarse-to-fine framework maintains competitive accuracy while significantly improving computational efficiency.	The authors acknowledge that the focus on fine-grained features might hinder object-centric learning in some cases, suggesting further exploration. Future work could investigate leveraging more powerful 2D feature extractors for generating soft labels.	video correspondences, fine-grained feature learning, self-supervised learning, adversarial training, optical flow
2308.02935 Report	Bias Behind the Wheel: Fairness Analysis of Autonomous Driving Systems	Xinyue Li, Zhenpeng Chen, Jie M. Zhang, Federica Sarro, Ying Zhang, Xuanzhe Liu	This paper analyzes fairness in automated pedestrian detection, a crucial but under-explored issue in autonomous driving systems. We evaluate eight state-of-the-art deep learning-based pedestrian detectors across demographic groups on large-scale real-world datasets. To enable thorough fairness testing, we provide extensive annotations for the datasets, resulting in 8,311 images with 16,070 gender labels, 20,115 age labels, and 3,513 skin tone labels. Our findings reveal significant fairness issues, particularly related to age. The undetected proportions for children are 20.14% higher compared to adults. Furthermore, we explore how various driving scenarios affect the fairness of pedestrian detectors. We find that pedestrian detectors demonstrate significant gender biases during night time, potentially exacerbating the prevalent societal issue of female safety concerns during nighttime out. Moreover, we observe that pedestrian detectors can demonstrate both enhanced fairness and superior performance under specific driving conditions, which challenges the fairness-performance trade-off theory widely acknowledged in the fairness literature. We publicly release the code, data, and results to support future research on fairness in autonomous driving.	This paper presents the first comprehensive study on fairness issues in pedestrian detection for autonomous driving, evaluating eight state-of-the-art detectors across diverse demographic groups.	Fairness in autonomous driving systems, crucial for preventing discriminatory outcomes and ensuring equal treatment, remains under-explored. This study aims to uncover and analyze these issues to pave the way for more equitable and unbiased systems.	The study evaluates eight deep learning-based pedestrian detectors on four real-world datasets enriched with manually annotated demographic labels (gender, age, skin tone). The authors analyze performance disparities (miss rates) across demographic groups under different driving scenarios (brightness, contrast, weather conditions).	State-of-the-art pedestrian detectors exhibit significant age bias, with a 20.14% higher miss rate for children compared to adults. Significant gender bias is observed during nighttime, with higher miss rates for females, potentially exacerbating safety concerns. Contrary to common belief, pedestrian detectors can achieve enhanced fairness and detection performance under specific driving scenarios (e.g., higher brightness).	The manual labeling process, while mitigated by using two annotators and an arbitrator, poses inherent subjectivity. The study's focus on eight specific pedestrian detectors and four datasets might introduce selection bias, although the authors carefully chose representative models and widely-used datasets.	fairness, pedestrian detection, autonomous driving, deep learning, bias
2308.02915 Report	DiffDance: Cascaded Human Motion Diffusion Model for Dance Generation	Qiaosong Qi, Le Zhuo, Aixi Zhang, Yue Liao, Fei Fang, Si Liu, Shuicheng Yan	When hearing music, it is natural for people to dance to its rhythm. Automatic dance generation, however, is a challenging task due to the physical constraints of human motion and rhythmic alignment with target music. Conventional autoregressive methods introduce compounding errors during sampling and struggle to capture the long-term structure of dance sequences. To address these limitations, we present a novel cascaded motion diffusion model, DiffDance, designed for high-resolution, long-form dance generation. This model comprises a music-to-dance diffusion model and a sequence super-resolution diffusion model. To bridge the gap between music and motion for conditional generation, DiffDance employs a pretrained audio representation learning model to extract music embeddings and further align its embedding space to motion via contrastive loss. During training our cascaded diffusion model, we also incorporate multiple geometric losses to constrain the model outputs to be physically plausible and add a dynamic loss weight that adaptively changes over diffusion timesteps to facilitate sample diversity. Through comprehensive experiments performed on the benchmark dataset AIST++, we demonstrate that DiffDance is capable of generating realistic dance sequences that align effectively with the input music. These results are comparable to those achieved by state-of-the-art autoregressive methods.	Presents DiffDance, a cascaded motion diffusion model for generating high-resolution, long-form dance sequences from music.	Addresses limitations of autoregressive methods in dance generation, which suffer from compounding errors and struggle to capture long-term structure. DiffDance leverages diffusion models for realistic and diverse dance sequence generation.	Employs a two-stage approach: 1) Music-to-Dance diffusion model generates low-resolution dance. 2) Sequence Super-Resolution diffusion model upscales to high-resolution. Uses Wav2CLIP for music embedding and aligns it with motion embedding via contrastive loss. Incorporates geometric losses and dynamic loss weight for realistic and diverse motion.	Achieves state-of-the-art performance on FID_k and Beat Align Score, demonstrating superior dance quality and music-dance alignment. Generates dance sequences with distinct long-term choreographic structures, as observed in user studies. Shows the effectiveness of the cascaded approach, embedding alignment, and geometric losses through ablation studies.	Limited diversity in geometric features potentially due to regularization losses. Beat Align Score, while effective, may not capture all nuances of human dance evaluation.	diffusion model, music-to-dance generation, conditional generation, multimodal learning, motion synthesis
2308.02874 Report	Sketch and Text Guided Diffusion Model for Colored Point Cloud Generation	Zijie Wu, Yaonan Wang, Mingtao Feng, He Xie, Ajmal Mian	Diffusion probabilistic models have achieved remarkable success in text guided image generation. However, generating 3D shapes is still challenging due to the lack of sufficient data containing 3D models along with their descriptions. Moreover, text based descriptions of 3D shapes are inherently ambiguous and lack details. In this paper, we propose a sketch and text guided probabilistic diffusion model for colored point cloud generation that conditions the denoising process jointly with a hand drawn sketch of the object and its textual description. We incrementally diffuse the point coordinates and color values in a joint diffusion process to reach a Gaussian distribution. Colored point cloud generation thus amounts to learning the reverse diffusion process, conditioned by the sketch and text, to iteratively recover the desired shape and color. Specifically, to learn effective sketch-text embedding, our model adaptively aggregates the joint embedding of text prompt and the sketch based on a capsule attention network. Our model uses staged diffusion to generate the shape and then assign colors to different parts conditioned on the appearance prompt while preserving precise shapes from the first stage. This gives our model the flexibility to extend to multiple tasks, such as appearance re-editing and part segmentation. Experimental results demonstrate that our model outperforms recent state-of-the-art in point cloud generation.	This paper introduces STPD, a novel sketch and text guided diffusion model for generating colored 3D point clouds.	Generating 3D shapes from text is challenging due to data scarcity and ambiguity in textual descriptions. STPD addresses this using sketches, which provide unambiguous geometric information.	STPD uses a capsule attention network to extract sparse features from sketches, fuses them with text embeddings, and employs a staged diffusion process to generate shape and color separately.	STPD outperforms state-of-the-art methods in colored point cloud generation. The attention-based capsule network effectively learns from sparse sketch data. STPD demonstrates strong representation learning ability, applicable to 3D object classification and part segmentation.	STPD's generalization ability is limited by training data size. Handling conflicting sketch and text inputs needs further investigation.	3d point cloud generation, diffusion models, sketch-based modeling, text-to-3d, capsule attention networks
2308.02840 Report	Learning Unified Decompositional and Compositional NeRF for Editable Novel View Synthesis	Yuxin Wang, Wayne Wu, Dan Xu	Implicit neural representations have shown powerful capacity in modeling real-world 3D scenes, offering superior performance in novel view synthesis. In this paper, we target a more challenging scenario, i.e., joint scene novel view synthesis and editing based on implicit neural scene representations. State-of-the-art methods in this direction typically consider building separate networks for these two tasks (i.e., view synthesis and editing). Thus, the modeling of interactions and correlations between these two tasks is very limited, which, however, is critical for learning high-quality scene representations. To tackle this problem, in this paper, we propose a unified Neural Radiance Field (NeRF) framework to effectively perform joint scene decomposition and composition for modeling real-world scenes. The decomposition aims at learning disentangled 3D representations of different objects and the background, allowing for scene editing, while scene composition models an entire scene representation for novel view synthesis. Specifically, with a two-stage NeRF framework, we learn a coarse stage for predicting a global radiance field as guidance for point sampling, and in the second fine-grained stage, we perform scene decomposition by a novel one-hot object radiance field regularization module and a pseudo supervision via inpainting to handle ambiguous background regions occluded by objects. The decomposed object-level radiance fields are further composed by using activations from the decomposition module. Extensive quantitative and qualitative results show the effectiveness of our method for scene decomposition and composition, outperforming state-of-the-art methods for both novel-view synthesis and editing tasks.	This paper presents a novel Neural Radiance Field (NeRF) framework that unifies scene decomposition and composition for editable novel view synthesis.	Existing methods for object-aware scene representation often use separate networks for view synthesis and editing, limiting the modeling of interactions between these tasks crucial for high-quality representations.	The proposed two-stage framework first learns a global radiance field for point sampling guidance. In the fine-grained stage, it performs decomposition using learnable object codes, one-hot object radiance regularization, and in-painting pseudo-supervision for occluded regions. Composition is achieved by utilizing learned activation weights for object-level radiance fields.	The proposed method demonstrates superior performance in novel view synthesis compared to state-of-the-art methods like ObjectNeRF and ObjectSDF. It enables effective scene decomposition, allowing for object manipulations such as removal, addition, duplication, and position changes. The framework shows clear advantages in background rendering quality, particularly in handling unseen or occluded regions.	The current method relies on object masks as supervisory signals, which may limit its applicability in scenarios where such annotations are unavailable. Future work could explore the extension of the framework to handle dynamic scenes with moving objects.	neural radiance fields, novel view synthesis, scene decomposition, scene composition, object editing
2308.02669 Report	ConceptLab: Creative Concept Generation using VLM-Guided Diffusion Prior Constraints	Elad Richardson, Kfir Goldberg, Yuval Alaluf, Daniel Cohen-Or	Recent text-to-image generative models have enabled us to transform our words into vibrant, captivating imagery. The surge of personalization techniques that has followed has also allowed us to imagine unique concepts in new scenes. However, an intriguing question remains: How can we generate a new, imaginary concept that has never been seen before? In this paper, we present the task of creative text-to-image generation, where we seek to generate new members of a broad category (e.g., generating a pet that differs from all existing pets). We leverage the under-studied Diffusion Prior models and show that the creative generation problem can be formulated as an optimization process over the output space of the diffusion prior, resulting in a set of "prior constraints". To keep our generated concept from converging into existing members, we incorporate a question-answering Vision-Language Model (VLM) that adaptively adds new constraints to the optimization problem, encouraging the model to discover increasingly more unique creations. Finally, we show that our prior constraints can also serve as a strong mixing mechanism allowing us to create hybrids between generated concepts, introducing even more flexibility into the creative process.	ConceptLab, a method for generating novel image concepts (e.g., a new type of pet) that belong to a broad category (e.g., pets) but differ from existing members of that category (e.g., cats, dogs).	Existing text-to-image generation techniques excel at generating existing concepts or personalizing models to specific subjects but lack the ability to creatively imagine entirely new concepts within a category.	ConceptLab optimizes a token embedding in the text encoder space of a pretrained text-to-image diffusion model. It uses "prior constraints" derived from CLIP similarities between a target category and existing members, leveraging a Diffusion Prior model to guide the optimization process. An iterative feedback loop with a VLM expands the set of negative constraints, fostering greater concept uniqueness.	ConceptLab successfully generates novel concepts across various categories, like pets, buildings, and even artistic styles. Generated concepts can be seamlessly integrated into different scenes and artistic renderings through text prompts. Quantitative and user study evaluations confirm ConceptLab's superiority over baseline methods like negative prompting in creating unique and diverse concepts within target categories.	Editing generated concepts using text prompts does not always consistently maintain the concept's unique properties. The success of ConceptLab can be limited by the performance of the VLM used for adaptive negative constraint generation.	creative generation, text-to-image synthesis, diffusion models, diffusion prior, vision-language models
2308.02552 Report	Degeneration-Tuning: Using Scrambled Grid shield Unwanted Concepts from Stable Diffusion	Zixuan Ni, Longhui Wei, Jiacheng Li, Siliang Tang, Yueting Zhuang, Qi Tian	Owing to the unrestricted nature of the content in the training data, large text-to-image diffusion models, such as Stable Diffusion (SD), are capable of generating images with potentially copyrighted or dangerous content based on corresponding textual concepts information. This includes specific intellectual property (IP), human faces, and various artistic styles. However, Negative Prompt, a widely used method for content removal, frequently fails to conceal this content due to inherent limitations in its inference logic. In this work, we propose a novel strategy named \textbf{Degeneration-Tuning (DT)} to shield contents of unwanted concepts from SD weights. By utilizing Scrambled Grid to reconstruct the correlation between undesired concepts and their corresponding image domain, we guide SD to generate meaningless content when such textual concepts are provided as input. As this adaptation occurs at the level of the model's weights, the SD, after DT, can be grafted onto other conditional diffusion frameworks like ControlNet to shield unwanted concepts. In addition to qualitatively showcasing the effectiveness of our DT method in protecting various types of concepts, a quantitative comparison of the SD before and after DT indicates that the DT method does not significantly impact the generative quality of other contents. The FID and IS scores of the model on COCO-30K exhibit only minor changes after DT, shifting from 12.61 and 39.20 to 13.04 and 38.25, respectively, which clearly outperforms the previous methods.	This paper introduces Degeneration-Tuning (DT), a novel technique to prevent Stable Diffusion from generating images of undesired concepts by disrupting the low-frequency visual information associated with these concepts, guiding the model to produce meaningless content instead.	Large text-to-image diffusion models like Stable Diffusion, trained on unrestricted data, risk generating potentially copyrighted or harmful content. Existing methods for content removal, such as Negative Prompt or Safety Filters, have limitations. DT offers a solution by directly modifying model weights, making it robust to parameter leakage.	DT employs a Scrambled Grid operation to disrupt the low-frequency visual content of targeted concepts. The model is then fine-tuned on this degraded dataset alongside anchor images generated without the specific concepts, effectively masking the original semantic content.	DT successfully shields various concepts, including specific IPs, artistic styles, and individuals, without significantly affecting the generation quality of other content. DT remains effective when grafted onto other conditional diffusion models like ControlNet. Continual DT, while feasible, presents challenges in maintaining image quality due to potential bias amplification.	Continual DT requires further investigation to address the observed decline in generated image quality. The impact of DT on the generation of conceptually related terms needs further exploration.	stable diffusion, content protection, degeneration-tuning, scrambled grid, continual learning
2308.02535 Report	Learning to Generate Training Datasets for Robust Semantic Segmentation	Marwane Hariat, Olivier Laurent, Rémi Kazmierczak, Shihao Zhang, Andrei Bursuc, Angela Yao, Gianni Franchi	Semantic segmentation methods have advanced significantly. Still, their robustness to real-world perturbations and object types not seen during training remains a challenge, particularly in safety-critical applications. We propose a novel approach to improve the robustness of semantic segmentation techniques by leveraging the synergy between label-to-image generators and image-to-label segmentation models. Specifically, we design Robusta, a novel robust conditional generative adversarial network to generate realistic and plausible perturbed images that can be used to train reliable segmentation models. We conduct in-depth studies of the proposed generative model, assess the performance and robustness of the downstream segmentation network, and demonstrate that our approach can significantly enhance the robustness in the face of real-world perturbations, distribution shifts, and out-of-distribution samples. Our results suggest that this approach could be valuable in safety-critical applications, where the reliability of perception modules such as semantic segmentation is of utmost importance and comes with a limited computational budget in inference. We release our code at https://github.com/ENSTA-U2IS-AI/robusta.	This paper presents Robusta, a novel cascaded cGAN architecture that improves the robustness of semantic segmentation models against input perturbations and enables them to detect outlier objects.	Robustness in semantic segmentation is crucial for safety-critical applications like autonomous driving where unexpected objects or conditions can lead to failures.	Robusta leverages attention layers and sub-networks to generate realistic images even from corrupted label maps. The generated images are used to train an observer network for anomaly detection. The authors introduce a new framework to evaluate the robustness of label-to-image generators and compare Robusta to SOTA methods.	Robusta generates images comparable or superior in quality to SOTA label-to-image translation methods. Robusta exhibits superior robustness to label map perturbations compared to other cGANs. Using Robusta generated data improves the robustness and out-of-distribution detection capabilities of semantic segmentation models.	The paper primarily focuses on specific types of outliers and may not generalize to all unseen objects. The two-stage training process of Robusta increases computational cost compared to single-stage methods.	semantic segmentation, robustness, generative adversarial networks, anomaly detection, out-of-distribution detection
2308.02487 Report	Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP	Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, Liang-Chieh Chen	Open-vocabulary segmentation is a challenging task requiring segmenting and recognizing objects from an open set of categories. One way to address this challenge is to leverage multi-modal models, such as CLIP, to provide image and text features in a shared embedding space, which bridges the gap between closed-vocabulary and open-vocabulary recognition. Hence, existing methods often adopt a two-stage framework to tackle the problem, where the inputs first go through a mask generator and then through the CLIP model along with the predicted masks. This process involves extracting features from images multiple times, which can be ineffective and inefficient. By contrast, we propose to build everything into a single-stage framework using a shared Frozen Convolutional CLIP backbone, which not only significantly simplifies the current two-stage pipeline, but also remarkably yields a better accuracy-cost trade-off. The proposed FC-CLIP, benefits from the following observations: the frozen CLIP backbone maintains the ability of open-vocabulary classification and can also serve as a strong mask generator, and the convolutional CLIP generalizes well to a larger input resolution than the one used during contrastive image-text pretraining. When training on COCO panoptic data only and testing in a zero-shot manner, FC-CLIP achieve 26.8 PQ, 16.8 AP, and 34.1 mIoU on ADE20K, 18.2 PQ, 27.9 mIoU on Mapillary Vistas, 44.0 PQ, 26.8 AP, 56.2 mIoU on Cityscapes, outperforming the prior art by +4.2 PQ, +2.4 AP, +4.2 mIoU on ADE20K, +4.0 PQ on Mapillary Vistas and +20.1 PQ on Cityscapes, respectively. Additionally, the training and testing time of FC-CLIP is 7.5x and 6.6x significantly faster than the same prior art, while using 5.9x fewer parameters. FC-CLIP also sets a new state-of-the-art performance across various open-vocabulary semantic segmentation datasets. Code at https://github.com/bytedance/fc-clip	This paper introduces FC-CLIP, a single-stage framework for open-vocabulary segmentation, that builds upon a shared frozen convolutional CLIP backbone.	Existing open-vocabulary segmentation methods rely on two-stage frameworks that are inefficient and ineffective due to separate feature extraction for mask generation and classification. FC-CLIP addresses these limitations with a unified and efficient approach.	FC-CLIP leverages a frozen convolutional CLIP backbone for both mask generation and classification. It consists of a class-agnostic mask generator, an in-vocabulary classifier trained on seen classes, and an out-of-vocabulary classifier for novel classes, combined using geometric ensembling.	FC-CLIP achieves state-of-the-art results on open-vocabulary panoptic segmentation benchmarks, including ADE20K, Cityscapes, and Mapillary Vistas, outperforming prior arts like ODISE significantly. FC-CLIP demonstrates strong performance in open-vocabulary semantic segmentation, achieving state-of-the-art results on ADE20K-847 and PASCAL-Context-459. FC-CLIP offers a significantly faster inference speed, running 6.6 times faster than ODISE.	The paper identifies potential for further research in better utilizing CLIP for mask segmentation and classification. Addressing potential biases present in the Internet data used for CLIP pre-training is crucial.	open-vocabulary segmentation, panoptic segmentation, semantic segmentation, clip, single-stage framework
2308.02299 Report	RegionBLIP: A Unified Multi-modal Pre-training Framework for Holistic and Regional Comprehension	Qiang Zhou, Chaohui Yu, Shaofeng Zhang, Sitong Wu, Zhibing Wang, Fan Wang	In this work, we investigate extending the comprehension of Multi-modal Large Language Models (MLLMs) to regional objects. To this end, we propose to extract features corresponding to regional objects as soft prompts for LLM, which provides a straightforward and scalable approach and eliminates the need for LLM fine-tuning. To effectively extract regional features from regular image features and irregular point cloud features, we present a novel and unified position-assisted feature extraction module. Furthermore, training an MLLM from scratch is highly time-consuming. Thus, we propose incrementally extending existing pre-trained MLLMs to comprehend more modalities and the regional objects of those modalities. Specifically, we freeze the Q-Former from BLIP-2, an impressive MLLM, and optimize the modality-specific Lora parameters in Q-Former and LLM for each newly introduced modality. The freezing of the Q-Former eliminates the need for extensive pre-training on massive image-text data. The freezed Q-Former pre-trained from massive image-text data is also beneficial for the pre-training on image-region-text data. We name our framework RegionBLIP. We pre-train RegionBLIP on image-region-text, point-cloud-text, and point-cloud-region-text data. Experimental results verify that \Ours{} can preserve the image comprehension capability of BILP-2 and further gain a comprehension of the newly introduced point cloud modality and regional objects. The Data, Code, and Pre-trained models will be available at https://github.com/mightyzau/RegionBLIP.	This paper presents RegionBLIP, a unified Multi-modal Large Language Model (MLLM) framework that incorporates both holistic and regional object comprehension for image and point cloud modalities.	Comprehending regional objects is essential in many applications like virtual reality. Existing MLLMs struggle to efficiently incorporate this capability, especially across multiple modalities.	The authors introduce a position-assisted feature extraction (PaFE) module to extract regional features from both regular image features and irregular point cloud features. They also propose an incremental pre-training scheme that freezes the Q-Former from BLIP-2 and learns modality-specific Lora parameters, enabling efficient extension to new modalities.	RegionBLIP preserves the image comprehension capabilities of BLIP-2 while extending it to point cloud and regional object comprehension. The PaFE module significantly improves regional comprehension performance for both image and point cloud modalities. The incremental pre-training scheme effectively extends MLLM's comprehension capabilities to new modalities without retraining on massive datasets.	The performance of point cloud region captioning is somewhat limited due to not utilizing point cloud color information. Future work will involve increasing the size of the RegionCap dataset to improve the generalization of image-region comprehension for MLLM models.	multi-modal learning, large language models, region comprehension, incremental pre-training, point cloud understanding
2308.02236 Report	FB-BEV: BEV Representation from Forward-Backward View Transformations	Zhiqi Li, Zhiding Yu, Wenhai Wang, Anima Anandkumar, Tong Lu, Jose M. Alvarez	View Transformation Module (VTM), where transformations happen between multi-view image features and Bird-Eye-View (BEV) representation, is a crucial step in camera-based BEV perception systems. Currently, the two most prominent VTM paradigms are forward projection and backward projection. Forward projection, represented by Lift-Splat-Shoot, leads to sparsely projected BEV features without post-processing. Backward projection, with BEVFormer being an example, tends to generate false-positive BEV features from incorrect projections due to the lack of utilization on depth. To address the above limitations, we propose a novel forward-backward view transformation module. Our approach compensates for the deficiencies in both existing methods, allowing them to enhance each other to obtain higher quality BEV representations mutually. We instantiate the proposed module with FB-BEV, which achieves a new state-of-the-art result of 62.4% NDS on the nuScenes test set. Code and models are available at https://github.com/NVlabs/FB-BEV.	Proposes FB-BEV, a novel forward-backward view transformation module for camera-based 3D object detection that generates dense and accurate Bird's Eye View (BEV) representations.	Existing view transformation modules for BEV perception either produce sparse BEV features (forward projection) or suffer from false positives due to inaccurate depth utilization (backward projection).	Combines forward projection (F-VTM) to generate initial sparse BEV and depth-aware backward projection (B-VTM) to refine foreground regions identified by a lightweight Foreground Region Proposal Network (FRPN). Depth consistency ensures accurate feature mapping in B-VTM.	FB-BEV achieves state-of-the-art 62.4% NDS on the nuScenes test set, outperforming previous methods. Depth-aware backward projection significantly improves performance compared to standard backward projection, demonstrating effective depth utilization. FRPN improves both inference efficiency and detection accuracy by focusing refinement on foreground regions.	The two-stage design, while efficient, could be further optimized for end-to-end training. Exploration of alternative depth-aware mechanisms for backward projection could yield further performance gains.	3d object detection, "birds eye view (bev)", view transformation module (vtm), forward-backward projection, depth consistency
2308.02157 Report	Improved Order Analysis and Design of Exponential Integrator for Diffusion Models Sampling	Qinsheng Zhang, Jiaming Song, Yongxin Chen	Efficient differential equation solvers have significantly reduced the sampling time of diffusion models (DMs) while retaining high sampling quality. Among these solvers, exponential integrators (EI) have gained prominence by demonstrating state-of-the-art performance. However, existing high-order EI-based sampling algorithms rely on degenerate EI solvers, resulting in inferior error bounds and reduced accuracy in contrast to the theoretically anticipated results under optimal settings. This situation makes the sampling quality extremely vulnerable to seemingly innocuous design choices such as timestep schedules. For example, an inefficient timestep scheduler might necessitate twice the number of steps to achieve a quality comparable to that obtained through carefully optimized timesteps. To address this issue, we reevaluate the design of high-order differential solvers for DMs. Through a thorough order analysis, we reveal that the degeneration of existing high-order EI solvers can be attributed to the absence of essential order conditions. By reformulating the differential equations in DMs and capitalizing on the theory of exponential integrators, we propose refined EI solvers that fulfill all the order conditions, which we designate as Refined Exponential Solver (RES). Utilizing these improved solvers, RES exhibits more favorable error bounds theoretically and achieves superior sampling efficiency and stability in practical applications. For instance, a simple switch from the single-step DPM-Solver++ to our order-satisfied RES solver when Number of Function Evaluations (NFE) $=9$, results in a reduction of numerical defects by $25.2\%$ and FID improvement of $25.4\%$ (16.77 vs 12.51) on a pre-trained ImageNet diffusion model.	This paper proposes Refined Exponential Solver (RES), an improved exponential integrator for diffusion model sampling that addresses the order condition violations in existing methods.	Existing high-order exponential integrator-based sampling algorithms often lead to suboptimal performance due to the use of degenerate solvers that violate necessary order conditions. This results in worse error bounds and reduced accuracy compared to theoretical expectations.	The authors perform a thorough order analysis of single-step numerical schemes for the diffusion probability flow ODE, identify the overlooked order conditions, and derive a refined exponential integrator that satisfies these conditions. They also extend the analysis to multistep deterministic and stochastic sampling algorithms.	RES demonstrates significantly smaller numerical defects and faster convergence compared to existing methods like DDIM, Heun, and DPM-Solver++. The reduction in numerical defects achieved by RES translates to improved sampling quality, as evidenced by higher FID scores. RES exhibits enhanced robustness to suboptimal time-step schedules compared to other methods.	The choice of logarithmic transformation for the noise level, while empirically beneficial, lacks theoretical justification. Training-free methods, even with RES, are still slower than GANs or distillation-based methods.	diffusion models, sampling algorithms, exponential integrators, numerical ode solvers, order conditions
2308.02154 Report	SDDM: Score-Decomposed Diffusion Models on Manifolds for Unpaired Image-to-Image Translation	Shikun Sun, Longhui Wei, Junliang Xing, Jia Jia, Qi Tian	Recent score-based diffusion models (SBDMs) show promising results in unpaired image-to-image translation (I2I). However, existing methods, either energy-based or statistically-based, provide no explicit form of the interfered intermediate generative distributions. This work presents a new score-decomposed diffusion model (SDDM) on manifolds to explicitly optimize the tangled distributions during image generation. SDDM derives manifolds to make the distributions of adjacent time steps separable and decompose the score function or energy guidance into an image ``denoising" part and a content ``refinement" part. To refine the image in the same noise level, we equalize the refinement parts of the score function and energy guidance, which permits multi-objective optimization on the manifold. We also leverage the block adaptive instance normalization module to construct manifolds with lower dimensions but still concentrated with the perturbed reference image. SDDM outperforms existing SBDM-based methods with much fewer diffusion steps on several I2I benchmarks.	This paper proposes SDDM, a novel score-decomposed diffusion model on manifolds, for unpaired image-to-image translation.	Existing score-based diffusion models for image translation lack explicit control over intermediate generative distributions, leading to suboptimal results.	SDDM decomposes score function and energy guidance into "denoising" and "refinement" parts using manifolds. It utilizes statistical guidance to separate adjacent time-step distributions and leverages the BAdaIN module to construct low-dimensional manifolds. Finally, it performs multi-objective optimization on these manifolds.	SDDM achieves superior performance on I2I benchmarks compared to other SBDM-based methods. It requires significantly fewer diffusion steps (100) than methods like EGSDE (1000) while achieving better or comparable results. Ablation studies confirm the effectiveness of score decomposition, BAdaIN-based manifolds, and multi-objective optimization.	The approach introduces additional computations, albeit negligible compared to neural network inferences. Future work includes exploring stronger energy functions and applying SDDM to a wider range of image translation tasks.	image-to-image translation, diffusion models, score-based models, manifold optimization, generative models
2308.02117 Report	VQGraph: Rethinking Graph Representation Space for Bridging GNNs and MLPs	Ling Yang, Ye Tian, Minkai Xu, Zhongyi Liu, Shenda Hong, Wei Qu, Wentao Zhang, Bin Cui, Muhan Zhang, Jure Leskovec	GNN-to-MLP distillation aims to utilize knowledge distillation (KD) to learn computationally-efficient multi-layer perceptron (student MLP) on graph data by mimicking the output representations of teacher GNN. Existing methods mainly make the MLP to mimic the GNN predictions over a few class labels. However, the class space may not be expressive enough for covering numerous diverse local graph structures, thus limiting the performance of knowledge transfer from GNN to MLP. To address this issue, we propose to learn a new powerful graph representation space by directly labeling nodes' diverse local structures for GNN-to-MLP distillation. Specifically, we propose a variant of VQ-VAE to learn a structure-aware tokenizer on graph data that can encode each node's local substructure as a discrete code. The discrete codes constitute a codebook as a new graph representation space that is able to identify different local graph structures of nodes with the corresponding code indices. Then, based on the learned codebook, we propose a new distillation target, namely soft code assignments, to directly transfer the structural knowledge of each node from GNN to MLP. The resulting framework VQGraph achieves new state-of-the-art performance on GNN-to-MLP distillation in both transductive and inductive settings across seven graph datasets. We show that VQGraph with better performance infers faster than GNNs by 828x, and also achieves accuracy improvement over GNNs and stand-alone MLPs by 3.90% and 28.05% on average, respectively. Code: https://github.com/YangLing0818/VQGraph.	This paper introduces VQGraph, a novel GNN-to-MLP distillation framework that enhances the expressiveness of graph representation space by directly labeling diverse local node structures using a codebook for structure-aware knowledge transfer.	Existing GNN-to-MLP distillation methods rely on class label space which lacks expressiveness to capture the diverse local graph structures, limiting their performance.	VQGraph leverages a variant of VQ-VAE to learn a structure-aware tokenizer on graph data, encoding each node's local substructure into a discrete code. These codes constitute a codebook, forming a powerful representation space that can distinguish different local structures. VQGraph then utilizes this codebook to perform structure-aware distillation by minimizing the KL divergence between GNN and MLP predictions over the discrete codes (soft code assignment).	VQGraph achieves state-of-the-art performance on GNN-to-MLP distillation, outperforming teacher GNNs by 3.90% on average accuracy while being 828x faster in inference. The learned representation space in VQGraph is more compact and better captures both local and global graph structural information compared to existing methods. Extensive experiments across seven datasets, including both transductive and inductive settings, show the effectiveness and robustness of VQGraph.	The codebook size selection is crucial and currently relies on dataset-specific tuning. Further exploration of different relation modules for computing code assignments could be beneficial.	graph neural networks, knowledge distillation, graph representation learning, structure-aware distillation, vq-vae
2308.02065 Report	On the Biometric Capacity of Generative Face Models	Vishnu Naresh Boddeti, Gautam Sreekumar, Arun Ross	There has been tremendous progress in generating realistic faces with high fidelity over the past few years. Despite this progress, a crucial question remains unanswered: "Given a generative face model, how many unique identities can it generate?" In other words, what is the biometric capacity of the generative face model? A scientific basis for answering this question will benefit evaluating and comparing different generative face models and establish an upper bound on their scalability. This paper proposes a statistical approach to estimate the biometric capacity of generated face images in a hyperspherical feature space. We employ our approach on multiple generative models, including unconditional generators like StyleGAN, Latent Diffusion Model, and "Generated Photos," as well as DCFace, a class-conditional generator. We also estimate capacity w.r.t. demographic attributes such as gender and age. Our capacity estimates indicate that (a) under ArcFace representation at a false acceptance rate (FAR) of 0.1%, StyleGAN3 and DCFace have a capacity upper bound of $1.43\times10^6$ and $1.190\times10^4$, respectively; (b) the capacity reduces drastically as we lower the desired FAR with an estimate of $1.796\times10^4$ and $562$ at FAR of 1% and 10%, respectively, for StyleGAN3; (c) there is no discernible disparity in the capacity w.r.t gender; and (d) for some generative models, there is an appreciable disparity in the capacity w.r.t age. Code is available at https://github.com/human-analysis/capacity-generative-face-models.	This paper proposes the first statistically robust method for estimating the biometric capacity, or the maximum number of unique identities a generative face model can produce, by analyzing the distribution of generated faces in a hyperspherical feature space.	Estimating capacity provides an upper bound on the scalability of generative face models without exhaustive empirical evaluation, allowing for informed deployment and comparison of different models based on the uniqueness of generated identities.	The approach involves representing generated faces in a hyperspherical feature space (using face recognition models like ArcFace, AdaFace), approximating population and class-specific manifolds as hyperspherical caps, and calculating capacity as the ratio of their surface areas.	StyleGAN3 and DCFace have capacity upper bounds of 1.43 million and 11,900 respectively at a false acceptance rate (FAR) of 0.1%. Capacity decreases drastically with stricter FAR thresholds (e.g., StyleGAN3 capacity drops to 562 at 10% FAR). While capacity remains consistent across genders, some models show disparity in capacity across different age groups.	The estimation relies on the assumption that intra-class variance can be approximated from real-face datasets. The approach provides an upper bound, and relaxing assumptions could lead to tighter capacity estimates in future work.	generative face models, biometric capacity, hyperspherical feature space, face recognition, diversity and uniqueness
2308.01944 Report	Dynamic Token-Pass Transformers for Semantic Segmentation	Yuang Liu, Qiang Zhou, Jing Wang, Fan Wang, Jun Wang, Wei Zhang	Vision transformers (ViT) usually extract features via forwarding all the tokens in the self-attention layers from top to toe. In this paper, we introduce dynamic token-pass vision transformers (DoViT) for semantic segmentation, which can adaptively reduce the inference cost for images with different complexity. DoViT gradually stops partial easy tokens from self-attention calculation and keeps the hard tokens forwarding until meeting the stopping criteria. We employ lightweight auxiliary heads to make the token-pass decision and divide the tokens into keeping/stopping parts. With a token separate calculation, the self-attention layers are speeded up with sparse tokens and still work friendly with hardware. A token reconstruction module is built to collect and reset the grouped tokens to their original position in the sequence, which is necessary to predict correct semantic masks. We conduct extensive experiments on two common semantic segmentation tasks, and demonstrate that our method greatly reduces about 40% $\sim$ 60% FLOPs and the drop of mIoU is within 0.8% for various segmentation transformers. The throughput and inference speed of ViT-L/B are increased to more than 2$\times$ on Cityscapes.	This paper presents DoViT, a dynamic token-pass vision transformer for semantic segmentation that adaptively reduces inference cost based on image complexity.	Current vision transformers, though achieving high performance, are computationally expensive, making them prohibitive for real-time applications and resource-constrained devices.	DoViT uses a semantic early-probe scheme to progressively stop easy tokens from self-attention calculation based on prediction confidence. It employs separate self-attention for remaining tokens and reconstructs the token sequence to ensure correct semantic prediction.	DoViT reduces FLOPs by 40-60% with less than 0.8% mIoU drop on Cityscapes compared to standard ViT backbones. Throughput and FPS are improved to over 2x on Cityscapes, demonstrating significant speedup. The adaptive token-pass allows for image-specific inference cost, leading to varying levels of computation reduction based on complexity.	Smaller networks show less FLOPs reduction due to lower confidence at early-probe stages, especially on challenging datasets like ADE20K. Future work includes combining data-aware acceleration with parameter-aware compression techniques and extending it to other dense prediction tasks.	semantic segmentation, vision transformer, model acceleration, dynamic token pass, early-probe
2308.01904 Report	DETR Doesn't Need Multi-Scale or Locality Design	Yutong Lin, Yuhui Yuan, Zheng Zhang, Chen Li, Nanning Zheng, Han Hu	This paper presents an improved DETR detector that maintains a "plain" nature: using a single-scale feature map and global cross-attention calculations without specific locality constraints, in contrast to previous leading DETR-based detectors that reintroduce architectural inductive biases of multi-scale and locality into the decoder. We show that two simple technologies are surprisingly effective within a plain design to compensate for the lack of multi-scale feature maps and locality constraints. The first is a box-to-pixel relative position bias (BoxRPB) term added to the cross-attention formulation, which well guides each query to attend to the corresponding object region while also providing encoding flexibility. The second is masked image modeling (MIM)-based backbone pre-training which helps learn representation with fine-grained localization ability and proves crucial for remedying dependencies on the multi-scale feature maps. By incorporating these technologies and recent advancements in training and problem formation, the improved "plain" DETR showed exceptional improvements over the original DETR detector. By leveraging the Object365 dataset for pre-training, it achieved 63.9 mAP accuracy using a Swin-L backbone, which is highly competitive with state-of-the-art detectors which all heavily rely on multi-scale feature maps and region-based feature extraction. Code is available at https://github.com/impiga/Plain-DETR .	This paper proposes an improved DETR detector that maintains a "plain" nature by using a single-scale feature map and global cross-attention calculations without specific locality constraints, unlike previous DETR-based detectors.	The paper aims to improve upon the original DETR detector while preserving its simplicity and reducing reliance on domain-specific architectural biases.	The paper introduces two key technologies: 1) Box-to-pixel relative position bias (BoxRPB) to guide cross-attention computation, and 2) Masked image modeling (MIM) pre-training to enhance feature representation with fine-grained localization.	BoxRPB significantly improves detection accuracy by +8.9 mAP over the plain DETR baseline. MIM pre-training further boosts performance by +7.4 mAP and enables the removal of multi-scale feature maps. The improved plain DETR achieves 63.9 mAP with a Swin-L backbone, making it competitive with state-of-the-art detectors.	The paper primarily focuses on object detection, and its generalizability to other vision tasks needs further exploration. Further investigation into the interplay between BoxRPB and MIM pre-training is needed.	object detection, detr, transformer, relative position bias, masked image modeling
2308.01779 Report	Point2Mask: Point-supervised Panoptic Segmentation via Optimal Transport	Wentong Li, Yuqian Yuan, Song Wang, Jianke Zhu, Jianshu Li, Jian Liu, Lei Zhang	Weakly-supervised image segmentation has recently attracted increasing research attentions, aiming to avoid the expensive pixel-wise labeling. In this paper, we present an effective method, namely Point2Mask, to achieve high-quality panoptic prediction using only a single random point annotation per target for training. Specifically, we formulate the panoptic pseudo-mask generation as an Optimal Transport (OT) problem, where each ground-truth (gt) point label and pixel sample are defined as the label supplier and consumer, respectively. The transportation cost is calculated by the introduced task-oriented maps, which focus on the category-wise and instance-wise differences among the various thing and stuff targets. Furthermore, a centroid-based scheme is proposed to set the accurate unit number for each gt point supplier. Hence, the pseudo-mask generation is converted into finding the optimal transport plan at a globally minimal transportation cost, which can be solved via the Sinkhorn-Knopp Iteration. Experimental results on Pascal VOC and COCO demonstrate the promising performance of our proposed Point2Mask approach to point-supervised panoptic segmentation. Source code is available at: https://github.com/LiWentomng/Point2Mask.	Presents Point2Mask, a novel weakly supervised panoptic segmentation method that leverages Optimal Transport (OT) to generate pseudo-labels from single point annotations.	Addresses the limitations of existing weakly supervised methods that struggle to accurately segment objects using only point supervision, particularly in differentiating between nearby instances of the same category.	1. Feature Learning: Employs a two-branch network to extract category- and instance-level representations. 2. OT-based Pseudo-label Generation: Formulates an OT problem to assign pixels to ground truth labels based on a cost function that considers semantic and boundary information. 3. Training: Trains a panoptic segmentation model using the generated pseudo-labels.	Achieves state-of-the-art performance on Pascal VOC and COCO datasets using single-point supervision. Demonstrates superior performance compared to previous weakly supervised methods, particularly in distinguishing nearby instances. Showcases the effectiveness of OT in assigning pixels for accurate pseudo-label generation.	May not perform well on dense objects of the same category due to reliance on single-point annotation. Relies on a relatively simple segmentation architecture which could limit performance on complex scenes.	panoptic segmentation, weakly supervised learning, optimal transport, point annotation, pseudo-label
2308.01766 Report	Neural Poisson Surface Reconstruction: Resolution-Agnostic Shape Reconstruction from Point Clouds	Hector Andrade-Loarca, Julius Hege, Daniel Cremers, Gitta Kutyniok	We introduce Neural Poisson Surface Reconstruction (nPSR), an architecture for shape reconstruction that addresses the challenge of recovering 3D shapes from points. Traditional deep neural networks face challenges with common 3D shape discretization techniques due to their computational complexity at higher resolutions. To overcome this, we leverage Fourier Neural Operators to solve the Poisson equation and reconstruct a mesh from oriented point cloud measurements. nPSR exhibits two main advantages: First, it enables efficient training on low-resolution data while achieving comparable performance at high-resolution evaluation, thanks to the resolution-agnostic nature of FNOs. This feature allows for one-shot super-resolution. Second, our method surpasses existing approaches in reconstruction quality while being differentiable and robust with respect to point sampling rates. Overall, the neural Poisson surface reconstruction not only improves upon the limitations of classical deep neural networks in shape reconstruction but also achieves superior results in terms of reconstruction quality, running time, and resolution agnosticism.	Introduces Neural Poisson Surface Reconstruction, a novel architecture using Fourier Neural Operators for reconstructing 3D shapes from oriented point clouds by solving the Poisson equation.	Addresses limitations of traditional deep learning methods in 3D shape reconstruction, particularly in handling high resolutions and low sampling rates.	Leverages Fourier Neural Operators to learn a mapping from a point cloud rasterized to a voxel grid representation of the divergence of the normal field, to the reconstructed shape. Employs Otsu's thresholding and marching cubes for post-processing.	Significantly outperforms existing methods in low sampling scenarios (3,000-25,000 points). Achieves comparable performance to state-of-the-art in high sampling regimes (250,000 points). Exhibits resolution agnosticism, enabling training on low-resolution data and evaluating on higher resolutions with similar performance.	Requires pre-determined resolution for training data, potentially leading to loss of detail. Further exploration of alternative architectures and regularization techniques for optimization.	3d shape reconstruction, point cloud processing, fourier neural operator, poisson surface reconstruction, resolution agnostic
2308.01544 Report	Multimodal Neurons in Pretrained Text-Only Transformers	Sarah Schwettmann, Neil Chowdhury, Samuel Klein, David Bau, Antonio Torralba	Language models demonstrate remarkable capacity to generalize representations learned in one modality to downstream tasks in other modalities. Can we trace this ability to individual neurons? We study the case where a frozen text transformer is augmented with vision using a self-supervised visual encoder and a single linear projection learned on an image-to-text task. Outputs of the projection layer are not immediately decodable into language describing image content; instead, we find that translation between modalities occurs deeper within the transformer. We introduce a procedure for identifying "multimodal neurons" that convert visual representations into corresponding text, and decoding the concepts they inject into the model's residual stream. In a series of experiments, we show that multimodal neurons operate on specific visual concepts across inputs, and have a systematic causal effect on image captioning.	The paper investigates the emergence of "multimodal neurons" within the MLP layers of a frozen text transformer (GPT-J) augmented with a self-supervised visual encoder (BEIT) for image captioning.	This work aims to understand how language models, trained solely on text, demonstrate cross-modal generalization abilities when combined with visual encoders.	The authors introduce a gradient-based attribution method to identify neurons that significantly contribute to predicting specific nouns in image captions. They decode the language contributions of these neurons by analyzing the corresponding output embedding weights.	Image representations projected into the transformer's embedding space do not directly encode interpretable semantic information, implying that cross-modal translation happens within the transformer. Multimodal neurons, found predominantly in earlier transformer layers, exhibit selectivity for specific visual concepts and consistently translate them into related text. Ablating these multimodal neurons significantly alters the generated captions, demonstrating their causal role in translating visual information into language.	The study focuses on a single vision-language model (LiMBeR-BEIT) with separate vision and language components. Future work should investigate the presence of multimodal neurons in other architectures. While the authors demonstrate the existence and influence of multimodal neurons, a deeper understanding of their formation and how they assemble concepts from upstream representations is needed.	multimodal learning, vision-language models, transformer networks, neuron interpretability, cross-modal generalization
2308.01536 Report	MFIM: Megapixel Facial Identity Manipulation	Sanghyeon Na	Face swapping is a task that changes a facial identity of a given image to that of another person. In this work, we propose a novel face-swapping framework called Megapixel Facial Identity Manipulation (MFIM). The face-swapping model should achieve two goals. First, it should be able to generate a high-quality image. We argue that a model which is proficient in generating a megapixel image can achieve this goal. However, generating a megapixel image is generally difficult without careful model design. Therefore, our model exploits pretrained StyleGAN in the manner of GAN-inversion to effectively generate a megapixel image. Second, it should be able to effectively transform the identity of a given image. Specifically, it should be able to actively transform ID attributes (e.g., face shape and eyes) of a given image into those of another person, while preserving ID-irrelevant attributes (e.g., pose and expression). To achieve this goal, we exploit 3DMM that can capture various facial attributes. Specifically, we explicitly supervise our model to generate a face-swapped image with the desirable attributes using 3DMM. We show that our model achieves state-of-the-art performance through extensive experiments. Furthermore, we propose a new operation called ID mixing, which creates a new identity by semantically mixing the identities of several people. It allows the user to customize the new identity.	This paper presents MFIM, a novel face-swapping framework that generates high-quality megapixel face-swapped images and effectively performs identity transformation.	Face swapping has applications in entertainment, privacy protection, and the theatrical industry, making high-quality and effective face swapping techniques increasingly important.	MFIM utilizes a pretrained StyleGAN generator and a facial attribute encoder to generate images. It leverages 3DMM for explicit supervision during training to ensure effective identity transformation, particularly in face shape. It introduces a novel ID mixing operation, creating new identities by combining attributes from multiple source images.	MFIM achieves state-of-the-art performance on face swapping benchmarks, outperforming baselines in identity, shape, expression, and pose metrics. The use of style maps in the encoder allows MFIM to preserve details from the target image, addressing limitations of previous StyleGAN-based methods. The ID mixing operation enables semantic control over identity creation, blending global and local attributes from multiple source images without requiring additional training or labels.	The disentanglement of ID and ID-irrelevant representations in MFIM can be further improved to prevent attribute leakage. Investigating the application of MFIM to high-frequency detail reconstruction, potentially through techniques like ROI-only synthesis, is a promising direction for future work.	face swapping, gan inversion, stylegan, 3dmm, identity mixing
2308.01532 Report	Multimodal Adaptation of CLIP for Few-Shot Action Recognition	Jiazheng Xing, Mengmeng Wang, Xiaojun Hou, Guang Dai, Jingdong Wang, Yong Liu	Applying large-scale pre-trained visual models like CLIP to few-shot action recognition tasks can benefit performance and efficiency. Utilizing the "pre-training, fine-tuning" paradigm makes it possible to avoid training a network from scratch, which can be time-consuming and resource-intensive. However, this method has two drawbacks. First, limited labeled samples for few-shot action recognition necessitate minimizing the number of tunable parameters to mitigate over-fitting, also leading to inadequate fine-tuning that increases resource consumption and may disrupt the generalized representation of models. Second, the video's extra-temporal dimension challenges few-shot recognition's effective temporal modeling, while pre-trained visual models are usually image models. This paper proposes a novel method called Multimodal Adaptation of CLIP (MA-CLIP) to address these issues. It adapts CLIP for few-shot action recognition by adding lightweight adapters, which can minimize the number of learnable parameters and enable the model to transfer across different tasks quickly. The adapters we design can combine information from video-text multimodal sources for task-oriented spatiotemporal modeling, which is fast, efficient, and has low training costs. Additionally, based on the attention mechanism, we design a text-guided prototype construction module that can fully utilize video-text information to enhance the representation of video prototypes. Our MA-CLIP is plug-and-play, which can be used in any different few-shot action recognition temporal alignment metric.	This paper proposes MA-CLIP, a novel method that adapts the CLIP model for few-shot action recognition by incorporating lightweight adapters and a text-guided prototype construction module.	Few-shot action recognition benefits from large pre-trained models but suffers from overfitting with limited data and difficulty in effective temporal modeling. MA-CLIP addresses these issues by using adapters to minimize trainable parameters and enable quick task transfer while enhancing temporal modeling with minimal extra parameters.	MA-CLIP freezes the pre-trained CLIP encoders and inserts lightweight adapters for task-specific spatiotemporal modeling. These adapters leverage multimodal information from video and text. A text-guided prototype construction module, based on attention, enhances video prototype representations. MA-CLIP is designed to be compatible with any temporal alignment metric used in few-shot action recognition.	MA-CLIP achieves state-of-the-art performance on five widely used datasets for few-shot action recognition, surpassing previous methods in accuracy. The use of adapters allows for significant reduction in trainable parameters and training time compared to full fine-tuning of the visual encoder, making it more efficient. Experiments demonstrate that incorporating text information significantly boosts performance, highlighting the importance of multimodal learning for this task.	The performance improvement from CLIP pre-training is less significant for datasets where temporal information is crucial. Future work could explore different adapter architectures or integrate other parameter-efficient fine-tuning techniques.	few-shot action recognition, clip, multimodal learning, parameter-efficient fine-tuning, adapters
2308.01499 Report	TDMD: A Database for Dynamic Color Mesh Subjective and Objective Quality Explorations	Qi Yang, Joel Jung, Timon Deschamps, Xiaozhong Xu, Shan Liu	Dynamic colored meshes (DCM) are widely used in various applications; however, these meshes may undergo different processes, such as compression or transmission, which can distort them and degrade their quality. To facilitate the development of objective metrics for DCMs and study the influence of typical distortions on their perception, we create the Tencent - dynamic colored mesh database (TDMD) containing eight reference DCM objects with six typical distortions. Using processed video sequences (PVS) derived from the DCM, we have conducted a large-scale subjective experiment that resulted in 303 distorted DCM samples with mean opinion scores, making the TDMD the largest available DCM database to our knowledge. This database enabled us to study the impact of different types of distortion on human perception and offer recommendations for DCM compression and related tasks. Additionally, we have evaluated three types of state-of-the-art objective metrics on the TDMD, including image-based, point-based, and video-based metrics, on the TDMD. Our experimental results highlight the strengths and weaknesses of each metric, and we provide suggestions about the selection of metrics in practical DCM applications. The TDMD will be made publicly available at the following location: https://multimedia.tencent.com/resources/tdmd.	This paper introduces TDMD, a new database for Dynamic Colored Mesh (DCM) quality assessment. It contains 8 reference DCMs and 6 types of distortions (color noise, texture map downsampling, geometrical Gaussian noise, mesh decimation, MPEG lossy compression, and texture map compression) at various severity levels, totaling 303 distorted samples with MOS obtained via subjective experiments.	Existing mesh quality assessment work mainly focuses on static, often non-colored meshes. However, DCMs are increasingly used, demanding dedicated quality assessment tools and an understanding of how distortions impact human perception.	The researchers applied distortions to reference DCMs, converted them into processed video sequences (PVSs) using a predefined camera path, and conducted subjective experiments to obtain MOS. They then evaluated the performance of three types of objective metrics (image-based, point-based, and video-based) on TDMD.	The impact of mesh decimation and texture map compression on perceived quality is limited at the tested levels. Point-based metric ${ m PCQM}_\rm{p}$ and video-based metric MS-SSIM achieve the best performance in predicting DCM quality. Sampling resolution and method impact the performance of point-based metrics, with denser sampling generally leading to higher accuracy.	The study only considers a 2D monitor viewing environment, while VR viewing might yield different results. Further research is needed to explore optimal camera paths for PVS generation, as different DCM content might have varying regions of interest.	dynamic mesh quality assessment, subjective experiment, database, objective metric, point cloud
2308.01472 Report	Reverse Stable Diffusion: What prompt was used to generate this image?	Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, Mubarak Shah	Text-to-image diffusion models such as Stable Diffusion have recently attracted the interest of many researchers, and inverting the diffusion process can play an important role in better understanding the generative process and how to engineer prompts in order to obtain the desired images. To this end, we introduce the new task of predicting the text prompt given an image generated by a generative diffusion model. We combine a series of white-box and black-box models (with and without access to the weights of the diffusion network) to deal with the proposed task. We propose a novel learning framework comprising of a joint prompt regression and multi-label vocabulary classification objective that generates improved prompts. To further improve our method, we employ a curriculum learning procedure that promotes the learning of image-prompt pairs with lower labeling noise (i.e. that are better aligned), and an unsupervised domain-adaptive kernel learning method that uses the similarities between samples in the source and target domains as extra features. We conduct experiments on the DiffusionDB data set, predicting text prompts from images generated by Stable Diffusion. Our novel learning framework produces excellent results on the aforementioned task, yielding the highest gains when applied on the white-box model. In addition, we make an interesting discovery: training a diffusion model on the prompt generation task can make the model generate images that are much better aligned with the input prompts, when the model is directly reused for text-to-image generation.	This paper introduces the novel task of predicting the text prompt embedding given an image generated by a text-to-image diffusion model, aiming to reverse the generative process and better understand prompt engineering.	Understanding the image-to-text mapping in diffusion models is crucial for improving prompt design, understanding the generative process, and potentially enhancing image generation quality.	The paper proposes a learning framework that combines white-box and black-box models, incorporating three novel components: a joint prompt regression and multi-label vocabulary classification objective, a curriculum learning procedure for handling noisy labels, and a domain-adaptive kernel learning (DAKL) method for leveraging target domain information.	The proposed learning framework, particularly the classification head and curriculum learning, consistently improves the performance across different image encoders. The joint framework, combining embeddings from multiple models, outperforms individual models, with DAKL further enhancing performance. Training a diffusion model on the prompt generation task leads to generating images better aligned with the input prompts, showcasing a promising application for improving text-to-image generation quality.	The paper primarily focuses on Stable Diffusion and a single dataset (DiffusionDB), potentially limiting generalizability. The computational cost of DAKL, despite using k-means for efficiency, can still be a concern for larger datasets.	diffusion models, text-to-image generation, image-to-text generation, prompt engineering, curriculum learning
2308.01390 Report	OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models	Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, Ludwig Schmidt	We introduce OpenFlamingo, a family of autoregressive vision-language models ranging from 3B to 9B parameters. OpenFlamingo is an ongoing effort to produce an open-source replication of DeepMind's Flamingo models. On seven vision-language datasets, OpenFlamingo models average between 80 - 89% of corresponding Flamingo performance. This technical report describes our models, training data, hyperparameters, and evaluation suite. We share our models and code at https://github.com/mlfoundations/open_flamingo.	Introduced OpenFlamingo, a family of open-source autoregressive vision-language models with 3B to 9B parameters, replicating DeepMind's Flamingo models.	Addresses the lack of open-source alternatives to closed-source autoregressive vision-language models, enabling research on their capabilities and safety.	Trained on LAION-2B and Multimodal C4 datasets, using CLIP as the vision encoder and MPT or RedPajama language models as decoders.	OpenFlamingo models achieve 80-89% of corresponding Flamingo models' performance on seven vision-language datasets. Performance generally improves with more in-context examples but at a lower rate than Flamingo. OpenFlamingo models exhibit limitations in visual question answering, particularly in counting, answer verbosity, and handling non-central objects.	Limited performance with many in-context examples potentially due to few images in training sequences. Unexpected performance degradation in 4B models with frozen image and end-of-chunk embeddings.	vision-language models, open-source, flamingo, in-context learning, multimodal
2308.01379 Report	Computational Long Exposure Mobile Photography	Eric Tabellion, Nikhil Karnad, Noa Glaser, Ben Weiss, David E. Jacobs, Yael Pritch	Long exposure photography produces stunning imagery, representing moving elements in a scene with motion-blur. It is generally employed in two modalities, producing either a foreground or a background blur effect. Foreground blur images are traditionally captured on a tripod-mounted camera and portray blurred moving foreground elements, such as silky water or light trails, over a perfectly sharp background landscape. Background blur images, also called panning photography, are captured while the camera is tracking a moving subject, to produce an image of a sharp subject over a background blurred by relative motion. Both techniques are notoriously challenging and require additional equipment and advanced skills. In this paper, we describe a computational burst photography system that operates in a hand-held smartphone camera app, and achieves these effects fully automatically, at the tap of the shutter button. Our approach first detects and segments the salient subject. We track the scene motion over multiple frames and align the images in order to preserve desired sharpness and to produce aesthetically pleasing motion streaks. We capture an under-exposed burst and select the subset of input frames that will produce blur trails of controlled length, regardless of scene or camera motion velocity. We predict inter-frame motion and synthesize motion-blur to fill the temporal gaps between the input frames. Finally, we composite the blurred image with the sharp regular exposure to protect the sharpness of faces or areas of the scene that are barely moving, and produce a final high resolution and high dynamic range (HDR) photograph. Our system democratizes a capability previously reserved to professionals, and makes this creative style accessible to most casual photographers. More information and supplementary material can be found on our project webpage: https://motion-mode.github.io/	This paper presents a computational burst photography system for smartphones that automatically produces long exposure effects, with either blurred foregrounds or backgrounds, by compensating for camera and subject motion.	Long exposure photography, traditionally requiring tripods, filters, and advanced skills, is made accessible to casual photographers through this system.	The system analyzes scene motion, detects and tracks subjects, aligns images for desired sharpness, predicts motion for blur synthesis, and composites with a sharp exposure for optimal results.	The system effectively synthesizes long exposure effects in both foreground and background blur modes, as demonstrated by examples. A novel background blur alignment technique using temporal regularization produces aesthetically pleasing, consistent motion blur trails. A simplified motion prediction model, designed for mobile efficiency, achieves comparable quality to more complex models.	Background blur for very small subjects can lead to misalignments due to prediction and tracking errors. Large motion disparities exceeding the model's receptive field can cause artifacts, limiting the system's ability to handle all motion magnitudes.	computational photography, long exposure, motion blur, mobile photography, computer vision
2308.01316 Report	Patched Denoising Diffusion Models For High-Resolution Image Synthesis	Zheng Ding, Mengqi Zhang, Jiajun Wu, Zhuowen Tu	We propose an effective denoising diffusion model for generating high-resolution images (e.g., 1024$\times$512), trained on small-size image patches (e.g., 64$\times$64). We name our algorithm Patch-DM, in which a new feature collage strategy is designed to avoid the boundary artifact when synthesizing large-size images. Feature collage systematically crops and combines partial features of the neighboring patches to predict the features of a shifted image patch, allowing the seamless generation of the entire image due to the overlap in the patch feature space. Patch-DM produces high-quality image synthesis results on our newly collected dataset of nature images (1024$\times$512), as well as on standard benchmarks of smaller sizes (256$\times$256), including LSUN-Bedroom, LSUN-Church, and FFHQ. We compare our method with previous patch-based generation methods and achieve state-of-the-art FID scores on all four datasets. Further, Patch-DM also reduces memory complexity compared to the classic diffusion models.	Proposes Patch-DM, a patch-based denoising diffusion model for high-resolution image synthesis, using a novel feature collage strategy to avoid boundary artifacts.	Addresses limitations of current diffusion models in high-resolution image generation due to high computational costs and memory requirements.	Trains a patch-level denoising U-Net model with a feature collage strategy, where features from neighboring patches are combined to predict shifted patches, ensuring consistency.	Achieves state-of-the-art FID scores on a newly collected dataset of 1024x512 natural images and standard benchmarks (LSUN-Bedroom, LSUN-Church, FFHQ). Generates high-quality images with minimal boundary artifacts despite being patch-based. Reduces memory complexity compared to classic diffusion models due to the patch-level representation.	Loss of some detailed image information when downsampling for global condition extraction using pre-trained encoders. Limited exploration of patch sizes beyond 64x64.	image synthesis, denoising diffusion models, high-resolution images, patch-based generation, feature collage
2308.01300 Report	Revisiting DETR Pre-training for Object Detection	Yan Ma, Weicong Liang, Bohan Chen, Yiduo Hao, Bojian Hou, Xiangyu Yue, Chao Zhang, Yuhui Yuan	Motivated by the remarkable achievements of DETR-based approaches on COCO object detection and segmentation benchmarks, recent endeavors have been directed towards elevating their performance through self-supervised pre-training of Transformers while preserving a frozen backbone. Noteworthy advancements in accuracy have been documented in certain studies. Our investigation delved deeply into a representative approach, DETReg, and its performance assessment in the context of emerging models like $\mathcal{H}$-Deformable-DETR. Regrettably, DETReg proves inadequate in enhancing the performance of robust DETR-based models under full data conditions. To dissect the underlying causes, we conduct extensive experiments on COCO and PASCAL VOC probing elements such as the selection of pre-training datasets and strategies for pre-training target generation. By contrast, we employ an optimized approach named Simple Self-training which leads to marked enhancements through the combination of an improved box predictor and the Objects$365$ benchmark. The culmination of these endeavors results in a remarkable AP score of $59.3\%$ on the COCO val set, outperforming $\mathcal{H}$-Deformable-DETR + Swin-L without pre-training by $1.4\%$. Moreover, a series of synthetic pre-training datasets, generated by merging contemporary image-to-text(LLaVA) and text-to-image (SDXL) models, significantly amplifies object detection capabilities.	This paper revisits self-supervised pre-training for DETR object detection models, finding existing methods ineffective for stronger DETR variants and proposing a Simple Self-training scheme with improved pre-training targets.	Pre-training the Transformer components of DETR models is crucial to fully realize their potential and enhance object detection performance.	The paper investigates the limitations of DETReg, proposes using pseudo-boxes and pseudo-class predictions as pre-training targets, and explores using synthetic datasets generated by text-to-image models.	Simple Self-training significantly outperforms DETReg and achieves competitive results on COCO (59.3% AP). Accurate pseudo-box targets are more crucial than classification targets for effective pre-training. Pre-training with synthetic datasets generated from text-to-image models show promising results, comparable to using real data (Objects365).	The study primarily focuses on object detection, leaving extensions to other vision tasks for future work. Further exploration of larger batch sizes and longer training schedules for pre-training is necessary.	object detection, detr, self-supervised learning, pre-training, synthetic data
2308.01140 Report	Dynamically Scaled Temperature in Self-Supervised Contrastive Learning	Siladittya Manna, Soumitri Chattopadhyay, Rakesh Dey, Saumik Bhattacharya, Umapada Pal	In contemporary self-supervised contrastive algorithms like SimCLR, MoCo, etc., the task of balancing attraction between two semantically similar samples and repulsion between two samples of different classes is primarily affected by the presence of hard negative samples. While the InfoNCE loss has been shown to impose penalties based on hardness, the temperature hyper-parameter is the key to regulating the penalties and the trade-off between uniformity and tolerance. In this work, we focus our attention on improving the performance of InfoNCE loss in self-supervised learning by proposing a novel cosine similarity dependent temperature scaling function to effectively optimize the distribution of the samples in the feature space. We also provide mathematical analyses to support the construction of such a dynamically scaled temperature function. Experimental evidence shows that the proposed framework outperforms the contrastive loss-based SSL algorithms.	The paper proposes DySTreSS, a novel self-supervised contrastive learning framework that dynamically scales the temperature parameter in the InfoNCE loss based on cosine similarity.	The temperature parameter in InfoNCE loss significantly impacts the trade-off between uniformity and tolerance in feature representation. Dynamically scaling it helps to better optimize this trade-off and improve representation learning.	The authors theoretically analyze the effect of temperature on local and global feature structures, deriving criteria for a suitable temperature scaling function. They propose a cosine-based function that satisfies these criteria and apply it to the SimCLR framework.	DySTreSS outperforms state-of-the-art SSL methods like SimCLR, MoCov2, and DCL on linear evaluation benchmarks including ImageNet and CIFAR. The proposed method also shows superior performance on transfer learning tasks for both image and text modalities. Ablation studies validate the effectiveness of the chosen temperature function and its impact on uniformity, tolerance, and overall accuracy.	The paper primarily focuses on cosine similarity-based temperature scaling and its effectiveness on other similarity measures is not explored. The impact of dynamic temperature scaling on computational overhead, specifically for large-scale datasets and models, is not discussed.	self-supervised learning, contrastive learning, infonce loss, temperature scaling, representation learning
2308.01045 Report	Dynamic Token Pruning in Plain Vision Transformers for Semantic Segmentation	Quan Tang, Bowen Zhang, Jiajun Liu, Fagui Liu, Yifan Liu	Vision transformers have achieved leading performance on various visual tasks yet still suffer from high computational complexity. The situation deteriorates in dense prediction tasks like semantic segmentation, as high-resolution inputs and outputs usually imply more tokens involved in computations. Directly removing the less attentive tokens has been discussed for the image classification task but can not be extended to semantic segmentation since a dense prediction is required for every patch. To this end, this work introduces a Dynamic Token Pruning (DToP) method based on the early exit of tokens for semantic segmentation. Motivated by the coarse-to-fine segmentation process by humans, we naturally split the widely adopted auxiliary-loss-based network architecture into several stages, where each auxiliary block grades every token's difficulty level. We can finalize the prediction of easy tokens in advance without completing the entire forward pass. Moreover, we keep $k$ highest confidence tokens for each semantic category to uphold the representative context information. Thus, computational complexity will change with the difficulty of the input, akin to the way humans do segmentation. Experiments suggest that the proposed DToP architecture reduces on average $20\% - 35\%$ of computational cost for current semantic segmentation methods based on plain vision transformers without accuracy degradation.	This paper introduces Dynamic Token Pruning (DToP), a method for reducing computational cost in vision transformers for semantic segmentation by allowing early exit of easy-to-recognize tokens.	Vision transformers achieve high performance but suffer from heavy computational overhead, especially in dense prediction tasks like semantic segmentation where high-resolution images generate numerous tokens.	DToP divides the network into stages using inherent auxiliary blocks. It grades token difficulty at each stage, finalizing predictions for easy tokens and pruning them from further computation, while harder tokens proceed to subsequent stages.	DToP reduces computational cost by 20-35% on average without sacrificing accuracy on benchmarks like ADE20K, Pascal Context, and COCO-Stuff-10K. The method effectively allocates computation by pruning more tokens in simple images and fewer in complex ones. Keeping the 'k' most confident tokens for each semantic category during pruning helps retain contextual information and improves performance.	DToP, like other dynamic networks, faces limitations in fully utilizing mini-batch computation efficiency. Future work includes optimizing DToP to further expedite vision transformers.	semantic segmentation, vision transformer, token pruning, computational efficiency, dynamic network
2308.00951 Report	From Sparse to Soft Mixtures of Experts	Joan Puigcerver, Carlos Riquelme, Basil Mustafa, Neil Houlsby	Sparse mixture of expert architectures (MoEs) scale model capacity without large increases in training or inference costs. Despite their success, MoEs suffer from a number of issues: training instability, token dropping, inability to scale the number of experts, or ineffective finetuning. In this work, we proposeSoft MoE, a fully-differentiable sparse Transformer that addresses these challenges, while maintaining the benefits of MoEs. Soft MoE performs an implicit soft assignment by passing different weighted combinations of all input tokens to each expert. As in other MoE works, experts in Soft MoE only process a subset of the (combined) tokens, enabling larger model capacity at lower inference cost. In the context of visual recognition, Soft MoE greatly outperforms standard Transformers (ViTs) and popular MoE variants (Tokens Choice and Experts Choice). For example, Soft MoE-Base/16 requires 10.5x lower inference cost (5.7x lower wall-clock time) than ViT-Huge/14 while matching its performance after similar training. Soft MoE also scales well: Soft MoE Huge/14 with 128 experts in 16 MoE layers has over 40x more parameters than ViT Huge/14, while inference time cost grows by only 2%, and it performs substantially better.	The paper introduces \name, a fully-differentiable sparse Transformer that addresses the challenges of training instability, token dropping, scalability, and ineffective fine-tuning in existing sparse Mixture of Expert (MoE) architectures.	Sparse MoEs are crucial for scaling model capacity without excessive computational costs, making them essential for improving performance in various tasks like visual recognition.	\name utilizes a soft assignment mechanism, computing weighted averages of input tokens for each expert instead of discrete token-to-expert assignments. This approach simplifies training, avoids token dropping and expert imbalance, and enhances speed.	\name consistently outperforms dense Vision Transformers (ViTs) and other sparse MoE variants (Tokens Choice and Experts Choice) in image classification tasks, achieving better performance with lower training costs. The paper demonstrates \name's scalability to thousands of experts, enabling the training of large models with improved performance and manageable inference costs. Experiments on image-language contrastive learning show that representations learned by \name are beneficial for other tasks like image-text alignment.	The current design of \name makes its application in auto-regressive decoders challenging due to the need to preserve causality between tokens. While \name maintains computational efficiency, its memory requirements can increase with a large number of experts, especially when using one slot per expert, which is often optimal for performance.	mixture of experts, transformers, visual recognition, sparse models, image classification
2308.00906 Report	ImageBrush: Learning Visual In-Context Instructions for Exemplar-Based Image Manipulation	Yasheng Sun, Yifan Yang, Houwen Peng, Yifei Shen, Yuqing Yang, Han Hu, Lili Qiu, Hideki Koike	While language-guided image manipulation has made remarkable progress, the challenge of how to instruct the manipulation process faithfully reflecting human intentions persists. An accurate and comprehensive description of a manipulation task using natural language is laborious and sometimes even impossible, primarily due to the inherent uncertainty and ambiguity present in linguistic expressions. Is it feasible to accomplish image manipulation without resorting to external cross-modal language information? If this possibility exists, the inherent modality gap would be effortlessly eliminated. In this paper, we propose a novel manipulation methodology, dubbed ImageBrush, that learns visual instructions for more accurate image editing. Our key idea is to employ a pair of transformation images as visual instructions, which not only precisely captures human intention but also facilitates accessibility in real-world scenarios. Capturing visual instructions is particularly challenging because it involves extracting the underlying intentions solely from visual demonstrations and then applying this operation to a new image. To address this challenge, we formulate visual instruction learning as a diffusion-based inpainting problem, where the contextual information is fully exploited through an iterative process of generation. A visual prompting encoder is carefully devised to enhance the model's capacity in uncovering human intent behind the visual instructions. Extensive experiments show that our method generates engaging manipulation results conforming to the transformations entailed in demonstrations. Moreover, our model exhibits robust generalization capabilities on various downstream tasks such as pose transfer, image translation and video inpainting.	This paper introduces ImageBrush, a novel framework for image manipulation that utilizes pairs of exemplar images as visual instructions, eliminating the need for language-based guidance.	This approach addresses challenges associated with the ambiguity and limitations of language in accurately conveying human intention for image manipulation.	ImageBrush leverages a diffusion-based inpainting strategy with a grid-like input containing exemplar images and a target image. A visual prompting encoder extracts semantic relationships and a user interface enables bounding box annotations for specifying regions of interest.	ImageBrush outperforms language-guided methods in qualitative evaluations, demonstrating superior fidelity to provided examples. Quantitative results on diverse in-the-wild datasets demonstrate ImageBrush's superior performance in tasks like image translation, pose transfer, and video inpainting. Ablation studies highlight the importance of each component in ImageBrush, including the diffusion process, visual prompting encoder, and region of interest interface.	ImageBrush may face challenges with significant disparities between instructions and query images. Handling intricate details like subtle background changes or small object additions remains challenging.	image manipulation, visual instruction, diffusion models, in-context learning, visual prompting
2308.00773 Report	High-Fidelity Eye Animatable Neural Radiance Fields for Human Face	Hengfei Wang, Zhongqun Zhang, Yihua Cheng, Hyung Jin Chang	Face rendering using neural radiance fields (NeRF) is a rapidly developing research area in computer vision. While recent methods primarily focus on controlling facial attributes such as identity and expression, they often overlook the crucial aspect of modeling eyeball rotation, which holds importance for various downstream tasks. In this paper, we aim to learn a face NeRF model that is sensitive to eye movements from multi-view images. We address two key challenges in eye-aware face NeRF learning: how to effectively capture eyeball rotation for training and how to construct a manifold for representing eyeball rotation. To accomplish this, we first fit FLAME, a well-established parametric face model, to the multi-view images considering multi-view consistency. Subsequently, we introduce a new Dynamic Eye-aware NeRF (DeNeRF). DeNeRF transforms 3D points from different views into a canonical space to learn a unified face NeRF model. We design an eye deformation field for the transformation, including rigid transformation, e.g., eyeball rotation, and non-rigid transformation. Through experiments conducted on the ETH-XGaze dataset, we demonstrate that our model is capable of generating high-fidelity images with accurate eyeball rotation and non-rigid periocular deformation, even under novel viewing angles. Furthermore, we show that utilizing the rendered images can effectively enhance gaze estimation performance.	This paper introduces DeNeRF, a novel dynamic eye-aware Neural Radiance Field (NeRF) capable of rendering high-fidelity faces with animatable eyes from multi-view images under novel viewpoints and eye poses.	Controllable eye movement in face rendering is crucial for realism and downstream tasks like gaze estimation, yet existing face NeRF models often overlook this aspect.	The proposed DeNeRF leverages multi-view face tracking with FLAME to accurately capture eyeball rotation. It then learns a unified face NeRF in a canonical space, employing an eye deformation field (including rigid and non-rigid transformations) to transform 3D points from observation space to the canonical space.	DeNeRF generates high-fidelity face images with accurate eyeball rotation and non-rigid periocular deformation, even under novel viewing angles. Quantitative comparisons demonstrate DeNeRF's superior performance over existing 2D and 3D face rendering methods, particularly in the eye region. Using DeNeRF's rendered images for data augmentation significantly improves the performance of downstream gaze estimation tasks.	The model currently requires multi-view images, limiting its applicability to single-view scenarios. The computational cost of DeNeRF is relatively high, hindering its deployment in real-time applications.	neural radiance fields, face rendering, eye animation, gaze estimation, computer vision
2308.00759 Report	Decomposition Ascribed Synergistic Learning for Unified Image Restoration	Jinghao Zhang, Feng Zhao	Learning to restore multiple image degradations within a single model is quite beneficial for real-world applications. Nevertheless, existing works typically concentrate on regarding each degradation independently, while their relationship has been less exploited to ensure the synergistic learning. To this end, we revisit the diverse degradations through the lens of singular value decomposition, with the observation that the decomposed singular vectors and singular values naturally undertake the different types of degradation information, dividing various restoration tasks into two groups, \ie, singular vector dominated and singular value dominated. The above analysis renders a more unified perspective to ascribe the diverse degradations, compared to previous task-level independent learning. The dedicated optimization of degraded singular vectors and singular values inherently utilizes the potential relationship among diverse restoration tasks, attributing to the Decomposition Ascribed Synergistic Learning (DASL). Specifically, DASL comprises two effective operators, namely, Singular VEctor Operator (SVEO) and Singular VAlue Operator (SVAO), to favor the decomposed optimization, which can be lightly integrated into existing image restoration backbone. Moreover, the congruous decomposition loss has been devised for auxiliary. Extensive experiments on blended five image restoration tasks demonstrate the effectiveness of our method.	This paper proposes Decomposition Ascribed Synergistic Learning (DASL), a novel approach to unified image restoration that leverages the relationship between different degradation types through singular value decomposition.	Existing multi-degradation learning methods often treat each degradation independently, neglecting their potential synergistic relationships. DASL aims to address this limitation by enabling a more unified learning process.	DASL decomposes image degradations based on singular value decomposition, observing that singular vectors and singular values capture distinct degradation information. It then employs two operators, Singular VEctor Operator (SVEO) and Singular VAlue Operator (SVAO), to optimize degraded singular vectors and values, respectively, alongside a congruous decomposition loss.	DASL consistently outperforms existing general image restoration and all-in-one methods on five common image restoration tasks. The method demonstrates reduced computational complexity and faster inference compared to baseline methods. Ablation studies confirm the contribution of SVEO, SVAO, and decomposition loss to the performance gain.	Exploring more sophisticated correlations beyond decomposed singular vectors and singular values. Investigating the potential of leveraging the distribution discrepancy of degradations on separate orders of decomposed components.	image restoration, multi-degradation learning, singular value decomposition, synergistic learning, deep learning
2308.00755 Report	The Bias Amplification Paradox in Text-to-Image Generation	Preethi Seshadri, Sameer Singh, Yanai Elazar	Bias amplification is a phenomenon in which models exacerbate biases or stereotypes present in the training data. In this paper, we study bias amplification in the text-to-image domain using Stable Diffusion by comparing gender ratios in training vs. generated images. We find that the model appears to amplify gender-occupation biases found in the training data (LAION) considerably. However, we discover that amplification can be largely attributed to discrepancies between training captions and model prompts. For example, an inherent difference is that captions from the training data often contain explicit gender information while our prompts do not, which leads to a distribution shift and consequently inflates bias measures. Once we account for distributional differences between texts used for training and generation when evaluating amplification, we observe that amplification decreases drastically. Our findings illustrate the challenges of comparing biases in models and their training data, and highlight confounding factors that impact analyses.	This paper investigates bias amplification in text-to-image models, focusing on how distributional differences between training captions and generation prompts contribute to the phenomenon.	Understanding bias amplification is crucial as it can exacerbate stereotypes and disparities. The work aims to explain why models amplify biases despite being trained to fit the training data.	The authors analyze gender-occupation bias in Stable Diffusion and its training dataset (LAION). They compare gender ratios in generated images to those in training images, considering different methods for selecting relevant training captions.	Naively selecting training captions based on occupation keywords leads to an overestimation of bias amplification. Excluding captions with explicit gender indicators and using nearest neighbors based on text embeddings to select training captions significantly reduces observed amplification. Prompting the model with training captions directly results in minimal amplification, suggesting that the model largely reflects the bias present in the training data when distributional differences are minimized.	The analysis doesn't account for biases stemming from the text embedding model (CLIP). Gender classification relies on a binary model, neglecting nuances in gender identity and potentially perpetuating stereotypes.	bias amplification, text-to-image generation, stable diffusion, gender bias, dataset bias
2308.00729 Report	Ada-DQA: Adaptive Diverse Quality-aware Feature Acquisition for Video Quality Assessment	Hongbo Liu, Mingda Wu, Kun Yuan, Ming Sun, Yansong Tang, Chuanchuan Zheng, Xing Wen, Xiu Li	Video quality assessment (VQA) has attracted growing attention in recent years. While the great expense of annotating large-scale VQA datasets has become the main obstacle for current deep-learning methods. To surmount the constraint of insufficient training data, in this paper, we first consider the complete range of video distribution diversity (\ie content, distortion, motion) and employ diverse pretrained models (\eg architecture, pretext task, pre-training dataset) to benefit quality representation. An Adaptive Diverse Quality-aware feature Acquisition (Ada-DQA) framework is proposed to capture desired quality-related features generated by these frozen pretrained models. By leveraging the Quality-aware Acquisition Module (QAM), the framework is able to extract more essential and relevant features to represent quality. Finally, the learned quality representation is utilized as supplementary supervisory information, along with the supervision of the labeled quality score, to guide the training of a relatively lightweight VQA model in a knowledge distillation manner, which largely reduces the computational cost during inference. Experimental results on three mainstream no-reference VQA benchmarks clearly show the superior performance of Ada-DQA in comparison with current state-of-the-art approaches without using extra training data of VQA.	This paper proposes Ada-DQA, an Adaptive Diverse Quality-aware Feature Acquisition framework for Video Quality Assessment (VQA) that leverages diverse pre-trained models to overcome limitations of limited labeled training data in VQA.	DNN-based VQA methods suffer from the limited scale of existing VQA datasets and using only content-aware features from pre-trained models is insufficient to represent quality degradation in videos.	Ada-DQA constructs a pool of diverse pre-trained models (different architectures, pre-training tasks, datasets) covering various quality-related factors (content, distortion, motion). A Quality-aware Acquisition Module (QAM) dynamically captures desired features from these models, with a sparsity constraint on gating weights to emphasize crucial features. Finally, knowledge distillation transfers learned representations to a lightweight VQA model.	Ada-DQA achieves state-of-the-art results on three NR-VQA benchmarks (KoNViD-1k, LIVE-VQC, YouTube-UGC) without using external QA training data. Using diverse pre-trained models outperforms using a single pre-trained model consistently across datasets. Adding a sparsity constraint to QAM leads to continuous performance improvement as the number of pre-trained models increases.	The quality-related information provided by adding more pre-trained models plateaus at a certain point. Future work could explore incorporating more diverse pre-trained models or other techniques beyond knowledge distillation.	video quality assessment, diverse pretrained model, knowledge distillation, quality-aware representation, sparsity constraint
2308.00727 Report	Adaptive Semantic Consistency for Cross-domain Few-shot Classification	Hengchu Lu, Yuanjie Shao, Xiang Wang, Changxin Gao	Cross-domain few-shot classification (CD-FSC) aims to identify novel target classes with a few samples, assuming that there exists a domain shift between source and target domains. Existing state-of-the-art practices typically pre-train on source domain and then finetune on the few-shot target data to yield task-adaptive representations. Despite promising progress, these methods are prone to overfitting the limited target distribution since data-scarcity and ignore the transferable knowledge learned in the source domain. To alleviate this problem, we propose a simple plug-and-play Adaptive Semantic Consistency (ASC) framework, which improves cross-domain robustness by preserving source transfer capability during the finetuning stage. Concretely, we reuse the source images in the pretraining phase and design an adaptive weight assignment strategy to highlight the samples similar to target domain, aiming to aggregate informative target-related knowledge from source domain. Subsequently, a semantic consistency regularization is applied to constrain the consistency between the semantic features of the source images output by the source model and target model. In this way, the proposed ASC enables explicit transfer of source domain knowledge to prevent the model from overfitting the target domain. Extensive experiments on multiple benchmarks demonstrate the effectiveness of the proposed ASC, and ASC provides consistent improvements over the baselines. The source code will be released.	This paper proposes Adaptive Semantic Consistency (ASC), a plug-and-play framework for cross-domain few-shot classification that mitigates overfitting by preserving transferable knowledge from the source domain during finetuning.	Existing cross-domain few-shot classification methods are prone to overfitting the limited target data and often neglect valuable knowledge learned from the source domain.	ASC employs an adaptive weight assignment strategy to emphasize source domain samples similar to the target domain. It also introduces a semantic consistency regularization, constraining the semantic features of source images from the source and target models to be consistent during finetuning.	ASC consistently improves performance on multiple benchmarks compared to baseline methods. The adaptive weight assignment strategy effectively highlights transferable knowledge from the source domain. Regularizing semantic-level features is more effective than mid-level features in preserving transferable knowledge and preventing negative transfer.	The source image selection strategy relies on source image labels, which may not always be available. Future work can explore alternative strategies for selecting relevant source images without label dependency.	cross-domain few-shot learning, semantic consistency, transfer learning, overfitting prevention, few-shot classification
2308.00692 Report	LISA: Reasoning Segmentation via Large Language Model	Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, Jiaya Jia	Although perception systems have made remarkable advancements in recent years, they still rely on explicit human instruction or pre-defined categories to identify the target objects before executing visual recognition tasks. Such systems cannot actively reason and comprehend implicit user intention. In this work, we propose a new segmentation task -- reasoning segmentation. The task is designed to output a segmentation mask given a complex and implicit query text. Furthermore, we establish a benchmark comprising over one thousand image-instruction-mask data samples, incorporating intricate reasoning and world knowledge for evaluation purposes. Finally, we present LISA: large Language Instructed Segmentation Assistant, which inherits the language generation capabilities of multimodal Large Language Models (LLMs) while also possessing the ability to produce segmentation masks. We expand the original vocabulary with a token and propose the embedding-as-mask paradigm to unlock the segmentation capability. Remarkably, LISA can handle cases involving complex reasoning and world knowledge. Also, it demonstrates robust zero-shot capability when trained exclusively on reasoning-free datasets. In addition, fine-tuning the model with merely 239 reasoning segmentation data samples results in further performance enhancement. Both quantitative and qualitative experiments show our method effectively unlocks new reasoning segmentation capabilities for multimodal LLMs. Code, models, and data are available at https://github.com/dvlab-research/LISA.	This paper introduces 'reasoning segmentation', a new task requiring segmentation masks to be generated from implicit text queries involving complex reasoning and world knowledge.	Current perception systems rely heavily on explicit instructions, highlighting the need for systems capable of understanding implicit user intent, crucial for advancing AI and robotics.	The authors present LISA, a model that equips multimodal LLMs with segmentation abilities. It leverages an embedding-as-mask paradigm, where a '' token's embedding is decoded into a segmentation mask, enabling end-to-end training.	LISA effectively handles complex reasoning and world knowledge in segmentation tasks. It exhibits strong zero-shot performance on the ReasonSeg benchmark, even when trained solely on reasoning-free datasets. Fine-tuning LISA on a mere 239 reasoning segmentation samples considerably boosts its performance.	The current performance bottleneck might lie in the query text understanding, suggesting the need for stronger multimodal LLMs. The research highlights the need for more reasoning segmentation training data to further improve performance.	reasoning segmentation, multimodal large language models, implicit instruction understanding, embedding-as-mask, lisa
2308.00520 Report	NormKD: Normalized Logits for Knowledge Distillation	Zhihao Chi, Tu Zheng, Hengjia Li, Zheng Yang, Boxi Wu, Binbin Lin, Deng Cai	Logit based knowledge distillation gets less attention in recent years since feature based methods perform better in most cases. Nevertheless, we find it still has untapped potential when we re-investigate the temperature, which is a crucial hyper-parameter to soften the logit outputs. For most of the previous works, it was set as a fixed value for the entire distillation procedure. However, as the logits from different samples are distributed quite variously, it is not feasible to soften all of them to an equal degree by just a single temperature, which may make the previous work transfer the knowledge of each sample inadequately. In this paper, we restudy the hyper-parameter temperature and figure out its incapability to distill the knowledge from each sample sufficiently when it is a single value. To address this issue, we propose Normalized Knowledge Distillation (NormKD), with the purpose of customizing the temperature for each sample according to the characteristic of the sample's logit distribution. Compared to the vanilla KD, NormKD barely has extra computation or storage cost but performs significantly better on CIRAR-100 and ImageNet for image classification. Furthermore, NormKD can be easily applied to the other logit based methods and achieve better performance which can be closer to or even better than the feature based method.	This paper proposes NormKD, a novel knowledge distillation approach that customizes the temperature for each sample based on its logit distribution, enhancing knowledge transfer from teacher to student models.	Existing logit-based knowledge distillation methods often use a fixed temperature, which inadequately softens logits from different samples with varying distributions, hindering effective knowledge transfer.	NormKD replaces the fixed temperature with the scaled standard variance of each sample's logit output. This normalizes the logits, enabling more equal knowledge distillation from individual samples.	NormKD significantly outperforms vanilla KD on CIFAR-100 and ImageNet datasets, demonstrating its effectiveness. Combining NormKD with other logit-based methods, such as DKD, further boosts performance, surpassing even some feature-based methods. NormKD achieves these improvements with minimal computational overhead, making it efficient and easy to implement.	The assumption of logit distributions as normal distributions may not always hold true, potentially limiting the effectiveness in certain cases. Future work could explore alternative methods to better characterize and normalize logit distributions for enhanced knowledge distillation.	knowledge distillation, logit-based distillation, temperature scaling, normalization, deep learning
2308.00458 Report	Center Contrastive Loss for Metric Learning	Bolun Cai, Pengfei Xiong, Shangxuan Tian	Contrastive learning is a major studied topic in metric learning. However, sampling effective contrastive pairs remains a challenge due to factors such as limited batch size, imbalanced data distribution, and the risk of overfitting. In this paper, we propose a novel metric learning function called Center Contrastive Loss, which maintains a class-wise center bank and compares the category centers with the query data points using a contrastive loss. The center bank is updated in real-time to boost model convergence without the need for well-designed sample mining. The category centers are well-optimized classification proxies to re-balance the supervisory signal of each class. Furthermore, the proposed loss combines the advantages of both contrastive and classification methods by reducing intra-class variations and enhancing inter-class differences to improve the discriminative power of embeddings. Our experimental results, as shown in Figure 1, demonstrate that a standard network (ResNet50) trained with our loss achieves state-of-the-art performance and faster convergence.	This paper proposes Center Contrastive Loss (CCL), a novel metric learning loss function that maintains and updates a class-wise center bank, contrasting category centers with query data points using a contrastive loss.	CCL overcomes limitations of existing contrastive learning methods, addressing challenges in sampling effective pairs due to factors like limited batch size and imbalanced data distribution.	CCL utilizes a center bank updated in sync with the encoder, contrasting it with data points using a contrastive loss enhanced with a large-margin component. This reduces intra-class variations while enhancing inter-class differences, boosting discriminative power.	CCL achieves state-of-the-art Recall@1 accuracy on benchmark datasets like SOP, CUB, and Cars196. It exhibits faster convergence compared to previous methods, achieving superior performance within a fraction of training epochs. CCL demonstrates robustness to noisy labels, outperforming other robust metric learning methods under various noise settings.	The impact of hyperparameters like hypersphere radius (s) is not extensively explored. Future work could investigate extensions of CCL for broader applications such as face recognition, person re-identification, and clustering.	metric learning, contrastive learning, center loss, image retrieval, deep learning
2308.00261 Report	Improving Pixel-based MIM by Reducing Wasted Modeling Capability	Yuan Liu, Songyang Zhang, Jiacheng Chen, Zhaohui Yu, Kai Chen, Dahua Lin	There has been significant progress in Masked Image Modeling (MIM). Existing MIM methods can be broadly categorized into two groups based on the reconstruction target: pixel-based and tokenizer-based approaches. The former offers a simpler pipeline and lower computational cost, but it is known to be biased toward high-frequency details. In this paper, we provide a set of empirical studies to confirm this limitation of pixel-based MIM and propose a new method that explicitly utilizes low-level features from shallow layers to aid pixel reconstruction. By incorporating this design into our base method, MAE, we reduce the wasted modeling capability of pixel-based MIM, improving its convergence and achieving non-trivial improvements across various downstream tasks. To the best of our knowledge, we are the first to systematically investigate multi-level feature fusion for isotropic architectures like the standard Vision Transformer (ViT). Notably, when applied to a smaller model (e.g., ViT-S), our method yields significant performance gains, such as 1.2\% on fine-tuning, 2.8\% on linear probing, and 2.6\% on semantic segmentation. Code and models are available at https://github.com/open-mmlab/mmpretrain.	This paper proposes Multi-level Feature Fusion (MFF), a method to improve pixel-based Masked Image Modeling (MIM) by incorporating low-level features from shallow layers into the output layer for pixel reconstruction.	Pixel-based MIM, while simple and efficient, is biased towards high-frequency details, wasting modeling capacity that could be used to capture low-frequency semantics crucial for downstream tasks.	MFF extends MAE by fusing features from multiple shallow layers with the output layer. It explores different projection layers (linear, non-linear) and fusion strategies (weighted average pooling, self-attention).	MFF significantly improves MAE's performance on ImageNet-1K classification, COCO object detection, and ADE20K semantic segmentation. MFF enhances training efficiency, achieving comparable results to MAE with 5x fewer epochs. Analysis reveals that MFF reduces high-frequency bias in learned features and flattens the loss landscape, aiding optimization.	The study primarily focuses on ViT architecture and its effectiveness on other architectures needs further investigation. The selection of specific layers for fusion is currently based on empirical analysis and a more principled approach could be explored.	masked image modeling, self-supervised learning, multi-level feature fusion, vision transformer, representation learning
2308.00255 Report	LGViT: Dynamic Early Exiting for Accelerating Vision Transformer	Guanyu Xu, Jiawei Hao, Li Shen, Han Hu, Yong Luo, Hui Lin, Jialie Shen	Recently, the efficient deployment and acceleration of powerful vision transformers (ViTs) on resource-limited edge devices for providing multimedia services have become attractive tasks. Although early exiting is a feasible solution for accelerating inference, most works focus on convolutional neural networks (CNNs) and transformer models in natural language processing (NLP).Moreover, the direct application of early exiting methods to ViTs may result in substantial performance degradation. To tackle this challenge, we systematically investigate the efficacy of early exiting in ViTs and point out that the insufficient feature representations in shallow internal classifiers and the limited ability to capture target semantic information in deep internal classifiers restrict the performance of these methods. We then propose an early exiting framework for general ViTs termed LGViT, which incorporates heterogeneous exiting heads, namely, local perception head and global aggregation head, to achieve an efficiency-accuracy trade-off. In particular, we develop a novel two-stage training scheme, including end-to-end training and self-distillation with the backbone frozen to generate early exiting ViTs, which facilitates the fusion of global and local information extracted by the two types of heads. We conduct extensive experiments using three popular ViT backbones on three vision datasets. Results demonstrate that our LGViT can achieve competitive performance with approximately 1.8 $\times$ speed-up.	This paper proposes LGViT, an early exiting framework for Vision Transformers (ViTs) that uses heterogeneous exiting heads to improve inference speed while maintaining accuracy.	Deploying powerful ViTs on resource-limited edge devices for real-time multimedia applications is challenging due to their high computational complexity. Early exiting offers a solution but needs to be adapted for ViTs to avoid performance degradation.	LGViT incorporates local perception heads (based on convolution) at shallow exiting points and global aggregation heads (based on self-attention) at deep exiting points. This leverages the strengths of both convolution and self-attention for better feature representation. A novel two-stage training strategy, including end-to-end training and self-distillation, is used to further improve performance.	LGViT achieves competitive performance with an average speed-up of 1.8x compared to the original ViT models while sacrificing only 2% accuracy on three vision datasets. The heterogeneous exiting heads (LPH + GAH) outperform other exiting architectures, such as using only MLP, convolution, or attention, in terms of speed-accuracy trade-off. The proposed two-stage training strategy is shown to be more effective than other training schemes, like normal, weighted, distillation, and alternating training, for early exiting in ViTs.	The exiting positions and optimal exiting paths are currently chosen manually. Future work will explore using Bayesian optimization to automate the exiting decision process.	vision transformer, early exit, heterogeneous exiting heads, self-distillation, inference acceleration
2308.00135 Report	InFusion: Inject and Attention Fusion for Multi Concept Zero-Shot Text-based Video Editing	Anant Khandelwal	Large text-to-image diffusion models have achieved remarkable success in generating diverse, high-quality images. Additionally, these models have been successfully leveraged to edit input images by just changing the text prompt. But when these models are applied to videos, the main challenge is to ensure temporal consistency and coherence across frames. In this paper, we propose InFusion, a framework for zero-shot text-based video editing leveraging large pre-trained image diffusion models. Our framework specifically supports editing of multiple concepts with pixel-level control over diverse concepts mentioned in the editing prompt. Specifically, we inject the difference in features obtained with source and edit prompts from U-Net residual blocks of decoder layers. When these are combined with injected attention features, it becomes feasible to query the source contents and scale edited concepts along with the injection of unedited parts. The editing is further controlled in a fine-grained manner with mask extraction and attention fusion, which cut the edited part from the source and paste it into the denoising pipeline for the editing prompt. Our framework is a low-cost alternative to one-shot tuned models for editing since it does not require training. We demonstrated complex concept editing with a generalised image model (Stable Diffusion v1.5) using LoRA. Adaptation is compatible with all the existing image diffusion techniques. Extensive experimental results demonstrate the effectiveness of existing methods in rendering high-quality and temporally consistent videos.	This paper introduces InFusion, a zero-shot text-based video editing framework that leverages pre-trained image diffusion models (specifically Stable Diffusion v1.5) to enable multi-concept editing with pixel-level control.	The method addresses limitations in existing video editing techniques that struggle to maintain temporal consistency and fine-grained control when modifying multiple concepts within a video.	InFusion employs a two-part strategy: 1) Inject: Injects differences in spatial and attention features from source and edit prompts into the denoising pipeline to guide concept modification. 2) Attention Fusion: Uses masks extracted from cross-attention maps to combine source and edit attention, ensuring accurate concept replacement while preserving unedited content.	Achieves state-of-the-art temporal consistency and editing accuracy in edited videos compared to baseline methods, as evidenced by CLIP metrics and user studies. Demonstrates successful editing of complex concepts, including object replacement, color changes, style transfer, and scene modifications, while maintaining source video fidelity. Offers a cost-effective and flexible alternative to one-shot fine-tuned models, as it requires no training and is compatible with existing image diffusion techniques.	The mask thresholding in Attention Fusion might require case-by-case adjustments for optimal performance. Future work could explore extending InFusion to incorporate additional control mechanisms, such as motion guidance or user-specified editing regions.	video editing, text-guided synthesis, diffusion models, zero-shot learning, temporal consistency
2307.16867 Report	Revisiting the Parameter Efficiency of Adapters from the Perspective of Precision Redundancy	Shibo Jie, Haoqing Wang, Zhi-Hong Deng	Current state-of-the-art results in computer vision depend in part on fine-tuning large pre-trained vision models. However, with the exponential growth of model sizes, the conventional full fine-tuning, which needs to store a individual network copy for each tasks, leads to increasingly huge storage and transmission overhead. Adapter-based Parameter-Efficient Tuning (PET) methods address this challenge by tuning lightweight adapters inserted into the frozen pre-trained models. In this paper, we investigate how to make adapters even more efficient, reaching a new minimum size required to store a task-specific fine-tuned network. Inspired by the observation that the parameters of adapters converge at flat local minima, we find that adapters are resistant to noise in parameter space, which means they are also resistant to low numerical precision. To train low-precision adapters, we propose a computational-efficient quantization method which minimizes the quantization error. Through extensive experiments, we find that low-precision adapters exhibit minimal performance degradation, and even 1-bit precision is sufficient for adapters. The experimental results demonstrate that 1-bit adapters outperform all other PET methods on both the VTAB-1K benchmark and few-shot FGVC tasks, while requiring the smallest storage size. Our findings show, for the first time, the significant potential of quantization techniques in PET, providing a general solution to enhance the parameter efficiency of adapter-based PET methods. Code: https://github.com/JieShibo/PETL-ViT	This paper explores precision redundancy in adapter-based parameter-efficient tuning (PET) for vision transformers, proposing a method to train and store adapters in low-bit parameter space, significantly improving their efficiency with minimal performance loss.	Storing task-specific fine-tuned large vision models incurs prohibitive storage and transmission costs. Adapter-based PET methods, while more efficient than full fine-tuning, still require significant storage, especially for numerous tasks. This work leverages the precision redundancy in adapters to further improve efficiency.	The authors analyze the loss landscape of adapters and find they converge at flatter minima, implying resilience to noise, including quantization error. They propose an efficient quantization-aware training method based on empirical observations of adapter parameter distributions, minimizing quantization error during training.	Quantizing adapters to low-bit precision, even 1-bit, results in negligible performance degradation unlike quantizing entire models. With a fixed storage budget, 1-bit quantized adapters achieve superior performance compared to higher bit-width settings. The proposed 1-bit adapter method outperforms previous PET methods, including low-rank factorization methods, while using the smallest storage size on both VTAB-1K and few-shot FGVC tasks.	The study focuses on ViT backbones, and the optimal bit-width for other architectures may require further investigation. Exploring more sophisticated quantization strategies beyond the Gaussian distribution assumption could further improve performance.	parameter-efficient tuning, vision transformers, quantization, adapter-based tuning, low-bit neural networks
2307.16813 Report	Capturing Co-existing Distortions in User-Generated Content for No-reference Video Quality Assessment	Kun Yuan, Zishang Kong, Chuanchuan Zheng, Ming Sun, Xing Wen	Video Quality Assessment (VQA), which aims to predict the perceptual quality of a video, has attracted raising attention with the rapid development of streaming media technology, such as Facebook, TikTok, Kwai, and so on. Compared with other sequence-based visual tasks (\textit{e.g.,} action recognition), VQA faces two under-estimated challenges unresolved in User Generated Content (UGC) videos. \textit{First}, it is not rare that several frames containing serious distortions (\textit{e.g.,}blocking, blurriness), can determine the perceptual quality of the whole video, while other sequence-based tasks require more frames of equal importance for representations. \textit{Second}, the perceptual quality of a video exhibits a multi-distortion distribution, due to the differences in the duration and probability of occurrence for various distortions. In order to solve the above challenges, we propose \textit{Visual Quality Transformer (VQT)} to extract quality-related sparse features more efficiently. Methodologically, a Sparse Temporal Attention (STA) is proposed to sample keyframes by analyzing the temporal correlation between frames, which reduces the computational complexity from $O(T^2)$ to $O(T \log T)$. Structurally, a Multi-Pathway Temporal Network (MPTN) utilizes multiple STA modules with different degrees of sparsity in parallel, capturing co-existing distortions in a video. Experimentally, VQT demonstrates superior performance than many \textit{state-of-the-art} methods in three public no-reference VQA datasets. Furthermore, VQT shows better performance in four full-reference VQA datasets against widely-adopted industrial algorithms (\textit{i.e.,} VMAF and AVQT).	This paper proposes Visual Quality Transformer (VQT), a novel Transformer-based architecture designed for no-reference video quality assessment, specifically targeting the challenges posed by co-existing distortions in user-generated content.	Accurately assessing the quality of user-generated content (UGC) videos, often characterized by diverse and co-existing distortions, is crucial for various applications like content filtering and video enhancement.	VQT utilizes two key components: a Sparse Temporal Attention (STA) module for efficiently sampling keyframes containing distortions, and a Multi-Pathway Temporal Network (MPTN) to capture different distortion characteristics simultaneously.	VQT significantly outperforms state-of-the-art methods on three NR-VQA datasets, demonstrating substantial improvements in prediction accuracy. It surpasses even widely-adopted industrial algorithms (VMAF and AVQT) on four FR-VQA datasets, highlighting its robust generalization ability. VQT exhibits good generalization to general video classification tasks, achieving competitive results on Kinetics-400 while being computationally efficient.	The current keyframe selection in STA relies on pre-defined hyperparameters, which could be improved with adaptive mechanisms. Future work could investigate further speed-up techniques like knowledge distillation and quantization for real-time applications.	video quality assessment, user-generated content, video transformer, sparse temporal attention, co-existing distortions
2307.16686 Report	Guiding Image Captioning Models Toward More Specific Captions	Simon Kornblith, Lala Li, Zirui Wang, Thao Nguyen	Image captioning is conventionally formulated as the task of generating captions for images that match the distribution of reference image-caption pairs. However, reference captions in standard captioning datasets are short and may not uniquely identify the images they describe. These problems are further exacerbated when models are trained directly on image-alt text pairs collected from the internet. In this work, we show that it is possible to generate more specific captions with minimal changes to the training process. We implement classifier-free guidance for an autoregressive captioning model by fine-tuning it to estimate both conditional and unconditional distributions over captions. The guidance scale applied at decoding controls a trade-off between maximizing $p(\mathrm{caption}\|\mathrm{image})$ and $p(\mathrm{image}\|\mathrm{caption})$. Compared to standard greedy decoding, decoding with a guidance scale of 2 substantially improves reference-free metrics such as CLIPScore (0.808 vs. 0.775) and caption$\to$image retrieval performance in the CLIP embedding space (recall@1 44.6% vs. 26.5%), but worsens standard reference-based captioning metrics (e.g., CIDEr 78.6 vs 126.1). We further explore the use of language models to guide the decoding process, obtaining small improvements over the Pareto frontier of reference-free vs. reference-based captioning metrics that arises from classifier-free guidance, and substantially improving the quality of captions generated from a model trained only on minimally curated web data.	This paper investigates methods for guiding image captioning models to generate more specific captions, focusing on classifier-free guidance (CFG) and language model (LM) guidance.	Standard image captioning models often produce generic captions. This paper addresses this issue by exploring techniques to enhance caption specificity, aiming to better capture image details.	The authors employ CFG, a technique originally designed for diffusion models, by fine-tuning an autoregressive captioning model to estimate conditional and unconditional caption distributions. Additionally, they explore using a few-shot prompted LM to guide the caption generation process, influencing caption style and improving quality.	Applying CFG with a guidance scale of 2 substantially improves reference-free metrics such as CLIPScore and caption-to-image retrieval performance but negatively impacts reference-based metrics like CIDEr. LM guidance with descriptive prompts slightly outperforms CFG in balancing reference-free and reference-based metrics. LM guidance significantly enhances captions generated by a model trained on minimally curated web data, demonstrating its potential for zero-shot captioning.	The study primarily utilizes greedy decoding, which may not be optimal for LM guidance with structured prompts. Exploring beam search could be beneficial. While CFG enhances caption specificity, it can lead to grammatical errors and nonsensical words at higher guidance scales. Further research on regularizing the estimator of pointwise mutual information could mitigate this.	image captioning, classifier-free guidance, language model guidance, caption specificity, zero-shot captioning
2307.16601 Report	Sampling to Distill: Knowledge Transfer from Open-World Data	Yuzheng Wang, Zhaoyu Chen, Jie Zhang, Dingkang Yang, Zuhao Ge, Yang Liu, Siao Liu, Yunquan Sun, Wenqiang Zhang, Lizhe Qi	Data-Free Knowledge Distillation (DFKD) is a novel task that aims to train high-performance student models using only the teacher network without original training data. Despite encouraging results, existing DFKD methods rely heavily on generation modules with high computational costs. Meanwhile, they ignore the fact that the generated and original data exist domain shifts due to the lack of supervision information. Moreover, knowledge is transferred through each example, ignoring the implicit relationship among multiple examples. To this end, we propose a novel Open-world Data Sampling Distillation (ODSD) method without a redundant generation process. First, we try to sample open-world data close to the original data's distribution by an adaptive sampling module. Then, we introduce a low-noise representation to alleviate the domain shifts and build a structured relationship of multiple data examples to exploit data knowledge. Extensive experiments on CIFAR-10, CIFAR-100, NYUv2, and ImageNet show that our ODSD method achieves state-of-the-art performance. Especially, we improve 1.50\%-9.59\% accuracy on the ImageNet dataset compared with the existing results.	This paper proposes Open-world Data Sampling Distillation (ODSD), a novel data-free knowledge distillation method that avoids the computational cost of data generation by effectively utilizing open-world unlabeled data.	Existing DFKD methods suffer from high computational costs associated with generation modules and domain shifts between generated and original data. Additionally, they often overlook the implicit relationship among multiple data examples, limiting knowledge transfer.	ODSD employs Adaptive Prototype Sampling (APS) to select unlabeled data resembling the original data distribution. It introduces Denoising Contrastive Relational Distillation (DCRD) with low-noise representation to mitigate label noise and utilizes a contrastive structured relationship to leverage knowledge from both data and the teacher network.	ODSD achieves state-of-the-art performance on CIFAR-10, CIFAR-100, NYUv2, and ImageNet datasets. The method shows significant improvements, exceeding existing methods by up to 9.59% accuracy on ImageNet. Experiments demonstrate the effectiveness of the proposed sampling method, distillation approach, and structured knowledge framework.	The performance of prototype-based sampling might be sensitive to the choice of clustering algorithm and the number of prototypes. Future work could explore more effective contrastive learning strategies for structured knowledge distillation.	knowledge distillation, data-free learning, contrastive learning, domain adaptation, computer vision
2307.16586 Report	SAMFlow: Eliminating Any Fragmentation in Optical Flow with Segment Anything Model	Shili Zhou, Ruian He, Weimin Tan, Bo Yan	Optical Flow Estimation aims to find the 2D dense motion field between two frames. Due to the limitation of model structures and training datasets, existing methods often rely too much on local clues and ignore the integrity of objects, resulting in fragmented motion estimation. Through theoretical analysis, we find the pre-trained large vision models are helpful in optical flow estimation, and we notice that the recently famous Segment Anything Model (SAM) demonstrates a strong ability to segment complete objects, which is suitable for solving the fragmentation problem. We thus propose a solution to embed the frozen SAM image encoder into FlowFormer to enhance object perception. To address the challenge of in-depth utilizing SAM in non-segmentation tasks like optical flow estimation, we propose an Optical Flow Task-Specific Adaption scheme, including a Context Fusion Module to fuse the SAM encoder with the optical flow context encoder, and a Context Adaption Module to adapt the SAM features for optical flow task with Learned Task-Specific Embedding. Our proposed SAMFlow model reaches 0.86/2.10 clean/final EPE and 3.55/12.32 EPE/F1-all on Sintel and KITTI-15 training set, surpassing Flowformer by 8.5%/9.9% and 13.2%/16.3%. Furthermore, our model achieves state-of-the-art performance on the Sintel and KITTI-15 benchmarks, ranking #1 among all two-frame methods on Sintel clean pass.	This paper proposes SAMFlow, a novel approach to enhance the accuracy of optical flow estimation by embedding a frozen Segment Anything Model (SAM) image encoder into FlowFormer, effectively addressing fragmentation issues.	Existing optical flow estimation methods often produce fragmented results due to the limitations of datasets and model structures, relying too heavily on local clues and ignoring object integrity. SAM's strong object segmentation ability makes it suitable for solving this fragmentation problem.	The authors integrate the SAM image encoder into FlowFormer and propose an Optical Flow Task-Specific Adaption scheme. This scheme consists of a Context Fusion Module (CFM) to fuse SAM and FlowFormer encoder features, and a Context Adaption Module (CAM) to adapt the fused features for optical flow estimation using Learned Task-Specific Embedding (LTSE).	SAMFlow achieves state-of-the-art performance on Sintel and KITTI-15 benchmarks, surpassing FlowFormer by a significant margin. The model exhibits strong robustness against fragmentation attacks, outperforming other methods in scenarios with occlusions and complex textures. SAMFlow ranks #1 among all two-frame methods on the Sintel clean pass benchmark.	The model's performance improvement comes with increased computational cost, although this can be mitigated by using smaller SAM encoder scales. The authors primarily focus on two-frame optical flow estimation and plan to explore the application of SAM in multi-frame settings in future work.	optical flow estimation, fragmentation, segment anything model (sam), flowformer, task-specific adaptation
2307.16489 Report	BAGM: A Backdoor Attack for Manipulating Text-to-Image Generative Models	Jordan Vice, Naveed Akhtar, Richard Hartley, Ajmal Mian	The rise in popularity of text-to-image generative artificial intelligence (AI) has attracted widespread public interest. We demonstrate that this technology can be attacked to generate content that subtly manipulates its users. We propose a Backdoor Attack on text-to-image Generative Models (BAGM), which upon triggering, infuses the generated images with manipulative details that are naturally blended in the content. Our attack is the first to target three popular text-to-image generative models across three stages of the generative process by modifying the behaviour of the embedded tokenizer, the language model or the image generative model. Based on the penetration level, BAGM takes the form of a suite of attacks that are referred to as surface, shallow and deep attacks in this article. Given the existing gap within this domain, we also contribute a comprehensive set of quantitative metrics designed specifically for assessing the effectiveness of backdoor attacks on text-to-image models. The efficacy of BAGM is established by attacking state-of-the-art generative models, using a marketing scenario as the target domain. To that end, we contribute a dataset of branded product images. Our embedded backdoors increase the bias towards the target outputs by more than five times the usual, without compromising the model robustness or the generated content utility. By exposing generative AI's vulnerabilities, we encourage researchers to tackle these challenges and practitioners to exercise caution when using pre-trained models. Relevant code, input prompts and supplementary material can be found at https://github.com/JJ-Vice/BAGM, and the dataset is available at: https://ieee-dataport.org/documents/marketable-foods-mf-dataset. Keywords: Generative Artificial Intelligence, Generative Models, Text-to-Image generation, Backdoor Attacks, Trojan, Stable Diffusion.	The paper proposes BAGM, a novel backdoor attack framework targeting text-to-image generative AI models, demonstrating manipulation of generated outputs across different stages of the generative process.	The work exposes vulnerabilities of increasingly popular text-to-image AI models to subtle manipulation, raising important security and ethical concerns as these models can be exploited to influence user sentiments.	The authors introduce three types of backdoor attacks: (1) Surface attack manipulating tokenization, (2) Shallow attack targeting the language model, and (3) Deep attack targeting the image generative model. They evaluate the attacks on three popular text-to-image pipelines (Stable Diffusion, Kandinsky, DeepFloyd-IF) using the proposed Marketable Foods dataset and novel evaluation metrics.	Backdoor attacks successfully injected into all three pipelines at various stages, demonstrating the vulnerability of these systems. Proposed attacks achieved high attack success rates while maintaining low impact on model utility, making them stealthy and effective. The paper establishes a benchmark for evaluating backdoor attacks on generative AI models through novel metrics, paving the way for future research on defense mechanisms.	The paper primarily focuses on a marketing scenario for demonstration. Exploring other domains and attack vectors is crucial for a comprehensive understanding of these vulnerabilities. While the proposed metrics offer a valuable starting point, further research is needed to develop more robust and comprehensive evaluation standards for generative AI model attacks.	generative ai, backdoor attacks, text-to-image synthesis, model security, digital marketing
2307.16441 Report	Interactive Neural Painting	Elia Peruzzo, Willi Menapace, Vidit Goel, Federica Arrigoni, Hao Tang, Xingqian Xu, Arman Chopikyan, Nikita Orlov, Yuxiao Hu, Humphrey Shi, Nicu Sebe, Elisa Ricci	In the last few years, Neural Painting (NP) techniques became capable of producing extremely realistic artworks. This paper advances the state of the art in this emerging research domain by proposing the first approach for Interactive NP. Considering a setting where a user looks at a scene and tries to reproduce it on a painting, our objective is to develop a computational framework to assist the users creativity by suggesting the next strokes to paint, that can be possibly used to complete the artwork. To accomplish such a task, we propose I-Paint, a novel method based on a conditional transformer Variational AutoEncoder (VAE) architecture with a two-stage decoder. To evaluate the proposed approach and stimulate research in this area, we also introduce two novel datasets. Our experiments show that our approach provides good stroke suggestions and compares favorably to the state of the art. Additional details, code and examples are available at https://helia95.github.io/inp-website.	This paper introduces Interactive Neural Painting (INP), a novel image generation task where a computational tool assists users in painting by suggesting subsequent strokes based on a reference image and user input, aiming to make painting more accessible.	Current Neural Painting (NP) methods lack interactivity and limit user control over the artistic process. INP addresses this gap by enabling user participation and integrating their style, potentially democratizing artistic expression through painting.	The proposed method, INP-VAE, uses a conditional transformer VAE architecture. It leverages a context encoder to extract information from the reference image, user-painted canvas, and recent strokes. A two-stage decoder predicts stroke parameters, ensuring coherence with the reference and mimicking human painting styles learned from a synthetic dataset.	INP-VAE generates stroke suggestions that accurately reflect reference images while adhering to characteristics of human painting demonstrations. The method exhibits superior performance compared to adapted state-of-the-art NP techniques in terms of stroke sequence similarity, diversity, and adherence to painting style. A user study confirms a clear preference for INP-VAE over baselines in generating stroke sequences that align with human-like painting processes.	The reliance on synthetic datasets for training, while demonstrating the framework's capability, highlights the need for real human painting data for further refinement. Future research can explore incorporating user feedback mechanisms to adapt the model's suggestions and better align with user intentions over time.	interactive neural painting, image generation, conditional vae, transformer networks, human-computer interaction
2307.16371 Report	MobileVidFactory: Automatic Diffusion-Based Social Media Video Generation for Mobile Devices from Text	Junchen Zhu, Huan Yang, Wenjing Wang, Huiguo He, Zixi Tuo, Yongsheng Yu, Wen-Huang Cheng, Lianli Gao, Jingkuan Song, Jianlong Fu, Jiebo Luo	Videos for mobile devices become the most popular access to share and acquire information recently. For the convenience of users' creation, in this paper, we present a system, namely MobileVidFactory, to automatically generate vertical mobile videos where users only need to give simple texts mainly. Our system consists of two parts: basic and customized generation. In the basic generation, we take advantage of the pretrained image diffusion model, and adapt it to a high-quality open-domain vertical video generator for mobile devices. As for the audio, by retrieving from our big database, our system matches a suitable background sound for the video. Additionally to produce customized content, our system allows users to add specified screen texts to the video for enriching visual expression, and specify texts for automatic reading with optional voices as they like.	Introduces MobileVidFactory, the first automatic system for generating vertical videos for mobile devices from text, incorporating both basic and user-customized content creation.	Addresses the growing popularity of vertical videos on social media and the need for accessible, easy-to-use video creation tools.	Combines a pretrained image diffusion model adapted for vertical video generation, an audio retrieval model for background sound, and optional user-specified text overlays and text-to-speech narration.	Generates high-quality vertical videos with detailed frames and smooth motion. Enables users to customize videos with text overlays and personalized voiceovers. Offers a user-friendly way to create engaging content for mobile consumption.	The current training dataset for vertical video finetuning is limited. Exploring more sophisticated audio-visual matching techniques is of interest.	vertical video generation, diffusion model, mobile video, text-to-video, social media
2307.16275 Report	Stylized Projected GAN: A Novel Architecture for Fast and Realistic Image Generation	Md Nurul Muttakin, Malik Shahid Sultan, Robert Hoehndorf, Hernando Ombao	Generative Adversarial Networks are used for generating the data using a generator and a discriminator, GANs usually produce high-quality images, but training GANs in an adversarial setting is a difficult task. GANs require high computation power and hyper-parameter regularization for converging. Projected GANs tackle the training difficulty of GANs by using transfer learning to project the generated and real samples into a pre-trained feature space. Projected GANs improve the training time and convergence but produce artifacts in the generated images which reduce the quality of the generated samples, we propose an optimized architecture called Stylized Projected GANs which integrates the mapping network of the Style GANs with Skip Layer Excitation of Fast GAN. The integrated modules are incorporated within the generator architecture of the Fast GAN to mitigate the problem of artifacts in the generated images.	The paper proposes Stylized Projected GAN (SPGAN), a novel architecture that integrates the mapping network of StyleGAN with Skip Layer Excitation (SLE) of FastGAN for faster generation of realistic images with fewer training samples.	Training GANs is challenging, often requiring extensive computational resources and large datasets. Existing methods like Projected GANs, while faster, suffer from artifacts in generated images. This work addresses the need for architectures that balance training speed and image quality.	The authors experiment with different combinations of architectural components from StyleGAN and FastGAN, focusing on the generator design. They investigate the impact of integrating the mapping network at different resolutions, using deeper mapping networks, and combining the mapping network with SLE.	SPGAN with stylization in initial layers significantly reduces the number of training samples required compared to Projected GAN, achieving better FID, KID, and precision scores. Integrating the mapping network with SLE in later layers is ineffective, suggesting artifacts originate in low-resolution layers. Deeper mapping networks improve image diversity (higher recall) but may slightly reduce image quality (lower precision).	Despite improvements, artifacts persist in generated images. Future work will focus on refining the discriminator to address this. Potential solutions include incorporating artifact-aware loss functions, a separate artifact classification head, or an encoder-based approach for artifact detection.	generative adversarial networks, image generation, transfer learning, stylegan, fastgan
2307.16204 Report	Open-Set Domain Adaptation with Visual-Language Foundation Models	Qing Yu, Go Irie, Kiyoharu Aizawa	Unsupervised domain adaptation (UDA) has proven to be very effective in transferring knowledge obtained from a source domain with labeled data to a target domain with unlabeled data. Owing to the lack of labeled data in the target domain and the possible presence of unknown classes, open-set domain adaptation (ODA) has emerged as a potential solution to identify these classes during the training phase. Although existing ODA approaches aim to solve the distribution shifts between the source and target domains, most methods fine-tuned ImageNet pre-trained models on the source domain with the adaptation on the target domain. Recent visual-language foundation models (VLFM), such as Contrastive Language-Image Pre-Training (CLIP), are robust to many distribution shifts and, therefore, should substantially improve the performance of ODA. In this work, we explore generic ways to adopt CLIP, a popular VLFM, for ODA. We investigate the performance of zero-shot prediction using CLIP, and then propose an entropy optimization strategy to assist the ODA models with the outputs of CLIP. The proposed approach achieves state-of-the-art results on various benchmarks, demonstrating its effectiveness in addressing the ODA problem.	This paper proposes a novel method for Open-Set Domain Adaptation (ODA) that leverages the power of Visual-Language Foundation Models (VLFM), particularly CLIP.	Existing ODA methods often struggle with unknown classes and distribution shifts between domains. This work leverages the robust zero-shot capabilities and large-scale pre-training of CLIP to enhance ODA performance.	The method utilizes the zero-shot predictions from CLIP and an entropy optimization strategy. The entropy of CLIP's predictions identifies potential unknown samples. An ODA model is then trained on source data and adapted to the target domain using entropy separation and CLIP's predictions as guidance.	The proposed method achieves state-of-the-art results on various ODA benchmarks, including Office, Office-Home, VisDA, and DomainNet. The approach is effective for both ODA and Source-Free ODA (SF-ODA). The study reveals that CLIP's zero-shot performance is comparable to existing ODA methods, especially on datasets with coarse-grained classes and common domains.	The current method utilizes a simple strategy for leveraging CLIP. Exploring more sophisticated integration and fine-tuning strategies could further improve performance. Future work includes investigating computationally efficient methods for adapting CLIP to ODA while preventing overfitting to the source domain.	open-set domain adaptation, source-free domain adaptation, visual-language foundation models, clip, zero-shot learning
2307.16184 Report	UnIVAL: Unified Model for Image, Video, Audio and Language Tasks	Mustafa Shukor, Corentin Dancette, Alexandre Rame, Matthieu Cord	Large Language Models (LLMs) have made the ambitious quest for generalist agents significantly far from being a fantasy. A key hurdle for building such general models is the diversity and heterogeneity of tasks and modalities. A promising solution is unification, allowing the support of a myriad of tasks and modalities within one unified framework. While few large models (e.g., Flamingo (Alayrac et al., 2022), trained on massive datasets, can support more than two modalities, current small to mid-scale unified models are still limited to 2 modalities, usually image-text or video-text. The question that we ask is: is it possible to build efficiently a unified model that can support all modalities? To answer this, we propose UnIVAL, a step further towards this ambitious goal. Without relying on fancy datasets sizes or models with billions of parameters, the ~ 0.25B parameter UnIVAL model goes beyond two modalities and unifies text, images, video, and audio into a single model. Our model is efficiently pretrained on many tasks, based on task balancing and multimodal curriculum learning. UnIVAL shows competitive performance to existing state-of-the-art approaches, across image and video-text tasks. The feature representations learned from image and video-text modalities, allows the model to achieve competitive performance when finetuned on audio-text tasks, despite not being pretrained on audio. Thanks to the unified model, we propose a novel study on multimodal model merging via weight interpolation of models trained on different multimodal tasks, showing their benefits in particular for out-of-distribution generalization. Finally, we motivate unification by showing the synergy between tasks. The model weights and code are released here: https://github.com/mshukor/UnIVAL.	Proposes UnIVAL, a unified model handling image, video, and audio-text tasks within a single architecture, vocabulary, input/output format, and training objective.	Overcomes limitations of models focused on one or two modalities by leveraging synergies between diverse tasks and modalities for a more generalist approach.	Employs a Transformer-based encoder-decoder LM with modality-specific CNN encoders, pretrained on a variety of image/video-text datasets using a multimodal curriculum learning and task balancing strategy.	Achieves competitive performance on image/video-text tasks, including new SOTA on Visual Grounding. Shows strong generalization to new modalities, achieving competitive performance on audio-text tasks without pretraining on audio data. Demonstrates the effectiveness of weight interpolation for merging models finetuned on different multimodal tasks, improving multitask performance without inference overhead.	Limited performance on complex instructions and tasks requiring intricate reasoning. Hallucinations and potential biases inherited from training data need further mitigation.	multimodal learning, unified models, curriculum learning, weight interpolation, generalist agents
2307.16183 Report	HD-Fusion: Detailed Text-to-3D Generation Leveraging Multiple Noise Estimation	Jinbo Wu, Xiaobo Gao, Xing Liu, Zhengyang Shen, Chen Zhao, Haocheng Feng, Jingtuo Liu, Errui Ding	In this paper, we study Text-to-3D content generation leveraging 2D diffusion priors to enhance the quality and detail of the generated 3D models. Recent progress (Magic3D) in text-to-3D has shown that employing high-resolution (e.g., 512 x 512) renderings can lead to the production of high-quality 3D models using latent diffusion priors. To enable rendering at even higher resolutions, which has the potential to further augment the quality and detail of the models, we propose a novel approach that combines multiple noise estimation processes with a pretrained 2D diffusion prior. Distinct from the Bar-Tal et al.s' study which binds multiple denoised results to generate images from texts, our approach integrates the computation of scoring distillation losses such as SDS loss and VSD loss which are essential techniques for the 3D content generation with 2D diffusion priors. We experimentally evaluated the proposed approach. The results show that the proposed approach can generate high-quality details compared to the baselines.	This paper proposes HD-Fusion, a novel text-to-3D generation method leveraging multiple noise estimation processes with pretrained 2D diffusion priors to produce highly detailed 3D models.	Generating high-quality, detailed 3D models from text is crucial for applications like the Metaverse, requiring computationally expensive and data-intensive 3D diffusion models. HD-Fusion addresses this challenge by using 2D diffusion priors for efficient training and high-quality output.	The method utilizes a two-stage coarse-to-fine approach. First, a neural field represents the object's shape and color, optimized using SDS loss in the latent space. The second stage uses a DMTet model and a color network, optimized by rendering views at higher resolutions and employing multiple noise estimation for memory efficiency. ControlNet is incorporated for geometric accuracy.	The proposed multiple noise estimation enables training with high-resolution rendering, leading to finer details compared to baselines. HD-Fusion outperforms SOTA methods like Magic3D and Fantasia3D in terms of visual quality. Task-specific guidance, like pose guidance using ControlNet, significantly improves geometric accuracy, as shown in 3D human character generation.	The impact of varying the number of tiles in the multiple noise estimation process needs further investigation. Exploring the combination of the proposed method with other advancements in text-to-3D, such as VSD, could lead to even better visual quality. Future work involves investigating the application of the proposed approach to more challenging tasks, such as 3D scene generation from text.	text-to-3d generation, diffusion models, multiple noise estimation, high-resolution rendering, controlnet
2307.16151 Report	StylePrompter: All Styles Need Is Attention	Chenyi Zhuang, Pan Gao, Aljosa Smolic	GAN inversion aims at inverting given images into corresponding latent codes for Generative Adversarial Networks (GANs), especially StyleGAN where exists a disentangled latent space that allows attribute-based image manipulation at latent level. As most inversion methods build upon Convolutional Neural Networks (CNNs), we transfer a hierarchical vision Transformer backbone innovatively to predict $\mathcal{W^+}$ latent codes at token level. We further apply a Style-driven Multi-scale Adaptive Refinement Transformer (SMART) in $\mathcal{F}$ space to refine the intermediate style features of the generator. By treating style features as queries to retrieve lost identity information from the encoder's feature maps, SMART can not only produce high-quality inverted images but also surprisingly adapt to editing tasks. We then prove that StylePrompter lies in a more disentangled $\mathcal{W^+}$ and show the controllability of SMART. Finally, quantitative and qualitative experiments demonstrate that StylePrompter can achieve desirable performance in balancing reconstruction quality and editability, and is "smart" enough to fit into most edits, outperforming other $\mathcal{F}$-involved inversion methods.	This paper introduces StylePrompter, a novel Transformer-based GAN inversion framework for generating high-quality, editable images by mapping real images into StyleGAN's latent space.	Balancing high-quality image inversion with flexible editing capabilities in GANs remains a challenge. Existing methods often struggle with this trade-off, especially in deeper, more expressive latent spaces like StyleGAN's &F; space.	The authors employ a hierarchical Swin Transformer backbone to predict latent codes (&W;+) at a token level, allowing for disentangled attribute learning. Additionally, they introduce a Style-driven Multi-scale Adaptive Refinement Transformer (SMART) block to refine intermediate style features in &F; space, enhancing reconstruction quality and enabling flexible editing.	StylePrompter achieves a better balance between reconstruction quality and editability compared to previous methods. The study shows that the predicted latent codes in &W;+ space exhibit a higher degree of disentanglement, enabling more controlled and meaningful image manipulations. The proposed SMART block effectively refines style features, enhancing inversion quality while surprisingly preserving editing capabilities in the &F; space.	The model struggles to reconstruct out-of-domain details, potentially due to style feature modification at a shallow layer. Future work could explore stacking multiple SMART blocks for progressive refinement and improved out-of-domain detail reconstruction.	gan inversion, transformer, image editing, multi-scale attention, stylegan
2307.16125 Report	SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension	Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, Ying Shan	Based on powerful Large Language Models (LLMs), recent generative Multimodal Large Language Models (MLLMs) have gained prominence as a pivotal research area, exhibiting remarkable capability for both comprehension and generation. In this work, we address the evaluation of generative comprehension in MLLMs as a preliminary step towards a comprehensive assessment of generative models, by introducing a benchmark named SEED-Bench. SEED-Bench consists of 19K multiple choice questions with accurate human annotations (x 6 larger than existing benchmarks), which spans 12 evaluation dimensions including the comprehension of both the image and video modality. We develop an advanced pipeline for generating multiple-choice questions that target specific evaluation dimensions, integrating both automatic filtering and manual verification processes. Multiple-choice questions with groundtruth options derived from human annotation enables an objective and efficient assessment of model performance, eliminating the need for human or GPT intervention during evaluation. We further evaluate the performance of 18 models across all 12 dimensions, covering both the spatial and temporal understanding. By revealing the limitations of existing MLLMs through evaluation results, we aim for SEED-Bench to provide insights for motivating future research. We will launch and consistently maintain a leaderboard to provide a platform for the community to assess and investigate model capability.	This paper introduces SEED-Bench, a large-scale benchmark designed to evaluate the generative comprehension abilities of Multimodal Large Language Models (MLLMs).	Existing benchmarks for evaluating MLLMs are limited in scale, scope, and objectivity. SEED-Bench addresses these limitations by providing a comprehensive and objective evaluation framework for MLLMs.	SEED-Bench leverages foundation models to extract visual information from images and videos, which is then used by ChatGPT/GPT-4 to generate multiple-choice questions. The generated questions are then filtered automatically and manually to ensure quality and relevance.	Most MLLMs still exhibit limited performance across all 12 evaluation dimensions, especially in fine-grained temporal understanding and text recognition. InstructBLIP achieves state-of-the-art results on SEED-Bench, demonstrating superior performance in 8 out of 12 evaluation dimensions. VideoLLMs, despite being trained on video data, fail to achieve competitive performance on temporal understanding tasks compared to ImageLLMs.	The current version of SEED-Bench primarily focuses on multiple-choice questions, potentially limiting the diversity of evaluated abilities. Future work includes expanding the benchmark with additional evaluation dimensions, incorporating more diverse question formats, and exploring automatic generation of video-related questions.	multimodal large language models, benchmarking, generative comprehension, visual reasoning, temporal understanding
2307.15860 Report	What can Discriminator do? Towards Box-free Ownership Verification of Generative Adversarial Network	Ziheng Huang, Boheng Li, Yan Cai, Run Wang, Shangwei Guo, Liming Fang, Jing Chen, Lina Wang	In recent decades, Generative Adversarial Network (GAN) and its variants have achieved unprecedented success in image synthesis. However, well-trained GANs are under the threat of illegal steal or leakage. The prior studies on remote ownership verification assume a black-box setting where the defender can query the suspicious model with specific inputs, which we identify is not enough for generation tasks. To this end, in this paper, we propose a novel IP protection scheme for GANs where ownership verification can be done by checking outputs only, without choosing the inputs (i.e., box-free setting). Specifically, we make use of the unexploited potential of the discriminator to learn a hypersphere that captures the unique distribution learned by the paired generator. Extensive evaluations on two popular GAN tasks and more than 10 GAN architectures demonstrate our proposed scheme to effectively verify the ownership. Our proposed scheme shown to be immune to popular input-based removal attacks and robust against other existing attacks. The source code and models are available at https://github.com/AbstractTeen/gan_ownership_verification	This paper proposes a novel, box-free ownership verification scheme for Generative Adversarial Networks (GANs) by leveraging the discriminator's ability to capture the generator's learned data distribution.	Existing black-box verification methods for GANs are vulnerable to input manipulation and ambiguity attacks, particularly in tasks where deterministic inputs are not feasible.	The method trains a hypersphere-based classifier using the discriminator's representations. This classifier captures the unique distribution of images generated by the paired generator. A pearson correlation loss is introduced during training to prevent discriminator degradation and preserve its representational capacity.	The method effectively distinguishes between GANs with different architectures, training datasets, and even initialization seeds. The scheme is robust against model pruning and output image transformations while maintaining acceptable image quality. It is resilient to ambiguity attacks as it relies on the discriminator's unique representation, which is difficult to replicate without the original discriminator.	The security relies on the secrecy of the discriminator, as its disclosure could enable attacks. Future work can explore extending the approach to other generative models like diffusion models.	generative adversarial networks, ownership verification, box-free verification, discriminator representation, ambiguity attack
2307.15697 Report	SimDETR: Simplifying self-supervised pretraining for DETR	Ioannis Maniadis Metaxas, Adrian Bulat, Ioannis Patras, Brais Martinez, Georgios Tzimiropoulos	DETR-based object detectors have achieved remarkable performance but are sample-inefficient and exhibit slow convergence. Unsupervised pretraining has been found to be helpful to alleviate these impediments, allowing training with large amounts of unlabeled data to improve the detector's performance. However, existing methods have their own limitations, like keeping the detector's backbone frozen in order to avoid performance degradation and utilizing pretraining objectives misaligned with the downstream task. To overcome these limitations, we propose a simple pretraining framework for DETR-based detectors that consists of three simple yet key ingredients: (i) richer, semantics-based initial proposals derived from high-level feature maps, (ii) discriminative training using object pseudo-labels produced via clustering, (iii) self-training to take advantage of the improved object proposals learned by the detector. We report two main findings: (1) Our pretraining outperforms prior DETR pretraining works on both the full and low data regimes by significant margins. (2) We show we can pretrain DETR from scratch (including the backbone) directly on complex image datasets like COCO, paving the path for unsupervised representation learning directly using DETR.	This paper proposes SimDETR, a self-supervised pretraining framework for DETR-based object detectors that improves sample efficiency and convergence speed.	DETR-based detectors, while achieving high performance, are known for slow convergence and requiring large amounts of labeled data.	SimDETR uses three key components: 1) semantics-based initial proposals from clustered high-level feature maps, 2) class-aware pretraining using object pseudo-labels derived from clustering, and 3) iterative self-training for refining object proposals and enhancing supervision.	SimDETR outperforms prior DETR pretraining methods in full data, semi-supervised, and few-shot settings. SimDETR allows pretraining DETR from scratch, including the backbone, directly on complex datasets like COCO, demonstrating effective unsupervised representation learning. SimDETR achieves competitive results for self-supervised representation learning on scene-centric images, indicating its potential for general-purpose representation learning.	The performance of SimDETR, while competitive, is still slightly lower than object-centric pretraining on ImageNet, suggesting further room for improvement. The paper focuses on DETR-based architectures, leaving the exploration of SimDETR's effectiveness on other detection frameworks for future work.	object detection, self-supervised learning, detr, unsupervised pretraining, representation learning
2307.15640 Report	CLIP Brings Better Features to Visual Aesthetics Learners	Liwu Xu, Jinjin Xu, Yuzhe Yang, Yijie Huang, Yanchun Xie, Yaqian Li	The success of pre-training approaches on a variety of downstream tasks has revitalized the field of computer vision. Image aesthetics assessment (IAA) is one of the ideal application scenarios for such methods due to subjective and expensive labeling procedure. In this work, an unified and flexible two-phase \textbf{C}LIP-based \textbf{S}emi-supervised \textbf{K}nowledge \textbf{D}istillation paradigm is proposed, namely \textbf{\textit{CSKD}}. Specifically, we first integrate and leverage a multi-source unlabeled dataset to align rich features between a given visual encoder and an off-the-shelf CLIP image encoder via feature alignment loss. Notably, the given visual encoder is not limited by size or structure and, once well-trained, it can seamlessly serve as a better visual aesthetic learner for both student and teacher. In the second phase, the unlabeled data is also utilized in semi-supervised IAA learning to further boost student model performance when applied in latency-sensitive production scenarios. By analyzing the attention distance and entropy before and after feature alignment, we notice an alleviation of feature collapse issue, which in turn showcase the necessity of feature alignment instead of training directly based on CLIP image encoder. Extensive experiments indicate the superiority of CSKD, which achieves state-of-the-art performance on multiple widely used IAA benchmarks.	This paper proposes CSKD, a novel CLIP-based two-phase Semi-supervised Knowledge Distillation method for Image Aesthetics Assessment (IAA), which improves both the generalization ability and the knowledge distillation efficiency of IAA algorithms.	IAA suffers from poor model generalization ability due to the subjective and expensive labeling procedure. Existing DL-based methods have high complexity hindering their deployment on mobile devices, while lightweight models usually suffer from unacceptable performance drop. This paper utilizes the representation ability of CLIP to improve the performance of IAA models.	The method consists of two phases: 1) CLIP-based Feature Alignment (CFA): aligns the features of a given visual encoder with an off-the-shelf CLIP image encoder using a large unlabeled dataset; 2) Semi-supervised Knowledge Distillation (SKD): fine-tunes a teacher model using labeled IAA data and then trains a student model with both labeled and unlabeled data by minimizing the difference between their predictions and human/pseudo labels.	CSKD achieves state-of-the-art performance on multiple IAA benchmarks, including AVA, AADB, and PARA. Analysis of attention maps before and after CFA indicates an alleviation of the feature collapse issue. Semi-supervised knowledge distillation with unlabeled data significantly boosts student model performance.	Limitation1: The performance improvement brought by using a larger CLIP model is not thoroughly investigated. Limitation2: The method is only evaluated on three IAA datasets, and its generalization ability to other datasets needs further validation. Future work will focus on exploring the impact of different CLIP models and applying the method to other image-related tasks.	image aesthetics assessment, clip, knowledge distillation, semi-supervised learning, feature alignment
2307.15353 Report	Supervised Homography Learning with Realistic Dataset Generation	Hai Jiang, Haipeng Li, Songchen Han, Haoqiang Fan, Bing Zeng, Shuaicheng Liu	In this paper, we propose an iterative framework, which consists of two phases: a generation phase and a training phase, to generate realistic training data and yield a supervised homography network. In the generation phase, given an unlabeled image pair, we utilize the pre-estimated dominant plane masks and homography of the pair, along with another sampled homography that serves as ground truth to generate a new labeled training pair with realistic motion. In the training phase, the generated data is used to train the supervised homography network, in which the training data is refined via a content consistency module and a quality assessment module. Once an iteration is finished, the trained network is used in the next data generation phase to update the pre-estimated homography. Through such an iterative strategy, the quality of the dataset and the performance of the network can be gradually and simultaneously improved. Experimental results show that our method achieves state-of-the-art performance and existing supervised methods can be also improved based on the generated dataset. Code and dataset are available at https://github.com/JianghaiSCU/RealSH.	This paper proposes an iterative deep framework to generate realistic datasets for supervised homography learning and trains a high-precision homography estimation network.	Supervised homography learning methods lag behind unsupervised methods due to the lack of qualified training data that simultaneously satisfies both label criteria and realism criteria.	The framework iteratively generates training data and trains the homography network. It uses pre-estimated dominant plane masks, initial homographies, and a sampled ground truth homography to synthesize realistic image pairs. A content consistency module and a quality assessment module are introduced to refine the generated data during training.	The method achieves state-of-the-art performance on CA-unsup and GHOF benchmarks, outperforming existing supervised and unsupervised methods. The generated dataset, CA-sup, significantly improves the performance of existing supervised methods, demonstrating its effectiveness. Ablation studies validate the contribution of each component in the framework, including the dataset generation strategy, content consistency module, and quality assessment module.	The method relies on the accuracy of pre-estimated dominant plane masks and initial homographies. The iterative process requires more computation compared to traditional training methods.	homography estimation, dataset generation, supervised learning, deep learning, computer vision
2307.15333 Report	Dynamic PlenOctree for Adaptive Sampling Refinement in Explicit NeRF	Haotian Bai, Yiqi Lin, Yize Chen, Lin Wang	The explicit neural radiance field (NeRF) has gained considerable interest for its efficient training and fast inference capabilities, making it a promising direction such as virtual reality and gaming. In particular, PlenOctree (POT)[1], an explicit hierarchical multi-scale octree representation, has emerged as a structural and influential framework. However, POT's fixed structure for direct optimization is sub-optimal as the scene complexity evolves continuously with updates to cached color and density, necessitating refining the sampling distribution to capture signal complexity accordingly. To address this issue, we propose the dynamic PlenOctree DOT, which adaptively refines the sample distribution to adjust to changing scene complexity. Specifically, DOT proposes a concise yet novel hierarchical feature fusion strategy during the iterative rendering process. Firstly, it identifies the regions of interest through training signals to ensure adaptive and efficient refinement. Next, rather than directly filtering out valueless nodes, DOT introduces the sampling and pruning operations for octrees to aggregate features, enabling rapid parameter learning. Compared with POT, our DOT outperforms it by enhancing visual quality, reducing over $55.15$/$68.84\%$ parameters, and providing 1.7/1.9 times FPS for NeRF-synthetic and Tanks $\&$ Temples, respectively. Project homepage:https://vlislab22.github.io/DOT. [1] Yu, Alex, et al. "Plenoctrees for real-time rendering of neural radiance fields." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.	This paper proposes DOT, a dynamic PlenOctree structure that adaptively refines the sample distribution in explicit NeRF based on training signals, improving rendering quality and efficiency.	Fixed octree structures like POT are sub-optimal as scene complexity changes during training. DOT addresses this by dynamically calibrating the octree structure for better adaptation.	DOT uses a hierarchical feature fusion strategy. It identifies regions of interest based on training signals like ray weight and then prunes valueless regions while sampling more in complex areas. This process iteratively refines the octree, aggregating features for efficient representation.	DOT significantly reduces the number of parameters compared to POT (over 55% on synthetic and 68% on Tanks & Temples). It enhances rendering quality, achieving better PSNR, SSIM, and LPIPS scores than POT on both datasets. DOT achieves a considerable speedup, nearly doubling the FPS of POT on synthetic and Tanks & Temples datasets.	DOT relies on pretrained NeRF-SH models, inheriting the limitation of potentially long initial training times. Future work includes exploring methods to train the model from scratch with signal-guided sample allocation.	neural radiance fields, plenoctree, adaptive sampling, hierarchical feature fusion, real-time rendering
2307.15157 Report	R-LPIPS: An Adversarially Robust Perceptual Similarity Metric	Sara Ghazanfari, Siddharth Garg, Prashanth Krishnamurthy, Farshad Khorrami, Alexandre Araujo	Similarity metrics have played a significant role in computer vision to capture the underlying semantics of images. In recent years, advanced similarity metrics, such as the Learned Perceptual Image Patch Similarity (LPIPS), have emerged. These metrics leverage deep features extracted from trained neural networks and have demonstrated a remarkable ability to closely align with human perception when evaluating relative image similarity. However, it is now well-known that neural networks are susceptible to adversarial examples, i.e., small perturbations invisible to humans crafted to deliberately mislead the model. Consequently, the LPIPS metric is also sensitive to such adversarial examples. This susceptibility introduces significant security concerns, especially considering the widespread adoption of LPIPS in large-scale applications. In this paper, we propose the Robust Learned Perceptual Image Patch Similarity (R-LPIPS) metric, a new metric that leverages adversarially trained deep features. Through a comprehensive set of experiments, we demonstrate the superiority of R-LPIPS compared to the classical LPIPS metric. The code is available at https://github.com/SaraGhazanfari/R-LPIPS.	This paper introduces R-LPIPS, an adversarially robust perceptual similarity metric designed to address the vulnerability of the LPIPS metric to adversarial examples.	The sensitivity of LPIPS to adversarial perturbations poses significant security risks, especially in applications like copyright infringement detection and digital forensics where image similarity assessment is crucial.	R-LPIPS leverages adversarially trained deep features, incorporating adversarial training into the LPIPS training process to enhance its robustness.	R-LPIPS demonstrates superior robustness compared to LPIPS when evaluated against adversarial attacks (l-infinity-PGD and l2-PGD) across various data distortions. The natural 2AFC score of R-LPIPS remains comparable to LPIPS, indicating that robustness is achieved without sacrificing accuracy. New perceptual attacks (R-PPGA and R-LPA) based on R-LPIPS prove to be more effective than attacks based on LPIPS, successfully breaking the defenses of the perceptually robust model PAT.	The adversarial training of R-LPIPS currently focuses on x0, with potential for further exploration by applying AT to x1 or both x0 and x1. While adversarial training provides empirical robustness, R-LPIPS lacks theoretical guarantees. Investigating theoretical foundations for perceptual metrics like R-LPIPS is an important area for future work.	perceptual similarity metric, adversarial robustness, lpips, adversarial training, perceptual attacks
2307.15139 Report	Online Clustered Codebook	Chuanxia Zheng, Andrea Vedaldi	Vector Quantisation (VQ) is experiencing a comeback in machine learning, where it is increasingly used in representation learning. However, optimizing the codevectors in existing VQ-VAE is not entirely trivial. A problem is codebook collapse, where only a small subset of codevectors receive gradients useful for their optimisation, whereas a majority of them simply ``dies off'' and is never updated or used. This limits the effectiveness of VQ for learning larger codebooks in complex computer vision tasks that require high-capacity representations. In this paper, we present a simple alternative method for online codebook learning, Clustering VQ-VAE (CVQ-VAE). Our approach selects encoded features as anchors to update the ``dead'' codevectors, while optimising the codebooks which are alive via the original loss. This strategy brings unused codevectors closer in distribution to the encoded features, increasing the likelihood of being chosen and optimized. We extensively validate the generalization capability of our quantiser on various datasets, tasks (e.g. reconstruction and generation), and architectures (e.g. VQ-VAE, VQGAN, LDM). Our CVQ-VAE can be easily integrated into the existing models with just a few lines of code.	This paper introduces CVQ-VAE, a novel Vector Quantisation (VQ) method addressing codebook collapse in representation learning by dynamically initializing codebooks using online feature clustering.	Codebook collapse limits the effectiveness of VQ, particularly for large codebooks in complex computer vision tasks requiring high-capacity representations. CVQ-VAE aims to overcome this limitation and improve the utilization of large codebooks.	CVQ-VAE dynamically initializes unoptimized codevectors by resampling from learned features. Unlike traditional clustering, it employs running averages of encoded features across mini-batches to handle changing feature representations during deep network training.	CVQ-VAE significantly outperforms previous VQ methods like VQ-VAE and SQ-VAE on various datasets. It achieves superior reconstruction quality compared to state-of-the-art methods like VQGAN, even under high compression ratios. The method demonstrates strong generalization capabilities across different tasks, datasets, and architectures, including VQ-VAE, VQGAN, and LDM.	While CVQ-VAE demonstrates promising results, the exploration of optimal codebook dimensionality remains an open question. Future work could investigate the application of CVQ-VAE to broader downstream tasks beyond generation and completion.	vector quantisation, representation learning, codebook collapse, image generation, deep learning
2307.15131 Report	Seal-3D: Interactive Pixel-Level Editing for Neural Radiance Fields	Xiangyu Wang, Jingsen Zhu, Qi Ye, Yuchi Huo, Yunlong Ran, Zhihua Zhong, Jiming Chen	With the popularity of implicit neural representations, or neural radiance fields (NeRF), there is a pressing need for editing methods to interact with the implicit 3D models for tasks like post-processing reconstructed scenes and 3D content creation. While previous works have explored NeRF editing from various perspectives, they are restricted in editing flexibility, quality, and speed, failing to offer direct editing response and instant preview. The key challenge is to conceive a locally editable neural representation that can directly reflect the editing instructions and update instantly. To bridge the gap, we propose a new interactive editing method and system for implicit representations, called Seal-3D, which allows users to edit NeRF models in a pixel-level and free manner with a wide range of NeRF-like backbone and preview the editing effects instantly. To achieve the effects, the challenges are addressed by our proposed proxy function mapping the editing instructions to the original space of NeRF models in the teacher model and a two-stage training strategy for the student model with local pretraining and global finetuning. A NeRF editing system is built to showcase various editing types. Our system can achieve compelling editing effects with an interactive speed of about 1 second.	Seal-3D, an interactive pixel-level editing method for neural radiance fields that supports instant preview.	Existing NeRF editing methods are limited in flexibility, quality, and speed, lacking direct editing response and instant preview.	The method uses a proxy function to map editing instructions to the original NeRF space and a two-stage training strategy (local pretraining for instant preview and global finetuning for refinement) for a student NeRF model.	Interactive editing with instant preview (≈1s) is achieved. The method supports various editing types including geometry and color edits. The student model can generate higher quality results than the teacher model due to multi-view consistency.	The method does not support complex view-dependent lighting effects. It cannot handle reconstruction failures in the original NeRF model.	neural radiance fields, nerf editing, interactive editing, 3d scene editing, instant preview
2307.15055 Report	PointOdyssey: A Large-Scale Synthetic Dataset for Long-Term Point Tracking	Yang Zheng, Adam W. Harley, Bokui Shen, Gordon Wetzstein, Leonidas J. Guibas	We introduce PointOdyssey, a large-scale synthetic dataset, and data generation framework, for the training and evaluation of long-term fine-grained tracking algorithms. Our goal is to advance the state-of-the-art by placing emphasis on long videos with naturalistic motion. Toward the goal of naturalism, we animate deformable characters using real-world motion capture data, we build 3D scenes to match the motion capture environments, and we render camera viewpoints using trajectories mined via structure-from-motion on real videos. We create combinatorial diversity by randomizing character appearance, motion profiles, materials, lighting, 3D assets, and atmospheric effects. Our dataset currently includes 104 videos, averaging 2,000 frames long, with orders of magnitude more correspondence annotations than prior work. We show that existing methods can be trained from scratch in our dataset and outperform the published variants. Finally, we introduce modifications to the PIPs point tracking method, greatly widening its temporal receptive field, which improves its performance on PointOdyssey as well as on two real-world benchmarks. Our data and code are publicly available at: https://pointodyssey.com	Introduces PointOdyssey, a large-scale synthetic dataset for training and evaluating long-term fine-grained tracking algorithms, featuring long videos with naturalistic motion and diverse scenes.	Addresses the lack of datasets for fine-grained long-range tracking that reflect the complexities and opportunities of real-world video.	Generates synthetic data using motion capture data to animate characters, recreates real-world environments, randomizes scene attributes, and provides pixel-perfect annotations for long-range trajectories.	Existing methods trained on PointOdyssey outperform their publicly available variants. A modified PIPs method with an extended temporal receptive field and template updates (PIPs++) achieves state-of-the-art performance on PointOdyssey and real-world benchmarks. PointOdyssey presents a more challenging benchmark than existing real-world datasets like TAP-Vid-DAVIS and CroHD.	Dataset currently lacks large outdoor scenes with significant camera travel. Exploration of trackers utilizing scene-level and semantic cues, beyond low-level appearance matching, remains an open challenge.	point tracking, synthetic dataset, long-term tracking, motion capture, scene understanding
2307.15049 Report	Regularized Mask Tuning: Uncovering Hidden Knowledge in Pre-trained Vision-Language Models	Kecheng Zheng, Wei Wu, Ruili Feng, Kai Zhu, Jiawei Liu, Deli Zhao, Zheng-Jun Zha, Wei Chen, Yujun Shen	Prompt tuning and adapter tuning have shown great potential in transferring pre-trained vision-language models (VLMs) to various downstream tasks. In this work, we design a new type of tuning method, termed as regularized mask tuning, which masks the network parameters through a learnable selection. Inspired by neural pathways, we argue that the knowledge required by a downstream task already exists in the pre-trained weights but just gets concealed in the upstream pre-training stage. To bring the useful knowledge back into light, we first identify a set of parameters that are important to a given downstream task, then attach a binary mask to each parameter, and finally optimize these masks on the downstream data with the parameters frozen. When updating the mask, we introduce a novel gradient dropout strategy to regularize the parameter selection, in order to prevent the model from forgetting old knowledge and overfitting the downstream data. Experimental results on 11 datasets demonstrate the consistent superiority of our method over previous alternatives. It is noteworthy that we manage to deliver 18.73% performance improvement compared to the zero-shot CLIP via masking an average of only 2.56% parameters. Furthermore, our method is synergistic with most existing parameter-efficient tuning methods and can boost the performance on top of them. Project page can be found here (https://wuw2019.github.io/R-AMT/).	The paper introduces Regularized Mask Tuning (R-MT), a new technique for adapting pre-trained vision-language models (VLMs) to downstream tasks by selectively masking parameters using learnable binary masks.	Existing efficient tuning methods like prompt tuning and adapter tuning do not fully exploit the potential of pre-trained VLM parameters. R-MT aims to uncover hidden task-specific knowledge within these parameters, inspired by the concept of neural pathways in the brain.	R-MT identifies key parameters based on gradient changes during downstream task training. Binary masks are attached to these parameters and optimized with gradient dropout regularization. This regularization incorporates general knowledge from the pre-trained VLM to prevent forgetting and overfitting.	R-MT consistently outperforms existing methods, including prompt tuning and adapter tuning, on 11 image classification datasets. R-MT achieves 18.73% performance improvement over zero-shot CLIP while masking only 2.56% of parameters on average. R-MT is synergistic with existing methods and can boost their performance by around 3%.	R-MT has not been evaluated on open-world detection and segmentation tasks due to computational resource limitations. Future work will explore applying R-MT to other visual tasks such as segmentation.	vision-language models, parameter-efficient tuning, mask tuning, few-shot learning, gradient dropout regularization
2307.15033 Report	Diverse Inpainting and Editing with GAN Inversion	Ahmet Burak Yildirim, Hamza Pehlivan, Bahri Batuhan Bilecen, Aysegul Dundar	Recent inversion methods have shown that real images can be inverted into StyleGAN's latent space and numerous edits can be achieved on those images thanks to the semantically rich feature representations of well-trained GAN models. However, extensive research has also shown that image inversion is challenging due to the trade-off between high-fidelity reconstruction and editability. In this paper, we tackle an even more difficult task, inverting erased images into GAN's latent space for realistic inpaintings and editings. Furthermore, by augmenting inverted latent codes with different latent samples, we achieve diverse inpaintings. Specifically, we propose to learn an encoder and mixing network to combine encoded features from erased images with StyleGAN's mapped features from random samples. To encourage the mixing network to utilize both inputs, we train the networks with generated data via a novel set-up. We also utilize higher-rate features to prevent color inconsistencies between the inpainted and unerased parts. We run extensive experiments and compare our method with state-of-the-art inversion and inpainting methods. Qualitative metrics and visual comparisons show significant improvements.	This paper introduces a novel framework for diverse image inpainting and editing using GAN inversion. It leverages an encoder and mixing network to combine encoded features from erased images with randomly sampled latent codes from StyleGAN.	This approach addresses the limitations of existing GAN inversion methods that struggle with the trade-off between high-fidelity reconstruction and editability, particularly in the challenging scenario of inpainting erased images.	The framework utilizes a two-stage training pipeline. First, it trains an encoder and mixing network with generated data to ensure diversity. Second, it incorporates skip connections to achieve high-fidelity reconstructions and seamless transitions between unerased and erased pixels.	The proposed method significantly outperforms state-of-the-art models in terms of FID, LPIPS, U-IDS, and P-IDS metrics for image inpainting. The framework demonstrates robustness across different mask difficulty levels and generalizes well to diverse datasets like FFHQ, AFHQ Cat, and AFHQ Dog. The model successfully performs diverse inpainting and enables image editing on erased regions using InterfaceGAN directions.	The diversity of inpainting results, while improved, is still limited by the model's ability to generate semantically consistent pixels. Future work could explore alternative mixing network architectures or training strategies to further enhance the diversity and realism of inpainted outputs.	gan inversion, image inpainting, image editing, generative adversarial networks, stylegan
2307.14971 Report	Take-A-Photo: 3D-to-2D Generative Pre-training of Point Cloud Models	Ziyi Wang, Xumin Yu, Yongming Rao, Jie Zhou, Jiwen Lu	With the overwhelming trend of mask image modeling led by MAE, generative pre-training has shown a remarkable potential to boost the performance of fundamental models in 2D vision. However, in 3D vision, the over-reliance on Transformer-based backbones and the unordered nature of point clouds have restricted the further development of generative pre-training. In this paper, we propose a novel 3D-to-2D generative pre-training method that is adaptable to any point cloud model. We propose to generate view images from different instructed poses via the cross-attention mechanism as the pre-training scheme. Generating view images has more precise supervision than its point cloud counterpart, thus assisting 3D backbones to have a finer comprehension of the geometrical structure and stereoscopic relations of the point cloud. Experimental results have proved the superiority of our proposed 3D-to-2D generative pre-training over previous pre-training methods. Our method is also effective in boosting the performance of architecture-oriented approaches, achieving state-of-the-art performance when fine-tuning on ScanObjectNN classification and ShapeNetPart segmentation tasks. Code is available at https://github.com/wangzy22/TAP.	This paper proposes TAP, a novel 3D-to-2D generative pre-training method for point cloud models that enhances geometric structure and stereoscopic relation understanding.	Existing 3D generative pre-training methods suffer from imprecise supervision and limited backbone adaptability. This work aims to address these limitations.	TAP generates view images from different poses using a pose-dependent Photograph Module and a 2D generator. The module encodes pose information into queries for cross-attention with 3D features, enabling the model to learn projection relations. The generated images are supervised by rendered ground truth images with MSE loss.	TAP consistently improves performance across various point cloud backbone architectures. It outperforms previous generative pre-training methods on ScanObjectNN classification and achieves state-of-the-art results on ShapeNetPart segmentation. The method demonstrates superior performance in few-shot learning scenarios and shows promising results in scene-level dense prediction tasks.	The current implementation relies on a relatively simple 2D generator, which could be further improved for generating higher-fidelity images. Exploring the effectiveness of perceptual loss with more realistic rendered images is an intriguing avenue for future work.	3d vision, point cloud analysis, generative pre-training, cross-modal learning, self-supervised learning
2307.14918 Report	GET3D--: Learning GET3D from Unconstrained Image Collections	Fanghua Yu, Xintao Wang, Zheyuan Li, Yan-Pei Cao, Ying Shan, Chao Dong	The demand for efficient 3D model generation techniques has grown exponentially, as manual creation of 3D models is time-consuming and requires specialized expertise. While generative models have shown potential in creating 3D textured shapes from 2D images, their applicability in 3D industries is limited due to the lack of a well-defined camera distribution in real-world scenarios, resulting in low-quality shapes. To overcome this limitation, we propose GET3D--, the first method that directly generates textured 3D shapes from 2D images with unknown pose and scale. GET3D-- comprises a 3D shape generator and a learnable camera sampler that captures the 6D external changes on the camera. In addition, We propose a novel training schedule to stably optimize both the shape generator and camera sampler in a unified framework. By controlling external variations using the learnable camera sampler, our method can generate aligned shapes with clear textures. Extensive experiments demonstrate the efficacy of GET3D--, which precisely fits the 6D camera pose distribution and generates high-quality shapes on both synthetic and realistic unconstrained datasets.	GET3D-- generates textured 3D shapes from 2D images with unknown and unconstrained camera poses.	Existing 3D generation methods often assume fixed or known camera distributions, limiting their applicability to real-world images with unconstrained camera poses.	GET3D-- employs a 3D shape generator and a learnable 6D camera sampler. It uses a novel training schedule: (1) initializes the shape generator with a fixed camera distribution, (2) initializes the camera sampler with the learned coarse shapes, (3) jointly trains both, and (4) fine-tunes the shape generator. It also uses camera compensation and a shape align loss to decouple object and camera transformations.	GET3D-- generates higher-quality shapes and textures compared to baseline GET3D on unconstrained datasets. The learnable camera sampler effectively captures the underlying 6D camera distribution. Camera compensation and shape align loss are crucial for accurate shape and texture generation.	The method assumes camera component independence and single Gaussian ground-truth distribution. Shape align loss might introduce noise when object shapes vary greatly.	3d shape generation, camera pose estimation, unconstrained images, generative adversarial networks, differentiable rendering
2307.14770 Report	3DPortraitGAN: Learning One-Quarter Headshot 3D GANs from a Single-View Portrait Dataset with Diverse Body Poses	Yiqian Wu, Hao Xu, Xiangjun Tang, Hongbo Fu, Xiaogang Jin	3D-aware face generators are typically trained on 2D real-life face image datasets that primarily consist of near-frontal face data, and as such, they are unable to construct one-quarter headshot 3D portraits with complete head, neck, and shoulder geometry. Two reasons account for this issue: First, existing facial recognition methods struggle with extracting facial data captured from large camera angles or back views. Second, it is challenging to learn a distribution of 3D portraits covering the one-quarter headshot region from single-view data due to significant geometric deformation caused by diverse body poses. To this end, we first create the dataset 360{\deg}-Portrait-HQ (360{\deg}PHQ for short) which consists of high-quality single-view real portraits annotated with a variety of camera parameters (the yaw angles span the entire 360{\deg} range) and body poses. We then propose 3DPortraitGAN, the first 3D-aware one-quarter headshot portrait generator that learns a canonical 3D avatar distribution from the 360{\deg}PHQ dataset with body pose self-learning. Our model can generate view-consistent portrait images from all camera angles with a canonical one-quarter headshot 3D representation. Our experiments show that the proposed framework can accurately predict portrait body poses and generate view-consistent, realistic portrait images with complete geometry from all camera angles.	This paper introduces 3DPortraitGAN, the first 3D-aware one-quarter headshot portrait generator that can learn a canonical 3D avatar distribution from a single-view portrait dataset with diverse body poses.	Existing 3D-aware face generators are limited to frontal views and lack complete neck and shoulder geometry due to limitations in existing datasets.	The authors create a new dataset, 360°-Portrait-HQ (360°PHQ), containing single-view portraits with diverse camera angles and body poses. They then propose a 3DPortraitGAN framework with a body pose-aware discriminator and a deformation module to generate view-consistent one-quarter headshot portraits with complete geometry.	3DPortraitGAN generates high-quality, view-consistent portrait images from 360° camera angles. The model accurately predicts body poses, surpassing the accuracy of coarse poses obtained from off-the-shelf methods. Quantitative evaluation shows 3DPortraitGAN outperforms state-of-the-art methods in FID and facial identity consistency metrics.	The deformation module, solely based on the SMPL model, does not consider the generated geometry, leading to artifacts and high computational cost. The pose predictor in the generator is prone to collapsing during training, limiting the model's ability to achieve perfectly canonical representations.	portrait generation, 3d-aware gans, deformable neural radiance fields, single-view reconstruction, body pose estimation
2307.14735 Report	Test Time Adaptation for Blind Image Quality Assessment	Subhadeep Roy, Shankhanil Mitra, Soma Biswas, Rajiv Soundararajan	While the design of blind image quality assessment (IQA) algorithms has improved significantly, the distribution shift between the training and testing scenarios often leads to a poor performance of these methods at inference time. This motivates the study of test time adaptation (TTA) techniques to improve their performance at inference time. Existing auxiliary tasks and loss functions used for TTA may not be relevant for quality-aware adaptation of the pre-trained model. In this work, we introduce two novel quality-relevant auxiliary tasks at the batch and sample levels to enable TTA for blind IQA. In particular, we introduce a group contrastive loss at the batch level and a relative rank loss at the sample level to make the model quality aware and adapt to the target data. Our experiments reveal that even using a small batch of images from the test distribution helps achieve significant improvement in performance by updating the batch normalization statistics of the source model.	This paper introduces novel self-supervised test-time adaptation (TTA) techniques for blind image quality assessment (IQA) to address the challenge of distribution shifts between training and testing data.	Existing IQA algorithms often suffer from poor generalization ability due to distribution shifts between training and testing scenarios. TTA offers a promising solution to adapt pre-trained IQA models to target data distributions at inference time, thereby improving their performance.	The proposed TTA-IQA method introduces two novel quality-relevant auxiliary tasks at the batch and sample levels to enable TTA for IQA: 1) Group Contrastive (GC) loss: Contrasting groups of low and high-quality images in a batch to capture quality discriminative information. 2) Rank loss: Enforcing the model to rank the image quality of distorted augmentations of each test sample to maintain quality order.	TTA-IQA significantly improves the performance of four different quality-aware source models (TReS, MUSIQ, HyperIQA, MetaIQA) on four IQA databases (KonIQ-10k, PIPAL, CID2013, LIVE-IQA). The combination of rank loss and GC loss consistently outperforms using either loss individually, demonstrating their complementary nature. TTA-IQA effectively adapts to target data even with small batch sizes, highlighting its efficiency in real-world scenarios.	The choice of distortion types for the rank loss relies on the source model's knowledge, which may be inaccurate for significantly different target distributions. Future work includes exploring more sophisticated auxiliary tasks and extending TTA-IQA to video quality assessment.	test-time adaptation, blind image quality assessment, group contrastive learning, rank loss, distribution shift
2307.14659 Report	LLDiffusion: Learning Degradation Representations in Diffusion Models for Low-Light Image Enhancement	Tao Wang, Kaihao Zhang, Ziqian Shao, Wenhan Luo, Bjorn Stenger, Tae-Kyun Kim, Wei Liu, Hongdong Li	Current deep learning methods for low-light image enhancement (LLIE) typically rely on pixel-wise mapping learned from paired data. However, these methods often overlook the importance of considering degradation representations, which can lead to sub-optimal outcomes. In this paper, we address this limitation by proposing a degradation-aware learning scheme for LLIE using diffusion models, which effectively integrates degradation and image priors into the diffusion process, resulting in improved image enhancement. Our proposed degradation-aware learning scheme is based on the understanding that degradation representations play a crucial role in accurately modeling and capturing the specific degradation patterns present in low-light images. To this end, First, a joint learning framework for both image generation and image enhancement is presented to learn the degradation representations. Second, to leverage the learned degradation representations, we develop a Low-Light Diffusion model (LLDiffusion) with a well-designed dynamic diffusion module. This module takes into account both the color map and the latent degradation representations to guide the diffusion process. By incorporating these conditioning factors, the proposed LLDiffusion can effectively enhance low-light images, considering both the inherent degradation patterns and the desired color fidelity. Finally, we evaluate our proposed method on several well-known benchmark datasets, including synthetic and real-world unpaired datasets. Extensive experiments on public benchmarks demonstrate that our LLDiffusion outperforms state-of-the-art LLIE methods both quantitatively and qualitatively. The source code and pre-trained models are available at https://github.com/TaoWangzj/LLDiffusion.	This paper introduces LLDiffusion, a novel degradation-aware diffusion model for low-light image enhancement, which integrates degradation representations into the enhancement process.	Current LLIE methods often overlook degradation representations, leading to sub-optimal results with artifacts or unnatural enhancements. LLDiffusion addresses this by explicitly modeling and utilizing degradation patterns.	The approach involves a two-stage process: (1) Joint learning of degradation representations through a degradation generation network and an enhancement diffusion module. (2) Enhancement using a dynamic diffusion module conditioned on learned degradation representations and image priors (color maps).	LLDiffusion outperforms state-of-the-art LLIE methods on benchmark datasets (LOL, LOL-v2, VE-LOL) both quantitatively and qualitatively. The method exhibits strong generalization ability, effectively enhancing images from unseen datasets (DICM, MEF, NPE). Ablation studies confirm the contribution of each component, highlighting the importance of degradation representation learning and the dynamic diffusion module.	The latent map encoder currently has a simple structure and could be improved with increased width and depth for potentially better performance. Future work will explore extending LLDiffusion for low-light video enhancement.	low-light image enhancement, diffusion models, degradation representations, deep learning, computer vision
2307.14638 Report	EqGAN: Feature Equalization Fusion for Few-shot Image Generation	Yingbo Zhou, Zhihao Yue, Yutong Ye, Pengyu Zhang, Xian Wei, Mingsong Chen	Due to the absence of fine structure and texture information, existing fusion-based few-shot image generation methods suffer from unsatisfactory generation quality and diversity. To address this problem, we propose a novel feature Equalization fusion Generative Adversarial Network (EqGAN) for few-shot image generation. Unlike existing fusion strategies that rely on either deep features or local representations, we design two separate branches to fuse structures and textures by disentangling encoded features into shallow and deep contents. To refine image contents at all feature levels, we equalize the fused structure and texture semantics at different scales and supplement the decoder with richer information by skip connections. Since the fused structures and textures may be inconsistent with each other, we devise a consistent equalization loss between the equalized features and the intermediate output of the decoder to further align the semantics. Comprehensive experiments on three public datasets demonstrate that, EqGAN not only significantly improves generation performance with FID score (by up to 32.7%) and LPIPS score (by up to 4.19%), but also outperforms the state-of-the-arts in terms of accuracy (by up to 1.97%) for downstream classification tasks.	The paper proposes EqGAN, a feature equalization fusion-based generative adversarial network for few-shot image generation.	Existing fusion-based methods suffer from unsatisfactory generation quality and diversity due to semantic entanglement when fusing image features.	EqGAN disentangles encoded features into structure and texture branches, performs multi-scale feature equalization fusion, and introduces a consistent equalization loss to align fused semantics.	EqGAN significantly improves FID and LPIPS scores compared to state-of-the-art methods, demonstrating superior image quality and diversity. Ablation studies confirm the effectiveness of each component in the feature equalization fusion strategy. EqGAN boosts the accuracy of downstream classification tasks by providing higher-quality augmented images.	The model's performance might be further enhanced by exploring more sophisticated fusion strategies. The computational cost of EqGAN is relatively high due to the multi-scale feature processing.	few-shot image generation, generative adversarial networks, feature fusion, semantic alignment, image quality
2307.14620 Report	NeRF-Det: Learning Geometry-Aware Volumetric Representation for Multi-View 3D Object Detection	Chenfeng Xu, Bichen Wu, Ji Hou, Sam Tsai, Ruilong Li, Jialiang Wang, Wei Zhan, Zijian He, Peter Vajda, Kurt Keutzer, Masayoshi Tomizuka	We present NeRF-Det, a novel method for indoor 3D detection with posed RGB images as input. Unlike existing indoor 3D detection methods that struggle to model scene geometry, our method makes novel use of NeRF in an end-to-end manner to explicitly estimate 3D geometry, thereby improving 3D detection performance. Specifically, to avoid the significant extra latency associated with per-scene optimization of NeRF, we introduce sufficient geometry priors to enhance the generalizability of NeRF-MLP. Furthermore, we subtly connect the detection and NeRF branches through a shared MLP, enabling an efficient adaptation of NeRF to detection and yielding geometry-aware volumetric representations for 3D detection. Our method outperforms state-of-the-arts by 3.9 mAP and 3.1 mAP on the ScanNet and ARKITScenes benchmarks, respectively. We provide extensive analysis to shed light on how NeRF-Det works. As a result of our joint-training design, NeRF-Det is able to generalize well to unseen scenes for object detection, view synthesis, and depth estimation tasks without requiring per-scene optimization. Code is available at \url{https://github.com/facebookresearch/NeRF-Det}.	Presents NeRF-Det, a novel method for indoor 3D object detection from posed RGB images, leveraging NeRF to learn geometry-aware volumetric representations.	Addresses the challenge of ambiguous scene geometry in indoor 3D detection from RGB-only images by explicitly modeling it using NeRF.	Jointly trains a NeRF branch with the 3D detection pipeline, sharing a geometry MLP and using augmented image features (including variance and color) as priors for NeRF. It estimates an opacity field from density to refine volume features.	Outperforms state-of-the-art RGB-only methods by 3.9 mAP and 3.1 mAP on ScanNet and ARKITScenes, respectively. Demonstrates the effectiveness of NeRF over depth maps and cost volume for scene geometry modeling in 3D detection. Shows generalization ability to novel view synthesis and depth estimation on unseen scenes without per-scene optimization.	The detection branch might hinder the NeRF branch's performance by potentially erasing low-level details. Future work includes adapting NeRF-Det for outdoor 3D detection, addressing challenges like dynamic objects and unbounded scenes.	3d object detection, neural radiance fields (nerf), multi-view geometry, indoor scene understanding, geometry-aware representations
2307.14611 Report	TextManiA: Enriching Visual Feature by Text-driven Manifold Augmentation	Moon Ye-Bin, Jisoo Kim, Hongyeob Kim, Kilho Son, Tae-Hyun Oh	We propose TextManiA, a text-driven manifold augmentation method that semantically enriches visual feature spaces, regardless of class distribution. TextManiA augments visual data with intra-class semantic perturbation by exploiting easy-to-understand visually mimetic words, i.e., attributes. This work is built on an interesting hypothesis that general language models, e.g., BERT and GPT, encompass visual information to some extent, even without training on visual training data. Given the hypothesis, TextManiA transfers pre-trained text representation obtained from a well-established large language encoder to a target visual feature space being learned. Our extensive analysis hints that the language encoder indeed encompasses visual information at least useful to augment visual representation. Our experiments demonstrate that TextManiA is particularly powerful in scarce samples with class imbalance as well as even distribution. We also show compatibility with the label mix-based approaches in evenly distributed scarce data.	Proposes TextManiA, a method that enriches visual features by transferring attribute information from text embeddings to visual feature spaces, particularly beneficial for long-tailed and scarce data.	Addresses the challenge of performance degradation in learning models when faced with data distribution shifts, especially in long-tailed distributions and scarce data scenarios.	Leverages visually mimetic words (attributes) encoded by language models (BERT, GPT-2, CLIP) to augment visual features. Computes difference vectors between text embeddings with and without attributes, projects them onto the target visual feature space, and adds them to the original features.	TextManiA consistently improves performance on long-tailed classification benchmarks (CIFAR-100-LT, ImageNet-LT), demonstrating its effectiveness in handling skewed class distributions. Outperforms or complements mix-based augmentation methods in scarce data classification tasks (CIFAR-100-10%, Tiny-ImageNet-10%), highlighting the benefit of intra-class semantic perturbation. Improves few-shot object detection accuracy (PASCAL VOC, MS-COCO) by enhancing the classification head's performance, especially in low-shot settings.	Current attribute set limited to color and size, exploring additional attributes could further enhance performance. More effective attribute selection methods for specific tasks and datasets could be investigated.	data augmentation, long-tail classification, scarce data, few-shot learning, vision and language
2307.14489 Report	SuperInpaint: Learning Detail-Enhanced Attentional Implicit Representation for Super-resolutional Image Inpainting	Canyu Zhang, Qing Guo, Xiaoguang Li, Renjie Wan, Hongkai Yu, Ivor Tsang, Song Wang	In this work, we introduce a challenging image restoration task, referred to as SuperInpaint, which aims to reconstruct missing regions in low-resolution images and generate completed images with arbitrarily higher resolutions. We have found that this task cannot be effectively addressed by stacking state-of-the-art super-resolution and image inpainting methods as they amplify each other's flaws, leading to noticeable artifacts. To overcome these limitations, we propose the detail-enhanced attentional implicit representation (DEAR) that can achieve SuperInpaint with a single model, resulting in high-quality completed images with arbitrary resolutions. Specifically, we use a deep convolutional network to extract the latent embedding of an input image and then enhance the high-frequency components of the latent embedding via an adaptive high-pass filter. This leads to detail-enhanced semantic embedding. We further feed the semantic embedding into an unmask-attentional module that suppresses embeddings from ineffective masked pixels. Additionally, we extract a pixel-wise importance map that indicates which pixels should be used for image reconstruction. Given the coordinates of a pixel we want to reconstruct, we first collect its neighboring pixels in the input image and extract their detail-enhanced semantic embeddings, unmask-attentional semantic embeddings, importance values, and spatial distances to the desired pixel. Then, we feed all the above terms into an implicit representation and generate the color of the specified pixel. To evaluate our method, we extend three existing datasets for this new task and build 18 meaningful baselines using SOTA inpainting and super-resolution methods. Extensive experimental results demonstrate that our method outperforms all existing methods by a significant margin on four widely used metrics.	This paper identifies a novel and challenging image restoration task, termed "SuperInpaint", which focuses on reconstructing missing regions in low-resolution images and generating high-fidelity completed images at any desired higher resolution.	Existing image inpainting methods can't handle resolution changes, and super-resolution methods struggle with large missing regions. Combining them directly leads to amplified artifacts and unsatisfactory results.	The authors propose DEAR (Detail-Enhanced Attentional Implicit Representation) for this task. DEAR leverages implicit image representation and incorporates three key modules: 1) Detail-Enhanced Semantic Embedding (DSE) to enhance high-frequency details. 2) Unmask-Attentional Semantic Embedding (USE) to suppress information from ineffective masked pixels. 3) Pixel-wise Importance Map to identify pixels suitable for reconstruction.	DEAR significantly outperforms all 18 constructed baselines (combinations of SOTA inpainting and super-resolution methods) on three newly created datasets for SuperInpaint. DEAR achieves superior performance in terms of PSNR, SSIM, L1, and LPIPS across a wide range of upscaling ratios. Ablation studies confirm the effectiveness of each proposed module (DSE, USE, PIM) in contributing to the overall performance gain.	The current work primarily focuses on reconstructing images with a single upscale ratio during training. Exploring the feasibility of training a single DEAR model for arbitrary upscale ratios is an intriguing direction for future work.	image inpainting, super-resolution, implicit neural representation, detail enhancement, attention mechanism
2307.14352 Report	General Image-to-Image Translation with One-Shot Image Guidance	Bin Cheng, Zuhao Liu, Yunbo Peng, Yue Lin	Large-scale text-to-image models pre-trained on massive text-image pairs show excellent performance in image synthesis recently. However, image can provide more intuitive visual concepts than plain text. People may ask: how can we integrate the desired visual concept into an existing image, such as our portrait? Current methods are inadequate in meeting this demand as they lack the ability to preserve content or translate visual concepts effectively. Inspired by this, we propose a novel framework named visual concept translator (VCT) with the ability to preserve content in the source image and translate the visual concepts guided by a single reference image. The proposed VCT contains a content-concept inversion (CCI) process to extract contents and concepts, and a content-concept fusion (CCF) process to gather the extracted information to obtain the target image. Given only one reference image, the proposed VCT can complete a wide range of general image-to-image translation tasks with excellent results. Extensive experiments are conducted to prove the superiority and effectiveness of the proposed methods. Codes are available at https://github.com/CrystalNeuro/visual-concept-translator.	This paper proposes Visual Concept Translator (VCT), a novel framework for general image-to-image translation guided by a single reference image.	Image-guided I2I, integrating visual concepts from a reference image into a source image while preserving content, has broad applications in areas like game production and art creation. Existing methods struggle to effectively translate visual concepts while preserving source content.	VCT employs a two-step process: (1) Content-Concept Inversion (CCI) extracts content and concept embeddings from the source and reference images respectively using techniques like Pivot Turning Inversion and Multi-concept Inversion. (2) Content-Concept Fusion (CCF) utilizes a dual-stream denoising architecture with an attention control mechanism to combine extracted information and generate the target image.	VCT demonstrates superior performance in general I2I tasks compared to GAN-based and existing diffusion-based methods, effectively translating concepts from reference images while preserving source image content. The method excels in style transfer tasks, outperforming state-of-the-art approaches by effectively transferring artistic styles from reference images to content images. Ablation studies confirm the efficacy of individual VCT components, including Multi-concept Inversion, Pivotal Turning Inversion, and Attention Control, highlighting their contributions to the framework's performance.	The paper acknowledges a trade-off between preserving source image structure and incorporating semantic changes from the reference image, suggesting further exploration of this balance. Future work could investigate extending VCT to incorporate multiple reference images for more complex concept fusion and manipulation.	image-to-image translation, visual concept, diffusion models, one-shot learning, attention mechanism
2307.14331 Report	Visual Instruction Inversion: Image Editing via Visual Prompting	Thao Nguyen, Yuheng Li, Utkarsh Ojha, Yong Jae Lee	Text-conditioned image editing has emerged as a powerful tool for editing images. However, in many situations, language can be ambiguous and ineffective in describing specific image edits. When faced with such challenges, visual prompts can be a more informative and intuitive way to convey ideas. We present a method for image editing via visual prompting. Given pairs of example that represent the "before" and "after" images of an edit, our goal is to learn a text-based editing direction that can be used to perform the same edit on new images. We leverage the rich, pretrained editing capabilities of text-to-image diffusion models by inverting visual prompts into editing instructions. Our results show that with just one example pair, we can achieve competitive results compared to state-of-the-art text-conditioned image editing frameworks.	This paper presents a novel framework for image editing that learns specific editing instructions from before-and-after image pairs, enabling intuitive editing with diffusion models.	Describing desired image edits with text can be challenging due to the ambiguity of language. Visual prompting offers a more intuitive and precise way to convey specific image transformations.	The proposed method leverages a pretrained text-conditioned image editing diffusion model (InstructPix2Pix). By optimizing a textual instruction to reconstruct the "after" image from the "before" image while aligning with their semantic difference in CLIP embedding space, the method learns an edit direction applicable to new images.	The method achieves competitive performance against state-of-the-art text-conditioned image editing models, demonstrating its effectiveness in learning and applying edits from visual prompts. Using identical noise during training and testing helps balance the extent of editing and faithfulness to the input image. The learned instructions can be combined with user-provided text prompts, allowing for flexible and specific image manipulations.	The method's reliance on a pretrained model limits its editing scope and might inherit unwanted biases. Further research is needed to investigate the sensitivity to prompt selection and explore the potential of diffusion models as task solvers for computer vision tasks.	image editing, visual prompting, diffusion models, text-to-image synthesis, computer vision
2307.14073 Report	VideoControlNet: A Motion-Guided Video-to-Video Translation Framework by Using Diffusion Model with ControlNet	Zhihao Hu, Dong Xu	Recently, diffusion models like StableDiffusion have achieved impressive image generation results. However, the generation process of such diffusion models is uncontrollable, which makes it hard to generate videos with continuous and consistent content. In this work, by using the diffusion model with ControlNet, we proposed a new motion-guided video-to-video translation framework called VideoControlNet to generate various videos based on the given prompts and the condition from the input video. Inspired by the video codecs that use motion information for reducing temporal redundancy, our framework uses motion information to prevent the regeneration of the redundant areas for content consistency. Specifically, we generate the first frame (i.e., the I-frame) by using the diffusion model with ControlNet. Then we generate other key frames (i.e., the P-frame) based on the previous I/P-frame by using our newly proposed motion-guided P-frame generation (MgPG) method, in which the P-frames are generated based on the motion information and the occlusion areas are inpainted by using the diffusion model. Finally, the rest frames (i.e., the B-frame) are generated by using our motion-guided B-frame interpolation (MgBI) module. Our experiments demonstrate that our proposed VideoControlNet inherits the generation capability of the pre-trained large diffusion model and extends the image diffusion model to the video diffusion model by using motion information. More results are provided at our project page.	Proposed VideoControlNet, a motion-guided video-to-video translation framework using a diffusion model with ControlNet, for generating diverse and content-consistent videos from prompts and input video conditions.	Existing video diffusion models struggle to generate videos with continuous and consistent content due to the uncontrollable nature of the diffusion process.	Leverages motion information to prevent redundant area regeneration and uses diffusion-model-based inpainting for new content. Employs a motion-guided P-frame generation (MgPG) module for keyframes and a motion-guided B-frame interpolation (MgBI) module for intermediate frames.	Outperforms state-of-the-art methods in user preference and objective metrics like FVD, IS, FID. Generates high-quality videos with better content consistency compared to methods like Text2LIVE. Offers flexibility in controlling video style and enables video editing through masks and prompts.	Relies on accurate optical flow estimation for optimal performance, which can be challenging for complex motion. Strong motion guidance necessitates detailed conditions (depth maps, canny maps) limiting flexibility in condition types (e.g., segmentation maps).	video generation, diffusion models, video-to-video translation, controlnet, motion guidance
2307.14063 Report	ECO: Ensembling Context Optimization for Vision-Language Models	Lorenzo Agnolucci, Alberto Baldrati, Francesco Todino, Federico Becattini, Marco Bertini, Alberto Del Bimbo	Image recognition has recently witnessed a paradigm shift, where vision-language models are now used to perform few-shot classification based on textual prompts. Among these, the CLIP model has shown remarkable capabilities for zero-shot transfer by matching an image and a custom textual prompt in its latent space. This has paved the way for several works that focus on engineering or learning textual contexts for maximizing CLIP's classification capabilities. In this paper, we follow this trend by learning an ensemble of prompts for image classification. We show that learning diverse and possibly shorter contexts improves considerably and consistently the results rather than relying on a single trainable prompt. In particular, we report better few-shot capabilities with no additional cost at inference time. We demonstrate the capabilities of our approach on 11 different benchmarks.	This paper introduces ECO, a method that enhances prompt learning for few-shot image classification in vision-language models by learning an ensemble of diverse and shorter textual prompts instead of a single, longer prompt.	ECO improves upon existing prompt learning methods, which often focus on optimizing a single textual prompt, by leveraging the power of prompt ensembling to achieve more robust and accurate results, especially in few-shot scenarios.	ECO learns multiple sets of context tokens (prompts) with a reduced number of tokens per prompt while keeping the total number of trainable parameters the same as single-prompt methods like CoOp. The learned prompts are then combined using prompt ensembling, effectively averaging their textual features for classification.	ECO consistently outperforms existing methods, including zero-shot CLIP and CoOp, on 11 different image classification benchmarks. The method proves to be more data-efficient, showing significant improvements even with a limited number of training shots (1 or 2). ECO maintains computational efficiency at inference time as the learned prompt features can be pre-computed and used as a single prompt.	The current study focuses on evaluating ECO with CoOp; further research could explore its integration with other prompt learning techniques like CoCoOp and MaPLe. While ECO effectively balances context length and the number of prompts, determining the optimal configuration for specific datasets or tasks might require further investigation.	prompt learning, prompt ensembling, few-shot learning, image classification, vision-language models
2307.14030 Report	Consensus-Adaptive RANSAC	Luca Cavalli, Daniel Barath, Marc Pollefeys, Viktor Larsson	RANSAC and its variants are widely used for robust estimation, however, they commonly follow a greedy approach to finding the highest scoring model while ignoring other model hypotheses. In contrast, Iteratively Reweighted Least Squares (IRLS) techniques gradually approach the model by iteratively updating the weight of each correspondence based on the residuals from previous iterations. Inspired by these methods, we propose a new RANSAC framework that learns to explore the parameter space by considering the residuals seen so far via a novel attention layer. The attention mechanism operates on a batch of point-to-model residuals, and updates a per-point estimation state to take into account the consensus found through a lightweight one-step transformer. This rich state then guides the minimal sampling between iterations as well as the model refinement. We evaluate the proposed approach on essential and fundamental matrix estimation on a number of indoor and outdoor datasets. It outperforms state-of-the-art estimators by a significant margin adding only a small runtime overhead. Moreover, we demonstrate good generalization properties of our trained model, indicating its effectiveness across different datasets and tasks. The proposed attention mechanism and one-step transformer provide an adaptive behavior that enhances the performance of RANSAC, making it a more effective tool for robust estimation. Code is available at https://github.com/cavalli1234/CA-RANSAC.	Proposes CA-RANSAC, a novel RANSAC framework that leverages consensus from previous iterations to enhance sampling and model refinement during robust estimation.	Addresses limitations of traditional RANSAC methods that ignore sub-optimal model hypotheses, leading to improved exploration of the parameter space and better model selection.	Introduces a consensus-based attention mechanism operating on point-to-model residuals, updating per-point estimation states using a one-step transformer to guide minimal sampling and non-linear model refinement.	Outperforms state-of-the-art estimators in essential and fundamental matrix estimation tasks on indoor and outdoor datasets. Demonstrates superior accuracy, particularly in low-error regimes, indicating effective model refinement. Exhibits good generalization across different datasets, matching strategies, and estimation tasks.	Current implementation relies on a fixed number of iterations without an early termination criterion. Exploration of more efficient biased sampling schemes within inlier pools could further enhance performance.	ransac, robust estimation, attention mechanism, consensus-based learning, minimal sample selection
2307.13974 Report	Tracking Anything in High Quality	Jiawen Zhu, Zhenyu Chen, Zeqi Hao, Shijie Chang, Lu Zhang, Dong Wang, Huchuan Lu, Bin Luo, Jun-Yan He, Jin-Peng Lan, Hanyuan Chen, Chenyang Li	Visual object tracking is a fundamental video task in computer vision. Recently, the notably increasing power of perception algorithms allows the unification of single/multiobject and box/mask-based tracking. Among them, the Segment Anything Model (SAM) attracts much attention. In this report, we propose HQTrack, a framework for High Quality Tracking anything in videos. HQTrack mainly consists of a video multi-object segmenter (VMOS) and a mask refiner (MR). Given the object to be tracked in the initial frame of a video, VMOS propagates the object masks to the current frame. The mask results at this stage are not accurate enough since VMOS is trained on several closeset video object segmentation (VOS) datasets, which has limited ability to generalize to complex and corner scenes. To further improve the quality of tracking masks, a pretrained MR model is employed to refine the tracking results. As a compelling testament to the effectiveness of our paradigm, without employing any tricks such as test-time data augmentations and model ensemble, HQTrack ranks the 2nd place in the Visual Object Tracking and Segmentation (VOTS2023) challenge. Code and models are available at https://github.com/jiawen-zhu/HQTrack.	HQTrack, a framework for High Quality Tracking anything in videos, comprising a video multi-object segmenter (VMOS) and a mask refiner (MR)	Addresses challenges in the VOTS2023 challenge, such as long-term sequences, disappearing/reappearing targets, and complex scenes	VMOS (based on DeAOT) propagates object masks across frames, and MR (using HQ-SAM) refines the masks by leveraging a pre-trained segmentation model.	Joint tracking outperforms separate tracking for multiple objects. Multi-scale propagation mechanism and InternImage-T backbone significantly improve VMOS performance. Selectively refining masks with HQ-SAM based on IoU threshold enhances overall accuracy.	Limited exploration of the relationship between long-term memory gap and object disappearance/reappearance. Further investigation on the influence of different mask refiners.	visual object tracking, video object segmentation, multi-object tracking, mask refinement, hq-sam
2307.13908 Report	Points-to-3D: Bridging the Gap between Sparse Points and Shape-Controllable Text-to-3D Generation	Chaohui Yu, Qiang Zhou, Jingliang Li, Zhe Zhang, Zhibin Wang, Fan Wang	Text-to-3D generation has recently garnered significant attention, fueled by 2D diffusion models trained on billions of image-text pairs. Existing methods primarily rely on score distillation to leverage the 2D diffusion priors to supervise the generation of 3D models, e.g., NeRF. However, score distillation is prone to suffer the view inconsistency problem, and implicit NeRF modeling can also lead to an arbitrary shape, thus leading to less realistic and uncontrollable 3D generation. In this work, we propose a flexible framework of Points-to-3D to bridge the gap between sparse yet freely available 3D points and realistic shape-controllable 3D generation by distilling the knowledge from both 2D and 3D diffusion models. The core idea of Points-to-3D is to introduce controllable sparse 3D points to guide the text-to-3D generation. Specifically, we use the sparse point cloud generated from the 3D diffusion model, Point-E, as the geometric prior, conditioned on a single reference image. To better utilize the sparse 3D points, we propose an efficient point cloud guidance loss to adaptively drive the NeRF's geometry to align with the shape of the sparse 3D points. In addition to controlling the geometry, we propose to optimize the NeRF for a more view-consistent appearance. To be specific, we perform score distillation to the publicly available 2D image diffusion model ControlNet, conditioned on text as well as depth map of the learned compact geometry. Qualitative and quantitative comparisons demonstrate that Points-to-3D improves view consistency and achieves good shape controllability for text-to-3D generation. Points-to-3D provides users with a new way to improve and control text-to-3D generation.	Presents Points-to-3D, a novel text-to-3D generation framework that bridges the gap between sparse 3D points and realistic, shape-controllable 3D generation by leveraging pre-trained 2D and 3D diffusion models.	Addresses limitations in existing text-to-3D methods, such as view inconsistency (Janus problem) and lack of shape controllability, aiming for more realistic and controllable 3D content generation.	Utilizes a pre-trained point cloud diffusion model (Point-E) to generate sparse 3D points from a reference image, guides NeRF geometry using an efficient point cloud guidance loss, and optimizes appearance via score distillation from a controllable 2D diffusion model (ControlNet) conditioned on text and learned depth map.	Significantly alleviates view inconsistency in generated 3D content compared to baselines. Achieves good controllability over 3D shapes by leveraging reference images and sparse 3D point guidance. Demonstrates superior performance in terms of CLIP R-precision and user preference for view consistency and prompt relevance.	Performance can be affected by limitations of the underlying pre-trained 2D and 3D diffusion models. Currently requires a reference image for shape guidance, limiting spontaneity in content creation.	text-to-3d, diffusion models, nerf, point cloud, shape controllability
2307.13856 Report	On the unreasonable vulnerability of transformers for image restoration -- and an easy fix	Shashank Agnihotri, Kanchana Vaishnavi Gandikota, Julia Grabinski, Paramanand Chandramouli, Margret Keuper	Following their success in visual recognition tasks, Vision Transformers(ViTs) are being increasingly employed for image restoration. As a few recent works claim that ViTs for image classification also have better robustness properties, we investigate whether the improved adversarial robustness of ViTs extends to image restoration. We consider the recently proposed Restormer model, as well as NAFNet and the "Baseline network" which are both simplified versions of a Restormer. We use Projected Gradient Descent (PGD) and CosPGD, a recently proposed adversarial attack tailored to pixel-wise prediction tasks for our robustness evaluation. Our experiments are performed on real-world images from the GoPro dataset for image deblurring. Our analysis indicates that contrary to as advocated by ViTs in image classification works, these models are highly susceptible to adversarial attacks. We attempt to improve their robustness through adversarial training. While this yields a significant increase in robustness for Restormer, results on other networks are less promising. Interestingly, the design choices in NAFNet and Baselines, which were based on iid performance, and not on robust generalization, seem to be at odds with the model robustness. Thus, we investigate this further and find a fix.	This paper investigates the adversarial robustness of Transformer-based image restoration networks, namely Restormer, Baseline Network, and NAFNet.	This study is important because while these networks achieve state-of-the-art performance on clean images, their robustness to adversarial attacks is crucial for real-world applications, especially in safety-critical domains.	The authors evaluate the robustness of these networks using PGD and CosPGD attacks on the GoPro image deblurring dataset. They analyze the effects of adversarial training as a defense mechanism and study the impact of different architectural choices on robustness.	Transformer-based restoration networks are highly vulnerable to adversarial attacks, exhibiting significant performance drops and distinct spectral artifacts. Adversarial training effectively improves robustness and reduces spectral artifacts, with Restormer showing the most significant gains. Design choices in NAFNet and Baseline Network, aimed at simplifying Restormer, negatively impact robustness. Replacing GELU activation with ReLU in the Intermediate network significantly improves robustness.	While adversarial training and design changes improve robustness, there is still a considerable gap in achieving ideal restoration quality. Future work could explore alternative methods beyond adversarial training to enhance robustness and image quality.	adversarial robustness, image restoration, vision transformers, deblurring, adversarial training
2307.13746 Report	ChildGAN: Large Scale Synthetic Child Facial Data Using Domain Adaptation in StyleGAN	Muhammad Ali Farooq, Wang Yao, Gabriel Costache, Peter Corcoran	In this research work, we proposed a novel ChildGAN, a pair of GAN networks for generating synthetic boys and girls facial data derived from StyleGAN2. ChildGAN is built by performing smooth domain transfer using transfer learning. It provides photo-realistic, high-quality data samples. A large-scale dataset is rendered with a variety of smart facial transformations: facial expressions, age progression, eye blink effects, head pose, skin and hair color variations, and variable lighting conditions. The dataset comprises more than 300k distinct data samples. Further, the uniqueness and characteristics of the rendered facial features are validated by running different computer vision application tests which include CNN-based child gender classifier, face localization and facial landmarks detection test, identity similarity evaluation using ArcFace, and lastly running eye detection and eye aspect ratio tests. The results demonstrate that synthetic child facial data of high quality offers an alternative to the cost and complexity of collecting a large-scale dataset from real children.	This paper presents ChildGAN, a pair of GAN networks based on StyleGAN2 for generating large-scale, high-quality synthetic child facial images.	Large-scale child facial datasets are crucial for various AI applications but are challenging to acquire due to ethical and privacy concerns. Synthetic data offers a viable alternative.	ChildGAN leverages transfer learning to adapt StyleGAN2, trained on adult faces, to generate child faces. It incorporates smart transformations like facial expressions, aging, and lighting for data diversity.	ChildGAN generates over 300k unique child face images with diverse attributes. Validation tests using gender classification, facial landmark detection, and identity similarity confirm the high quality and diversity of the synthetic data. Eye aspect ratio tests on the synthetic data demonstrate realistic eye blinking effects.	Quantitative validation of the synthetic data distribution against a real-world ground truth remains challenging. Expanding ChildGAN to encompass greater ethnic diversity is a potential area for future research.	synthetic data generation, generative adversarial networks (gans), facial image analysis, child facial recognition, transfer learning
2307.13720 Report	Composite Diffusion \| whole >= Σparts	Vikram Jamwal, Ramaneswaran S	For an artist or a graphic designer, the spatial layout of a scene is a critical design choice. However, existing text-to-image diffusion models provide limited support for incorporating spatial information. This paper introduces Composite Diffusion as a means for artists to generate high-quality images by composing from the sub-scenes. The artists can specify the arrangement of these sub-scenes through a flexible free-form segment layout. They can describe the content of each sub-scene primarily using natural text and additionally by utilizing reference images or control inputs such as line art, scribbles, human pose, canny edges, and more. We provide a comprehensive and modular method for Composite Diffusion that enables alternative ways of generating, composing, and harmonizing sub-scenes. Further, we wish to evaluate the composite image for effectiveness in both image quality and achieving the artist's intent. We argue that existing image quality metrics lack a holistic evaluation of image composites. To address this, we propose novel quality criteria especially relevant to composite generation. We believe that our approach provides an intuitive method of art creation. Through extensive user surveys, quantitative and qualitative analysis, we show how it achieves greater spatial, semantic, and creative control over image generation. In addition, our methods do not need to retrain or modify the architecture of the base diffusion models and can work in a plug-and-play manner with the fine-tuned models.	This paper introduces Composite Diffusion, a novel approach for generating high-quality images by composing sub-scenes arranged by artists in a free-form layout.	Existing text-to-image models offer limited spatial control, making it difficult for artists to dictate object layout and properties within a scene. This method seeks to grant artists greater creative control.	The method utilizes pre-trained diffusion models and divides the generation into two stages: (1) Scaffolding: sub-scenes are generated independently using text descriptions, reference images, or control conditions. (2) Harmonization: Sub-scenes are blended and refined in the context of each other, ensuring coherence.	Composite Diffusion demonstrates superior performance in spatial fidelity and content fidelity compared to text-to-image and serial inpainting baselines. The method allows for controlled variation in image generation through modifications in segment layout, text descriptions, and the use of fine-tuned models. Qualitative evaluation through user surveys and artist collaboration confirms the effectiveness of Composite Diffusion in creating high-quality, customizable artwork.	The current implementation's performance is limited by the granularity of sub-scenes supported by the diffusion model's image space. Achieving precise object shape conformance in text-only conditioning remains a challenge, often necessitating the use of control condition inputs.	image generation, diffusion models, spatial control, composite images, generative ai
2307.13639 Report	Fake It Without Making It: Conditioned Face Generation for Accurate 3D Face Reconstruction	Will Rowan, Patrik Huber, Nick Pears, Andrew Keeling	Accurate 3D face reconstruction from 2D images is an enabling technology with applications in healthcare, security, and creative industries. However, current state-of-the-art methods either rely on supervised training with very limited 3D data or self-supervised training with 2D image data. To bridge this gap, we present a method to generate a large-scale synthesised dataset of 250K photorealistic images and their corresponding shape parameters and depth maps, which we call SynthFace. Our synthesis method conditions Stable Diffusion on depth maps sampled from the FLAME 3D Morphable Model (3DMM) of the human face, allowing us to generate a diverse set of shape-consistent facial images that is designed to be balanced in race and gender. We further propose ControlFace, a deep neural network, trained on SynthFace, which achieves competitive performance on the NoW benchmark, without requiring 3D supervision or manual 3D asset creation. The complete SynthFace dataset will be made publicly available upon publication.	This paper introduces SynthFace, a large-scale synthetic dataset of 250K photorealistic face images with corresponding 3D shape parameters and depth maps, and ControlFace, a deep neural network trained on SynthFace for 3D face reconstruction.	Accurate 3D face reconstruction from 2D images is crucial for applications in various fields, but existing methods are limited by the scarcity of paired 2D-to-3D data. SynthFace addresses this by providing a large-scale dataset for supervised training.	SynthFace is generated by conditioning Stable Diffusion, a text-to-image diffusion model, on depth maps of 3D faces from the FLAME model. This generates photorealistic images with known 3D shape. ControlFace is then trained on SynthFace to regress 3DMM parameters from facial images.	SynthFace is the largest dataset of its kind, containing 250K photorealistic face images with corresponding 3D shape information, balanced by race and gender. ControlFace, trained on SynthFace, achieves competitive performance on the NoW benchmark for 3D face reconstruction. This approach demonstrates the potential of combining 2D and 3D generative models for improving 3D face reconstruction.	The current iteration of SynthFace does not model facial expressions, limiting the scope of ControlFace to shape prediction. The use of ArcFace, an identity descriptor network, to extract shape information might introduce errors. Future work could explore networks specifically designed for shape extraction.	3d face reconstruction, synthetic data, stable diffusion, 3d morphable model, controlnet
2307.13240 Report	Fashion Matrix: Editing Photos by Just Talking	Zheng Chong, Xujie Zhang, Fuwei Zhao, Zhenyu Xie, Xiaodan Liang	The utilization of Large Language Models (LLMs) for the construction of AI systems has garnered significant attention across diverse fields. The extension of LLMs to the domain of fashion holds substantial commercial potential but also inherent challenges due to the intricate semantic interactions in fashion-related generation. To address this issue, we developed a hierarchical AI system called Fashion Matrix dedicated to editing photos by just talking. This system facilitates diverse prompt-driven tasks, encompassing garment or accessory replacement, recoloring, addition, and removal. Specifically, Fashion Matrix employs LLM as its foundational support and engages in iterative interactions with users. It employs a range of Semantic Segmentation Models (e.g., Grounded-SAM, MattingAnything, etc.) to delineate the specific editing masks based on user instructions. Subsequently, Visual Foundation Models (e.g., Stable Diffusion, ControlNet, etc.) are leveraged to generate edited images from text prompts and masks, thereby facilitating the automation of fashion editing processes. Experiments demonstrate the outstanding ability of Fashion Matrix to explores the collaborative potential of functionally diverse pre-trained models in the domain of fashion editing.	Presents Fashion Matrix, a novel hierarchical AI system that leverages Large Language Models (LLMs) to enable conversational photo editing in the fashion domain.	Addresses the limitations of existing image editing tools that lack fine-grained control and struggle with the nuanced semantic understanding required in fashion-related applications.	Integrates LLMs with Semantic Segmentation Models (e.g., Grounded-SAM, MattingAnything) and Visual Foundation Models (e.g., Stable Diffusion, ControlNet) to enable multi-round dialogue-based editing with tasks like garment replacement, recoloring, addition, and removal.	Introduces an 'AutoMasker' module that combines human parsing, pose estimation, and semantic segmentation for precise editing mask generation. Outperforms text-based try-on methods (Text2Human, FICE) in terms of image quality (CLIP Score, IS), naturalness, and text-image matching. Demonstrates the potential of combining functionally diverse pre-trained models for complex fashion editing tasks through extensive zero-shot experiments.	LLM optimization specifically for fashion domain is needed for improved performance. More detailed Semantic Segmentation Models for both humans and fashion items would enhance system capabilities.	fashion editing, large language models, conversational ai, semantic segmentation, image generation
2307.13226 Report	Strivec: Sparse Tri-Vector Radiance Fields	Quankai Gao, Qiangeng Xu, Hao Su, Ulrich Neumann, Zexiang Xu	We propose Strivec, a novel neural representation that models a 3D scene as a radiance field with sparsely distributed and compactly factorized local tensor feature grids. Our approach leverages tensor decomposition, following the recent work TensoRF, to model the tensor grids. In contrast to TensoRF which uses a global tensor and focuses on their vector-matrix decomposition, we propose to utilize a cloud of local tensors and apply the classic CANDECOMP/PARAFAC (CP) decomposition to factorize each tensor into triple vectors that express local feature distributions along spatial axes and compactly encode a local neural field. We also apply multi-scale tensor grids to discover the geometry and appearance commonalities and exploit spatial coherence with the tri-vector factorization at multiple local scales. The final radiance field properties are regressed by aggregating neural features from multiple local tensors across all scales. Our tri-vector tensors are sparsely distributed around the actual scene surface, discovered by a fast coarse reconstruction, leveraging the sparsity of a 3D scene. We demonstrate that our model can achieve better rendering quality while using significantly fewer parameters than previous methods, including TensoRF and Instant-NGP.	This paper introduces Strivec, a novel neural scene representation that leverages sparse, multi-scale, tri-vector tensors to represent local radiance fields for high-quality novel view synthesis.	Existing methods, while achieving progress in compactness and quality, struggle to balance representing intricate local details with efficient use of model capacity. Strivec aims to address this by combining the sparsity of local representations with the efficiency of shared feature encoding.	Strivec distributes local tensors based on coarse scene geometry. Each tensor uses CP decomposition to factorize its feature grid into tri-vector components. Features are aggregated from neighboring tensors at multiple scales to regress volume density and view-dependent color, enabling efficient and accurate radiance field rendering.	Strivec achieves state-of-the-art rendering quality on both synthetic (NeRF Synthetic) and real (ScanNet, Tanks and Temples) datasets, outperforming previous methods like TensoRF and Instant-NGP. Strivec achieves this superior quality with significantly fewer parameters, demonstrating its efficient representation power. The paper conducts ablation studies showcasing the benefits of multi-scale representation, tri-vector factorization, and robustness to initial geometry choice.	While achieving high quality and compactness, Strivec's optimization is slower than TensoRF due to the multi-tensor aggregation. Exploring acceleration strategies while maintaining quality could be beneficial. The paper observes that adding more tensor components yields diminishing returns after a certain point. Investigating techniques to better capture high-frequency details with increased capacity is a potential avenue for future work.	neural radiance fields, novel view synthesis, tensor decomposition, 3d scene representation, sparse representation
2307.12981 Report	3D-LLM: Injecting the 3D World into Large Language Models	Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, Chuang Gan	Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs. Project Page: : https://vis-www.cs.umass.edu/3dllm/.	This paper introduces 3D-LLMs, a new family of large language models that can understand and interact with 3D scenes represented as point clouds with features.	Existing LLMs and VLMs lack grounding in the 3D physical world, limiting their ability to reason about spatial relationships, affordances, and other 3D concepts. 3D-LLMs bridge this gap.	The authors generate a 300k 3D-language dataset covering various tasks and train 3D-LLMs using pretrained 2D VLMs (Flamingo, BLIP-2) as backbones. They extract 3D features from multi-view images and incorporate a 3D localization mechanism.	3D-LLMs outperform state-of-the-art baselines on the ScanQA 3D question answering benchmark. Held-in experiments demonstrate 3D-LLMs' effectiveness in 3D captioning, task decomposition, and 3D-assisted dialog. Qualitative examples showcase 3D-LLMs' ability to perform tasks beyond the scope of existing LLMs and VLMs, such as navigation and grounding.	The current 3D feature extractor relies on rendering 3D scenes into multi-view images, introducing an additional rendering process. Future work includes exploring end-to-end training with 3D data and expanding 3D-LLMs to more complex 3D reasoning and planning tasks.	large language models, 3d vision, vision-language models, 3d scene understanding, 3d reasoning
2307.12972 Report	DFA3D: 3D Deformable Attention For 2D-to-3D Feature Lifting	Hongyang Li, Hao Zhang, Zhaoyang Zeng, Shilong Liu, Feng Li, Tianhe Ren, Lei Zhang	In this paper, we propose a new operator, called 3D DeFormable Attention (DFA3D), for 2D-to-3D feature lifting, which transforms multi-view 2D image features into a unified 3D space for 3D object detection. Existing feature lifting approaches, such as Lift-Splat-based and 2D attention-based, either use estimated depth to get pseudo LiDAR features and then splat them to a 3D space, which is a one-pass operation without feature refinement, or ignore depth and lift features by 2D attention mechanisms, which achieve finer semantics while suffering from a depth ambiguity problem. In contrast, our DFA3D-based method first leverages the estimated depth to expand each view's 2D feature map to 3D and then utilizes DFA3D to aggregate features from the expanded 3D feature maps. With the help of DFA3D, the depth ambiguity problem can be effectively alleviated from the root, and the lifted features can be progressively refined layer by layer, thanks to the Transformer-like architecture. In addition, we propose a mathematically equivalent implementation of DFA3D which can significantly improve its memory efficiency and computational speed. We integrate DFA3D into several methods that use 2D attention-based feature lifting with only a few modifications in code and evaluate on the nuScenes dataset. The experiment results show a consistent improvement of +1.41\% mAP on average, and up to +15.1\% mAP improvement when high-quality depth information is available, demonstrating the superiority, applicability, and huge potential of DFA3D. The code is available at https://github.com/IDEA-Research/3D-deformable-attention.git.	This paper introduces 3D Deformable Attention (DFA3D), a novel operator for 2D-to-3D feature lifting in multi-view 3D object detection. It addresses the depth ambiguity issue present in existing 2D attention-based methods.	Existing feature lifting methods suffer from limitations: Lift-Splat methods lack feature refinement and struggle with depth errors, while 2D attention-based approaches exhibit depth ambiguity due to ignoring depth information.	DFA3D leverages estimated depth to expand 2D feature maps into 3D. It then uses a depth-weighted 2D deformable attention mechanism for efficient feature aggregation, addressing the memory consumption issue of explicit 3D feature expansion.	DFA3D effectively alleviates the depth ambiguity problem by sampling features in 3D space. The Transformer-like architecture with DFA3D allows for progressive feature refinement over multiple layers. Experiments on the nuScenes dataset show consistent improvements, with an average increase of +1.41% mAP and up to +15.1% mAP with high-quality depth.	The performance of DFA3D relies on the quality of estimated depth. Future work includes exploring the integration of temporal information for improved depth estimation.	3d object detection, multi-view vision, feature lifting, deformable attention, depth ambiguity
2307.12967 Report	Learning Dense Correspondences between Photos and Sketches	Xuanchen Lu, Xiaolong Wang, Judith E Fan	Humans effortlessly grasp the connection between sketches and real-world objects, even when these sketches are far from realistic. Moreover, human sketch understanding goes beyond categorization -- critically, it also entails understanding how individual elements within a sketch correspond to parts of the physical world it represents. What are the computational ingredients needed to support this ability? Towards answering this question, we make two contributions: first, we introduce a new sketch-photo correspondence benchmark, $\textit{PSC6k}$, containing 150K annotations of 6250 sketch-photo pairs across 125 object categories, augmenting the existing Sketchy dataset with fine-grained correspondence metadata. Second, we propose a self-supervised method for learning dense correspondences between sketch-photo pairs, building upon recent advances in correspondence learning for pairs of photos. Our model uses a spatial transformer network to estimate the warp flow between latent representations of a sketch and photo extracted by a contrastive learning-based ConvNet backbone. We found that this approach outperformed several strong baselines and produced predictions that were quantitatively consistent with other warp-based methods. However, our benchmark also revealed systematic differences between predictions of the suite of models we tested and those of humans. Taken together, our work suggests a promising path towards developing artificial systems that achieve more human-like understanding of visual images at different levels of abstraction. Project page: https://photo-sketch-correspondence.github.io	This paper introduces PSC6k, a new benchmark for photo-sketch dense correspondence learning, and proposes a self-supervised method for learning dense correspondences between sketch-photo pairs.	Understanding the link between sketches and real-world objects is crucial for bridging the gap between human and artificial vision systems. This task requires robust image understanding across domains and levels of abstraction, particularly in aligning semantic correspondences between stylized and photorealistic images.	The PSC6k benchmark augments the Sketchy dataset with 150K keypoint annotations on 6250 sketch-photo pairs. The proposed self-supervised method utilizes a contrastive learning-based ConvNet backbone to extract latent representations and a spatial transformer network to estimate the warp flow between a sketch and a photo, aiming to maximize their feature map similarity.	The proposed method outperforms existing self-supervised and weakly supervised methods on PSC6k, setting a new state-of-the-art. Analysis reveals systematic differences between model predictions and human annotations, highlighting areas for future improvement. The photo-sketch contrastive learning procedure reduces the texture bias in learned representations, leading to a stronger shape bias more aligned with human perception.	The model exhibits limitations in handling non-continuous transformations and aligning fine structures. Future work could explore stroke-based keypoints for improved coverage of semantically meaningful sketch regions.	sketch understanding, dense correspondence learning, self-supervised learning, contrastive learning, spatial transformer network
2307.12909 Report	Dyn-E: Local Appearance Editing of Dynamic Neural Radiance Fields	Shangzhan Zhang, Sida Peng, Yinji ShenTu, Qing Shuai, Tianrun Chen, Kaicheng Yu, Hujun Bao, Xiaowei Zhou	Recently, the editing of neural radiance fields (NeRFs) has gained considerable attention, but most prior works focus on static scenes while research on the appearance editing of dynamic scenes is relatively lacking. In this paper, we propose a novel framework to edit the local appearance of dynamic NeRFs by manipulating pixels in a single frame of training video. Specifically, to locally edit the appearance of dynamic NeRFs while preserving unedited regions, we introduce a local surface representation of the edited region, which can be inserted into and rendered along with the original NeRF and warped to arbitrary other frames through a learned invertible motion representation network. By employing our method, users without professional expertise can easily add desired content to the appearance of a dynamic scene. We extensively evaluate our approach on various scenes and show that our approach achieves spatially and temporally consistent editing results. Notably, our approach is versatile and applicable to different variants of dynamic NeRF representations.	This paper introduces Dyn-E, a novel framework for local appearance editing of dynamic Neural Radiance Fields (NeRFs) by manipulating pixels in a single training video frame.	Current NeRF editing methods mainly focus on static scenes, leaving dynamic scene editing, crucial for volumetric video editing, underexplored.	Dyn-E lifts the edited region to 3D space, forming a textured mesh. It utilizes an invertible network to represent the local surface motion, propagating edits across video frames while preserving unedited areas.	Dyn-E achieves spatially and temporally consistent editing results, outperforming baselines relying on scene flow or optical flow warping. The local surface representation effectively handles occlusions between edited content and the original dynamic NeRF. Dyn-E demonstrates versatility by being applicable to various dynamic NeRF representations like HyperNeRF, DynamicNeRF, and Neural Body.	The current method assumes the edited region is mostly occlusion-free, which might not hold in complex scenarios. Future work could explore incorporating semantic information or user interaction for more controllable editing.	dynamic neural radiance fields, appearance editing, 3d scene editing, volumetric video editing, invertible networks
2307.12868 Report	Understanding the Latent Space of Diffusion Models through the Lens of Riemannian Geometry	Yong-Hyun Park, Mingi Kwon, Jaewoong Choi, Junghyo Jo, Youngjung Uh	Despite the success of diffusion models (DMs), we still lack a thorough understanding of their latent space. To understand the latent space $\mathbf{x}_t \in \mathcal{X}$, we analyze them from a geometrical perspective. Our approach involves deriving the local latent basis within $\mathcal{X}$ by leveraging the pullback metric associated with their encoding feature maps. Remarkably, our discovered local latent basis enables image editing capabilities by moving $\mathbf{x}_t$, the latent space of DMs, along the basis vector at specific timesteps. We further analyze how the geometric structure of DMs evolves over diffusion timesteps and differs across different text conditions. This confirms the known phenomenon of coarse-to-fine generation, as well as reveals novel insights such as the discrepancy between $\mathbf{x}_t$ across timesteps, the effect of dataset complexity, and the time-varying influence of text prompts. To the best of our knowledge, this paper is the first to present image editing through $\mathbf{x}$-space traversal, editing only once at specific timestep $t$ without any additional training, and providing thorough analyses of the latent structure of DMs. The code to reproduce our experiments can be found at https://github.com/enkeejunior1/Diffusion-Pullback.	This paper introduces a novel approach for analyzing and manipulating the latent space of diffusion models (DMs) using a geometrical perspective, leveraging the pullback metric to discover local latent bases.	Understanding the latent space of DMs is crucial for leveraging their full potential, especially in image editing and manipulation, which existing methods struggle to fully utilize.	The authors employ the pullback metric to define distances in the latent space based on the local Euclidean metric of the corresponding feature space. They use SVD on the Jacobian of the mapping between these spaces to discover local latent bases.	Traversing along the discovered latent basis enables semantic image editing at various diffusion timesteps. The latent space structure evolves from low-frequency to high-frequency components as the generative process progresses, reflecting the coarse-to-fine generation. Textual prompts in text-to-image DMs influence the latent space structure, with similar prompts yielding similar structures, but this influence diminishes in later generative stages.	The discovered latent directions can sometimes exhibit entanglement between attributes, likely due to dataset biases. While effective in many cases, the method occasionally leads to abrupt changes during editing, highlighting the need for further exploration of the complex geometry of the DM latent space.	diffusion models, latent space, image editing, pullback metric, riemannian geometry
2307.12751 Report	ICF-SRSR: Invertible scale-Conditional Function for Self-Supervised Real-world Single Image Super-Resolution	Reyhaneh Neshatavar, Mohsen Yavartanoo, Sanghyun Son, Kyoung Mu Lee	Single image super-resolution (SISR) is a challenging ill-posed problem that aims to up-sample a given low-resolution (LR) image to a high-resolution (HR) counterpart. Due to the difficulty in obtaining real LR-HR training pairs, recent approaches are trained on simulated LR images degraded by simplified down-sampling operators, e.g., bicubic. Such an approach can be problematic in practice because of the large gap between the synthesized and real-world LR images. To alleviate the issue, we propose a novel Invertible scale-Conditional Function (ICF), which can scale an input image and then restore the original input with different scale conditions. By leveraging the proposed ICF, we construct a novel self-supervised SISR framework (ICF-SRSR) to handle the real-world SR task without using any paired/unpaired training data. Furthermore, our ICF-SRSR can generate realistic and feasible LR-HR pairs, which can make existing supervised SISR networks more robust. Extensive experiments demonstrate the effectiveness of the proposed method in handling SISR in a fully self-supervised manner. Our ICF-SRSR demonstrates superior performance compared to the existing methods trained on synthetic paired images in real-world scenarios and exhibits comparable performance compared to state-of-the-art supervised/unsupervised methods on public benchmark datasets.	This paper proposes ICF-SRSR, a novel self-supervised framework for single image super-resolution (SISR) using an invertible scale-conditional function (ICF).	ICF-SRSR addresses the issue of poor generalization in real-world SISR tasks, which stems from models being trained on synthetic datasets with simplified down-sampling operators.	ICF-SRSR leverages a learnable ICF that can up-sample and down-sample an input image based on different scale conditions. The framework is trained in a self-supervised manner by minimizing the distance between the original input and the generated images after consecutive up-down and down-up stages.	ICF-SRSR outperforms existing self-supervised and some supervised methods on synthetic datasets. It surpasses methods trained on synthetic datasets when evaluated on real-world datasets. ICF-SRSR can generate realistic low-resolution and high-resolution image pairs, beneficial for training other SISR models.	The paper's evaluation on real-world datasets is limited due to the scarcity of aligned low-resolution and high-resolution image pairs. Future work will focus on creating a large-scale real-world dataset and exploring applications of ICF in other image restoration tasks.	super-resolution, self-supervised learning, real-world image super-resolution, invertible scale-conditional function, image restoration
2307.12732 Report	CLIP-KD: An Empirical Study of CLIP Model Distillation	Chuanguang Yang, Zhulin An, Libo Huang, Junyu Bi, Xinqiang Yu, Han Yang, Boyu Diao, Yongjun Xu	Contrastive Language-Image Pre-training (CLIP) has become a promising language-supervised visual pre-training framework. This paper aims to distill small CLIP models supervised by a large teacher CLIP model. We propose several distillation strategies, including relation, feature, gradient and contrastive paradigms, to examine the effectiveness of CLIP-Knowledge Distillation (KD). We show that a simple feature mimicry with Mean Squared Error loss works surprisingly well. Moreover, interactive contrastive learning across teacher and student encoders is also effective in performance improvement. We explain that the success of CLIP-KD can be attributed to maximizing the feature similarity between teacher and student. The unified method is applied to distill several student models trained on CC3M+12M. CLIP-KD improves student CLIP models consistently over zero-shot ImageNet classification and cross-modal retrieval benchmarks. When using ViT-L/14 pretrained on Laion-400M as the teacher, CLIP-KD achieves 57.5\% and 55.4\% zero-shot top-1 ImageNet accuracy over ViT-B/16 and ResNet-50, surpassing the original CLIP without KD by 20.5\% and 20.1\% margins, respectively. Our code is released on https://github.com/winycg/CLIP-KD.	This paper investigates various knowledge distillation (KD) strategies for compressing CLIP models, improving the performance of smaller CLIP models under the supervision of a larger, pretrained teacher model.	Smaller CLIP models are desirable for resource-constrained applications, but they often suffer from performance degradation compared to larger models. This work aims to bridge this gap using KD.	The paper proposes and evaluates several KD strategies, including: (1) Contrastive Relational Distillation, (2) Feature Distillation, (3) Masked Feature Distillation, (4) Gradient Distillation, (5) Interactive Contrastive Learning, and (6) Augmented Feature Distillation. These methods are analyzed individually and in combination.	Feature Distillation with Mean Squared Error loss performs surprisingly well, significantly improving student performance. Interactive Contrastive Learning, which promotes interaction between teacher and student encoders, also leads to significant gains. The effectiveness of different KD methods is correlated with their ability to maximize feature similarity between teacher and student models.	Distilling knowledge from significantly larger teachers to smaller students might not be optimal due to potential capacity gaps. Exploring more advanced distillation strategies, such as incorporating intermediate layer distillation with architecture-aware mechanisms, could further improve performance.	knowledge distillation, clip, contrastive learning, multimodal learning, model compression
2307.12730 Report	COCO-O: A Benchmark for Object Detectors under Natural Distribution Shifts	Xiaofeng Mao, Yuefeng Chen, Yao Zhu, Da Chen, Hang Su, Rong Zhang, Hui Xue	Practical object detection application can lose its effectiveness on image inputs with natural distribution shifts. This problem leads the research community to pay more attention on the robustness of detectors under Out-Of-Distribution (OOD) inputs. Existing works construct datasets to benchmark the detector's OOD robustness for a specific application scenario, e.g., Autonomous Driving. However, these datasets lack universality and are hard to benchmark general detectors built on common tasks such as COCO. To give a more comprehensive robustness assessment, we introduce COCO-O(ut-of-distribution), a test dataset based on COCO with 6 types of natural distribution shifts. COCO-O has a large distribution gap with training data and results in a significant 55.7% relative performance drop on a Faster R-CNN detector. We leverage COCO-O to conduct experiments on more than 100 modern object detectors to investigate if their improvements are credible or just over-fitting to the COCO test set. Unfortunately, most classic detectors in early years do not exhibit strong OOD generalization. We further study the robustness effect on recent breakthroughs of detector's architecture design, augmentation and pre-training techniques. Some empirical findings are revealed: 1) Compared with detection head or neck, backbone is the most important part for robustness; 2) An end-to-end detection transformer design brings no enhancement, and may even reduce robustness; 3) Large-scale foundation models have made a great leap on robust object detection. We hope our COCO-O could provide a rich testbed for robustness study of object detection. The dataset will be available at https://github.com/alibaba/easyrobust/tree/main/benchmarks/coco_o.	This paper introduces COCO-O, a new benchmark dataset designed to evaluate the robustness of object detectors when faced with natural distribution shifts.	Existing robustness benchmarks for object detection either rely on synthetic data or focus on specific scenarios. COCO-O addresses this gap by providing a diverse set of real-world images with natural distribution shifts, enabling a more comprehensive robustness assessment of modern object detectors.	The authors construct COCO-O by collecting images from six distinct domains: sketch, weather, cartoon, painting, tattoo, and handmake. These domains represent varying degrees of object abstraction and introduce realistic challenges for object detection models. They evaluate a wide range of detectors, including classic architectures and state-of-the-art models, on COCO-O and analyze the impact of factors such as architecture design, data augmentation, and pre-training on robustness.	Contrary to expectations, most classic detectors and recent architectural advancements in object detection show limited progress in robustness to natural distribution shifts. The backbone network plays a more crucial role in OOD robustness than other detector components like the neck or head. Large-scale foundation models, particularly those pre-trained on massive image-language datasets, exhibit significantly improved robustness on COCO-O, highlighting the potential of data scale and external knowledge for robust object detection.	The reasons behind the poor performance of DETR-based models on COCO-O require further investigation. Future work will focus on developing novel techniques to enhance the OOD robustness of object detection algorithms, leveraging the challenges and insights provided by COCO-O.	object detection, robustness, benchmark dataset, distribution shift, out-of-distribution generalization
2307.12616 Report	CTVIS: Consistent Training for Online Video Instance Segmentation	Kaining Ying, Qing Zhong, Weian Mao, Zhenhua Wang, Hao Chen, Lin Yuanbo Wu, Yifan Liu, Chengxiang Fan, Yunzhi Zhuge, Chunhua Shen	The discrimination of instance embeddings plays a vital role in associating instances across time for online video instance segmentation (VIS). Instance embedding learning is directly supervised by the contrastive loss computed upon the contrastive items (CIs), which are sets of anchor/positive/negative embeddings. Recent online VIS methods leverage CIs sourced from one reference frame only, which we argue is insufficient for learning highly discriminative embeddings. Intuitively, a possible strategy to enhance CIs is replicating the inference phase during training. To this end, we propose a simple yet effective training strategy, called Consistent Training for Online VIS (CTVIS), which devotes to aligning the training and inference pipelines in terms of building CIs. Specifically, CTVIS constructs CIs by referring inference the momentum-averaged embedding and the memory bank storage mechanisms, and adding noise to the relevant embeddings. Such an extension allows a reliable comparison between embeddings of current instances and the stable representations of historical instances, thereby conferring an advantage in modeling VIS challenges such as occlusion, re-identification, and deformation. Empirically, CTVIS outstrips the SOTA VIS models by up to +5.0 points on three VIS benchmarks, including YTVIS19 (55.1% AP), YTVIS21 (50.1% AP) and OVIS (35.5% AP). Furthermore, we find that pseudo-videos transformed from images can train robust models surpassing fully-supervised ones.	This paper presents CTVIS, a novel training strategy for online video instance segmentation (VIS) that aligns training and inference pipelines to learn highly discriminative instance embeddings, thereby enhancing instance association across video frames.	Accurate instance association in videos, especially under challenges like occlusion and re-identification, is crucial for VIS and its downstream applications.	CTVIS leverages a memory bank to store momentum-averaged embeddings and constructs contrastive items by comparing against these stable representations. It further introduces noise during memory bank updates to simulate real-world tracking challenges.	CTVIS significantly outperforms state-of-the-art VIS methods on YTVIS19, YTVIS21, and OVIS benchmarks. The method effectively leverages long video sequences during training to improve embedding discrimination. Training CTVIS solely on pseudo-videos generated from augmented still images achieves competitive performance, surpassing fully-supervised counterparts.	The reliance on pseudo-videos for training might introduce biases if the augmentation strategies do not fully encapsulate real-world video characteristics. Future work could explore the integration of CTVIS with other query-based instance segmentation models and evaluate its generalization to other video-related tasks like video panoptic segmentation.	video instance segmentation, instance embedding learning, contrastive learning, memory bank, data augmentation
2307.12612 Report	Less is More: Focus Attention for Efficient DETR	Dehua Zheng, Wenhui Dong, Hailin Hu, Xinghao Chen, Yunhe Wang	DETR-like models have significantly boosted the performance of detectors and even outperformed classical convolutional models. However, all tokens are treated equally without discrimination brings a redundant computational burden in the traditional encoder structure. The recent sparsification strategies exploit a subset of informative tokens to reduce attention complexity maintaining performance through the sparse encoder. But these methods tend to rely on unreliable model statistics. Moreover, simply reducing the token population hinders the detection performance to a large extent, limiting the application of these sparse models. We propose Focus-DETR, which focuses attention on more informative tokens for a better trade-off between computation efficiency and model accuracy. Specifically, we reconstruct the encoder with dual attention, which includes a token scoring mechanism that considers both localization and category semantic information of the objects from multi-scale feature maps. We efficiently abandon the background queries and enhance the semantic interaction of the fine-grained object queries based on the scores. Compared with the state-of-the-art sparse DETR-like detectors under the same setting, our Focus-DETR gets comparable complexity while achieving 50.4AP (+2.2) on COCO. The code is available at https://github.com/huawei-noah/noah-research/tree/master/Focus-DETR and https://gitee.com/mindspore/models/tree/master/research/cv/Focus-DETR.	Focus-DETR, a novel DETR-like model that focuses attention on informative tokens using a scoring mechanism incorporating localization and category semantic information, achieving a better computation-accuracy trade-off.	DETR-like models, while effective, suffer from redundant computation in the encoder due to treating all tokens equally.	A scoring mechanism with top-down score modulations across multi-scale features identifies foreground and fine-grained object tokens. These tokens are processed through an encoder with dual attention, enhancing semantic information and reducing computation.	Achieves 50.4 AP (+2.2 AP over Sparse DETR) on COCO with comparable complexity. Outperforms state-of-the-art sparse DETR-like models with ResNet-50, ResNet-101, and Swin Transformer backbones. Demonstrates the effectiveness of focusing on informative tokens and enhancing their semantic representation.	Exploring more hierarchical semantic grading strategies beyond position and category information. Developing a unified scoring mechanism and feature enhancement algorithm for the entire Transformer.	object detection, detr, transformer, attention mechanism, efficient computation
2307.12574 Report	A Good Student is Cooperative and Reliable: CNN-Transformer Collaborative Learning for Semantic Segmentation	Jinjing Zhu, Yunhao Luo, Xu Zheng, Hao Wang, Lin Wang	In this paper, we strive to answer the question "how to collaboratively learn convolutional neural network (CNN)-based and vision transformer (ViT)-based models by selecting and exchanging the reliable knowledge between them for semantic segmentation?" Accordingly, we propose an online knowledge distillation (KD) framework that can simultaneously learn compact yet effective CNN-based and ViT-based models with two key technical breakthroughs to take full advantage of CNNs and ViT while compensating their limitations. Firstly, we propose heterogeneous feature distillation (HFD) to improve students' consistency in low-layer feature space by mimicking heterogeneous features between CNNs and ViT. Secondly, to facilitate the two students to learn reliable knowledge from each other, we propose bidirectional selective distillation (BSD) that can dynamically transfer selective knowledge. This is achieved by 1) region-wise BSD determining the directions of knowledge transferred between the corresponding regions in the feature space and 2) pixel-wise BSD discerning which of the prediction knowledge to be transferred in the logit space. Extensive experiments on three benchmark datasets demonstrate that our proposed framework outperforms the state-of-the-art online distillation methods by a large margin, and shows its efficacy in learning collaboratively between ViT-based and CNN-based models.	This supplementary material provides detailed insights into the CNN-Transformer collaborative learning framework for semantic segmentation, focusing on heterogeneous feature distillation (HFD) and region-wise bidirectional selective distillation (BSD).	The proposed method addresses the challenge of effectively transferring knowledge between CNN and Transformer models for semantic segmentation, leveraging their complementary strengths.	The method uses HFD to align heterogeneous features from early layers and BSD to selectively transfer knowledge in a region-wise manner based on prediction reliability.	The method enables CNNs and Transformers to learn collaboratively and improve each other's performance. BSD facilitates the selection and exchange of reliable knowledge between the models, leading to enhanced segmentation accuracy. Experimental results demonstrate the effectiveness of the proposed approach compared to vanilla training and other distillation methods.	The current study focuses on specific CNN and Transformer architectures; exploring other architectures could further enhance the method's applicability. Investigating the impact of different knowledge distillation strategies within the framework could lead to further performance improvements.	semantic segmentation, collaborative learning, knowledge distillation, cnn-transformer, heterogeneous feature distillation
2307.12560 Report	Interpolating between Images with Diffusion Models	Clinton J. Wang, Polina Golland	One little-explored frontier of image generation and editing is the task of interpolating between two input images, a feature missing from all currently deployed image generation pipelines. We argue that such a feature can expand the creative applications of such models, and propose a method for zero-shot interpolation using latent diffusion models. We apply interpolation in the latent space at a sequence of decreasing noise levels, then perform denoising conditioned on interpolated text embeddings derived from textual inversion and (optionally) subject poses. For greater consistency, or to specify additional criteria, we can generate several candidates and use CLIP to select the highest quality image. We obtain convincing interpolations across diverse subject poses, image styles, and image content, and show that standard quantitative metrics such as FID are insufficient to measure the quality of an interpolation. Code and data are available at https://clintonjwang.github.io/interpolation.	This paper presents a novel method for generating high-quality interpolations between two real images using pre-trained latent diffusion models, a task not addressed by existing image generation techniques.	Real image interpolation can broaden the creative applications of image generation models in fields like art, media, and design.	The method involves interpolating noisy latent representations of the input images at progressively decreasing noise levels, guided by interpolated text embeddings and optionally, subject poses. CLIP is used to select high-quality outputs from multiple generated candidates.	The proposed method generates convincing interpolations across diverse image styles, content, and subject poses. Adding noise to parent latent vectors before interpolation leads to more semantically meaningful transformations compared to alternative schemes. Standard image generation metrics like FID and PPL are insufficient to evaluate the quality of interpolations, as they favor simple alpha composites over creative transformations.	The method may struggle to interpolate images with significant differences in style, layout, or semantic mapping of objects. Future work can explore non-uniform interpolation schedules and address limitations in handling large stylistic and semantic gaps between images.	latent diffusion models, image interpolation, image editing, textual inversion, clip
2307.12493 Report	TF-ICON: Diffusion-Based Training-Free Cross-Domain Image Composition	Shilin Lu, Yanzhu Liu, Adams Wai-Kin Kong	Text-driven diffusion models have exhibited impressive generative capabilities, enabling various image editing tasks. In this paper, we propose TF-ICON, a novel Training-Free Image COmpositioN framework that harnesses the power of text-driven diffusion models for cross-domain image-guided composition. This task aims to seamlessly integrate user-provided objects into a specific visual context. Current diffusion-based methods often involve costly instance-based optimization or finetuning of pretrained models on customized datasets, which can potentially undermine their rich prior. In contrast, TF-ICON can leverage off-the-shelf diffusion models to perform cross-domain image-guided composition without requiring additional training, finetuning, or optimization. Moreover, we introduce the exceptional prompt, which contains no information, to facilitate text-driven diffusion models in accurately inverting real images into latent representations, forming the basis for compositing. Our experiments show that equipping Stable Diffusion with the exceptional prompt outperforms state-of-the-art inversion methods on various datasets (CelebA-HQ, COCO, and ImageNet), and that TF-ICON surpasses prior baselines in versatile visual domains. Code is available at https://github.com/Shilin-LU/TF-ICON	This paper introduces TF-ICON, a training-free image composition framework that leverages pre-trained text-to-image diffusion models for cross-domain image composition.	Existing diffusion-based image composition methods require expensive training or finetuning, potentially harming model priors. This work offers a training-free alternative for diverse visual domains.	The method uses an 'exceptional prompt' to accurately invert real images into latent codes. It then performs composition by injecting composite self-attention maps during the denoising process, ensuring seamless object integration across domains.	High-order diffusion ODE solvers are shown to outperform DDIM for real image inversion. Introducing an exceptional prompt allows for accurate image inversion in text-driven diffusion models, exceeding SOTA methods on CelebA-HQ, COCO, and ImageNet datasets. TF-ICON surpasses prior baselines in qualitative and quantitative evaluations, demonstrating superior performance in cross-domain image composition.	TF-ICON's reliance on self-attention maps limits its ability to generate object views significantly different from the reference image. The approach inherits the limitations and biases of the underlying Stable Diffusion model, potentially leading to artifacts in certain situations.	image composition, diffusion models, training-free, cross-domain, image inversion
2307.12392 Report	Iterative Robust Visual Grounding with Masked Reference based Centerpoint Supervision	Menghao Li, Chunlei Wang, Wenquan Feng, Shuchang Lyu, Guangliang Cheng, Xiangtai Li, Binghao Liu, Qi Zhao	Visual Grounding (VG) aims at localizing target objects from an image based on given expressions and has made significant progress with the development of detection and vision transformer. However, existing VG methods tend to generate false-alarm objects when presented with inaccurate or irrelevant descriptions, which commonly occur in practical applications. Moreover, existing methods fail to capture fine-grained features, accurate localization, and sufficient context comprehension from the whole image and textual descriptions. To address both issues, we propose an Iterative Robust Visual Grounding (IR-VG) framework with Masked Reference based Centerpoint Supervision (MRCS). The framework introduces iterative multi-level vision-language fusion (IMVF) for better alignment. We use MRCS to ahieve more accurate localization with point-wised feature supervision. Then, to improve the robustness of VG, we also present a multi-stage false-alarm sensitive decoder (MFSD) to prevent the generation of false-alarm objects when presented with inaccurate expressions. The proposed framework is evaluated on five regular VG datasets and two newly constructed robust VG datasets. Extensive experiments demonstrate that IR-VG achieves new state-of-the-art (SOTA) results, with improvements of 25\% and 10\% compared to existing SOTA approaches on the two newly proposed robust VG datasets. Moreover, the proposed framework is also verified effective on five regular VG datasets. Codes and models will be publicly at https://github.com/cv516Buaa/IR-VG.	This paper introduces IR-VG, a novel iterative robust visual grounding framework that tackles the issue of false alarms in visual grounding tasks, where models incorrectly detect objects when presented with inaccurate or irrelevant textual descriptions.	Current visual grounding methods often fail to accurately detect target objects when provided with irrelevant or inaccurate textual descriptions, a common occurrence in real-world applications.	IR-VG leverages three key modules: Masked Reference based Centerpoint Supervision (MRCS) for enhanced fine-grained feature representation and localization accuracy, Iterative Multi-Level Vision-Language Fusion (IMVF) for better multi-modal understanding, and Multi-Stage False-Alarm Sensitive Decoder (MFSD) to identify and prevent false alarm predictions.	IR-VG achieves state-of-the-art performance on five regular visual grounding datasets, demonstrating its effectiveness in general visual grounding tasks. The framework significantly outperforms existing methods on two newly proposed robust visual grounding datasets (RefCOCOg_F and ReferItGame_F), demonstrating its robustness to irrelevant or inaccurate descriptions. Ablation studies and qualitative analysis validate the contribution of each module (MRCS, IMVF, MFSD) to the overall performance improvement.	The paper acknowledges the need for more sophisticated frameworks to handle false alarms in future work. Future research will explore the issue of irrelevant expressions in foundation models like Grounding DINO.	visual grounding, robustness, false alarm detection, multi-modal learning, vision-language understanding
2307.12348 Report	ResShift: Efficient Diffusion Model for Image Super-resolution by Residual Shifting	Zongsheng Yue, Jianyi Wang, Chen Change Loy	Diffusion-based image super-resolution (SR) methods are mainly limited by the low inference speed due to the requirements of hundreds or even thousands of sampling steps. Existing acceleration sampling techniques inevitably sacrifice performance to some extent, leading to over-blurry SR results. To address this issue, we propose a novel and efficient diffusion model for SR that significantly reduces the number of diffusion steps, thereby eliminating the need for post-acceleration during inference and its associated performance deterioration. Our method constructs a Markov chain that transfers between the high-resolution image and the low-resolution image by shifting the residual between them, substantially improving the transition efficiency. Additionally, an elaborate noise schedule is developed to flexibly control the shifting speed and the noise strength during the diffusion process. Extensive experiments demonstrate that the proposed method obtains superior or at least comparable performance to current state-of-the-art methods on both synthetic and real-world datasets, even only with 15 sampling steps. Our code and model are available at https://github.com/zsyOAOA/ResShift.	This paper proposes ResShift, an efficient diffusion model for image super-resolution that significantly reduces the number of diffusion steps required, achieving superior performance with just 15 steps.	Existing diffusion-based SR methods suffer from slow inference speed due to hundreds or thousands of sampling steps. Acceleration techniques compromise performance, leading to over-blurry results.	ResShift constructs a Markov chain that shifts the residual between the high-resolution and low-resolution images, enabling efficient transition. A flexible noise schedule controls shifting speed and noise strength.	ResShift achieves superior or comparable performance to state-of-the-art methods on synthetic and real-world datasets with only 15 sampling steps. It offers a better fidelity-realism trade-off compared to existing diffusion-based SR methods. The proposed noise schedule provides flexibility in controlling the shifting speed and noise level, allowing for a trade-off between fidelity and realism.	ResShift's inference speed, while faster than existing diffusion-based methods, is still slower than GAN-based approaches due to its iterative nature. The model, like other SR methods, may struggle with severely degraded real-world images not well-represented by synthetic degradation models used in training.	image super-resolution, diffusion model, efficient inference, noise schedule, markov chain
2307.12280 Report	Downstream-agnostic Adversarial Examples	Ziqi Zhou, Shengshan Hu, Ruizhi Zhao, Qian Wang, Leo Yu Zhang, Junhui Hou, Hai Jin	Self-supervised learning usually uses a large amount of unlabeled data to pre-train an encoder which can be used as a general-purpose feature extractor, such that downstream users only need to perform fine-tuning operations to enjoy the benefit of "large model". Despite this promising prospect, the security of pre-trained encoder has not been thoroughly investigated yet, especially when the pre-trained encoder is publicly available for commercial use. In this paper, we propose AdvEncoder, the first framework for generating downstream-agnostic universal adversarial examples based on the pre-trained encoder. AdvEncoder aims to construct a universal adversarial perturbation or patch for a set of natural images that can fool all the downstream tasks inheriting the victim pre-trained encoder. Unlike traditional adversarial example works, the pre-trained encoder only outputs feature vectors rather than classification labels. Therefore, we first exploit the high frequency component information of the image to guide the generation of adversarial examples. Then we design a generative attack framework to construct adversarial perturbations/patches by learning the distribution of the attack surrogate dataset to improve their attack success rates and transferability. Our results show that an attacker can successfully attack downstream tasks without knowing either the pre-training dataset or the downstream dataset. We also tailor four defenses for pre-trained encoders, the results of which further prove the attack ability of AdvEncoder.	AdvEncoder, the first framework for generating downstream-agnostic universal adversarial examples based on pre-trained encoders in self-supervised learning.	Pre-trained encoders are increasingly used in various downstream tasks, raising concerns about their security as their vulnerabilities could impact numerous applications.	A frequency-based generative attack framework is employed to construct adversarial perturbations or patches by learning the distribution of an attacker's surrogate dataset.	AdvEncoder achieves high attack success rates and transferability against different downstream tasks, such as image classification and image retrieval. The attack remains effective even when the attacker has no knowledge of the pre-training dataset or downstream tasks. Existing defenses, like data corruption, fine-tuning, pruning, and adversarial training, show limited effectiveness against AdvEncoder.	The attack performance of AdvEncoder may vary depending on the similarity between the attacker's surrogate dataset and the pre-training/downstream datasets. Further exploration is needed to develop more robust defenses specifically tailored to protect pre-trained encoders from adversarial attacks. Future work includes investigating the effectiveness of AdvEncoder on other downstream tasks beyond image classification and retrieval.	adversarial examples, self-supervised learning, pre-trained encoders, universal adversarial perturbations, universal adversarial patches
2307.12217 Report	LoLep: Single-View View Synthesis with Locally-Learned Planes and Self-Attention Occlusion Inference	Cong Wang, Yu-Ping Wang, Dinesh Manocha	We propose a novel method, LoLep, which regresses Locally-Learned planes from a single RGB image to represent scenes accurately, thus generating better novel views. Without the depth information, regressing appropriate plane locations is a challenging problem. To solve this issue, we pre-partition the disparity space into bins and design a disparity sampler to regress local offsets for multiple planes in each bin. However, only using such a sampler makes the network not convergent; we further propose two optimizing strategies that combine with different disparity distributions of datasets and propose an occlusion-aware reprojection loss as a simple yet effective geometric supervision technique. We also introduce a self-attention mechanism to improve occlusion inference and present a Block-Sampling Self-Attention (BS-SA) module to address the problem of applying self-attention to large feature maps. We demonstrate the effectiveness of our approach and generate state-of-the-art results on different datasets. Compared to MINE, our approach has an LPIPS reduction of 4.8%-9.0% and an RV reduction of 73.9%-83.5%. We also evaluate the performance on real-world images and demonstrate the benefits.	Proposes LoLep, a novel single-view view synthesis method using locally-learned planes to represent scenes and generate better novel views from a single RGB image.	Existing methods struggle to represent occluded regions well, and while layered representations are suitable, they either require excessive computing power or rely on depth maps for accurate plane locations.	Utilizes a disparity sampler to regress locally-learned plane locations, introduces two parameter optimization strategies for different disparity distributions, and proposes an occlusion-aware reprojection loss for geometric supervision. A Block-Sampling Self-Attention (BS-SA) module enhances occlusion inference on large feature maps.	Outperforms MINE on KITTI, RealEstate10K, and Flowers Light Field datasets with improved LPIPS, SSIM, PSNR, and Rendering Variance (RV). Generates sharper and more realistic novel views with better handling of occlusions and scene geometry compared to previous methods. Demonstrates significant improvements in depth estimation on NYU-Depth V2 and iBims-1 datasets, highlighting accurate scene representation.	The locally-learned planes, while effective, represent a suboptimal solution due to their restriction to specific disparity bins. Future work will focus on developing new techniques to optimize planes across the entire disparity space while preventing clustering, potentially achieving even better results.	single-view view synthesis, locally-learned planes, occlusion inference, self-attention mechanism, multiplane image (mpi)
2307.12101 Report	Spatial Self-Distillation for Object Detection with Inaccurate Bounding Boxes	Di Wu, Pengfei Chen, Xuehui Yu, Guorong Li, Zhenjun Han, Jianbin Jiao	Object detection via inaccurate bounding boxes supervision has boosted a broad interest due to the expensive high-quality annotation data or the occasional inevitability of low annotation quality (\eg tiny objects). The previous works usually utilize multiple instance learning (MIL), which highly depends on category information, to select and refine a low-quality box. Those methods suffer from object drift, group prediction and part domination problems without exploring spatial information. In this paper, we heuristically propose a \textbf{Spatial Self-Distillation based Object Detector (SSD-Det)} to mine spatial information to refine the inaccurate box in a self-distillation fashion. SSD-Det utilizes a Spatial Position Self-Distillation \textbf{(SPSD)} module to exploit spatial information and an interactive structure to combine spatial information and category information, thus constructing a high-quality proposal bag. To further improve the selection procedure, a Spatial Identity Self-Distillation \textbf{(SISD)} module is introduced in SSD-Det to obtain spatial confidence to help select the best proposals. Experiments on MS-COCO and VOC datasets with noisy box annotation verify our method's effectiveness and achieve state-of-the-art performance. The code is available at https://github.com/ucas-vg/PointTinyBenchmark/tree/SSD-Det.	This paper proposes SSD-Det, a Spatial Self-Distillation based object detector that addresses the challenge of training object detectors with inaccurate bounding box annotations.	Training object detectors typically requires large amounts of accurately annotated data, which is expensive and time-consuming. Inaccurate annotations are common, especially with automated labeling techniques. This paper addresses this challenge by developing a method robust to such inaccuracies.	SSD-Det leverages spatial information through two novel modules: Spatial Position Self-Distillation (SPSD) and Spatial Identity Self-Distillation (SISD). SPSD refines proposal bag construction by learning semantic-spatial correspondence, while SISD improves proposal selection by predicting object-aware IoU.	SSD-Det significantly outperforms state-of-the-art methods on MS-COCO and VOC datasets with various noise levels. SPSD effectively improves proposal bag quality, leading to a higher upper bound for proposal selection. SISD successfully integrates object-relevant spatial confidence, improving proposal selection accuracy.	The current implementation of SSD-Det is limited to two-stage object detectors. Future work can explore the extension of SSD-Det to other detection frameworks and tasks, such as instance segmentation.	object detection, noisy annotations, self-distillation, spatial information, proposal refinement
2307.12027 Report	On the Effectiveness of Spectral Discriminators for Perceptual Quality Improvement	Xin Luo, Yunan Zhu, Shunxin Xu, Dong Liu	Several recent studies advocate the use of spectral discriminators, which evaluate the Fourier spectra of images for generative modeling. However, the effectiveness of the spectral discriminators is not well interpreted yet. We tackle this issue by examining the spectral discriminators in the context of perceptual image super-resolution (i.e., GAN-based SR), as SR image quality is susceptible to spectral changes. Our analyses reveal that the spectral discriminator indeed performs better than the ordinary (a.k.a. spatial) discriminator in identifying the differences in the high-frequency range; however, the spatial discriminator holds an advantage in the low-frequency range. Thus, we suggest that the spectral and spatial discriminators shall be used simultaneously. Moreover, we improve the spectral discriminators by first calculating the patch-wise Fourier spectrum and then aggregating the spectra by Transformer. We verify the effectiveness of the proposed method twofold. On the one hand, thanks to the additional spectral discriminator, our obtained SR images have their spectra better aligned to those of the real images, which leads to a better PD tradeoff. On the other hand, our ensembled discriminator predicts the perceptual quality more accurately, as evidenced in the no-reference image quality assessment task.	This paper analyzes the effectiveness of spectral discriminators for improving perceptual quality in GAN-based image super-resolution and proposes using spatial and spectral discriminators in combination.	Spectral discriminators, which analyze images in the frequency domain, have been proposed to address spectral discrepancies between generated and real images, but their effectiveness remains unclear.	The authors analyze the robustness of spatial and spectral discriminators under frequency perturbations, revealing their complementary strengths. They propose a Dual Transformer discriminator combining a Spatial Transformer and a Spectral Transformer with a per-patch Fourier Transform.	Spectral discriminators excel at identifying high-frequency noise, complementing spatial discriminators' strength in detecting low-frequency deficiencies. Combining spatial and spectral discriminators in a Dual Transformer discriminator leads to better spectral alignment between generated and real images, improving perceptual quality in super-resolution. The Dual Transformer discriminator demonstrates superior performance in no-reference image quality assessment compared to spatial discriminators alone.	Better-aligned spectra don't always guarantee improved perceptual quality. Training spatial and spectral discriminators separately increases computational overhead.	image super-resolution, generative adversarial networks, spectral discriminators, perceptual quality, frequency analysis
2307.11978 Report	Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels?	Cheng-En Wu, Yu Tian, Haichao Yu, Heng Wang, Pedro Morgado, Yu Hen Hu, Linjie Yang	Vision-language models such as CLIP learn a generic text-image embedding from large-scale training data. A vision-language model can be adapted to a new classification task through few-shot prompt tuning. We find that such a prompt tuning process is highly robust to label noises. This intrigues us to study the key reasons contributing to the robustness of the prompt tuning paradigm. We conducted extensive experiments to explore this property and find the key factors are: 1) the fixed classname tokens provide a strong regularization to the optimization of the model, reducing gradients induced by the noisy samples; 2) the powerful pre-trained image-text embedding that is learned from diverse and generic web data provides strong prior knowledge for image classification. Further, we demonstrate that noisy zero-shot predictions from CLIP can be used to tune its own prompt, significantly enhancing prediction accuracy in the unsupervised setting. The code is available at https://github.com/CEWu/PTNL.	This paper discovers and analyzes the surprising robustness of prompt tuning for vision-language models (e.g., CLIP) against noisy labels, outperforming traditional transfer learning methods like fine-tuning and linear probes.	Learning with noisy labels is crucial for real-world applications where perfectly annotated data is scarce. This study reveals the robustness of prompt tuning, a data-efficient method, in handling such imperfect data.	The authors conduct extensive experiments on various datasets, comparing prompt tuning with linear probes and fine-tuning under different noise levels and types. They analyze the impact of different components like class embeddings, learnable prompts, and robust loss functions (GCE) on the model's performance.	Prompt tuning of CLIP demonstrates significantly higher robustness to noisy labels compared to fine-tuning or linear probing methods. The fixed classname tokens within the prompt, along with CLIP's pre-trained text encoder, provide strong regularization, preventing overfitting to noisy data. Leveraging this robustness, a novel unsupervised prompt tuning approach is proposed, utilizing randomly sampled pseudo labels to enhance CLIP zero-shot performance.	The study primarily focuses on CLIP, leaving the exploration of other vision-language models for future work. Investigating the impact of varying prompt lengths and exploring a wider range of robust loss functions beyond GCE could further enhance the understanding of noise robustness in prompt tuning.	prompt tuning, vision-language models, noisy labels, clip, unsupervised learning
2307.11932 Report	RIC: Rotate-Inpaint-Complete for Generalizable Scene Reconstruction	Isaac Kasahara, Shubham Agrawal, Selim Engin, Nikhil Chavan-Dafle, Shuran Song, Volkan Isler	General scene reconstruction refers to the task of estimating the full 3D geometry and texture of a scene containing previously unseen objects. In many practical applications such as AR/VR, autonomous navigation, and robotics, only a single view of the scene may be available, making the scene reconstruction task challenging. In this paper, we present a method for scene reconstruction by structurally breaking the problem into two steps: rendering novel views via inpainting and 2D to 3D scene lifting. Specifically, we leverage the generalization capability of large visual language models (Dalle-2) to inpaint the missing areas of scene color images rendered from different views. Next, we lift these inpainted images to 3D by predicting normals of the inpainted image and solving for the missing depth values. By predicting for normals instead of depth directly, our method allows for robustness to changes in depth distributions and scale. With rigorous quantitative evaluation, we show that our method outperforms multiple baselines while providing generalization to novel objects and scenes.	This paper introduces Rotate-Inpaint-Complete (RIC), a novel method for 3D scene reconstruction from a single RGB-D image, leveraging the inpainting capabilities of large visual language models to handle novel objects and scenes.	Reconstructing complete 3D scenes from limited viewpoints is crucial for various applications like AR/VR, robotics, and autonomous navigation. Existing methods struggle with novel objects and cluttered scenes.	RIC generates novel views by rotating the input image, uses DALL-E for inpainting missing regions, predicts surface normals and occlusion boundaries from inpainted images, and optimizes depth using these predictions. Finally, it filters inconsistencies across viewpoints to refine the 3D reconstruction.	RIC outperforms baselines in 3D scene reconstruction metrics on both in-distribution (YCB-V) and out-of-distribution (HOPE) datasets, demonstrating generalizability. The method shows robustness to prompt specificity, indicating the effectiveness of view selection in preserving sufficient context for inpainting. Qualitative results highlight RIC's ability to generate realistic novel views and complete scene geometry, even for heavily occluded objects.	DALL-E's tendency to generate unrealistic elements can impact reconstruction quality, although mitigated through consistency filtering. Reconstructing the backside of objects remains challenging due to limited context at large viewpoints.	3d scene reconstruction, single-view reconstruction, dall-e, inpainting, novel view synthesis
2307.11828 Report	Enhancing Your Trained DETRs with Box Refinement	Yiqun Chen, Qiang Chen, Peize Sun, Shoufa Chen, Jingdong Wang, Jian Cheng	We present a conceptually simple, efficient, and general framework for localization problems in DETR-like models. We add plugins to well-trained models instead of inefficiently designing new models and training them from scratch. The method, called RefineBox, refines the outputs of DETR-like detectors by lightweight refinement networks. RefineBox is easy to implement and train as it only leverages the features and predicted boxes from the well-trained detection models. Our method is also efficient as we freeze the trained detectors during training. In addition, we can easily generalize RefineBox to various trained detection models without any modification. We conduct experiments on COCO and LVIS $1.0$. Experimental results indicate the effectiveness of our RefineBox for DETR and its representative variants (Figure 1). For example, the performance gains for DETR, Conditinal-DETR, DAB-DETR, and DN-DETR are 2.4 AP, 2.5 AP, 1.9 AP, and 1.6 AP, respectively. We hope our work will bring the attention of the detection community to the localization bottleneck of current DETR-like models and highlight the potential of the RefineBox framework. Code and models will be publicly available at: \href{https://github.com/YiqunChen1999/RefineBox}{https://github.com/YiqunChen1999/RefineBox}.	This paper introduces RefineBox, a novel framework to enhance the localization accuracy of pre-trained DETR-like object detectors by adding a lightweight refinement network.	The authors identify localization accuracy as the bottleneck in DETR-like detectors, hindering further performance improvement even with perfect classification.	RefineBox leverages the feature pyramid network from the pre-trained detector's backbone and refines predicted bounding boxes using a series of Refiner modules. Crucially, the original detector's parameters are frozen during training.	RefineBox consistently improves Average Precision (AP) across various DETR-like models on COCO and LVIS datasets. Significant gains are observed in AP75 and AP for small objects, highlighting improved localization accuracy. The framework is lightweight, adding minimal parameters and FLOPs, making it efficient for training and inference.	The simple design of the current refinement network may limit further performance gains. Exploring more sophisticated architectures is left for future work. The paper mainly focuses on improving localization. Investigating the impact of refining classification jointly with localization is a promising direction.	object detection, detr, localization accuracy, refinement network, two-stage detection
2307.11661 Report	Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts	Mayug Maniparambil, Chris Vorster, Derek Molloy, Noel Murphy, Kevin McGuinness, Noel E. O'Connor	Contrastive pretrained large Vision-Language Models (VLMs) like CLIP have revolutionized visual representation learning by providing good performance on downstream datasets. VLMs are 0-shot adapted to a downstream dataset by designing prompts that are relevant to the dataset. Such prompt engineering makes use of domain expertise and a validation dataset. Meanwhile, recent developments in generative pretrained models like GPT-4 mean they can be used as advanced internet search tools. They can also be manipulated to provide visual information in any structure. In this work, we show that GPT-4 can be used to generate text that is visually descriptive and how this can be used to adapt CLIP to downstream tasks. We show considerable improvements in 0-shot transfer accuracy on specialized fine-grained datasets like EuroSAT (~7%), DTD (~7%), SUN397 (~4.6%), and CUB (~3.3%) when compared to CLIP's default prompt. We also design a simple few-shot adapter that learns to choose the best possible sentences to construct generalizable classifiers that outperform the recently proposed CoCoOP by ~2% on average and by over 4% on 4 specialized fine-grained datasets. The code, prompts, and auxiliary text dataset is available at https://github.com/mayug/VDT-Adapter.	This paper proposes a novel method to enhance CLIP's zero-shot and few-shot domain adaptation capabilities by leveraging GPT-4 generated visually descriptive text (VDT) information.	This approach addresses the limitations of prompt engineering, which relies on domain expertise and often yields inconsistent results due to prompt sensitivity. Using VDT provides richer semantic information for CLIP, leading to more accurate and generalizable classification.	The methodology involves two main stages: 1) VDT generation: GPT-4 is prompted to generate detailed visual descriptions for each class in a dataset. 2) CLIP adaptation: For zero-shot transfer, VDT is incorporated into prompt ensembles. For few-shot transfer, a lightweight adapter network (CLIP-A-self) with self-attention is trained to selectively aggregate the most relevant VDT, improving classification accuracy on unseen classes.	GPT-4 generated VDT significantly improves CLIP's 0-shot performance on 12 diverse datasets, with an average gain of 2% and even larger improvements on fine-grained datasets like EuroSAT (7%), DTD (7%), and CUB (3.3%). CLIP-A-self, utilizing VDT, outperforms existing few-shot methods like CoCoOp by 3% on average in the Base-to-New setting, demonstrating better generalization ability. Analysis of attention weights reveals that CLIP-A-self effectively learns to prioritize visually relevant VDT sentences, contributing to its superior performance.	While GPT-4 provides high-quality VDT, its dependence on a paid API might pose scalability constraints. Future work can explore incorporating other modalities like object detection or scene graphs to further enrich CLIP's understanding, potentially leading to even better performance.	vision-language models, clip, gpt-4, prompt engineering, few-shot learning
2307.11558 Report	Advancing Visual Grounding with Scene Knowledge: Benchmark and Method	Zhihong Chen, Ruifei Zhang, Yibing Song, Xiang Wan, Guanbin Li	Visual grounding (VG) aims to establish fine-grained alignment between vision and language. Ideally, it can be a testbed for vision-and-language models to evaluate their understanding of the images and texts and their reasoning abilities over their joint space. However, most existing VG datasets are constructed using simple description texts, which do not require sufficient reasoning over the images and texts. This has been demonstrated in a recent study~\cite{luo2022goes}, where a simple LSTM-based text encoder without pretraining can achieve state-of-the-art performance on mainstream VG datasets. Therefore, in this paper, we propose a novel benchmark of \underline{S}cene \underline{K}nowledge-guided \underline{V}isual \underline{G}rounding (SK-VG), where the image content and referring expressions are not sufficient to ground the target objects, forcing the models to have a reasoning ability on the long-form scene knowledge. To perform this task, we propose two approaches to accept the triple-type input, where the former embeds knowledge into the image features before the image-query interaction; the latter leverages linguistic structure to assist in computing the image-text matching. We conduct extensive experiments to analyze the above methods and show that the proposed approaches achieve promising results but still leave room for improvement, including performance and interpretability. The dataset and code are available at \url{https://github.com/zhjohnchan/SK-VG}.	This paper introduces SK-VG, a new benchmark dataset for visual grounding that requires models to reason over image, scene knowledge, and query triples for accurate object localization.	Existing visual grounding datasets lack complex language understanding and reasoning challenges, failing to evaluate the full reasoning capabilities of vision-language models. SK-VG addresses this by incorporating scene knowledge, forcing models to reason beyond simple visual descriptions.	The authors construct SK-VG with real images, manually annotated with scene stories and referring expressions. Two approaches, KeViLI (one-stage knowledge embedding) and LeViLM (two-stage linguistic-enhanced matching), are proposed to address this new task.	LeViLM significantly outperforms traditional visual grounding models on SK-VG, demonstrating the effectiveness of incorporating scene knowledge. Fine-tuning LeViLM on SK-VG substantially improves performance compared to zero-shot or linear probing, highlighting the importance of model adaptation for this task. While LeViLM achieves decent results on easy/medium difficulty levels, it struggles with the hard split, indicating the need for further research in multi-hop reasoning and model interpretability.	Scene knowledge annotation can be subjective and biased due to the creative nature of the task and annotator differences. The scale of SK-VG is relatively smaller compared to existing datasets due to the complexity and time-consuming annotation process.	visual grounding, scene knowledge, reasoning, benchmark dataset, vision-language
2307.11458 Report	Strip-MLP: Efficient Token Interaction for Vision MLP	Guiping Cao, Shengda Luo, Wenjian Huang, Xiangyuan Lan, Dongmei Jiang, Yaowei Wang, Jianguo Zhang	Token interaction operation is one of the core modules in MLP-based models to exchange and aggregate information between different spatial locations. However, the power of token interaction on the spatial dimension is highly dependent on the spatial resolution of the feature maps, which limits the model's expressive ability, especially in deep layers where the feature are down-sampled to a small spatial size. To address this issue, we present a novel method called \textbf{Strip-MLP} to enrich the token interaction power in three ways. Firstly, we introduce a new MLP paradigm called Strip MLP layer that allows the token to interact with other tokens in a cross-strip manner, enabling the tokens in a row (or column) to contribute to the information aggregations in adjacent but different strips of rows (or columns). Secondly, a \textbf{C}ascade \textbf{G}roup \textbf{S}trip \textbf{M}ixing \textbf{M}odule (CGSMM) is proposed to overcome the performance degradation caused by small spatial feature size. The module allows tokens to interact more effectively in the manners of within-patch and cross-patch, which is independent to the feature spatial size. Finally, based on the Strip MLP layer, we propose a novel \textbf{L}ocal \textbf{S}trip \textbf{M}ixing \textbf{M}odule (LSMM) to boost the token interaction power in the local region. Extensive experiments demonstrate that Strip-MLP significantly improves the performance of MLP-based models on small datasets and obtains comparable or even better results on ImageNet. In particular, Strip-MLP models achieve higher average Top-1 accuracy than existing MLP-based models by +2.44\% on Caltech-101 and +2.16\% on CIFAR-100. The source codes will be available at~\href{https://github.com/Med-Process/Strip_MLP{https://github.com/Med-Process/Strip\_MLP}.	This paper proposes Strip-MLP, an efficient vision MLP model that enriches token interaction power through a novel Strip MLP layer, Cascade Group Strip Mixing Module (CGSMM), and Local Strip Mixing Module (LSMM).	Existing MLP-based models suffer from degraded token interaction power, especially in deep layers with down-sampled feature maps, limiting their expressive ability.	Strip MLP layer aggregates adjacent tokens in a cross-strip manner. CGSMM enables effective token interaction within and across channel-wise patches. LSMM enhances local token interactions.	Strip-MLP significantly outperforms previous MLP-based models on small datasets like Caltech-101 and CIFAR-100. It achieves comparable or superior results on ImageNet-1K with fewer parameters and FLOPs than other MLP, CNN, and Transformer models. Ablation studies demonstrate the effectiveness of each proposed component.	Optimal patch number in CGSMM depends on dataset scale and requires validation. Exploring the application of Strip MLP layer in other vision tasks.	vision mlp, token interaction, image classification, efficient model, strip mlp
2307.11418 Report	FaceCLIPNeRF: Text-driven 3D Face Manipulation using Deformable Neural Radiance Fields	Sungwon Hwang, Junha Hyung, Daejin Kim, Min-Jung Kim, Jaegul Choo	As recent advances in Neural Radiance Fields (NeRF) have enabled high-fidelity 3D face reconstruction and novel view synthesis, its manipulation also became an essential task in 3D vision. However, existing manipulation methods require extensive human labor, such as a user-provided semantic mask and manual attribute search unsuitable for non-expert users. Instead, our approach is designed to require a single text to manipulate a face reconstructed with NeRF. To do so, we first train a scene manipulator, a latent code-conditional deformable NeRF, over a dynamic scene to control a face deformation using the latent code. However, representing a scene deformation with a single latent code is unfavorable for compositing local deformations observed in different instances. As so, our proposed Position-conditional Anchor Compositor (PAC) learns to represent a manipulated scene with spatially varying latent codes. Their renderings with the scene manipulator are then optimized to yield high cosine similarity to a target text in CLIP embedding space for text-driven manipulation. To the best of our knowledge, our approach is the first to address the text-driven manipulation of a face reconstructed with NeRF. Extensive results, comparisons, and ablation studies demonstrate the effectiveness of our approach.	Presents FaceCLIPNeRF, a method for text-driven 3D face manipulation using deformable neural radiance fields, enabling control of facial expressions using only text prompts.	Existing 3D face manipulation methods with NeRF are labor-intensive, requiring manual input like semantic masks or attribute adjustments, making them unsuitable for non-expert users. This work addresses this by enabling manipulation with just a single text prompt.	The method first trains a scene manipulator based on HyperNeRF to control facial deformations with latent codes. To overcome limitations in representing complex expressions, a Position-conditional Anchor Compositor (PAC) is introduced. This PAC learns to combine learned deformation anchors, enabling the representation of a manipulated scene with spatially varying latent codes. Finally, the rendered images are optimized to align with a target text's attributes in CLIP embedding space.	FaceCLIPNeRF successfully manipulates facial expressions using both descriptive and emotional text prompts. The method outperforms baselines in quantitative metrics such as R-precision and LPIPS, demonstrating superior text reflectivity and visual quality. User studies confirm that FaceCLIPNeRF effectively reflects target text attributes while preserving visual realism and face identity.	The method relies on a pre-trained human segmentation network for excluding dynamic scene elements during camera pose estimation, potentially limiting generalizability. Future work could explore expanding the range of manipulable facial attributes and improving the fine-grained control over facial features.	3d face manipulation, neural radiance fields, text-driven manipulation, deformable nerf, clip
2307.11342 Report	Tuning Pre-trained Model via Moment Probing	Mingze Gao, Qilong Wang, Zhenyi Lin, Pengfei Zhu, Qinghua Hu, Jingbo Zhou	Recently, efficient fine-tuning of large-scale pre-trained models has attracted increasing research interests, where linear probing (LP) as a fundamental module is involved in exploiting the final representations for task-dependent classification. However, most of the existing methods focus on how to effectively introduce a few of learnable parameters, and little work pays attention to the commonly used LP module. In this paper, we propose a novel Moment Probing (MP) method to further explore the potential of LP. Distinguished from LP which builds a linear classification head based on the mean of final features (e.g., word tokens for ViT) or classification tokens, our MP performs a linear classifier on feature distribution, which provides the stronger representation ability by exploiting richer statistical information inherent in features. Specifically, we represent feature distribution by its characteristic function, which is efficiently approximated by using first- and second-order moments of features. Furthermore, we propose a multi-head convolutional cross-covariance (MHC$^3$) to compute second-order moments in an efficient and effective manner. By considering that MP could affect feature learning, we introduce a partially shared module to learn two recalibrating parameters (PSRP) for backbones based on MP, namely MP$_{+}$. Extensive experiments on ten benchmarks using various models show that our MP significantly outperforms LP and is competitive with counterparts at less training cost, while our MP$_{+}$ achieves state-of-the-art performance.	This paper proposes Moment Probing (MP), a novel method for fine-tuning large pre-trained models that outperforms linear probing (LP) by leveraging feature distribution for classification.	Existing efficient fine-tuning methods primarily focus on introducing learnable parameters while overlooking the potential of the commonly used LP module. This paper addresses this gap by enhancing the representation power of LP for improved performance.	MP models feature distribution using the characteristic function, approximating it with first and second-order moments. A multi-head convolutional cross-covariance (MHC³) method efficiently computes second-order moments. Furthermore, a partially shared module (PSRP) is introduced to learn recalibrating parameters for the backbone, resulting in MP₊.	MP consistently outperforms LP and achieves competitive or better performance than existing parameter-efficient methods at a lower training cost. MP generalizes well across pre-training strategies, few-shot settings, and out-of-distribution datasets. MP₊, incorporating feature learning, surpasses full fine-tuning and other efficient methods, achieving state-of-the-art performance.	The paper primarily focuses on classification tasks, future work could explore MP's applicability to other tasks like prompt learning. Further investigation into the theoretical properties and limitations of MHC³ is warranted.	fine-tuning, linear probing, parameter-efficient learning, transfer learning, moment probing
2307.11335 Report	Tri-MipRF: Tri-Mip Representation for Efficient Anti-Aliasing Neural Radiance Fields	Wenbo Hu, Yuling Wang, Lin Ma, Bangbang Yang, Lin Gao, Xiao Liu, Yuewen Ma	Despite the tremendous progress in neural radiance fields (NeRF), we still face a dilemma of the trade-off between quality and efficiency, e.g., MipNeRF presents fine-detailed and anti-aliased renderings but takes days for training, while Instant-ngp can accomplish the reconstruction in a few minutes but suffers from blurring or aliasing when rendering at various distances or resolutions due to ignoring the sampling area. To this end, we propose a novel Tri-Mip encoding that enables both instant reconstruction and anti-aliased high-fidelity rendering for neural radiance fields. The key is to factorize the pre-filtered 3D feature spaces in three orthogonal mipmaps. In this way, we can efficiently perform 3D area sampling by taking advantage of 2D pre-filtered feature maps, which significantly elevates the rendering quality without sacrificing efficiency. To cope with the novel Tri-Mip representation, we propose a cone-casting rendering technique to efficiently sample anti-aliased 3D features with the Tri-Mip encoding considering both pixel imaging and observing distance. Extensive experiments on both synthetic and real-world datasets demonstrate our method achieves state-of-the-art rendering quality and reconstruction speed while maintaining a compact representation that reduces 25% model size compared against Instant-ngp.	This paper proposes Tri-MipRF, a novel neural radiance field representation that enables both instant reconstruction and anti-aliased high-fidelity rendering.	Existing NeRF methods face a trade-off between quality and efficiency, struggling to achieve both high-quality anti-aliased renderings and fast reconstruction.	The method introduces a novel Tri-Mip encoding that factorizes pre-filtered 3D feature spaces into three orthogonal mipmaps, allowing efficient 3D area sampling using 2D feature maps. It also employs a cone-casting rendering technique with adaptive sphere sampling based on pixel imaging and distance, coupled with a hybrid volume-surface rendering strategy for real-time performance.	Tri-MipRF achieves state-of-the-art rendering quality with fine details and reduced aliasing on multi-scale Blender datasets. The method achieves fast reconstruction within five minutes on a single GPU, comparable to Instant-ngp but with superior rendering quality. Tri-MipRF maintains a compact representation, reducing the model size by 25% compared to Instant-ngp.	The reliance on proxy mesh generation for real-time rendering introduces additional steps and may impact performance for complex scenes. Exploration of alternative rendering strategies beyond hybrid volume-surface rendering could further enhance efficiency.	neural radiance fields, anti-aliasing, mipmap, cone casting, real-time rendering
2307.11308 Report	DPM-OT: A New Diffusion Probabilistic Model Based on Optimal Transport	Zezeng Li, ShengHao Li, Zhanpeng Wang, Na Lei, Zhongxuan Luo, Xianfeng Gu	Sampling from diffusion probabilistic models (DPMs) can be viewed as a piecewise distribution transformation, which generally requires hundreds or thousands of steps of the inverse diffusion trajectory to get a high-quality image. Recent progress in designing fast samplers for DPMs achieves a trade-off between sampling speed and sample quality by knowledge distillation or adjusting the variance schedule or the denoising equation. However, it can't be optimal in both aspects and often suffer from mode mixture in short steps. To tackle this problem, we innovatively regard inverse diffusion as an optimal transport (OT) problem between latents at different stages and propose the DPM-OT, a unified learning framework for fast DPMs with a direct expressway represented by OT map, which can generate high-quality samples within around 10 function evaluations. By calculating the semi-discrete optimal transport map between the data latents and the white noise, we obtain an expressway from the prior distribution to the data distribution, while significantly alleviating the problem of mode mixture. In addition, we give the error bound of the proposed method, which theoretically guarantees the stability of the algorithm. Extensive experiments validate the effectiveness and advantages of DPM-OT in terms of speed and quality (FID and mode mixture), thus representing an efficient solution for generative modeling. Source codes are available at https://github.com/cognaclee/DPM-OT	This paper introduces DPM-OT, a new diffusion probabilistic model for fast sampling that leverages optimal transport (OT) to build an expressway between latents at different stages of the inverse diffusion process.	Existing fast DPMs often compromise sample quality or introduce mode mixture due to approximating a continuous diffusion process. DPM-OT addresses these limitations by utilizing OT.	DPM-OT computes a semi-discrete optimal transport (SDOT) map between white noise and data latents at an intermediate diffusion step. This map acts as an expressway to quickly bring the noise to a near-perfect initial point for subsequent inverse diffusion, significantly reducing sampling steps.	DPM-OT generates high-quality images with fewer function evaluations compared to state-of-the-art models. The proposed method effectively mitigates mode mixture, leading to more semantically meaningful samples. Theoretical analysis proves that DPM-OT can fit the target data distribution no worse than traditional DPMs.	A limitation is the storage requirement for noisy training samples at the intermediate diffusion step. Future work includes extending DPM-OT to conditional image generation tasks.	diffusion probabilistic models, optimal transport, fast sampling, mode mixture, generative modeling
2307.11086 Report	PAPR: Proximity Attention Point Rendering	Yanshu Zhang, Shichong Peng, Alireza Moazeni, Ke Li	Learning accurate and parsimonious point cloud representations of scene surfaces from scratch remains a challenge in 3D representation learning. Existing point-based methods often suffer from the vanishing gradient problem or require a large number of points to accurately model scene geometry and texture. To address these limitations, we propose Proximity Attention Point Rendering (PAPR), a novel method that consists of a point-based scene representation and a differentiable renderer. Our scene representation uses a point cloud where each point is characterized by its spatial position, influence score, and view-independent feature vector. The renderer selects the relevant points for each ray and produces accurate colours using their associated features. PAPR effectively learns point cloud positions to represent the correct scene geometry, even when the initialization drastically differs from the target geometry. Notably, our method captures fine texture details while using only a parsimonious set of points. We also demonstrate four practical applications of our method: zero-shot geometry editing, object manipulation, texture transfer, and exposure control. More results and code are available on our project website at https://zvict.github.io/papr/.	This paper introduces Proximity Attention Point Rendering (PAPR), a novel method for learning and rendering parsimonious 3D point cloud scene representations directly from multi-view RGB images.	Learning accurate and concise representations of 3D scenes is crucial for various applications in computer vision and graphics, including novel view synthesis, scene editing, and virtual reality. Existing methods often struggle to balance representation capacity and computational complexity.	PAPR leverages a point cloud representation, where each point is defined by its position, influence score, and a view-independent feature vector. It utilizes a differentiable renderer with a ray-dependent point embedding and proximity attention to select relevant points and combine their features for generating high-quality images.	PAPR effectively learns accurate geometry and texture details from scratch, even with random point cloud initialization. It outperforms prior point-based and volumetric methods in terms of image quality while using a parsimonious point set. PAPR enables several practical applications, including zero-shot geometry editing, object manipulation, texture transfer, and exposure control.	The current pruning strategy assumes a near-constant background color, limiting its applicability in complex backgrounds. Future work can explore learning a separate model to handle background variations and extend the method's capability with more points and deeper networks.	3d scene representation, point cloud rendering, neural rendering, differentiable rendering, proximity attention
2307.11073 Report	OBJECT 3DIT: Language-guided 3D-aware Image Editing	Oscar Michel, Anand Bhattad, Eli VanderBilt, Ranjay Krishna, Aniruddha Kembhavi, Tanmay Gupta	Existing image editing tools, while powerful, typically disregard the underlying 3D geometry from which the image is projected. As a result, edits made using these tools may become detached from the geometry and lighting conditions that are at the foundation of the image formation process. In this work, we formulate the newt ask of language-guided 3D-aware editing, where objects in an image should be edited according to a language instruction in context of the underlying 3D scene. To promote progress towards this goal, we release OBJECT: a dataset consisting of 400K editing examples created from procedurally generated 3D scenes. Each example consists of an input image, editing instruction in language, and the edited image. We also introduce 3DIT : single and multi-task models for four editing tasks. Our models show impressive abilities to understand the 3D composition of entire scenes, factoring in surrounding objects, surfaces, lighting conditions, shadows, and physically-plausible object configurations. Surprisingly, training on only synthetic scenes from OBJECT, editing capabilities of 3DIT generalize to real-world images.	This work introduces a novel model, 3DIT, for 3D-aware language-guided image editing that considers scene context, including geometry, lighting, and object interactions.	Existing image editing methods often fall short in maintaining 3D consistency, leading to unrealistic edits. 3DIT aims to address this gap by leveraging language instructions for object manipulation while preserving scene realism.	The authors created a dataset, Objaverse Editing in Context (OEC), with 400k editing examples generated from 3D scenes. They fine-tuned a diffusion model on OEC for four tasks: object translation, rotation, insertion, and removal.	3DIT outperforms baselines on quantitative metrics for realism and faithfulness. Human evaluation shows a strong preference for 3DIT outputs, demonstrating superior geometric and lighting consistency. The model exhibits promising generalization to real-world images despite being trained solely on synthetic data.	Limitations: Current model is limited to single object manipulations; Real-world performance can be further improved. Future work: Extend 3DIT to handle multiple object edits and more complex scene manipulations; Explore fine-tuning on real-world data.	image editing, 3d-aware, language-guided, diffusion models, synthetic data
2307.11035 Report	Cascade-DETR: Delving into High-Quality Universal Object Detection	Mingqiao Ye, Lei Ke, Siyuan Li, Yu-Wing Tai, Chi-Keung Tang, Martin Danelljan, Fisher Yu	Object localization in general environments is a fundamental part of vision systems. While dominating on the COCO benchmark, recent Transformer-based detection methods are not competitive in diverse domains. Moreover, these methods still struggle to very accurately estimate the object bounding boxes in complex environments. We introduce Cascade-DETR for high-quality universal object detection. We jointly tackle the generalization to diverse domains and localization accuracy by proposing the Cascade Attention layer, which explicitly integrates object-centric information into the detection decoder by limiting the attention to the previous box prediction. To further enhance accuracy, we also revisit the scoring of queries. Instead of relying on classification scores, we predict the expected IoU of the query, leading to substantially more well-calibrated confidences. Lastly, we introduce a universal object detection benchmark, UDB10, that contains 10 datasets from diverse domains. While also advancing the state-of-the-art on COCO, Cascade-DETR substantially improves DETR-based detectors on all datasets in UDB10, even by over 10 mAP in some cases. The improvements under stringent quality requirements are even more pronounced. Our code and models will be released at https://github.com/SysCV/cascade-detr.	This paper proposes Cascade-DETR, a new DETR-based object detection model, for high-quality universal object detection. It tackles the generalization to diverse domains and localization accuracy of DETR-based detectors.	Existing Transformer-based object detection methods, while achieving SOTA performance on COCO, show limitations in generalizing to diverse domains and achieving high accuracy in bounding box estimations.	Cascade-DETR introduces two main components: (1) Cascade Attention: constrains cross-attention within iteratively refined predicted bounding boxes, injecting local object-centric prior. (2) IoU-aware Query Recalibration: predicts IoU of each query to recalibrate classification scores for better reflecting prediction quality. A new benchmark, UDB10, consisting of 10 datasets from diverse domains, is also introduced.	Cascade-DETR outperforms SOTA DETR-based methods on UDB10, improving UniAP by 5.7. On COCO, Cascade-DETR achieves significant improvements, especially under strict IoU thresholds, indicating better bounding box accuracy. Cascade-DETR consistently outperforms baselines across various domains in UDB10, even when fine-tuned from COCO pre-trained models, showcasing its generalizability.	The paper assumes the availability of bounding box annotations for training across all datasets, which might not always be feasible in real-world scenarios. Further exploration on incorporating different types of weak supervision, such as image-level tags or point annotations, for training Cascade-DETR is left for future work.	object detection, transformers, detr, bounding box accuracy, generalization
2307.10984 Report	Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image	Wei Yin, Chi Zhang, Hao Chen, Zhipeng Cai, Gang Yu, Kaixuan Wang, Xiaozhi Chen, Chunhua Shen	Reconstructing accurate 3D scenes from images is a long-standing vision task. Due to the ill-posedness of the single-image reconstruction problem, most well-established methods are built upon multi-view geometry. State-of-the-art (SOTA) monocular metric depth estimation methods can only handle a single camera model and are unable to perform mixed-data training due to the metric ambiguity. Meanwhile, SOTA monocular methods trained on large mixed datasets achieve zero-shot generalization by learning affine-invariant depths, which cannot recover real-world metrics. In this work, we show that the key to a zero-shot single-view metric depth model lies in the combination of large-scale data training and resolving the metric ambiguity from various camera models. We propose a canonical camera space transformation module, which explicitly addresses the ambiguity problems and can be effortlessly plugged into existing monocular models. Equipped with our module, monocular models can be stably trained with over 8 million images with thousands of camera models, resulting in zero-shot generalization to in-the-wild images with unseen camera settings. Experiments demonstrate SOTA performance of our method on 7 zero-shot benchmarks. Notably, our method won the championship in the 2nd Monocular Depth Estimation Challenge. Our method enables the accurate recovery of metric 3D structures on randomly collected internet images, paving the way for plausible single-image metrology. The potential benefits extend to downstream tasks, which can be significantly improved by simply plugging in our model. For example, our model relieves the scale drift issues of monocular-SLAM (Fig. 1), leading to high-quality metric scale dense mapping. The code is available at https://github.com/YvanYin/Metric3D.	This paper introduces Metric3D, a novel method for zero-shot metric 3D prediction from a single image, enabling accurate metric 3D reconstruction from in-the-wild images.	Existing methods for single-image 3D reconstruction either rely on object-specific priors or can only predict affine-invariant depths, lacking real-world metric information crucial for applications like metrology and robotics.	Metric3D addresses the metric ambiguity issue by introducing a canonical camera space transformation module (CSTM). This module transforms training data to a canonical camera space, allowing the model to learn metric depth information across diverse camera settings. Additionally, it leverages a random proposal normalization loss (RPNL) to enhance depth accuracy by emphasizing local geometric details.	Metric3D achieves state-of-the-art performance on 7 zero-shot benchmarks, outperforming existing methods in terms of metric depth accuracy and generalization ability. The method enables plausible single-image metrology, demonstrated by its ability to accurately measure object sizes in real-world images. It significantly improves the performance of downstream tasks like monocular SLAM, enabling metric-scale dense mapping by providing accurate depth priors.	The accuracy of the metric reconstruction relies on the availability and accuracy of camera intrinsic parameters. Future work includes exploring ways to estimate camera intrinsics directly from images to further enhance the applicability of the method.	3d reconstruction, metric depth estimation, zero-shot learning, single-image metrology, monocular slam
2307.10854 Report	BlendFace: Re-designing Identity Encoders for Face-Swapping	Kaede Shiohara, Xingchao Yang, Takafumi Taketomi	The great advancements of generative adversarial networks and face recognition models in computer vision have made it possible to swap identities on images from single sources. Although a lot of studies seems to have proposed almost satisfactory solutions, we notice previous methods still suffer from an identity-attribute entanglement that causes undesired attributes swapping because widely used identity encoders, eg, ArcFace, have some crucial attribute biases owing to their pretraining on face recognition tasks. To address this issue, we design BlendFace, a novel identity encoder for face-swapping. The key idea behind BlendFace is training face recognition models on blended images whose attributes are replaced with those of another mitigates inter-personal biases such as hairsyles. BlendFace feeds disentangled identity features into generators and guides generators properly as an identity loss function. Extensive experiments demonstrate that BlendFace improves the identity-attribute disentanglement in face-swapping models, maintaining a comparable quantitative performance to previous methods.	This paper introduces BlendFace, a novel identity encoder for face-swapping that mitigates identity-attribute entanglement, a common problem in existing models.	Existing face-swapping models often exhibit identity-attribute entanglement due to biases in pre-trained identity encoders like ArcFace, leading to undesired attribute swapping (e.g., hairstyles).	BlendFace is trained on blended images with swapped attributes, reducing bias towards specific features. It's then integrated into a face-swapping model as both the source feature extractor and identity loss function.	BlendFace improves identity-attribute disentanglement in face-swapping, reducing unwanted attribute transfer. It shows comparable or superior performance to state-of-the-art methods on FaceForensics++ in identity similarity and attribute preservation (expression, pose, gaze). BlendFace enhances the visual consistency of swapped faces compared to existing models.	BlendFace may not effectively handle large differences in face shapes between source and target images. Preserving hard occlusions (e.g., hands) remains challenging due to limited training data.	face swapping, identity encoder, generative adversarial networks, attribute disentanglement, face recognition
2307.10829 Report	Exact Diffusion Inversion via Bi-directional Integration Approximation	Guoqiang Zhang, J. P. Lewis, W. Bastiaan Kleijn	Recently, various methods have been proposed to address the inconsistency issue of DDIM inversion to enable image editing, such as EDICT [36] and Null-text inversion [22]. However, the above methods introduce considerable computational overhead. In this paper, we propose a new technique, named \emph{bi-directional integration approximation} (BDIA), to perform exact diffusion inversion with neglible computational overhead. Suppose we would like to estimate the next diffusion state $\boldsymbol{z}_{i-1}$ at timestep $t_i$ with the historical information $(i,\boldsymbol{z}_i)$ and $(i+1,\boldsymbol{z}_{i+1})$. We first obtain the estimated Gaussian noise $\hat{\boldsymbol{\epsilon}}(\boldsymbol{z}_i,i)$, and then apply the DDIM update procedure twice for approximating the ODE integration over the next time-slot $[t_i, t_{i-1}]$ in the forward manner and the previous time-slot $[t_i, t_{t+1}]$ in the backward manner. The DDIM step for the previous time-slot is used to refine the integration approximation made earlier when computing $\boldsymbol{z}_i$. A nice property of BDIA-DDIM is that the update expression for $\boldsymbol{z}_{i-1}$ is a linear combination of $(\boldsymbol{z}_{i+1}, \boldsymbol{z}_i, \hat{\boldsymbol{\epsilon}}(\boldsymbol{z}_i,i))$. This allows for exact backward computation of $\boldsymbol{z}_{i+1}$ given $(\boldsymbol{z}_i, \boldsymbol{z}_{i-1})$, thus leading to exact diffusion inversion. It is demonstrated with experiments that (round-trip) BDIA-DDIM is particularly effective for image editing. Our experiments further show that BDIA-DDIM produces markedly better image sampling qualities than DDIM for text-to-image generation. BDIA can also be applied to improve the performance of other ODE solvers in addition to DDIM. In our work, it is found that applying BDIA to the EDM sampling procedure produces consistently better performance over four pre-trained models.	This paper proposes BDIA (bi-directional integration approximation), a novel technique for achieving exact diffusion inversion with DDIM, reducing computational overhead compared to methods like EDICT.	Exact diffusion inversion is crucial for image editing applications in diffusion models, but existing methods introduce significant computational overhead. This work aims to address this limitation.	BDIA-DDIM approximates ODE integration using both forward and backward DDIM updates at each timestep. This allows for expressing each diffusion state as a linear combination of previous states and estimated noise, enabling exact inversion without doubling NFEs.	BDIA-DDIM achieves superior image sampling quality (FID score) compared to DDIM and DPM-Solver in text-to-image generation. Incorporating BDIA into both DDIM and EDM significantly improves FID scores for unconditional image generation. BDIA-DDIM demonstrates promising results in both text-based and ControlNet-based image editing, achieving comparable quality to EDICT while reducing NFEs by approximately half.	The paper primarily focuses on DDIM and EDM. Exploring BDIA's application with other ODE solvers (e.g., PLMS, DEIS) is left for future work. The paper evaluates image editing qualitatively and with FID scores. Further quantitative evaluation using other metrics (e.g., LPIPS) could provide a more comprehensive understanding of BDIA-DDIM's performance.	diffusion models, image editing, ode solvers, ddim inversion, exact inversion
2307.10816 Report	BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion	Jinheng Xie, Yuexiang Li, Yawen Huang, Haozhe Liu, Wentian Zhang, Yefeng Zheng, Mike Zheng Shou	Recent text-to-image diffusion models have demonstrated an astonishing capacity to generate high-quality images. However, researchers mainly studied the way of synthesizing images with only text prompts. While some works have explored using other modalities as conditions, considerable paired data, e.g., box/mask-image pairs, and fine-tuning time are required for nurturing models. As such paired data is time-consuming and labor-intensive to acquire and restricted to a closed set, this potentially becomes the bottleneck for applications in an open world. This paper focuses on the simplest form of user-provided conditions, e.g., box or scribble. To mitigate the aforementioned problem, we propose a training-free method to control objects and contexts in the synthesized images adhering to the given spatial conditions. Specifically, three spatial constraints, i.e., Inner-Box, Outer-Box, and Corner Constraints, are designed and seamlessly integrated into the denoising step of diffusion models, requiring no additional training and massive annotated layout data. Extensive experimental results demonstrate that the proposed constraints can control what and where to present in the images while retaining the ability of Diffusion models to synthesize with high fidelity and diverse concept coverage. The code is publicly available at https://github.com/showlab/BoxDiff.	This paper proposes BoxDiff, a training-free method for controlling the location and scale of objects in images synthesized by pre-trained text-to-image diffusion models, using simple spatial constraints like boxes or scribbles.	Current text-to-image models lack fine-grained spatial control, and existing layout-to-image methods require significant paired training data and are limited to closed-set categories. BoxDiff addresses these limitations by providing a training-free approach for spatial control in open-world settings.	BoxDiff works by applying spatial constraints (Inner-Box, Outer-Box, and Corner) to the cross-attention maps between text tokens and intermediate features during the denoising step of diffusion models. These constraints guide the synthesis process to adhere to the user-provided spatial conditions.	BoxDiff successfully controls the location and scale of synthesized objects according to user-provided boxes or scribbles. The method outperforms existing fully-supervised layout-to-image methods in terms of both semantic accuracy and alignment with spatial conditions. BoxDiff retains the high fidelity and diverse concept coverage of the underlying diffusion models, enabling the synthesis of novel objects and scenes.	The precision of spatial control is limited by the resolution of the cross-attention maps used. BoxDiff may struggle with unusual prompts or combinations of objects that infrequently co-occur.	text-to-image synthesis, diffusion models, spatial constraints, training-free, open-world
2307.10797 Report	HyperReenact: One-Shot Reenactment via Jointly Learning to Refine and Retarget Faces	Stella Bounareli, Christos Tzelepis, Vasileios Argyriou, Ioannis Patras, Georgios Tzimiropoulos	In this paper, we present our method for neural face reenactment, called HyperReenact, that aims to generate realistic talking head images of a source identity, driven by a target facial pose. Existing state-of-the-art face reenactment methods train controllable generative models that learn to synthesize realistic facial images, yet producing reenacted faces that are prone to significant visual artifacts, especially under the challenging condition of extreme head pose changes, or requiring expensive few-shot fine-tuning to better preserve the source identity characteristics. We propose to address these limitations by leveraging the photorealistic generation ability and the disentangled properties of a pretrained StyleGAN2 generator, by first inverting the real images into its latent space and then using a hypernetwork to perform: (i) refinement of the source identity characteristics and (ii) facial pose re-targeting, eliminating this way the dependence on external editing methods that typically produce artifacts. Our method operates under the one-shot setting (i.e., using a single source frame) and allows for cross-subject reenactment, without requiring any subject-specific fine-tuning. We compare our method both quantitatively and qualitatively against several state-of-the-art techniques on the standard benchmarks of VoxCeleb1 and VoxCeleb2, demonstrating the superiority of our approach in producing artifact-free images, exhibiting remarkable robustness even under extreme head pose changes. We make the code and the pretrained models publicly available at: https://github.com/StelaBou/HyperReenact .	HyperReenact, a neural face reenactment method that refines and retargets facial images using a pretrained StyleGAN2 model and a hypernetwork.	Existing methods struggle to produce realistic results in one-shot settings or with extreme head pose changes. This method aims to address these limitations by leveraging the strengths of pretrained StyleGAN2 and hypernetworks.	The method uses a hypernetwork to modify the weights of the StyleGAN2 generator based on appearance features from a source image and pose features from a target image. It operates in one-shot (using a single source frame) and is trained in three phases: real image inversion, self reenactment, and cross-subject reenactment.	Outperforms state-of-the-art methods on identity preservation and facial pose transfer, especially on challenging cases with large head pose differences. Produces artifact-free images, as demonstrated through quantitative metrics (CSIM, LPIPS, FID, FVD, APD, AED) and qualitative comparisons. Exhibits robustness to extreme head pose variations, outperforming other methods on a specifically designed benchmark.	Struggles to reconstruct detailed accessories, like hats or eyeglasses, potentially due to underrepresentation in the training dataset. Does not refine background details.	face reenactment, hypernetworks, stylegan2, one-shot learning, facial image editing
2307.10776 Report	Urban Radiance Field Representation with Deformable Neural Mesh Primitives	Fan Lu, Yan Xu, Guang Chen, Hongsheng Li, Kwan-Yee Lin, Changjun Jiang	Neural Radiance Fields (NeRFs) have achieved great success in the past few years. However, most current methods still require intensive resources due to ray marching-based rendering. To construct urban-level radiance fields efficiently, we design Deformable Neural Mesh Primitive~(DNMP), and propose to parameterize the entire scene with such primitives. The DNMP is a flexible and compact neural variant of classic mesh representation, which enjoys both the efficiency of rasterization-based rendering and the powerful neural representation capability for photo-realistic image synthesis. Specifically, a DNMP consists of a set of connected deformable mesh vertices with paired vertex features to parameterize the geometry and radiance information of a local area. To constrain the degree of freedom for optimization and lower the storage budgets, we enforce the shape of each primitive to be decoded from a relatively low-dimensional latent space. The rendering colors are decoded from the vertex features (interpolated with rasterization) by a view-dependent MLP. The DNMP provides a new paradigm for urban-level scene representation with appealing properties: $(1)$ High-quality rendering. Our method achieves leading performance for novel view synthesis in urban scenarios. $(2)$ Low computational costs. Our representation enables fast rendering (2.07ms/1k pixels) and low peak memory usage (110MB/1k pixels). We also present a lightweight version that can run 33$\times$ faster than vanilla NeRFs, and comparable to the highly-optimized Instant-NGP (0.61 vs 0.71ms/1k pixels). Project page: \href{https://dnmp.github.io/}{https://dnmp.github.io/}.	This paper proposes Deformable Neural Mesh Primitive (DNMP), a novel neural scene representation for efficient and high-quality urban view synthesis, leveraging the efficiency of classic meshes and the representation power of neural features.	Existing neural rendering methods, especially those for large-scale urban environments, suffer from high computational costs and memory footprints due to ray marching-based rendering. They also lack explicit surface constraints, leading to less robust novel view synthesis.	The proposed method voxelizes the urban scene and assigns each voxel a DNMP, which parameterizes local geometry and radiance. DNMP shapes are decoded from a compact latent space for robust optimization, while radiance features are associated with mesh vertices. The method utilizes efficient rasterization for feature interpolation and rendering.	The method achieves state-of-the-art novel view synthesis quality on KITTI-360 and Waymo datasets, outperforming baselines in terms of PSNR, SSIM, and LPIPS. It exhibits strong robustness against viewpoint changes, generating high-quality rendering results even with significant view differences from the training set. DNMP enables a 5x faster rendering speed and uses only 1/5 of the peak memory compared to Mip-NeRF 360, achieving a speed comparable to the highly optimized Instant-NGP.	The current framework is based on the static-scene assumption and cannot handle moving objects. Future work includes extending the method to incorporate dynamic elements for more general application scenarios.	neural rendering, urban scene representation, deformable mesh, novel view synthesis, efficient rendering
2307.10711 Report	AdjointDPM: Adjoint Sensitivity Method for Gradient Backpropagation of Diffusion Probabilistic Models	Jiachun Pan, Jun Hao Liew, Vincent Y. F. Tan, Jiashi Feng, Hanshu Yan	Existing customization methods require access to multiple reference examples to align pre-trained diffusion probabilistic models (DPMs) with user-provided concepts. This paper aims to address the challenge of DPM customization when the only available supervision is a differentiable metric defined on the generated contents. Since the sampling procedure of DPMs involves recursive calls to the denoising UNet, na\"ive gradient backpropagation requires storing the intermediate states of all iterations, resulting in extremely high memory consumption. To overcome this issue, we propose a novel method AdjointDPM, which first generates new samples from diffusion models by solving the corresponding probability-flow ODEs. It then uses the adjoint sensitivity method to backpropagate the gradients of the loss to the models' parameters (including conditioning signals, network weights, and initial noises) by solving another augmented ODE. To reduce numerical errors in both the forward generation and gradient backpropagation processes, we further reparameterize the probability-flow ODE and augmented ODE as simple non-stiff ODEs using exponential integration. Finally, we demonstrate the effectiveness of AdjointDPM on three interesting tasks: converting visual effects into identification text embeddings, finetuning DPMs for specific types of stylization, and optimizing initial noise to generate adversarial samples for security auditing.	Proposes AdjointDPM, a novel gradient backpropagation technique for diffusion probabilistic models (DPMs) based on the adjoint sensitivity method, enabling the optimization of DPM parameters like network weights, conditioning signals, and noisy states under differentiable loss functions.	Addresses the significant memory consumption problem of naive backpropagation in DPMs, especially for tasks like guided generation and model customization, which require optimizing model parameters to achieve specific properties in generated content.	Leverages the adjoint sensitivity method to compute gradients by solving a backward ODE, eliminating the need to store intermediate states of all iterations. Further reduces numerical errors by reparameterizing the probability-flow ODE and augmented ODE as simple non-stiff ODEs using exponential integration.	AdjointDPM enables guided sampling, allowing the guidance of Stable Diffusion to synthesize images of specific animal breeds under the supervision of fine-grained vision classifiers. Reveals potential security issues in DPM-based generation systems by successfully finding initial noise states that lead to the generation of NSFW content capable of bypassing safety filters. Facilitates stylization via a single reference image, enabling AdjointDPM to fine-tune a Stable Diffusion model for style defined by the Gram matrix of the reference, generalizing stylization capabilities to different objects.	The guidance of FGVC models does not fully resolve issues with inaccurate details in generated images. The effectiveness of style transfer depends on the selection of appropriate weights for style and content loss terms.	diffusion probabilistic models, adjoint sensitivity method, guided generation, model customization, security auditing
2307.10584 Report	Reference-based Painterly Inpainting via Diffusion: Crossing the Wild Reference Domain Gap	Dejia Xu, Xingqian Xu, Wenyan Cong, Humphrey Shi, Zhangyang Wang	Have you ever imagined how it would look if we placed new objects into paintings? For example, what would it look like if we placed a basketball into Claude Monet's ``Water Lilies, Evening Effect''? We propose Reference-based Painterly Inpainting, a novel task that crosses the wild reference domain gap and implants novel objects into artworks. Although previous works have examined reference-based inpainting, they are not designed for large domain discrepancies between the target and the reference, such as inpainting an artistic image using a photorealistic reference. This paper proposes a novel diffusion framework, dubbed RefPaint, to ``inpaint more wildly'' by taking such references with large domain gaps. Built with an image-conditioned diffusion model, we introduce a ladder-side branch and a masked fusion mechanism to work with the inpainting mask. By decomposing the CLIP image embeddings at inference time, one can manipulate the strength of semantic and style information with ease. Experiments demonstrate that our proposed RefPaint framework produces significantly better results than existing methods. Our method enables creative painterly image inpainting with reference objects that would otherwise be difficult to achieve. Project page: https://vita-group.github.io/RefPaint/	This paper introduces "Reference-based Painterly Inpainting", a novel task that implants new objects into artworks, even with significant domain gaps between the reference object and artistic background, and proposes a novel diffusion-based framework, called RefPaint, to address it.	This task enables creative painterly image inpainting with reference objects, going beyond the limitations of existing reference-based inpainting methods that struggle with large domain discrepancies and text-based inpainting's ambiguity in specifying desired content.	The RefPaint framework, built upon an image-conditioned diffusion model, introduces a ladder-side branch for masked image encoding and a masked fusion mechanism to incorporate inpainting masks. It uses PCA-decomposed CLIP image embeddings for disentangled semantic and style fusion via classifier-free guidance, allowing control over the trade-off between reference semantics and background style.	RefPaint successfully injects new objects into artistic images while preserving the background style, even with challenging domain gaps. The disentangled semantic and style fusion allows fine-grained control over the inpainted content, balancing fidelity to the reference object and stylistic consistency with the artwork. Quantitative comparisons using CLIP image distance demonstrate that RefPaint outperforms existing methods in terms of integrating reference objects while maintaining background style.	The model suffers from slow inference speed, which is a common limitation of diffusion models. Handling complex cases, such as multiple objects or objects with detailed textures, requires further exploration.	image inpainting, diffusion models, reference-based inpainting, painterly style transfer, image harmonization
2307.10504 Report	Identifying Interpretable Subspaces in Image Representations	Neha Kalibhat, Shweta Bhardwaj, Bayan Bruss, Hamed Firooz, Maziar Sanjabi, Soheil Feizi	We propose Automatic Feature Explanation using Contrasting Concepts (FALCON), an interpretability framework to explain features of image representations. For a target feature, FALCON captions its highly activating cropped images using a large captioning dataset (like LAION-400m) and a pre-trained vision-language model like CLIP. Each word among the captions is scored and ranked leading to a small number of shared, human-understandable concepts that closely describe the target feature. FALCON also applies contrastive interpretation using lowly activating (counterfactual) images, to eliminate spurious concepts. Although many existing approaches interpret features independently, we observe in state-of-the-art self-supervised and supervised models, that less than 20% of the representation space can be explained by individual features. We show that features in larger spaces become more interpretable when studied in groups and can be explained with high-order scoring concepts through FALCON. We discuss how extracted concepts can be used to explain and debug failures in downstream tasks. Finally, we present a technique to transfer concepts from one (explainable) representation space to another unseen representation space by learning a simple linear transformation. Code available at https://github.com/NehaKalibhat/falcon-explain.	This paper proposes FALCON, an interpretability framework that automatically identifies human-understandable concepts encoded by features in image representations.	Understanding what information is encoded in image representations, especially in self-supervised models, is crucial for their deployment and generalization.	FALCON leverages a probe dataset, a large captioning dataset (LAION-400m), and a pre-trained vision-language model (CLIP). It captions highly activating image crops for a target feature and extracts shared concepts. It also uses contrastive interpretation with lowly activating images to filter out spurious concepts.	FALCON successfully identifies meaningful concepts for individual features and, more surprisingly, for groups of features, which are shown to be more interpretable. Human evaluation via Amazon Mechanical Turk demonstrates the high relevance and explainability of FALCON's extracted concepts. The paper showcases the transferability of concepts across different representation spaces by learning a simple linear transformation.	Extending FALCON to explain vision-language models and non-image domains remains for future work. The framework currently relies on a pre-trained vision-language model, which might be limiting for models trained on specialized tasks or data.	interpretability, image representation, concept extraction, self-supervised learning, contrastive interpretation
2307.10373 Report	TokenFlow: Consistent Diffusion Features for Consistent Video Editing	Michal Geyer, Omer Bar-Tal, Shai Bagon, Tali Dekel	The generative AI revolution has recently expanded to videos. Nevertheless, current state-of-the-art video models are still lagging behind image models in terms of visual quality and user control over the generated content. In this work, we present a framework that harnesses the power of a text-to-image diffusion model for the task of text-driven video editing. Specifically, given a source video and a target text-prompt, our method generates a high-quality video that adheres to the target text, while preserving the spatial layout and motion of the input video. Our method is based on a key observation that consistency in the edited video can be obtained by enforcing consistency in the diffusion feature space. We achieve this by explicitly propagating diffusion features based on inter-frame correspondences, readily available in the model. Thus, our framework does not require any training or fine-tuning, and can work in conjunction with any off-the-shelf text-to-image editing method. We demonstrate state-of-the-art editing results on a variety of real-world videos. Webpage: https://diffusion-tokenflow.github.io/	Introduces TokenFlow, a technique that leverages the internal representations of videos in text-to-image diffusion models to enable consistent and high-quality video editing.	Existing video generation models lag behind image models and struggle to achieve both high visual quality and temporal consistency in edited videos. TokenFlow addresses this gap by harnessing the power of readily available, state-of-the-art image diffusion models for video editing.	TokenFlow extracts and analyzes diffusion features from a pre-trained image diffusion model (Stable Diffusion). It then enforces consistency by propagating edits based on inter-frame feature correspondences found in the original video. This approach ensures that edits adhere to the target text prompt while maintaining temporal coherence.	TokenFlow generates high-quality edited videos that exhibit strong adherence to the target text prompts. Quantitative evaluations, including warping error and user studies, demonstrate that TokenFlow significantly outperforms existing and concurrent video editing methods in terms of temporal consistency. The method is efficient, reducing per-frame editing time by 20% compared to applying image editing techniques frame-by-frame.	TokenFlow is currently limited to edits that preserve the original structure of the video, as it relies on the original video's motion and feature correspondences. The method's success is partially dependent on the accuracy of the underlying image editing technique used in conjunction with TokenFlow. Future work may explore combining TokenFlow with improved decoders to further enhance video quality and minimize flickering.	video editing, diffusion models, temporal consistency, text-driven editing, stable diffusion
2307.10159 Report	FABRIC: Personalizing Diffusion Models with Iterative Feedback	Dimitri von Rütte, Elisabetta Fedele, Jonathan Thomm, Lukas Wolf	In an era where visual content generation is increasingly driven by machine learning, the integration of human feedback into generative models presents significant opportunities for enhancing user experience and output quality. This study explores strategies for incorporating iterative human feedback into the generative process of diffusion-based text-to-image models. We propose FABRIC, a training-free approach applicable to a wide range of popular diffusion models, which exploits the self-attention layer present in the most widely used architectures to condition the diffusion process on a set of feedback images. To ensure a rigorous assessment of our approach, we introduce a comprehensive evaluation methodology, offering a robust mechanism to quantify the performance of generative visual models that integrate human feedback. We show that generation results improve over multiple rounds of iterative feedback through exhaustive analysis, implicitly optimizing arbitrary user preferences. The potential applications of these findings extend to fields such as personalized content creation and customization.	Presents FABRIC, a training-free method incorporating iterative user feedback (liked and disliked images) into text-to-image diffusion models for improved image generation aligned with user preferences.	Addresses the limitations of current text-to-image models, which often require iterative prompt engineering and struggle to capture nuanced user preferences.	Leverages attention-based reference image conditioning by injecting information from feedback images into the self-attention layer of a diffusion model's U-Net during the denoising process.	FABRIC effectively guides image generation toward user preferences, evidenced by improved scores from a human preference prediction model. It successfully steers generation towards a target image when feedback is provided based on similarity to that target. The method demonstrates orthogonality with other Stable Diffusion enhancements, enabling improvements on top of existing techniques like LoRA and fine-tuned checkpoints.	FABRIC may struggle to expand the generative distribution beyond the initial text-conditioned output, potentially limiting exploration. The current feedback mechanism relies on binary preferences (like/dislike), which could be expanded for more nuanced guidance.	text-to-image generation, diffusion models, human feedback, iterative refinement, attention mechanisms
2307.09947 Report	U-CE: Uncertainty-aware Cross-Entropy for Semantic Segmentation	Steven Landgraf, Markus Hillemann, Kira Wursthorn, Markus Ulrich	Deep neural networks have shown exceptional performance in various tasks, but their lack of robustness, reliability, and tendency to be overconfident pose challenges for their deployment in safety-critical applications like autonomous driving. In this regard, quantifying the uncertainty inherent to a model's prediction is a promising endeavour to address these shortcomings. In this work, we present a novel Uncertainty-aware Cross-Entropy loss (U-CE) that incorporates dynamic predictive uncertainties into the training process by pixel-wise weighting of the well-known cross-entropy loss (CE). Through extensive experimentation, we demonstrate the superiority of U-CE over regular CE training on two benchmark datasets, Cityscapes and ACDC, using two common backbone architectures, ResNet-18 and ResNet-101. With U-CE, we manage to train models that not only improve their segmentation performance but also provide meaningful uncertainties after training. Consequently, we contribute to the development of more robust and reliable segmentation models, ultimately advancing the state-of-the-art in safety-critical applications and beyond.	This paper proposes U-CE, a novel uncertainty-aware cross-entropy loss function for semantic segmentation that incorporates predictive uncertainties into the training process.	Quantifying predictive uncertainty is crucial for deploying deep learning models in safety-critical applications, as it provides insights into model reliability.	U-CE integrates Monte Carlo Dropout during training to compute pixel-wise uncertainties, which are then used to weight the standard cross-entropy loss.	U-CE consistently outperforms regular cross-entropy training in terms of mIoU across different dropout ratios, backbones, and datasets. Models trained with U-CE demonstrate the ability to predict meaningful uncertainties, aligning with segmentation performance. U-CE shows robustness to the choice of hyperparameters such as \alpha and the base learning rate.	U-CE's effectiveness might be limited when densely annotated ground truth labels are unavailable. Further investigation is needed to understand the impact of U-CE on generalization performance.	semantic segmentation, uncertainty quantification, monte carlo dropout, deep learning, computer vision
2307.09906 Report	Implicit Identity Representation Conditioned Memory Compensation Network for Talking Head video Generation	Fa-Ting Hong, Dan Xu	Talking head video generation aims to animate a human face in a still image with dynamic poses and expressions using motion information derived from a target-driving video, while maintaining the person's identity in the source image. However, dramatic and complex motions in the driving video cause ambiguous generation, because the still source image cannot provide sufficient appearance information for occluded regions or delicate expression variations, which produces severe artifacts and significantly degrades the generation quality. To tackle this problem, we propose to learn a global facial representation space, and design a novel implicit identity representation conditioned memory compensation network, coined as MCNet, for high-fidelity talking head generation.~Specifically, we devise a network module to learn a unified spatial facial meta-memory bank from all training samples, which can provide rich facial structure and appearance priors to compensate warped source facial features for the generation. Furthermore, we propose an effective query mechanism based on implicit identity representations learned from the discrete keypoints of the source image. It can greatly facilitate the retrieval of more correlated information from the memory bank for the compensation. Extensive experiments demonstrate that MCNet can learn representative and complementary facial memory, and can clearly outperform previous state-of-the-art talking head generation methods on VoxCeleb1 and CelebV datasets. Please check our \href{https://github.com/harlanhong/ICCV2023-MCNET}{Project}.	This paper proposes MCNet, an implicit identity representation conditioned memory compensation network for high-fidelity talking head video generation, addressing the ambiguity issue in existing methods when handling dramatic head motions.	Existing talking head generation methods, while achieving progress in motion estimation, struggle to generate high-quality videos with large head motions due to limited appearance information in a single source image, leading to artifacts and quality degradation.	The proposed MCNet learns a global facial meta-memory bank from all training samples to provide rich facial priors. It leverages an implicit identity representation learned from source image keypoints and warped features to query the meta-memory bank, obtaining identity-dependent memory for compensating ambiguous facial details in the warped source feature map.	MCNet outperforms state-of-the-art methods on VoxCeleb1 and CelebV datasets for both same-identity and cross-identity reenactment. The learned global facial meta-memory effectively compensates for ambiguous regions in generated faces, especially under large head motions or occlusions. The proposed method demonstrates strong generalizability, improving the performance when incorporated into other talking head generation frameworks.	The model's performance in handling unseen identities could be further improved. The computational cost associated with querying the large meta-memory bank is a limitation.	talking head generation, memory compensation network, implicit identity representation, global facial meta-memory, deep learning
2307.09882 Report	Adversarial Likelihood Estimation With One-Way Flows	Omri Ben-Dov, Pravir Singh Gupta, Victoria Abrevaya, Michael J. Black, Partha Ghosh	Generative Adversarial Networks (GANs) can produce high-quality samples, but do not provide an estimate of the probability density around the samples. However, it has been noted that maximizing the log-likelihood within an energy-based setting can lead to an adversarial framework where the discriminator provides unnormalized density (often called energy). We further develop this perspective, incorporate importance sampling, and show that 1) Wasserstein GAN performs a biased estimate of the partition function, and we propose instead to use an unbiased estimator; and 2) when optimizing for likelihood, one must maximize generator entropy. This is hypothesized to provide a better mode coverage. Different from previous works, we explicitly compute the density of the generated samples. This is the key enabler to designing an unbiased estimator of the partition function and computation of the generator entropy term. The generator density is obtained via a new type of flow network, called one-way flow network, that is less constrained in terms of architecture, as it does not require a tractable inverse function. Our experimental results show that our method converges faster, produces comparable sample quality to GANs with similar architecture, successfully avoids over-fitting to commonly used datasets and produces smooth low-dimensional latent representations of the training data.	This paper proposes a new framework for adversarial generative modeling that combines the advantages of GANs (high-quality samples) with density estimation capabilities.	Explicit density estimation in GANs allows for quantitative model comparison, likelihood-based training, and potentially mitigates issues like mode collapse.	The authors leverage the connection between EBMs and GANs, introducing an unbiased estimator of the partition function by explicitly computing the generator density. They achieve this using a novel 'one-way flow' network for the generator.	The model captures more modes and generates higher-quality samples than previous GANs on 2D datasets. On real datasets, it demonstrates faster convergence and comparable sample quality to GANs while exhibiting good generalization. The proposed method allows for practical computation of the partition function with a reasonable number of samples.	The current implementation relies on an approximate Jacobian determinant computation, which introduces noise. Exploring architectures with closed-form Jacobian determinants is left for future work. Further investigation is needed to fully leverage the potential of using multiple samples for approximating the normalizing factor.	generative adversarial networks, density estimation, energy-based models, normalizing flows, one-way flows
2307.09829 Report	What do neural networks learn in image classification? A frequency shortcut perspective	Shunxin Wang, Raymond Veldhuis, Christoph Brune, Nicola Strisciuglio	Frequency analysis is useful for understanding the mechanisms of representation learning in neural networks (NNs). Most research in this area focuses on the learning dynamics of NNs for regression tasks, while little for classification. This study empirically investigates the latter and expands the understanding of frequency shortcuts. First, we perform experiments on synthetic datasets, designed to have a bias in different frequency bands. Our results demonstrate that NNs tend to find simple solutions for classification, and what they learn first during training depends on the most distinctive frequency characteristics, which can be either low- or high-frequencies. Second, we confirm this phenomenon on natural images. We propose a metric to measure class-wise frequency characteristics and a method to identify frequency shortcuts. The results show that frequency shortcuts can be texture-based or shape-based, depending on what best simplifies the objective. Third, we validate the transferability of frequency shortcuts on out-of-distribution (OOD) test sets. Our results suggest that frequency shortcuts can be transferred across datasets and cannot be fully avoided by larger model capacity and data augmentation. We recommend that future research should focus on effective training schemes mitigating frequency shortcut learning.	This paper investigates what neural networks learn during image classification, focusing on their tendency to exploit frequency shortcuts – specific frequency sets leading to accurate but potentially oversimplified classification.	Understanding how data frequency characteristics and simplicity bias in neural networks can lead to frequency shortcut learning is crucial for addressing the limitations of current models and improving their generalization abilities, especially in out-of-distribution scenarios.	The authors conduct experiments on synthetic datasets with controlled frequency biases and natural images (ImageNet-10, ImageNet-SCT). They propose a metric (ADCS) to compare class-wise frequency distributions and a frequency culling method to identify frequency shortcuts. They analyze the effects of model capacity and data augmentation on shortcut learning.	Neural networks for classification tasks can prioritize learning distinctive frequency characteristics over semantic features, leading to frequency shortcut learning, where specific frequency subsets are used for classification. Frequency shortcuts can be texture-based or shape-based, depending on the dataset characteristics and can hinder the learning of more meaningful semantic information. Frequency shortcuts can be transferred across datasets and models and cannot be entirely avoided by increasing model capacity or applying common data augmentation techniques.	The ADCS metric, while insightful, cannot solely predict shortcut learning; further investigation of the relationship between frequency characteristics and learning dynamics is needed. Future work should focus on developing data augmentation strategies that explicitly target and mitigate frequency shortcut learning to improve the generalization capabilities of neural networks.	frequency analysis, shortcut learning, image classification, generalization, data augmentation
2307.09781 Report	Text2Layer: Layered Image Generation using Latent Diffusion Model	Xinyang Zhang, Wentian Zhao, Xin Lu, Jeff Chien	Layer compositing is one of the most popular image editing workflows among both amateurs and professionals. Motivated by the success of diffusion models, we explore layer compositing from a layered image generation perspective. Instead of generating an image, we propose to generate background, foreground, layer mask, and the composed image simultaneously. To achieve layered image generation, we train an autoencoder that is able to reconstruct layered images and train diffusion models on the latent representation. One benefit of the proposed problem is to enable better compositing workflows in addition to the high-quality image output. Another benefit is producing higher-quality layer masks compared to masks produced by a separate step of image segmentation. Experimental results show that the proposed method is able to generate high-quality layered images and initiates a benchmark for future work.	This paper proposes Text2Layer, a novel method for generating layered images from text prompts, composed of foreground, background, a layer mask, and a composited image.	Layered image generation facilitates more controllable and intuitive image editing workflows compared to traditional text-to-image generation or text-guided editing methods.	The authors create a 57.02M layered-image dataset ("LL2I") and train a Composition-Aware Two-Layer Autoencoder (CaT2I-AE). They then train a diffusion model on the latent representations learned by CaT2I-AE, enabling text-driven layered image generation.	Text2Layer generates higher quality layered images compared to baselines using Stable Diffusion components. The generated layer masks demonstrate superior accuracy in capturing foreground objects. The generated images exhibit strong text-image relevance, indicating effective adherence to text prompts.	The current LL2I dataset, while large, is still smaller than datasets used to train state-of-the-art text-to-image models, potentially limiting generation quality and diversity. Future work could explore conditional layer generation, enabling the generation of an arbitrary number of layers and more complex image compositions.	layered image generation, text-to-image synthesis, diffusion models, image editing, computer vision
2307.09582 Report	Guided Linear Upsampling	Shuangbing Song, Fan Zhong, Tianju Wang, Xueying Qin, Changhe Tu	Guided upsampling is an effective approach for accelerating high-resolution image processing. In this paper, we propose a simple yet effective guided upsampling method. Each pixel in the high-resolution image is represented as a linear interpolation of two low-resolution pixels, whose indices and weights are optimized to minimize the upsampling error. The downsampling can be jointly optimized in order to prevent missing small isolated regions. Our method can be derived from the color line model and local color transformations. Compared to previous methods, our method can better preserve detail effects while suppressing artifacts such as bleeding and blurring. It is efficient, easy to implement, and free of sensitive parameters. We evaluate the proposed method with a wide range of image operators, and show its advantages through quantitative and qualitative analysis. We demonstrate the advantages of our method for both interactive image editing and real-time high-resolution video processing. In particular, for interactive editing, the joint optimization can be precomputed, thus allowing for instant feedback without hardware acceleration.	This paper introduces Guided Linear Upsampling (GLU), a novel guided upsampling technique for accelerating high-resolution image processing.	Efficiently processing high-resolution images is crucial due to the increasing demand for high-quality visuals and the computational constraints of devices. GLU offers a simple yet powerful solution to address this challenge.	GLU represents each high-resolution pixel as a linear interpolation of two optimized low-resolution pixels. It jointly optimizes downsampling and upsampling to minimize error and preserve details. This approach is inspired by the color line model and local color transformations but without explicit smoothness constraints.	GLU outperforms previous methods (JBU, BGU) in quantitative and qualitative evaluations across various image processing tasks, especially for large upsampling ratios. The target-free optimization in GLU allows for pre-computation, enabling interactive editing with instant feedback and real-time video processing. Downsample optimization in GLU effectively preserves thin structures and small regions often lost in regular downsampling.	GLU might exhibit limitations when handling new edges or drastic changes in local image structures, which are not present in the source image. Adapting existing image processing operators for optimal performance at low resolutions is crucial for maximizing GLU's effectiveness.	guided upsampling, optimized downsampling, image processing, interactive image editing, real-time video processing
2307.09481 Report	AnyDoor: Zero-shot Object-level Image Customization	Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, Hengshuang Zhao	This work presents AnyDoor, a diffusion-based image generator with the power to teleport target objects to new scenes at user-specified locations in a harmonious way. Instead of tuning parameters for each object, our model is trained only once and effortlessly generalizes to diverse object-scene combinations at the inference stage. Such a challenging zero-shot setting requires an adequate characterization of a certain object. To this end, we complement the commonly used identity feature with detail features, which are carefully designed to maintain texture details yet allow versatile local variations (e.g., lighting, orientation, posture, etc.), supporting the object in favorably blending with different surroundings. We further propose to borrow knowledge from video datasets, where we can observe various forms (i.e., along the time axis) of a single object, leading to stronger model generalizability and robustness. Extensive experiments demonstrate the superiority of our approach over existing alternatives as well as its great potential in real-world applications, such as virtual try-on and object moving. Project page is https://damo-vilab.github.io/AnyDoor-Page/.	This paper proposes AnyDoor, a diffusion-based model that teleports objects from a source image to a target scene at user-specified locations with desired shapes in a zero-shot manner.	Object teleportation is crucial for various applications like image composition, virtual try-on, and shape editing, but previous methods struggle to generate identity-consistent content, especially for untrained categories.	AnyDoor uses an ID extractor (DINOv2) to capture object identity and a detail extractor (ControlNet-style UNet) to learn appearance details from a collage of high-frequency object maps and the scene. These features guide a pre-trained text-to-image diffusion model for generation. The model is trained on a dataset incorporating video and image pairs to capture object variations and diverse scenarios.	AnyDoor outperforms existing reference-based methods in preserving object identity while generating high-quality compositions. It achieves superior multi-subject composition compared to finetuning-based methods without requiring parameter tuning. AnyDoor demonstrates strong potential for various applications like virtual try-on, object moving and swapping, and shape editing.	AnyDoor might struggle with generating fine details like small characters or logos. Future work could focus on incorporating additional controls and exploring higher-resolution generation.	image generation, diffusion models, object teleportation, zero-shot learning, image editing
2307.09361 Report	MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments	Spyros Gidaris, Andrei Bursuc, Oriane Simeoni, Antonin Vobecky, Nikos Komodakis, Matthieu Cord, Patrick Pérez	Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks for very large fully-annotated datasets. Different classes of self-supervised learning offer representations with either good contextual reasoning properties, e.g., using masked image modeling strategies, or invariance to image perturbations, e.g., with contrastive methods. In this work, we propose a single-stage and standalone method, MOCA, which unifies both desired properties using novel mask-and-predict objectives defined with high-level features (instead of pixel-level details). Moreover, we show how to effectively employ both learning paradigms in a synergistic and computation-efficient way. Doing so, we achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols with a training that is at least 3 times faster than prior methods.	This paper presents MOCA, a self-supervised representation learning method for Vision Transformers that leverages a novel masking strategy for predicting high-level online codebook assignments, thereby unifying the strengths of discriminative and hide-and-predict approaches.	Vision Transformers typically require extensive annotated training data. MOCA addresses this challenge by effectively learning from unlabeled data, enabling robust representations with enhanced contextual reasoning and perturbation invariance.	MOCA employs a teacher-student scheme where the teacher network, a momentum-updated version of the student, generates target codebook assignments from unmasked image views. The student network is trained to predict these assignments from masked views using two key objectives: masked same-view token assignment prediction (promoting contextual reasoning) and masked cross-view average assignment prediction (enhancing perturbation invariance).	MOCA achieves state-of-the-art results in low-shot ImageNet classification, outperforming existing methods by a significant margin. It demonstrates strong performance in linear probing and fine-tuning evaluations for image classification and semantic segmentation tasks. MOCA exhibits superior computational efficiency, requiring significantly less training time compared to competing methods, while maintaining competitive performance.	The paper explores the impact of decoder depth on performance but primarily focuses on ViT-B/16 architecture; investigating other architectures could be beneficial. While MOCA excels in low-shot learning, exploring its performance on a wider range of downstream tasks and datasets would provide a more comprehensive evaluation of its capabilities.	self-supervised learning, vision transformers, representation learning, masked image modeling, low-shot learning
2307.09283 Report	RepViT: Revisiting Mobile CNN From ViT Perspective	Ao Wang, Hui Chen, Zijia Lin, Jungong Han, Guiguang Ding	Recently, lightweight Vision Transformers (ViTs) demonstrate superior performance and lower latency, compared with lightweight Convolutional Neural Networks (CNNs), on resource-constrained mobile devices. Researchers have discovered many structural connections between lightweight ViTs and lightweight CNNs. However, the notable architectural disparities in the block structure, macro, and micro designs between them have not been adequately examined. In this study, we revisit the efficient design of lightweight CNNs from ViT perspective and emphasize their promising prospect for mobile devices. Specifically, we incrementally enhance the mobile-friendliness of a standard lightweight CNN, \ie, MobileNetV3, by integrating the efficient architectural designs of lightweight ViTs. This ends up with a new family of pure lightweight CNNs, namely RepViT. Extensive experiments show that RepViT outperforms existing state-of-the-art lightweight ViTs and exhibits favorable latency in various vision tasks. Notably, on ImageNet, RepViT achieves over 80\% top-1 accuracy with 1.0 ms latency on an iPhone 12, which is the first time for a lightweight model, to the best of our knowledge. Besides, when RepViT meets SAM, our RepViT-SAM can achieve nearly 10$\times$ faster inference than the advanced MobileSAM. Codes and models are available at \url{https://github.com/THU-MIG/RepViT}.	This paper introduces RepViT, a family of pure lightweight Convolutional Neural Networks (CNNs) designed for mobile devices, achieving state-of-the-art performance by incorporating efficient architectural designs from lightweight Vision Transformers (ViTs).	Lightweight ViTs, while demonstrating superior performance, face practical challenges due to inadequate hardware and computational library support. Lightweight CNNs, leveraging highly optimized convolution operations, prove advantageous for deployment on edge devices.	The authors progressively enhance MobileNetV3-L by integrating efficient designs from lightweight ViTs, focusing on block structure, macro architecture (stem, downsampling layers, classifier, stage ratio), and micro design (kernel size, SE layer placement).	RepViT consistently surpasses existing state-of-the-art lightweight ViTs and CNNs across diverse model sizes on ImageNet-1K, object detection, instance segmentation, and semantic segmentation benchmarks. RepViT-M1.0 achieves over 80% top-1 accuracy on ImageNet with 1.0 ms latency on an iPhone 12, marking a first for lightweight models. RepViT-SAM, integrating RepViT as the image encoder in the Segment Anything Model, exhibits exceptional efficiency on mobile devices while maintaining remarkable zero-shot transfer performance for downstream tasks.	The study primarily focuses on iPhone 12 for latency measurement, potentially limiting generalizability to other mobile platforms. Future exploration could involve investigating the effectiveness of RepViT's design principles on alternative lightweight CNN architectures beyond MobileNetV3.	lightweight cnn, vision transformer, mobile devices, efficient architecture design, computer vision
2307.09165 Report	Towards Trustworthy Dataset Distillation	Shijie Ma, Fei Zhu, Zhen Cheng, Xu-Yao Zhang	Efficiency and trustworthiness are two eternal pursuits when applying deep learning in real-world applications. With regard to efficiency, dataset distillation (DD) endeavors to reduce training costs by distilling the large dataset into a tiny synthetic dataset. However, existing methods merely concentrate on in-distribution (InD) classification in a closed-world setting, disregarding out-of-distribution (OOD) samples. On the other hand, OOD detection aims to enhance models' trustworthiness, which is always inefficiently achieved in full-data settings. For the first time, we simultaneously consider both issues and propose a novel paradigm called Trustworthy Dataset Distillation (TrustDD). By distilling both InD samples and outliers, the condensed datasets are capable to train models competent in both InD classification and OOD detection. To alleviate the requirement of real outlier data and make OOD detection more practical, we further propose to corrupt InD samples to generate pseudo-outliers and introduce Pseudo-Outlier Exposure (POE). Comprehensive experiments on various settings demonstrate the effectiveness of TrustDD, and the proposed POE surpasses state-of-the-art method Outlier Exposure (OE). Compared with the preceding DD, TrustDD is more trustworthy and applicable to real open-world scenarios. Our code will be publicly available.	The paper proposes Trustworthy Dataset Distillation (TrustDD), a novel paradigm that enhances dataset distillation by incorporating outlier exposure for improved out-of-distribution (OOD) detection.	Existing dataset distillation methods focus only on in-distribution classification, neglecting the critical aspect of OOD detection crucial for real-world deployment where unknown data is expected.	TrustDD extends the traditional dataset distillation framework by distilling both in-distribution samples and outliers, encouraging models to learn robust representations for both tasks. The authors further introduce Pseudo-Outlier Exposure (POE), a method for generating synthetic outliers from in-distribution data using corruption transformations.	TrustDD significantly improves OOD detection performance without sacrificing in-distribution classification accuracy. POE achieves comparable or even superior performance to Outlier Exposure (OE), which relies on curated outlier datasets. TrustDD generalizes well across various network architectures and OOD detection scores.	The current corruption transformations in POE are designed for natural images and might require adaptation for other data types. Further investigation on the optimal ratio of distilled in-distribution samples and outliers for balancing efficiency and trustworthiness is needed.	dataset distillation, out-of-distribution detection, trustworthy deep learning, pseudo-outlier exposure, open-world learning
2307.08996 Report	Towards Authentic Face Restoration with Iterative Diffusion Models and Beyond	Yang Zhao, Tingbo Hou, Yu-Chuan Su, Xuhui Jia. Yandong Li, Matthias Grundmann	An authentic face restoration system is becoming increasingly demanding in many computer vision applications, e.g., image enhancement, video communication, and taking portrait. Most of the advanced face restoration models can recover high-quality faces from low-quality ones but usually fail to faithfully generate realistic and high-frequency details that are favored by users. To achieve authentic restoration, we propose $\textbf{IDM}$, an $\textbf{I}$teratively learned face restoration system based on denoising $\textbf{D}$iffusion $\textbf{M}$odels (DDMs). We define the criterion of an authentic face restoration system, and argue that denoising diffusion models are naturally endowed with this property from two aspects: intrinsic iterative refinement and extrinsic iterative enhancement. Intrinsic learning can preserve the content well and gradually refine the high-quality details, while extrinsic enhancement helps clean the data and improve the restoration task one step further. We demonstrate superior performance on blind face restoration tasks. Beyond restoration, we find the authentically cleaned data by the proposed restoration system is also helpful to image generation tasks in terms of training stabilization and sample quality. Without modifying the models, we achieve better quality than state-of-the-art on FFHQ and ImageNet generation using either GANs or diffusion models.	This paper proposes IDM, an Iteratively learned face restoration system using Denoising Diffusion Models (DDMs) for authentic face restoration.	Existing face restoration models often fail to generate realistic high-frequency details and struggle to preserve delicate identity features. This paper addresses these challenges by introducing a novel approach using DDMs.	The proposed IDM leverages intrinsic iterative refinement within DDMs and extrinsic iterative enhancement of training data. Intrinsic learning gradually refines details while preserving content through the DDM's iterative denoising process. Extrinsic learning utilizes the trained DDM to enhance the training data itself, leading to improved restoration quality in the next iteration.	IDM achieves superior quantitative results on blind face restoration benchmarks, outperforming state-of-the-art methods like GFPGAN and CodeFormer in terms of PSNR, SSIM, LPIPS, and Arcface identity score. Qualitative results demonstrate IDM's ability to generate more realistic and faithful face restorations, preserving high-frequency details and delicate identity features better than baselines. Beyond restoration, the enhanced training data from IDM benefits image generation tasks, improving FID, precision, and recall scores for both GANs (StyleGAN2, BigGAN) and DDMs on FFHQ and ImageNet datasets.	The efficiency of IDM could be a limitation, as it requires multiple diffusion steps during inference, making it slower than single-forward pass methods. Further exploration of loss functions and optimizer settings for training DDMs could potentially address the observed color faithfulness issues with L2 loss.	face restoration, denoising diffusion models, authentic restoration, image generation, iterative learning
2307.08727 Report	Learning to Count without Annotations	Lukas Knobel, Tengda Han, Yuki M. Asano	While recent supervised methods for reference-based object counting continue to improve the performance on benchmark datasets, they have to rely on small datasets due to the cost associated with manually annotating dozens of objects in images. We propose UnCounTR, a model that can learn this task without requiring any manual annotations. To this end, we construct "Self-Collages", images with various pasted objects as training samples, that provide a rich learning signal covering arbitrary object types and counts. Our method builds on existing unsupervised representations and segmentation techniques to successfully demonstrate for the first time the ability of reference-based counting without manual supervision. Our experiments show that our method not only outperforms simple baselines and generic models such as FasterRCNN and DETR, but also matches the performance of supervised counting models in some domains.	This paper proposes UnCounTR, the first method for reference-based object counting that does not require any manual annotations.	Manually annotating object counts in images is expensive and limits the size of datasets. This paper explores whether it's possible to learn counting without relying on these annotations.	The paper introduces Self-Collages, a self-supervised method that generates training data by pasting objects onto background images. It uses an off-the-shelf pretrained DINO ViT backbone to extract features and train the counting model.	UnCounTR outperforms strong baselines like DETR and achieves comparable performance to supervised methods on CARPK and MSO. On FSC-147, UnCounTR outperforms the supervised method CounTR on low-count ranges and shows competitive results for medium counts. The paper demonstrates that UnCounTR can be extended to perform self-supervised semantic counting, where the model identifies exemplars and counts them without any prior.	UnCounTR's performance degrades for images with counts significantly higher than the ones seen during training, suggesting limits to its generalization abilities. The paper primarily focuses on counting, leaving the exploration of using Self-Collages for related tasks like semantic instance segmentation as future work.	object counting, self-supervised learning, few-shot learning, computer vision, unsupervised learning
2307.08695 Report	Neural Video Depth Stabilizer	Yiran Wang, Min Shi, Jiaqi Li, Zihao Huang, Zhiguo Cao, Jianming Zhang, Ke Xian, Guosheng Lin	Video depth estimation aims to infer temporally consistent depth. Some methods achieve temporal consistency by finetuning a single-image depth model during test time using geometry and re-projection constraints, which is inefficient and not robust. An alternative approach is to learn how to enforce temporal consistency from data, but this requires well-designed models and sufficient video depth data. To address these challenges, we propose a plug-and-play framework called Neural Video Depth Stabilizer (NVDS) that stabilizes inconsistent depth estimations and can be applied to different single-image depth models without extra effort. We also introduce a large-scale dataset, Video Depth in the Wild (VDW), which consists of 14,203 videos with over two million frames, making it the largest natural-scene video depth dataset to our knowledge. We evaluate our method on the VDW dataset as well as two public benchmarks and demonstrate significant improvements in consistency, accuracy, and efficiency compared to previous approaches. Our work serves as a solid baseline and provides a data foundation for learning-based video depth models. We will release our dataset and code for future research.	This paper introduces NVDS, a plug-and-play framework for improving temporal consistency in video depth estimation, and VDW, a large-scale natural-scene video depth dataset.	Existing video depth estimation methods suffer from limitations: test-time training methods are computationally expensive and not robust, while learning-based methods lack sufficient training data.	NVDS uses a Stabilization Network with cross-attention to refine flickering disparity maps from any single-image depth model. VDW provides diverse video data for training robust models.	NVDS significantly outperforms previous methods in terms of consistency, accuracy, and efficiency. VDW, with over 2 million frames, serves as the largest natural-scene video depth dataset to date. Experiments demonstrate the effectiveness of NVDS with different depth predictors and the benefits of VDW for training.	The current implementation of NVDS focuses on a specific framework. Future work includes exploring alternative mechanisms and lightweight models for broader application.	video depth estimation, temporal consistency, plug-and-play framework, dataset, deep learning
2307.08629 Report	Deficiency-Aware Masked Transformer for Video Inpainting	Yongsheng Yu, Heng Fan, Libo Zhang	Recent video inpainting methods have made remarkable progress by utilizing explicit guidance, such as optical flow, to propagate cross-frame pixels. However, there are cases where cross-frame recurrence of the masked video is not available, resulting in a deficiency. In such situation, instead of borrowing pixels from other frames, the focus of the model shifts towards addressing the inverse problem. In this paper, we introduce a dual-modality-compatible inpainting framework called Deficiency-aware Masked Transformer (DMT), which offers three key advantages. Firstly, we pretrain a image inpainting model DMT_img serve as a prior for distilling the video model DMT_vid, thereby benefiting the hallucination of deficiency cases. Secondly, the self-attention module selectively incorporates spatiotemporal tokens to accelerate inference and remove noise signals. Thirdly, a simple yet effective Receptive Field Contextualizer is integrated into DMT, further improving performance. Extensive experiments conducted on YouTube-VOS and DAVIS datasets demonstrate that DMT_vid significantly outperforms previous solutions. The code and video demonstrations can be found at github.com/yeates/DMT.	This paper introduces Deficiency-aware Masked Transformer (DMT), a novel video inpainting framework that leverages pre-trained image inpainting models to enhance performance in deficiency cases (where the masked content is absent throughout the video).	Addresses the challenge of deficiency cases in video inpainting, where existing methods struggle to generate plausible content. The paper bridges the gap between image and video inpainting by transferring knowledge from a pre-trained image inpainting model.	The authors propose a dual-modality-compatible framework with: (1) a pre-trained image inpainting model (DMT_img) serving as a prior for the video inpainting model (DMT_vid), (2) a Token Selection mechanism to focus on valid spatiotemporal tokens, (3) a Mask Activation strategy to iteratively hallucinate missing regions, and (4) a Receptive Field Contextualizer (RFC) to enhance spatial feature reconstruction.	DMT_vid significantly outperforms state-of-the-art video inpainting methods on benchmark datasets, achieving higher PSNR and SSIM scores while reducing VFID. The proposed framework effectively leverages the pre-trained image inpainting model to handle deficiency cases, demonstrating the benefits of knowledge transfer between image and video domains. The Token Selection and Mask Activation mechanisms contribute to improved efficiency and performance by reducing computational complexity and enabling the reconstruction of missing tokens.	The method's reliance on Transformers leads to high memory requirements when processing high-resolution videos. Training a unified model for both image and video inpainting tasks poses challenges due to the inherent differences in their objectives and requirements.	video inpainting, image inpainting, transformer, deficiency-aware, receptive field
2307.08585 Report	Identity-Preserving Aging of Face Images via Latent Diffusion Models	Sudipta Banerjee, Govind Mittal, Ameya Joshi, Chinmay Hegde, Nasir Memon	The performance of automated face recognition systems is inevitably impacted by the facial aging process. However, high quality datasets of individuals collected over several years are typically small in scale. In this work, we propose, train, and validate the use of latent text-to-image diffusion models for synthetically aging and de-aging face images. Our models succeed with few-shot training, and have the added benefit of being controllable via intuitive textual prompting. We observe high degrees of visual realism in the generated images while maintaining biometric fidelity measured by commonly used metrics. We evaluate our method on two benchmark datasets (CelebA and AgeDB) and observe significant reduction (~44%) in the False Non-Match Rate compared to existing state-of the-art baselines.	The paper proposes a novel method for age progression and regression of face images using latent text-to-image diffusion models, focusing on preserving biometric identity.	Facial aging significantly impacts face recognition systems, and existing methods struggle to balance visual realism with biometric fidelity. This work addresses this gap by leveraging the power of diffusion models and identity-preserving techniques.	The method adapts DreamBooth, a latent diffusion model, by incorporating biometric and contrastive losses during fine-tuning. This allows the model to learn identity-specific features while leveraging a regularization set of image-caption pairs to understand age progression concepts.	The method generates visually compelling age-progressed and regressed images while maintaining high biometric fidelity, as demonstrated by user studies and quantitative metrics. The proposed approach outperforms state-of-the-art methods like IPCGAN, AttGAN, and Talk-to-Edit, showing significant reduction in FNMR. Fine-tuning face recognition models on the generated images leads to significant performance improvement, suggesting their potential for improving face recognition robustness.	The method currently relies on fine-tuning for each individual, and exploring zero-shot learning is a future direction. Further research can investigate the use of composable diffusion models for more fine-grained control over age editing.	age progression, age regression, face recognition, latent diffusion models, biometric identity preservation
2307.08581 Report	BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs	Yang Zhao, Zhijie Lin, Daquan Zhou, Zilong Huang, Jiashi Feng, Bingyi Kang	LLMs have demonstrated remarkable abilities at interacting with humans through language, especially with the usage of instruction-following data. Recent advancements in LLMs, such as MiniGPT-4, LLaVA, and X-LLM, further enlarge their abilities by incorporating multi-modal inputs, including image, video, and speech. Despite their effectiveness at generating precise and detailed language understanding of the given modality signal, these LLMs give up the ability to ground specific parts of inputs, thus only constructing a coarse-grained mapping. However, explicit and informative correspondence between text and other modalities will not only improve the user experience but also help to expand the application scenario of multi-modal LLMs. Therefore, we propose BuboGPT, a multi-modal LLM with visual grounding that can perform cross-modal interaction between vision, audio and language, providing fine-grained understanding of visual objects and other given modalities. As a result, BuboGPT is able to point out the specific location of an object in the image, when it is generating response or description for that object. Our contributions are two-fold: 1) An off-the-shelf visual grounding module based on SAM that extracts entities in a sentence and find corresponding masks in the image. 2) A two-stage training scheme and instruction dataset to endow joint text-image-audio understanding. Our experiments show that BuboGPT achieves impressive multi-modality understanding and visual grounding abilities during the interaction with human. It performs consistently well when provided by arbitrary modality combinations (either aligned or unaligned). Our code, model and dataset are available at https://bubo-gpt.github.io .	BuboGPT, a multi-modal large language model (LLM) that incorporates visual grounding for fine-grained understanding of visual objects and their relationships with other modalities like text and audio.	Existing multi-modal LLMs lack the ability to ground understanding in specific parts of inputs, limiting their interpretability and application scenarios. BuboGPT addresses this by linking visual objects with other modalities for fine-grained understanding.	A two-stage training scheme is used: 1) Single-modal pre-training aligns vision and audio encoders with the LLM. 2) Multi-modal instruct tuning on a curated dataset with image-text, audio-text, and image-audio-text pairs, including negative pairs for better semantic reasoning.	BuboGPT achieves impressive visual grounding, accurately associating textual descriptions with image regions. The model demonstrates strong audio understanding, providing detailed descriptions even for subtle audio cues. BuboGPT excels in aligned and arbitrary audio-image understanding, identifying sound sources in images and reasoning about the relationship between audio and visual inputs.	Inherits language hallucination limitations from the underlying LLM, potentially generating non-factual information. Grounding question answering (QA) capabilities are limited by the text-based connection between grounding results and modalities, requiring further improvement with fine-grained visual grounding datasets and spatial information integration.	multi-modal learning, large language models, visual grounding, audio understanding, instruction tuning
2307.08526 Report	Image Captions are Natural Prompts for Text-to-Image Models	Shiye Lei, Hao Chen, Sen Zhang, Bo Zhao, Dacheng Tao	With the rapid development of Artificial Intelligence Generated Content (AIGC), it has become common practice in many learning tasks to train or fine-tune large models on synthetic data due to the data-scarcity and privacy leakage problems. Albeit promising with unlimited data generation, owing to massive and diverse information conveyed in real images, it is challenging for text-to-image generative models to synthesize informative training data with hand-crafted prompts, which usually leads to inferior generalization performance when training downstream models. In this paper, we theoretically analyze the relationship between the training effect of synthetic data and the synthetic data distribution induced by prompts. Then we correspondingly propose a simple yet effective method that prompts text-to-image generative models to synthesize more informative and diverse training data. Specifically, we caption each real image with the advanced captioning model to obtain informative and faithful prompts that extract class-relevant information and clarify the polysemy of class names. The image captions and class names are concatenated to prompt generative models for training image synthesis. Extensive experiments on ImageNette, ImageNet-100, and ImageNet-1K verify that our method significantly improves the performance of models trained on synthetic training data, i.e., 10% classification accuracy improvements on average.	This paper proposes Caption in Prompt (CiP), a training-free method to synthesize informative training data using large text-to-image (T2I) models. It involves captioning real images and combining these captions with class names to prompt T2I models for generating synthetic training samples.	Existing methods for generating synthetic training data using T2I models rely on simple prompts, resulting in limited information and diversity in the synthetic data. This leads to inferior generalization performance when training downstream models. CiP addresses this issue by creating more informative prompts based on real data.	CiP first uses an off-the-shelf image captioning model to generate captions for real images. Then, it concatenates these captions with class names to form prompts for the T2I model. Finally, the T2I model generates synthetic images based on these constructed prompts.	Guidance scale, a parameter in Stable Diffusion, significantly impacts the training effect of synthetic data, with a suitable range between 1.5 and 2.0. CiP significantly improves the training effect of synthetic datasets, leading to a substantial increase (around 10%) in classification accuracy compared to basic prompts. The quality of the captioning model used in CiP affects the performance, with BLIP-2 generating captions that lead to better results than ViT-GPT2.	Generating training data via large diffusion models requires substantial computational resources, limiting scalability and edge-computing applications. While CiP is more efficient than methods based on diffusion inversion, image editing, and fine-tuning, reducing synthesis cost remains important for broader adoption.	synthetic data generation, text-to-image models, image captioning, stable diffusion, deep learning
2307.08504 Report	BUS:Efficient and Effective Vision-language Pre-training with Bottom-Up Patch Summarization	Chaoya Jiang, Haiyang Xu, Wei Ye, Qinghao Ye, Chenliang Li, Ming Yan, Bin Bi, Shikun Zhang, Fei Huang, Songfang Huang	Vision Transformer (ViT) based Vision-Language Pre-training (VLP) models have demonstrated impressive performance in various tasks. However, the lengthy visual token sequences fed into ViT can lead to training inefficiency and ineffectiveness. Existing efforts address the challenge by either bottom-level patch extraction in the ViT backbone or top-level patch abstraction outside, not balancing training efficiency and effectiveness well. Inspired by text summarization in natural language processing, we propose a Bottom-Up Patch Summarization approach named BUS, coordinating bottom-level extraction and top-level abstraction to learn a concise summary of lengthy visual token sequences efficiently. Specifically, We incorporate a Text-Semantics-Aware Patch Selector (TSPS) into the ViT backbone to perform a coarse-grained visual token extraction and then attach a flexible Transformer-based Patch Abstraction Decoder (PAD) upon the backbone for top-level visual abstraction. This bottom-up collaboration enables our BUS to yield high training efficiency while maintaining or even improving effectiveness. We evaluate our approach on various visual-language understanding and generation tasks and show competitive downstream task performance while boosting the training efficiency by 50\%. Additionally, our model achieves state-of-the-art performance on many downstream tasks by increasing input image resolution without increasing computational costs over baselines.	This paper introduces \modelname, a novel Vision-Language Pre-training (VLP) model that utilizes a bottom-up patch summarization approach for efficient and effective learning.	Existing ViT-based VLP models suffer from training inefficiency and ineffectiveness due to lengthy visual token sequences. \modelname addresses this by summarizing these sequences, balancing efficiency and effectiveness.	The model employs a two-step process: 1) Key Patch Extraction (KPE) within the ViT backbone selects text-relevant patches using a Text Semantic-aware Patch Selector (TSPS). 2) Text-Guided Patch Abstraction (TPA) utilizes a lightweight Transformer-based Patch Abstraction Decoder (PAD) to further condense the selected patches into a concise visual summary.	\modelname achieves competitive or better performance on downstream tasks like VQA, image captioning, and retrieval while being significantly faster than previous VLP models. The model can process higher resolution images without increased computational cost, leading to state-of-the-art results on tasks like VQA. Ablation studies confirm the effectiveness of both KPE and TPA, highlighting their contribution to \modelname's efficiency and accuracy.	The paper primarily focuses on efficiency and effectiveness, leaving further exploration of the learned representations for future work. The impact of varying the number of selected patches on model performance could be further investigated.	vision-language pre-training, vision transformer, patch summarization, cross-modal learning, efficiency
2307.08500 Report	Cumulative Spatial Knowledge Distillation for Vision Transformers	Borui Zhao, Renjie Song, Jiajun Liang	Distilling knowledge from convolutional neural networks (CNNs) is a double-edged sword for vision transformers (ViTs). It boosts the performance since the image-friendly local-inductive bias of CNN helps ViT learn faster and better, but leading to two problems: (1) Network designs of CNN and ViT are completely different, which leads to different semantic levels of intermediate features, making spatial-wise knowledge transfer methods (e.g., feature mimicking) inefficient. (2) Distilling knowledge from CNN limits the network convergence in the later training period since ViT's capability of integrating global information is suppressed by CNN's local-inductive-bias supervision. To this end, we present Cumulative Spatial Knowledge Distillation (CSKD). CSKD distills spatial-wise knowledge to all patch tokens of ViT from the corresponding spatial responses of CNN, without introducing intermediate features. Furthermore, CSKD exploits a Cumulative Knowledge Fusion (CKF) module, which introduces the global response of CNN and increasingly emphasizes its importance during the training. Applying CKF leverages CNN's local inductive bias in the early training period and gives full play to ViT's global capability in the later one. Extensive experiments and analysis on ImageNet-1k and downstream datasets demonstrate the superiority of our CSKD. Code will be publicly available.	This paper proposes Cumulative Spatial Knowledge Distillation (CSKD), a novel knowledge distillation technique for Vision Transformers (ViTs) that addresses limitations of distilling from Convolutional Neural Networks (CNNs).	Distilling knowledge from CNNs to ViTs, while beneficial, presents challenges: 1) misaligned intermediate feature semantics due to architectural differences and 2) hindering ViT's global information integration capabilities in later training stages.	CSKD transfers spatial knowledge by using dense predictions from CNN's last features to supervise corresponding ViT patch tokens, avoiding intermediate feature alignment issues. It incorporates a Cumulative Knowledge Fusion (CKF) module that progressively emphasizes CNN's global response, balancing local and global knowledge transfer throughout training.	CSKD consistently outperforms DeiT and DearKD baselines on ImageNet-1k, achieving up to +1.8% top-1 accuracy improvement. The method demonstrates superior transfer learning performance on downstream datasets like CIFAR, Cars, and iNat19, indicating improved generalization. Visualizations of attention distances and heatmaps confirm that CSKD effectively leverages ViT's global modeling capacity.	The study primarily focuses on image classification tasks, leaving its application to other vision tasks for future exploration. The current implementation relies on a pre-trained CNN teacher; exploring student-teacher co-training could further enhance performance.	knowledge distillation, vision transformer, convolutional neural network, spatial knowledge transfer, cumulative knowledge fusion
2307.08448 Report	Not All Steps are Created Equal: Selective Diffusion Distillation for Image Manipulation	Luozhou Wang, Shuai Yang, Shu Liu, Ying-cong Chen	Conditional diffusion models have demonstrated impressive performance in image manipulation tasks. The general pipeline involves adding noise to the image and then denoising it. However, this method faces a trade-off problem: adding too much noise affects the fidelity of the image while adding too little affects its editability. This largely limits their practical applicability. In this paper, we propose a novel framework, Selective Diffusion Distillation (SDD), that ensures both the fidelity and editability of images. Instead of directly editing images with a diffusion model, we train a feedforward image manipulation network under the guidance of the diffusion model. Besides, we propose an effective indicator to select the semantic-related timestep to obtain the correct semantic guidance from the diffusion model. This approach successfully avoids the dilemma caused by the diffusion process. Our extensive experiments demonstrate the advantages of our framework. Code is released at https://github.com/AndysonYs/Selective-Diffusion-Distillation.	This paper proposes Selective Diffusion Distillation (SDD), a novel image manipulation framework that leverages a pre-trained text-guided diffusion model to supervise an efficient feedforward image manipulator, avoiding the editability-fidelity trade-off common in direct diffusion-based editing.	Existing diffusion-based image manipulation methods suffer from a trade-off between editability and fidelity, limiting their practicality. This paper aims to overcome this limitation by introducing a new framework that separates the manipulation process from the diffusion process.	The proposed SDD framework utilizes a pre-trained diffusion model as a supervisor to train a feedforward image manipulator (e.g., StyleGAN with a latent mapper). It introduces the Hybrid Quality Score (HQS) to select semantically relevant diffusion timesteps, ensuring the manipulator receives optimal guidance from the diffusion model.	SDD successfully performs various image manipulations across different domains (faces, cats, cars) while preserving high fidelity to the input image. Compared to other diffusion-based methods, SDD achieves higher CLIP similarity (better semantic alignment with the text prompt) and lower FID (better image quality). SDD demonstrates superior efficiency compared to diffusion-based methods when manipulating a large number of images.	The HQS selection strategy relies on empirical observations and might require further investigation for optimal performance. Future work can explore different architectures for the image manipulator and extend the method to a wider range of image manipulation tasks.	image manipulation, diffusion models, knowledge distillation, text-guided image editing, hybrid quality score
2307.08436 Report	DOT: A Distillation-Oriented Trainer	Borui Zhao, Quan Cui, Renjie Song, Jiajun Liang	Knowledge distillation transfers knowledge from a large model to a small one via task and distillation losses. In this paper, we observe a trade-off between task and distillation losses, i.e., introducing distillation loss limits the convergence of task loss. We believe that the trade-off results from the insufficient optimization of distillation loss. The reason is: The teacher has a lower task loss than the student, and a lower distillation loss drives the student more similar to the teacher, then a better-converged task loss could be obtained. To break the trade-off, we propose the Distillation-Oriented Trainer (DOT). DOT separately considers gradients of task and distillation losses, then applies a larger momentum to distillation loss to accelerate its optimization. We empirically prove that DOT breaks the trade-off, i.e., both losses are sufficiently optimized. Extensive experiments validate the superiority of DOT. Notably, DOT achieves a +2.59% accuracy improvement on ImageNet-1k for the ResNet50-MobileNetV1 pair. Conclusively, DOT greatly benefits the student's optimization properties in terms of loss convergence and model generalization. Code will be made publicly available.	This paper proposes Distillation-Oriented Trainer (DOT) to address the trade-off between task and distillation losses in knowledge distillation, aiming to improve student model convergence and generalization.	Knowledge distillation, while effective in transferring knowledge, often encounters a trade-off where improving distillation loss hinders task loss convergence, limiting the student model's performance.	DOT tackles this trade-off by employing separate momentum buffers for task and distillation losses during optimization. It assigns a larger momentum to the distillation loss, ensuring its gradients dominate the training process and lead to better knowledge transfer.	DOT successfully breaks the task-distillation loss trade-off, achieving lower values for both losses simultaneously. The method guides the student model towards flatter minima in the loss landscape, empirically demonstrating improved generalization ability. DOT consistently enhances the performance of various distillation methods across benchmarks like CIFAR-100, Tiny-ImageNet, and ImageNet-1k, achieving new state-of-the-art results.	The paper primarily focuses on image classification tasks, and further investigation is needed to assess DOT's effectiveness in other domains like natural language processing. Future work could explore the optimal balance between task and distillation losses during different training stages for potential further improvements.	knowledge distillation, optimization, deep learning, model compression, loss landscape
2307.08397 Report	CLIP-Guided StyleGAN Inversion for Text-Driven Real Image Editing	Ahmet Canberk Baykal, Abdul Basit Anees, Duygu Ceylan, Erkut Erdem, Aykut Erdem, Deniz Yuret	Researchers have recently begun exploring the use of StyleGAN-based models for real image editing. One particularly interesting application is using natural language descriptions to guide the editing process. Existing approaches for editing images using language either resort to instance-level latent code optimization or map predefined text prompts to some editing directions in the latent space. However, these approaches have inherent limitations. The former is not very efficient, while the latter often struggles to effectively handle multi-attribute changes. To address these weaknesses, we present CLIPInverter, a new text-driven image editing approach that is able to efficiently and reliably perform multi-attribute changes. The core of our method is the use of novel, lightweight text-conditioned adapter layers integrated into pretrained GAN-inversion networks. We demonstrate that by conditioning the initial inversion step on the CLIP embedding of the target description, we are able to obtain more successful edit directions. Additionally, we use a CLIP-guided refinement step to make corrections in the resulting residual latent codes, which further improves the alignment with the text prompt. Our method outperforms competing approaches in terms of manipulation accuracy and photo-realism on various domains including human faces, cats, and birds, as shown by our qualitative and quantitative results.	Presents CLIPInverter, a novel text-driven image editing approach that uses CLIP-guided adapter layers within pretrained GAN-inversion networks for efficient and reliable multi-attribute image manipulation.	Existing methods for language-guided image editing are either inefficient (instance-level optimization) or struggle with multi-attribute changes (predefined text prompts).	Integrates lightweight text-conditioned adapter layers (CLIPAdapter) into pretrained GAN-inversion networks. The initial inversion is conditioned on the CLIP embedding of the target description, and a CLIP-guided refinement step (CLIPRemapper) further improves alignment with the text prompt.	Outperforms competing approaches in manipulation accuracy and photo-realism across various domains (faces, cats, birds). Enables smooth image manipulations through latent code interpolation, offering user control over the editing process. Demonstrates zero-shot capabilities by handling novel descriptions and using reference images as conditioning input.	Inherits limitations of the underlying GAN inversion network, such as potential struggles with unusual poses or challenging lighting conditions. May be affected by biases present in the training data, which can lead to undesired manipulations. This can be mitigated by using more comprehensive textual descriptions.	image manipulation, text-guided editing, stylegan, clip, gan inversion
2307.08199 Report	Unbiased Image Synthesis via Manifold Guidance in Diffusion Models	Xingzhe Su, Daixi Jia, Fengge Wu, Junsuo Zhao, Changwen Zheng, Wenwen Qiang	Diffusion Models are a potent class of generative models capable of producing high-quality images. However, they often inadvertently favor certain data attributes, undermining the diversity of generated images. This issue is starkly apparent in skewed datasets like CelebA, where the initial dataset disproportionately favors females over males by 57.9%, this bias amplified in generated data where female representation outstrips males by 148%. In response, we propose a plug-and-play method named Manifold Guidance Sampling, which is also the first unsupervised method to mitigate bias issue in DDPMs. Leveraging the inherent structure of the data manifold, this method steers the sampling process towards a more uniform distribution, effectively dispersing the clustering of biased data. Without the need for modifying the existing model or additional training, it significantly mitigates data bias and enhances the quality and unbiasedness of the generated images.	This paper proposes Manifold Guidance Sampling (MGS), a plug-and-play, unsupervised method to mitigate bias in Denoising Diffusion Probabilistic Models (DDPMs) by guiding the sampling process towards a more uniform distribution on the data manifold.	DDPMs, despite their success in image synthesis, inherit and often amplify biases present in training data, leading to skewed and unrepresentative generated images. This underscores the need for methods like MGS to ensure fairness and diversity in generated data.	MGS operates in two stages: (1) It evaluates the data manifold by learning an efficient mapping from high-dimensional image space to a low-dimensional feature space, capturing the intrinsic data structure. (2) It incorporates manifold constraints into the DDPM sampling process, promoting a uniform distribution of generated samples across the learned manifold.	MGS effectively reduces bias in generated images, demonstrated through analysis of attribute distributions on the CelebA dataset. MGS enhances both the quality and diversity of generated images compared to standard DDPM sampling, evidenced by improved FID and sFID scores across multiple datasets. MGS is a versatile and adaptable method, compatible with various DDPM architectures and sampling schedules, and does not require model retraining or label information.	While significantly mitigating bias, MGS doesn't completely eliminate it, suggesting room for further improvement. The effectiveness of MGS may vary across different datasets and bias types, necessitating further investigation into its generalizability and potential limitations.	diffusion models, image synthesis, data bias, manifold learning, unsupervised learning
2307.08093 Report	Cross-Ray Neural Radiance Fields for Novel-view Synthesis from Unconstrained Image Collections	Yifan Yang, Shuhai Zhang, Zixiong Huang, Yubing Zhang, Mingkui Tan	Neural Radiance Fields (NeRF) is a revolutionary approach for rendering scenes by sampling a single ray per pixel and it has demonstrated impressive capabilities in novel-view synthesis from static scene images. However, in practice, we usually need to recover NeRF from unconstrained image collections, which poses two challenges: 1) the images often have dynamic changes in appearance because of different capturing time and camera settings; 2) the images may contain transient objects such as humans and cars, leading to occlusion and ghosting artifacts. Conventional approaches seek to address these challenges by locally utilizing a single ray to synthesize a color of a pixel. In contrast, humans typically perceive appearance and objects by globally utilizing information across multiple pixels. To mimic the perception process of humans, in this paper, we propose Cross-Ray NeRF (CR-NeRF) that leverages interactive information across multiple rays to synthesize occlusion-free novel views with the same appearances as the images. Specifically, to model varying appearances, we first propose to represent multiple rays with a novel cross-ray feature and then recover the appearance by fusing global statistics, i.e., feature covariance of the rays and the image appearance. Moreover, to avoid occlusion introduced by transient objects, we propose a transient objects handler and introduce a grid sampling strategy for masking out the transient objects. We theoretically find that leveraging correlation across multiple rays promotes capturing more global information. Moreover, extensive experimental results on large real-world datasets verify the effectiveness of CR-NeRF.	This paper proposes Cross-Ray NeRF (CR-NeRF), a novel method for synthesizing novel views from unconstrained image collections by leveraging interactions among multiple rays to address varying appearances and transient occlusions.	Existing NeRF methods struggle with unconstrained image collections due to their static scene assumption, leading to inaccurate reconstructions with over-smoothing and ghosting artifacts. CR-NeRF aims to overcome these limitations and enable realistic novel view synthesis from diverse and dynamic scenes.	CR-NeRF introduces a cross-ray paradigm with two key components: (1) Cross-ray appearance modeling: representing multiple rays with a cross-ray feature, fusing it with an appearance embedding using global statistics (feature covariance), and decoding it to obtain pixel colors simultaneously. (2) Cross-ray transient object handling: employing a segmentation network to generate a visibility map for transient objects and using grid sampling to pair the map with the input rays.	CR-NeRF outperforms state-of-the-art methods like NeRF-W and Ha-NeRF on benchmark datasets, achieving higher PSNR, SSIM, and lower LPIPS values. CR-NeRF demonstrates superior ability in modeling varying appearances, especially for images with high-frequency information, and effectively removes transient objects like tourists and cars. CR-NeRF exhibits significant inference efficiency when handling multiple images with varying appearances but fixed camera positions, outperforming Ha-NeRF significantly.	The definition of transient objects needs further exploration for more robust handling. The current method focuses on synthesizing static scenes and could be extended to dynamic scenes with moving objects in the future.	novel view synthesis, neural radiance fields, unconstrained image collections, appearance modeling, transient object handling
2307.08076 Report	Diffusion to Confusion: Naturalistic Adversarial Patch Generation Based on Diffusion Model for Object Detector	Shuo-Yen Lin, Ernie Chu, Che-Hsien Lin, Jun-Cheng Chen, Jia-Ching Wang	Many physical adversarial patch generation methods are widely proposed to protect personal privacy from malicious monitoring using object detectors. However, they usually fail to generate satisfactory patch images in terms of both stealthiness and attack performance without making huge efforts on careful hyperparameter tuning. To address this issue, we propose a novel naturalistic adversarial patch generation method based on the diffusion models (DM). Through sampling the optimal image from the DM model pretrained upon natural images, it allows us to stably craft high-quality and naturalistic physical adversarial patches to humans without suffering from serious mode collapse problems as other deep generative models. To the best of our knowledge, we are the first to propose DM-based naturalistic adversarial patch generation for object detectors. With extensive quantitative, qualitative, and subjective experiments, the results demonstrate the effectiveness of the proposed approach to generate better-quality and more naturalistic adversarial patches while achieving acceptable attack performance than other state-of-the-art patch generation methods. We also show various generation trade-offs under different conditions.	This paper proposes a novel method for generating naturalistic adversarial patches for object detectors, leveraging diffusion models (DM) pre-trained on natural images.	Existing adversarial patch generation methods often create visually conspicuous patterns or require extensive hyperparameter tuning to balance attack performance and natural appearance. This work addresses these limitations by utilizing the power of diffusion models in generating high-quality and diverse images.	The method introduces a novel Adversarial Patch Sampling (APS) technique based on DDIM. It optimizes an initial patch generated from a text-conditioned LDM by backpropagating the object detector's loss into the LDM's sampling process. Text conditioning and strategic noise injection during optimization contribute to maintaining the naturalism of the generated patches.	The proposed method demonstrates superior attack performance compared to previous state-of-the-art methods on various object detectors. Subjective evaluations through user studies confirm that the generated patches are perceived as more natural than those from previous methods, often even surpassing real-world images. The method exhibits robustness to existing defenses like SAC, showcasing its effectiveness in real-world scenarios.	The generalization of adversarial patches across different datasets requires further investigation and potential improvement. The computational cost associated with the diffusion model sampling process, although mitigated by DDIM and LDM, remains a consideration for future work.	adversarial patch, object detection, diffusion models, naturalistic patch, physical adversarial examples
2307.08041 Report	Planting a SEED of Vision in Large Language Model	Yuying Ge, Yixiao Ge, Ziyun Zeng, Xintao Wang, Ying Shan	We present SEED, an elaborate image tokenizer that empowers Large Language Models (LLMs) with the emergent ability to SEE and Draw at the same time. Research on image tokenizers has previously reached an impasse, as frameworks employing quantized visual tokens have lost prominence due to subpar performance and convergence in multimodal comprehension (compared to BLIP-2, etc.) or generation (compared to Stable Diffusion, etc.). Despite the limitations, we remain confident in its natural capacity to unify visual and textual representations, facilitating scalable multimodal training with LLM's original recipe. In this study, we identify two crucial principles for the architecture and training of SEED that effectively ease subsequent alignment with LLMs. (1) Image tokens should be independent of 2D physical patch positions and instead be produced with a 1D causal dependency, exhibiting intrinsic interdependence that aligns with the left-to-right autoregressive prediction mechanism in LLMs. (2) Image tokens should capture high-level semantics consistent with the degree of semantic abstraction in words, and be optimized for both discriminativeness and reconstruction during the tokenizer training phase. As a result, the off-the-shelf LLM is able to perform both image-to-text and text-to-image generation by incorporating our SEED through efficient LoRA tuning. Comprehensive multimodal pretraining and instruction tuning, which may yield improved results, are reserved for future investigation. This version of SEED was trained in 5.7 days using only 64 V100 GPUs and 5M publicly available image-text pairs. Our preliminary study emphasizes the great potential of discrete visual tokens in versatile multimodal LLMs and the importance of proper image tokenizers in broader research.	This paper introduces SEED, a novel image tokenizer designed to equip Large Language Models (LLMs) with the capacity for both visual understanding (seeing) and generation (drawing).	Existing image tokenizers struggle to effectively bridge the gap between visual and textual representations, hindering the development of truly versatile multimodal LLMs.	SEED leverages a VQ-based approach, employing a Causal Q-Former to produce discrete visual tokens with 1D causal dependency and high-level semantic information. It further incorporates a Reverse Q-Former to align visual tokens with the latent space of Stable Diffusion for image generation.	SEED tokens demonstrate competitive performance in text-image retrieval tasks compared to BLIP-2. SEED facilitates efficient alignment with LLMs through LoRA tuning, enabling text-to-image and image-to-text generation. Preliminary experiments with SEED-OPT₂.₇₋ show promising results in zero-shot image captioning, visual QA, and image generation.	The current SEED implementation is limited by the scale of training data (5M image-text pairs) and the size of the LLM used (OPT₂.₇₋). Future work will explore more comprehensive multimodal pretraining and instruction tuning to further enhance SEED’s capabilities.	image tokenization, multimodal llms, visual comprehension, image generation, causal dependency
2307.08012 Report	Householder Projector for Unsupervised Latent Semantics Discovery	Yue Song, Jichao Zhang, Nicu Sebe, Wei Wang	Generative Adversarial Networks (GANs), especially the recent style-based generators (StyleGANs), have versatile semantics in the structured latent space. Latent semantics discovery methods emerge to move around the latent code such that only one factor varies during the traversal. Recently, an unsupervised method proposed a promising direction to directly use the eigenvectors of the projection matrix that maps latent codes to features as the interpretable directions. However, one overlooked fact is that the projection matrix is non-orthogonal and the number of eigenvectors is too large. The non-orthogonality would entangle semantic attributes in the top few eigenvectors, and the large dimensionality might result in meaningless variations among the directions even if the matrix is orthogonal. To avoid these issues, we propose Householder Projector, a flexible and general low-rank orthogonal matrix representation based on Householder transformations, to parameterize the projection matrix. The orthogonality guarantees that the eigenvectors correspond to disentangled interpretable semantics, while the low-rank property encourages that each identified direction has meaningful variations. We integrate our projector into pre-trained StyleGAN2/StyleGAN3 and evaluate the models on several benchmarks. Within only $1\%$ of the original training steps for fine-tuning, our projector helps StyleGANs to discover more disentangled and precise semantic attributes without sacrificing image fidelity.	This paper introduces Householder Projector, a low-rank orthogonal matrix representation based on Householder transformations, to parameterize the projection matrix in StyleGANs for enhanced unsupervised latent semantics discovery.	Existing unsupervised methods for discovering interpretable directions in StyleGANs suffer from entangled semantics due to imbalanced eigenvalues in the projection matrix. Additionally, enforcing vanilla orthogonality can lead to meaningless variations due to the high dimensionality of the projector.	The proposed method decomposes the projection matrix into its SVD form and represents the orthogonal singular vectors using Householder reflectors. A low-rank identity matrix is employed for singular values, enabling control over the number of semantic concepts. The method leverages pre-trained weights for initialization and employs acceleration techniques for efficient computation.	Householder Projector significantly improves latent semantics discovery in StyleGANs, leading to more precise attribute control without compromising image fidelity. The method outperforms other unsupervised baselines in terms of latent space smoothness (PPL and PIPL) and maintains competitive image quality (FID). Householder Projector enables the discovery of diverse and semantically consistent interpretable directions across different layers and datasets.	Current experiments focus on fine-tuning pre-trained StyleGANs, and training from scratch could potentially further enhance performance. The number of semantics per layer is currently pre-defined, and exploring adaptive schemes for automatic semantic mining is an area for future work.	generative adversarial networks, latent semantics discovery, stylegan, householder transformations, disentanglement
2307.07790 Report	Adaptive Nonlinear Latent Transformation for Conditional Face Editing	Zhizhong Huang, Siteng Ma, Junping Zhang, Hongming Shan	Recent works for face editing usually manipulate the latent space of StyleGAN via the linear semantic directions. However, they usually suffer from the entanglement of facial attributes, need to tune the optimal editing strength, and are limited to binary attributes with strong supervision signals. This paper proposes a novel adaptive nonlinear latent transformation for disentangled and conditional face editing, termed AdaTrans. Specifically, our AdaTrans divides the manipulation process into several finer steps; i.e., the direction and size at each step are conditioned on both the facial attributes and the latent codes. In this way, AdaTrans describes an adaptive nonlinear transformation trajectory to manipulate the faces into target attributes while keeping other attributes unchanged. Then, AdaTrans leverages a predefined density model to constrain the learned trajectory in the distribution of latent codes by maximizing the likelihood of transformed latent code. Moreover, we also propose a disentangled learning strategy under a mutual information framework to eliminate the entanglement among attributes, which can further relax the need for labeled data. Consequently, AdaTrans enables a controllable face editing with the advantages of disentanglement, flexibility with non-binary attributes, and high fidelity. Extensive experimental results on various facial attributes demonstrate the qualitative and quantitative effectiveness of the proposed AdaTrans over existing state-of-the-art methods, especially in the most challenging scenarios with a large age gap and few labeled examples. The source code is available at https://github.com/Hzzone/AdaTrans.	Proposes AdaTrans, an adaptive nonlinear latent transformation method for disentangled and conditional face editing in StyleGAN, addressing limitations of linear interpolation methods.	Existing linear methods suffer from attribute entanglement, require manual strength tuning, and are limited to binary attributes. AdaTrans aims to achieve disentanglement, flexibility with non-binary attributes, and high fidelity.	Divides manipulation into finer steps with direction and size conditioned on attributes and latent codes, describing an adaptive nonlinear trajectory. Leverages a density model to constrain the trajectory within the latent space distribution.	Achieves disentangled and controllable face editing, preserving unrelated attributes even with large age gaps. Outperforms state-of-the-art methods in terms of editing accuracy, attribute preservation, and identity preservation. Demonstrates flexibility by handling multi-attribute editing and maintaining performance with limited labeled data.	Background preservation during editing is not addressed. Further exploration of intermediate StyleGAN features for background preservation is planned as future work.	face editing, stylegan, disentanglement, nonlinear transformation, latent space
2307.07710 Report	ExposureDiffusion: Learning to Expose for Low-light Image Enhancement	Yufei Wang, Yi Yu, Wenhan Yang, Lanqing Guo, Lap-Pui Chau, Alex C. Kot, Bihan Wen	Previous raw image-based low-light image enhancement methods predominantly relied on feed-forward neural networks to learn deterministic mappings from low-light to normally-exposed images. However, they failed to capture critical distribution information, leading to visually undesirable results. This work addresses the issue by seamlessly integrating a diffusion model with a physics-based exposure model. Different from a vanilla diffusion model that has to perform Gaussian denoising, with the injected physics-based exposure model, our restoration process can directly start from a noisy image instead of pure noise. As such, our method obtains significantly improved performance and reduced inference time compared with vanilla diffusion models. To make full use of the advantages of different intermediate steps, we further propose an adaptive residual layer that effectively screens out the side-effect in the iterative refinement when the intermediate results have been already well-exposed. The proposed framework can work with both real-paired datasets, SOTA noise models, and different backbone networks. Note that, the proposed framework is compatible with real-paired datasets, real/synthetic noise models, and different backbone networks. We evaluate the proposed method on various public benchmarks, achieving promising results with consistent improvements using different exposure models and backbones. Besides, the proposed method achieves better generalization capacity for unseen amplifying ratios and better performance than a larger feedforward neural model when few parameters are adopted.	This paper proposes ExposureDiffusion, a novel diffusion-based model for low-light image enhancement in raw image space, seamlessly integrating a diffusion model with a physics-based exposure model.	Existing raw image enhancement methods rely on deterministic mappings, failing to capture distribution information and effectively incorporate noise models, leading to suboptimal results.	The method simulates the exposure process using a progressive shared network, minimizing the divergence between the simulated process and the real physics-based exposure process. An adaptive residual layer dynamically fuses denoising strategies for areas with different noise levels.	ExposureDiffusion achieves significant performance improvements over baseline methods on SID and ELD datasets. The method demonstrates compatibility with different noise models and backbone networks, consistently enhancing results. ExposureDiffusion exhibits better generalization ability for unseen amplification ratios compared to feedforward networks.	The determination of optimal inference steps for varying noise levels needs further investigation. Future work could explore adaptive algorithms for automatically determining the number of inference steps.	low-light image enhancement, diffusion models, raw image processing, physics-based modeling, adaptive residual learning
2307.07678 Report	Both Spatial and Frequency Cues Contribute to High-Fidelity Image Inpainting	Ze Lu, Yalei Lv, Wenqi Wang, Pengfei Xiong	Deep generative approaches have obtained great success in image inpainting recently. However, most generative inpainting networks suffer from either over-smooth results or aliasing artifacts. The former lacks high-frequency details, while the latter lacks semantic structure. To address this issue, we propose an effective Frequency-Spatial Complementary Network (FSCN) by exploiting rich semantic information in both spatial and frequency domains. Specifically, we introduce an extra Frequency Branch and Frequency Loss on the spatial-based network to impose direct supervision on the frequency information, and propose a Frequency-Spatial Cross-Attention Block (FSCAB) to fuse multi-domain features and combine the corresponding characteristics. With our FSCAB, the inpainting network is capable of capturing frequency information and preserving visual consistency simultaneously. Extensive quantitative and qualitative experiments demonstrate that our inpainting network can effectively achieve superior results, outperforming previous state-of-the-art approaches with significantly fewer parameters and less computation cost. The code will be released soon.	This paper proposes a Frequency-Spatial Complementary Network (FSCN) for high-fidelity image inpainting.	Most existing image inpainting networks suffer from either over-smooth results (lacking high-frequency details) or aliasing artifacts (lacking semantic structure) due to focusing solely on spatial or frequency domain.	FSCN utilizes a Frequency Branch and Frequency Loss to capture high-frequency details and a spatial branch for semantic structures. It employs a Frequency-Spatial Cross-Attention Block (FSCAB) to effectively fuse features from both domains.	FSCN achieves state-of-the-art results on CelebA-HQ and Places datasets, outperforming previous methods in terms of FID, LPIPS, and SSIM. It recovers fine-grained details and preserves semantic consistency effectively. FSCN achieves superior results with significantly fewer parameters and less computational cost compared to previous SOTA methods.	Performance on thick masks can be further improved, potentially by exploring more sophisticated mask-aware strategies. The network's generalization ability across diverse datasets and inpainting scenarios could be further enhanced.	image inpainting, frequency domain, spatial domain, cross-attention, deep learning
2307.07663 Report	INVE: Interactive Neural Video Editing	Jiahui Huang, Leonid Sigal, Kwang Moo Yi, Oliver Wang, Joon-Young Lee	We present Interactive Neural Video Editing (INVE), a real-time video editing solution, which can assist the video editing process by consistently propagating sparse frame edits to the entire video clip. Our method is inspired by the recent work on Layered Neural Atlas (LNA). LNA, however, suffers from two major drawbacks: (1) the method is too slow for interactive editing, and (2) it offers insufficient support for some editing use cases, including direct frame editing and rigid texture tracking. To address these challenges we leverage and adopt highly efficient network architectures, powered by hash-grids encoding, to substantially improve processing speed. In addition, we learn bi-directional functions between image-atlas and introduce vectorized editing, which collectively enables a much greater variety of edits in both the atlas and the frames directly. Compared to LNA, our INVE reduces the learning and inference time by a factor of 5, and supports various video editing operations that LNA cannot. We showcase the superiority of INVE over LNA in interactive video editing through a comprehensive quantitative and qualitative analysis, highlighting its numerous advantages and improved performance. For video results, please see https://gabriel-huang.github.io/inve/	This paper presents INVE, an interactive video editing tool that allows users to propagate single-frame edits consistently throughout a video using a layered neural atlas representation.	Interactive video editing remains a challenging task due to the need for temporally consistent edits, robust object tracking, and real-time performance. Existing methods often fall short in one or more of these areas.	INVE builds upon Layered Neural Atlases (LNA) but introduces several key innovations: - Boosted Training & Inference Speed: Employs multi-resolution hash grids and a GPU-optimized MLP architecture for faster computation. - Inverse Mapping: Learns bi-directional mappings between frames and atlases to enable rigid texture tracking. - Layered Editing: Supports independent editing of sketches, textures, and local adjustments through separate layers. - Vectorized Sketching: Represents sketches as continuous vectorized strokes for artifact-free editing at the frame level.	INVE achieves 5x faster training and inference speed compared to LNA. It introduces inverse mapping for more accurate and intuitive texture tracking. Layered editing and vectorized sketching enable a wider range of editing possibilities with improved consistency and reduced artifacts.	The method's performance relies heavily on the quality of pre-computed optical flow and object masks. Future work could explore extending INVE to handle longer video sequences and more complex editing operations, such as object removal or insertion.	video editing, neural atlas, interactive editing, texture tracking, deep learning
2307.07653 Report	RFLA: A Stealthy Reflected Light Adversarial Attack in the Physical World	Donghua Wang, Wen Yao, Tingsong Jiang, Chao Li, Xiaoqian Chen	Physical adversarial attacks against deep neural networks (DNNs) have recently gained increasing attention. The current mainstream physical attacks use printed adversarial patches or camouflage to alter the appearance of the target object. However, these approaches generate conspicuous adversarial patterns that show poor stealthiness. Another physical deployable attack is the optical attack, featuring stealthiness while exhibiting weakly in the daytime with sunlight. In this paper, we propose a novel Reflected Light Attack (RFLA), featuring effective and stealthy in both the digital and physical world, which is implemented by placing the color transparent plastic sheet and a paper cut of a specific shape in front of the mirror to create different colored geometries on the target object. To achieve these goals, we devise a general framework based on the circle to model the reflected light on the target object. Specifically, we optimize a circle (composed of a coordinate and radius) to carry various geometrical shapes determined by the optimized angle. The fill color of the geometry shape and its corresponding transparency are also optimized. We extensively evaluate the effectiveness of RFLA on different datasets and models. Experiment results suggest that the proposed method achieves over 99% success rate on different datasets and models in the digital world. Additionally, we verify the effectiveness of the proposed method in different physical environments by using sunlight or a flashlight.	This paper presents RFLA, a novel physical adversarial attack that exploits reflected light to mislead Deep Neural Networks (DNNs) in both digital and physical environments.	Existing physical attacks lack stealth or struggle in strong light conditions. RFLA addresses these limitations by utilizing natural sunlight or artificial light sources like flashlights.	RFLA uses a mirror, colored transparent plastic sheets, and paper cut-outs to manipulate reflected light. A circle-based framework optimizes the position, geometry, and color of the reflected light for maximum attack effectiveness. The optimization process leverages the Particle Swarm Optimization (PSO) algorithm.	RFLA achieves high attack success rates (over 99%) against various image classification models in the digital world, significantly outperforming existing patch-based and line-based attacks. The attack demonstrates strong transferability across different DNN models, even in physical settings. Physical experiments using reflected sunlight and flashlight confirm RFLA's efficacy in real-world scenarios, successfully attacking image classification and traffic sign recognition models.	RFLA's effectiveness might be compromised in adverse weather conditions like fog or rain. Future work will focus on enhancing RFLA's robustness against various environmental factors and exploring more sophisticated defenses against such attacks.	adversarial attack, physical attack, reflected light, deep neural networks, particle swarm optimization
2307.07635 Report	CoTracker: It is Better to Track Together	Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, Christian Rupprecht	We introduce CoTracker, a transformer-based model that tracks dense points in a frame jointly across a video sequence. This differs from most existing state-of-the-art approaches that track points independently, ignoring their correlation. We show that joint tracking results in a significantly higher tracking accuracy and robustness. We also provide several technical innovations, including the concept of virtual tracks, which allows CoTracker to track 70k points jointly and simultaneously. Furthermore, CoTracker operates causally on short windows (hence, it is suitable for online tasks), but is trained by unrolling the windows across longer video sequences, which enables and significantly improves long-term tracking. We demonstrate qualitatively impressive tracking results, where points can be tracked for a long time even when they are occluded or leave the field of view. Quantitatively, CoTracker outperforms all recent trackers on standard benchmarks, often by a substantial margin.	Introduces CoTracker, a transformer-based model for jointly tracking dense points in videos, significantly improving accuracy and robustness by considering point correlations.	Existing point trackers largely ignore correlations between points, leading to suboptimal performance, especially in challenging scenarios like occlusions.	CoTracker utilizes a transformer architecture with novel virtual track tokens for efficiency, operates causally on short windows for online tasks, but leverages unrolled training on longer sequences to enhance long-term tracking.	Achieves state-of-the-art results on multiple benchmarks (TAP-Vid-DAVIS, PointOdyssey, DynamicReplica), surpassing previous methods by significant margins. Demonstrates the importance of joint tracking, with performance gains observed even when tracking a single target point supported by additional points. Shows the effectiveness of virtual track tokens, allowing for near-dense point tracking on a single GPU.	Despite improvements, tracking errors that humans easily avoid can still occur. Limited window size poses challenges for points occluded over long durations, suggesting potential benefits from incorporating global context or offline processing.	point tracking, joint tracking, transformer, virtual tracks, unrolled training
2307.07397 Report	Improving Zero-Shot Generalization for CLIP with Synthesized Prompts	Zhengbo Wang, Jian Liang, Ran He, Nan Xu, Zilei Wang, Tieniu Tan	With the growing interest in pretrained vision-language models like CLIP, recent research has focused on adapting these models to downstream tasks. Despite achieving promising results, most existing methods require labeled data for all classes, which may not hold in real-world applications due to the long tail and Zipf's law. For example, some classes may lack labeled data entirely, such as emerging concepts. To address this problem, we propose a plug-and-play generative approach called \textbf{S}ynt\textbf{H}es\textbf{I}zed \textbf{P}rompts~(\textbf{SHIP}) to improve existing fine-tuning methods. Specifically, we follow variational autoencoders to introduce a generator that reconstructs the visual features by inputting the synthesized prompts and the corresponding class names to the textual encoder of CLIP. In this manner, we easily obtain the synthesized features for the remaining label-only classes. Thereafter, we fine-tune CLIP with off-the-shelf methods by combining labeled and synthesized features. Extensive experiments on base-to-new generalization, cross-dataset transfer learning, and generalized zero-shot learning demonstrate the superiority of our approach. The code is available at \url{https://github.com/mrflogs/SHIP}.	This paper introduces SHIP, a plug-and-play generative method, to enhance CLIP's performance in few-shot learning scenarios where some classes lack labeled data.	Existing CLIP fine-tuning methods often falter when dealing with novel classes without labeled data, hindering their applicability in real-world settings.	SHIP employs a VAE-based generator to synthesize prompts for novel classes by leveraging the pre-trained CLIP's language encoder. These synthesized prompts are then used to generate features for novel classes, enabling the use of off-the-shelf fine-tuning methods on both base and novel classes.	SHIP consistently improves the performance of existing methods (CoOp, CLIP-Adapter, Tip-Adapter) in base-to-new generalization tasks across various datasets. In cross-dataset transfer learning, SHIP enhances CoOp's accuracy, demonstrating its effectiveness in transferring knowledge to new datasets. For generalized zero-shot learning, SHIP effectively handles the challenge of mixed base and novel classes during testing, surpassing previous methods in unseen class accuracy.	SHIP requires additional training, leading to increased computational cost compared to zero-shot CLIP. The effectiveness of SHIP in dense prediction tasks remains unexplored.	few-shot learning, vision-language models, clip, generative models, prompt learning
2307.06948 Report	Self-regulating Prompts: Foundational Model Adaptation without Forgetting	Muhammad Uzair Khattak, Syed Talal Wasim, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan	Prompt learning has emerged as an efficient alternative for fine-tuning foundational models, such as CLIP, for various downstream tasks. Conventionally trained using the task-specific objective, i.e., cross-entropy loss, prompts tend to overfit downstream data distributions and find it challenging to capture task-agnostic general features from the frozen CLIP. This leads to the loss of the model's original generalization capability. To address this issue, our work introduces a self-regularization framework for prompting called PromptSRC (Prompting with Self-regulating Constraints). PromptSRC guides the prompts to optimize for both task-specific and task-agnostic general representations using a three-pronged approach by: (a) regulating prompted representations via mutual agreement maximization with the frozen model, (b) regulating with self-ensemble of prompts over the training trajectory to encode their complementary strengths, and (c) regulating with textual diversity to mitigate sample diversity imbalance with the visual branch. To the best of our knowledge, this is the first regularization framework for prompt learning that avoids overfitting by jointly attending to pre-trained model features, the training trajectory during prompting, and the textual diversity. PromptSRC explicitly steers the prompts to learn a representation space that maximizes performance on downstream tasks without compromising CLIP generalization. We perform extensive experiments on 4 benchmarks where PromptSRC overall performs favorably well compared to the existing methods. Our code and pre-trained models are publicly available at: https://github.com/muzairkhattak/PromptSRC.	The paper introduces PromptSRC, a self-regularization framework for prompt learning that prevents overfitting and improves generalization in foundational vision-language models like CLIP.	Existing prompt learning methods, while effective, tend to overfit to downstream task data, sacrificing the generalization ability of the pre-trained model. PromptSRC aims to address this issue by retaining task-agnostic knowledge while adapting to downstream tasks.	PromptSRC utilizes a three-pronged approach: (a) maximizing mutual agreement between prompted and frozen model features, (b) employing a Gaussian-weighted self-ensemble of prompts learned across epochs, and (c) incorporating textual diversity by using multiple text augmentations for pre-trained features.	PromptSRC significantly outperforms existing methods in base-to-novel generalization, particularly on novel classes. It shows consistent improvements in few-shot learning, especially in extremely low-data regimes. PromptSRC demonstrates superior performance in domain generalization tasks, indicating its robustness to domain shifts.	The paper primarily focuses on image recognition tasks and evaluating its effectiveness on other downstream tasks like image captioning or visual question answering is left for future work. Exploring alternate regularization techniques or more sophisticated prompt aggregation strategies could further enhance performance.	prompt learning, clip, regularization, generalization, vision-language models
2307.06940 Report	Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation	Yingqing He, Menghan Xia, Haoxin Chen, Xiaodong Cun, Yuan Gong, Jinbo Xing, Yong Zhang, Xintao Wang, Chao Weng, Ying Shan, Qifeng Chen	Generating videos for visual storytelling can be a tedious and complex process that typically requires either live-action filming or graphics animation rendering. To bypass these challenges, our key idea is to utilize the abundance of existing video clips and synthesize a coherent storytelling video by customizing their appearances. We achieve this by developing a framework comprised of two functional modules: (i) Motion Structure Retrieval, which provides video candidates with desired scene or motion context described by query texts, and (ii) Structure-Guided Text-to-Video Synthesis, which generates plot-aligned videos under the guidance of motion structure and text prompts. For the first module, we leverage an off-the-shelf video retrieval system and extract video depths as motion structure. For the second module, we propose a controllable video generation model that offers flexible controls over structure and characters. The videos are synthesized by following the structural guidance and appearance instruction. To ensure visual consistency across clips, we propose an effective concept personalization approach, which allows the specification of the desired character identities through text prompts. Extensive experiments demonstrate that our approach exhibits significant advantages over various existing baselines.	Introduces a novel retrieval-based pipeline for storytelling video synthesis, enabling better quality, layout/motion control, and character personalization for character-consistent storytelling videos.	Addresses the limitations of current text-to-video generation techniques in creating engaging and coherent storytelling videos.	Leverages existing video content for structure guidance in a text-to-video generation model. Employs a new personalization method, TimeInv, for consistent character rendering across video clips, and tackles the conflict between structure and character generation through adjustable depth control.	Retrieval-augmented video generation significantly improves quality and controllability compared to text-only generation. TimeInv outperforms baseline personalization approaches in achieving consistent character appearance and compositionality. Adjustable depth control effectively mitigates the conflict between motion guidance and character fidelity.	Exploration of a general character control mechanism without fine-tuning is needed. Further research on better cooperation strategies between character and structure control is crucial.	story visualization, video diffusion models, retrieval-augmented generation, personalized generation, text-to-video synthesis
2307.06925 Report	Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image Models	Moab Arar, Rinon Gal, Yuval Atzmon, Gal Chechik, Daniel Cohen-Or, Ariel Shamir, Amit H. Bermano	Text-to-image (T2I) personalization allows users to guide the creative image generation process by combining their own visual concepts in natural language prompts. Recently, encoder-based techniques have emerged as a new effective approach for T2I personalization, reducing the need for multiple images and long training times. However, most existing encoders are limited to a single-class domain, which hinders their ability to handle diverse concepts. In this work, we propose a domain-agnostic method that does not require any specialized dataset or prior information about the personalized concepts. We introduce a novel contrastive-based regularization technique to maintain high fidelity to the target concept characteristics while keeping the predicted embeddings close to editable regions of the latent space, by pushing the predicted tokens toward their nearest existing CLIP tokens. Our experimental results demonstrate the effectiveness of our approach and show how the learned tokens are more semantic than tokens predicted by unregularized models. This leads to a better representation that achieves state-of-the-art performance while being more flexible than previous methods.	This paper presents a domain-agnostic tuning-encoder for fast personalization of text-to-image models, enabling one-shot inference-time tuning for diverse concepts.	Existing encoder-based text-to-image personalization methods are limited to single-class domains, hindering their applicability to diverse concepts.	The method leverages contrastive-based regularization to predict embeddings near semantically related words and employs a hyper-network to capture concept-specific features. A dual-path adaptation approach using hard and soft prompts is used during a brief inference-time tuning phase.	The contrastive regularization improves embedding quality and prevents overfitting. The method achieves comparable quality to state-of-the-art methods using only a single image and fewer training steps. Ablation studies highlight the importance of regularization, fine-tuning, and the hyper-network.	The method's performance is limited by the training data, potentially struggling with domains poorly represented in the dataset. While reduced, a tuning step is still required to enhance downstream similarity.	text-to-image synthesis, personalization, domain-agnostic, tuning-encoder, contrastive learning
2307.06526 Report	AvatarFusion: Zero-shot Generation of Clothing-Decoupled 3D Avatars Using 2D Diffusion	Shuo Huang, Zongxin Yang, Liangting Li, Yi Yang, Jia Jia	Large-scale pre-trained vision-language models allow for the zero-shot text-based generation of 3D avatars. The previous state-of-the-art method utilized CLIP to supervise neural implicit models that reconstructed a human body mesh. However, this approach has two limitations. Firstly, the lack of avatar-specific models can cause facial distortion and unrealistic clothing in the generated avatars. Secondly, CLIP only provides optimization direction for the overall appearance, resulting in less impressive results. To address these limitations, we propose AvatarFusion, the first framework to use a latent diffusion model to provide pixel-level guidance for generating human-realistic avatars while simultaneously segmenting clothing from the avatar's body. AvatarFusion includes the first clothing-decoupled neural implicit avatar model that employs a novel Dual Volume Rendering strategy to render the decoupled skin and clothing sub-models in one space. We also introduce a novel optimization method, called Pixel-Semantics Difference-Sampling (PS-DS), which semantically separates the generation of body and clothes, and generates a variety of clothing styles. Moreover, we establish the first benchmark for zero-shot text-to-avatar generation. Our experimental results demonstrate that our framework outperforms previous approaches, with significant improvements observed in all metrics. Additionally, since our model is clothing-decoupled, we can exchange the clothes of avatars. Code are available on our project page https://hansenhuang0823.github.io/AvatarFusion.	AvatarFusion is the first zero-shot text-to-3D-avatar generation framework that decouples clothing from the avatar model, allowing for more realistic avatars and clothing exchange between avatars.	Existing methods for generating 3D avatars from text suffer from facial distortion, unrealistic clothing, and limited detail due to the lack of avatar-specific models and the limitations of using CLIP for optimization.	AvatarFusion leverages a clothing-decoupled neural implicit avatar model with a dual volume rendering strategy and a novel optimization method called Pixel-Semantics Difference-Sampling (PS-DS), which utilizes a latent diffusion model for pixel-level guidance.	AvatarFusion outperforms baselines in both quantitative and qualitative evaluations on the newly proposed Famous-Character-50 benchmark. The generated avatars exhibit superior facial details, more realistic clothing, and better alignment with text prompts. The clothing-decoupled model enables clothing exchange between different avatars.	Currently, the method cannot generate a realistic backside due to the limited responsiveness of vision-language models. Future work may focus on addressing the backside generation issue and improving the generation of loose clothing.	3d avatar generation, zero-shot learning, diffusion models, clothing decoupling, neural implicit surfaces
2307.06304 Report	Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution	Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim Alabdulmohsin, Avital Oliver, Piotr Padlewski, Alexey Gritsenko, Mario Lučić, Neil Houlsby	The ubiquitous and demonstrably suboptimal choice of resizing images to a fixed resolution before processing them with computer vision models has not yet been successfully challenged. However, models such as the Vision Transformer (ViT) offer flexible sequence-based modeling, and hence varying input sequence lengths. We take advantage of this with NaViT (Native Resolution ViT) which uses sequence packing during training to process inputs of arbitrary resolutions and aspect ratios. Alongside flexible model usage, we demonstrate improved training efficiency for large-scale supervised and contrastive image-text pretraining. NaViT can be efficiently transferred to standard tasks such as image and video classification, object detection, and semantic segmentation and leads to improved results on robustness and fairness benchmarks. At inference time, the input resolution flexibility can be used to smoothly navigate the test-time cost-performance trade-off. We believe that NaViT marks a departure from the standard, CNN-designed, input and modelling pipeline used by most computer vision models, and represents a promising direction for ViTs.	The paper introduces NaViT, a Vision Transformer that processes images at their native resolution using a technique called "Patch n' Pack" which allows packing patches from multiple images into a single sequence, improving training efficiency and enabling flexible input resolutions and aspect ratios.	Current computer vision models resize images to a fixed resolution before processing, which can harm performance and is computationally inefficient. NaViT addresses these limitations by allowing for variable input sizes.	NaViT leverages the sequence-based nature of Vision Transformers and introduces masked self-attention, masked pooling, and factorized positional embeddings to handle variable resolution and aspect ratios. This allows packing patches from multiple images into a single sequence, significantly accelerating training.	NaViT achieves superior training efficiency, matching the performance of top-performing ViT models with 4 times less compute. The model allows for variable-resolution finetuning, achieving comparable performance to fixed-resolution finetuning while providing greater flexibility. NaViT demonstrates improved out-of-distribution generalization, particularly on datasets with extreme aspect ratios.	The paper mainly focuses on image classification and acknowledges the need for further exploration of NaViT's capabilities in downstream tasks like object detection and semantic segmentation. While NaViT demonstrates promising results, more research is needed to fully explore the potential of Patch n' Pack in other Vision Transformer architectures and applications.	vision transformers, native resolution, sequence packing, variable input size, training efficiency
2307.06281 Report	MMBench: Is Your Multi-modal Model an All-around Player?	Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, Dahua Lin	Large vision-language models have recently achieved remarkable progress, exhibiting great perception and reasoning abilities concerning visual information. However, how to effectively evaluate these large vision-language models remains a major obstacle, hindering future model development. Traditional benchmarks like VQAv2 or COCO Caption provide quantitative performance measurements but suffer from a lack of fine-grained ability assessment and non-robust evaluation metrics. Recent subjective benchmarks, such as OwlEval, offer comprehensive evaluations of a model's abilities by incorporating human labor, but they are not scalable and display significant bias. In response to these challenges, we propose MMBench, a novel multi-modality benchmark. MMBench methodically develops a comprehensive evaluation pipeline, primarily comprised of two elements. The first element is a meticulously curated dataset that surpasses existing similar benchmarks in terms of the number and variety of evaluation questions and abilities. The second element introduces a novel CircularEval strategy and incorporates the use of ChatGPT. This implementation is designed to convert free-form predictions into pre-defined choices, thereby facilitating a more robust evaluation of the model's predictions. MMBench is a systematically-designed objective benchmark for robustly evaluating the various abilities of vision-language models. We hope MMBench will assist the research community in better evaluating their models and encourage future advancements in this domain. Project page: https://opencompass.org.cn/mmbench.	This paper introduces MMBench, a bilingual benchmark designed for robust and holistic evaluation of multi-modal capabilities of large vision-language models (VLMs).	Evaluating VLMs effectively is crucial for further development, but existing benchmarks lack either fine-grained ability assessment or scalability and suffer from bias.	MMBench utilizes a hierarchical ability taxonomy, rigorous quality control, and a novel circular evaluation strategy (CircularEval) with LLM-assisted choice extraction.	MMBench surpasses existing benchmarks in the number and variety of evaluation questions and abilities, covering 20 fine-grained skills. GPT-4 achieves a 91.5% alignment rate with human evaluation in choice extraction, demonstrating its robustness in handling free-form VLM outputs. Comprehensive evaluation of various VLMs on MMBench reveals performance gaps and provides insights for future optimization, especially highlighting challenges in understanding low-level visual features, structuralized inputs, and spatial relationships.	The paper acknowledges potential bias in the initial English-centric data collection of MMBench. Future work may involve expanding the benchmark with more challenging scenarios, such as incorporating video understanding or interactive tasks.	vision-language models, multi-modal benchmark, evaluation, circulareval, llm-assisted choice extraction
2307.05977 Report	Towards Safe Self-Distillation of Internet-Scale Text-to-Image Diffusion Models	Sanghyun Kim, Seohyeon Jung, Balhae Kim, Moonseok Choi, Jinwoo Shin, Juho Lee	Large-scale image generation models, with impressive quality made possible by the vast amount of data available on the Internet, raise social concerns that these models may generate harmful or copyrighted content. The biases and harmfulness arise throughout the entire training process and are hard to completely remove, which have become significant hurdles to the safe deployment of these models. In this paper, we propose a method called SDD to prevent problematic content generation in text-to-image diffusion models. We self-distill the diffusion model to guide the noise estimate conditioned on the target removal concept to match the unconditional one. Compared to the previous methods, our method eliminates a much greater proportion of harmful content from the generated images without degrading the overall image quality. Furthermore, our method allows the removal of multiple concepts at once, whereas previous works are limited to removing a single concept at a time.	This paper introduces SDD, a self-distillation method for text-to-image diffusion models, to prevent the generation of harmful or copyrighted content.	Large-scale image generation models, trained on vast internet data, risk generating harmful or copyrighted content, posing a significant challenge to their safe deployment. Existing detoxification methods are often insufficient and can degrade image quality.	SDD fine-tunes the diffusion model using self-distillation, guiding the noise estimate conditioned on the target removal concept to match the unconditional one. An EMA teacher model is employed to mitigate catastrophic forgetting during fine-tuning.	SDD effectively removes a greater proportion of harmful content from generated images compared to previous methods, as demonstrated by experiments on NSFW and artist concept removal. SDD exhibits minimal interference with other concepts in the generated images, preserving the overall image quality and user intent. The use of an EMA teacher model in SDD helps maintain image quality and details more effectively compared to directly fine-tuning the student model.	The method may not completely remove all problematic content and could still have minor impact on image quality. The research primarily focuses on NSFW and artist concept removal, with limited exploration of other harmful content types.	text-to-image generation, diffusion models, safe ai, content moderation, self-distillation
2307.05892 Report	SC-NeuS: Consistent Neural Surface Reconstruction from Sparse and Noisy Views	Shi-Sheng Huang, Zi-Xin Zou, Yi-Chi Zhang, Hua Huang	The recent neural surface reconstruction by volume rendering approaches have made much progress by achieving impressive surface reconstruction quality, but are still limited to dense and highly accurate posed views. To overcome such drawbacks, this paper pays special attention on the consistent surface reconstruction from sparse views with noisy camera poses. Unlike previous approaches, the key difference of this paper is to exploit the multi-view constraints directly from the explicit geometry of the neural surface, which can be used as effective regularization to jointly learn the neural surface and refine the camera poses. To build effective multi-view constraints, we introduce a fast differentiable on-surface intersection to generate on-surface points, and propose view-consistent losses based on such differentiable points to regularize the neural surface learning. Based on this point, we propose a jointly learning strategy for neural surface and camera poses, named SC-NeuS, to perform geometry-consistent surface reconstruction in an end-to-end manner. With extensive evaluation on public datasets, our SC-NeuS can achieve consistently better surface reconstruction results with fine-grained details than previous state-of-the-art neural surface reconstruction approaches, especially from sparse and noisy camera views.	This paper presents SC-NeuS, a novel learning framework for geometry-consistent neural surface reconstruction from sparse views with noisy camera poses, leveraging multi-view constraints derived directly from the explicit geometry of the neural surface.	Existing neural surface reconstruction methods often struggle with sparse and noisy input, limiting their applicability in real-world scenarios where dense, high-quality data acquisition is challenging.	The method introduces a fast differentiable on-surface intersection to sample points on the neural surface. These points are then used to define view-consistent losses, regularizing the joint learning of the neural surface representation and camera poses in an end-to-end manner. A coarse-to-fine learning strategy further enhances reconstruction accuracy.	SC-NeuS achieves state-of-the-art surface reconstruction quality from sparse and noisy views, outperforming existing methods like BARF, IDR, and NeuS-BARF on public datasets like DTU and BlendedMVS. The proposed method demonstrates superior accuracy in both camera pose estimation and surface reconstruction compared to baselines. Ablation studies confirm the effectiveness of the view-consistent re-projection and patch-warping losses in improving both the geometric accuracy and fine-grained detail of the reconstructed surfaces.	The method's performance depends on the quality of 2D feature matching, which can be challenging in low-texture or illumination-varying scenes. Large camera pose variations between sparse views may hinder effective joint optimization.	neural surface reconstruction, sparse view reconstruction, camera pose estimation, multi-view constraints, differentiable rendering
2307.05707 Report	MoP-CLIP: A Mixture of Prompt-Tuned CLIP Models for Domain Incremental Learning	Julien Nicolas, Florent Chiaroni, Imtiaz Ziko, Ola Ahmad, Christian Desrosiers, Jose Dolz	Despite the recent progress in incremental learning, addressing catastrophic forgetting under distributional drift is still an open and important problem. Indeed, while state-of-the-art domain incremental learning (DIL) methods perform satisfactorily within known domains, their performance largely degrades in the presence of novel domains. This limitation hampers their generalizability, and restricts their scalability to more realistic settings where train and test data are drawn from different distributions. To address these limitations, we present a novel DIL approach based on a mixture of prompt-tuned CLIP models (MoP-CLIP), which generalizes the paradigm of S-Prompting to handle both in-distribution and out-of-distribution data at inference. In particular, at the training stage we model the features distribution of every class in each domain, learning individual text and visual prompts to adapt to a given domain. At inference, the learned distributions allow us to identify whether a given test sample belongs to a known domain, selecting the correct prompt for the classification task, or from an unseen domain, leveraging a mixture of the prompt-tuned CLIP models. Our empirical evaluation reveals the poor performance of existing DIL methods under domain shift, and suggests that the proposed MoP-CLIP performs competitively in the standard DIL settings while outperforming state-of-the-art methods in OOD scenarios. These results demonstrate the superiority of MoP-CLIP, offering a robust and general solution to the problem of domain incremental learning.	This paper introduces MoP-CLIP, an exemplar-free domain incremental learning (DIL) approach based on a mixture of prompt-tuned CLIP models, addressing the limitations of existing methods in handling distributional drift and generalizing to unseen domains.	Current DIL methods struggle with performance degradation under distributional shifts between training and testing data, limiting their applicability in real-world scenarios where such shifts are common.	MoP-CLIP learns class-wise feature distributions for each domain during training, enabling it to identify whether a test sample belongs to a known domain and select the appropriate prompt, or to an unseen domain, triggering a mixture of prompts for prediction.	MoP-CLIP achieves competitive performance on known domains compared to state-of-the-art DIL methods. It significantly outperforms existing exemplar-free methods in scenarios with domain distributional shifts. The paper provides empirical evidence of the limitations of existing DIL methods under domain shift and demonstrates the effectiveness of the proposed approach through extensive experiments.	The assumption of isotropic Gaussian distribution for features around prototypes, while simplifying the model, might not hold for all datasets and could be explored further. Further investigation into alternative distributions beyond Gaussian for modeling distances to prototypes, such as Weibull or Generalized Pareto, could be beneficial.	domain incremental learning, prompt learning, distributional drift, out-of-distribution generalization, clip
2307.05473 Report	Differentiable Blocks World: Qualitative 3D Decomposition by Rendering Primitives	Tom Monnier, Jake Austin, Angjoo Kanazawa, Alexei A. Efros, Mathieu Aubry	Given a set of calibrated images of a scene, we present an approach that produces a simple, compact, and actionable 3D world representation by means of 3D primitives. While many approaches focus on recovering high-fidelity 3D scenes, we focus on parsing a scene into mid-level 3D representations made of a small set of textured primitives. Such representations are interpretable, easy to manipulate and suited for physics-based simulations. Moreover, unlike existing primitive decomposition methods that rely on 3D input data, our approach operates directly on images through differentiable rendering. Specifically, we model primitives as textured superquadric meshes and optimize their parameters from scratch with an image rendering loss. We highlight the importance of modeling transparency for each primitive, which is critical for optimization and also enables handling varying numbers of primitives. We show that the resulting textured primitives faithfully reconstruct the input images and accurately model the visible 3D points, while providing amodal shape completions of unseen object regions. We compare our approach to the state of the art on diverse scenes from DTU, and demonstrate its robustness on real-life captures from BlendedMVS and Nerfstudio. We also showcase how our results can be used to effortlessly edit a scene or perform physical simulations. Code and video results are available at https://www.tmonnier.com/DBW .	This paper introduces Differentiable Blocks World (DBW), an end-to-end method for reconstructing 3D scenes from calibrated images using a set of textured superquadric primitives.	Existing multi-view modeling approaches, while highly accurate, often produce dense, uninterpretable representations. DBW addresses this by providing a compact, interpretable, and manipulable scene representation suitable for tasks like physics-based simulations and scene editing.	DBW optimizes the parameters of superquadric meshes and their UV textures directly from images by minimizing a rendering loss. A key innovation is the modeling of primitive transparency, facilitating handling of varying primitive numbers and occlusions.	DBW accurately reconstructs visible 3D points and faithfully reconstructs input images on DTU benchmark. Outperforms state-of-the-art 3D decomposition methods (EMS, MonteBoxFinder) applied on ground-truth point clouds in terms of interpretability and accuracy. Demonstrates robustness on real-life captures (Nerfstudio, BlendedMVS), enabling applications like amodal scene completion, scene editing, and physics-based simulations.	DBW can sometimes converge to suboptimal solutions, missing parts or yielding unnatural decompositions. Automatic selection among multiple runs mitigates this but increases computational cost.	3d reconstruction, primitive-based representation, differentiable rendering, multi-view stereo, scene understanding
2307.05468 Report	My3DGen: A Scalable Personalized 3D Generative Model	Luchao Qi, Jiaye Wu, Annie N. Wang, Shengze Wang, Roni Sengupta	In recent years, generative 3D face models (e.g., EG3D) have been developed to tackle the problem of synthesizing photo-realistic faces. However, these models are often unable to capture facial features unique to each individual, highlighting the importance of personalization. Some prior works have shown promise in personalizing generative face models, but these studies primarily focus on 2D settings. Also, these methods require both fine-tuning and storing a large number of parameters for each user, posing a hindrance to achieving scalable personalization. Another challenge of personalization is the limited number of training images available for each individual, which often leads to overfitting when using full fine-tuning methods. Our proposed approach, My3DGen, generates a personalized 3D prior of an individual using as few as 50 training images. My3DGen allows for novel view synthesis, semantic editing of a given face (e.g. adding a smile), and synthesizing novel appearances, all while preserving the original person's identity. We decouple the 3D facial features into global features and personalized features by freezing the pre-trained EG3D and training additional personalized weights through low-rank decomposition. As a result, My3DGen introduces only $\textbf{240K}$ personalized parameters per individual, leading to a $\textbf{127}\times$ reduction in trainable parameters compared to the $\textbf{30.6M}$ required for fine-tuning the entire parameter space. Despite this significant reduction in storage, our model preserves identity features without compromising the quality of downstream applications.	My3DGen is a novel approach for creating personalized 3D generative priors for individuals using as few as 50 training images. This allows for novel view synthesis, semantic editing, and novel appearance synthesis while preserving identity.	Current 3D generative face models struggle to capture and manipulate individual facial features without distorting identity. Personalizing these models is crucial for enhancing realism in various applications but often faces scalability issues due to large parameter storage requirements.	My3DGen uses a pre-trained EG3D model for global facial features and learns personalized features via low-rank adaptation (LoRA). This method decomposes convolutional and fully-connected layer weights, drastically reducing the number of trainable parameters compared to full fine-tuning.	My3DGen outperforms pre-trained EG3D in 3D reconstruction, novel appearance synthesis, image enhancement, and semantic editing while preserving identity. Despite using significantly fewer trainable parameters, My3DGen achieves comparable results to fully fine-tuning a pre-trained model. Analysis shows that personalizing earlier layers of StyleGAN2, responsible for coarse facial features, has the most impact on quality and identity preservation.	My3DGen faces difficulties reconstructing faces heavily obscured by objects. The model struggles with heavily cropped faces where boundaries are filled with padded values.	personalization, 3d-gan, 3d face, lora, generative models
2307.05462 Report	Efficient 3D Articulated Human Generation with Layered Surface Volumes	Yinghao Xu, Wang Yifan, Alexander W. Bergman, Menglei Chai, Bolei Zhou, Gordon Wetzstein	Access to high-quality and diverse 3D articulated digital human assets is crucial in various applications, ranging from virtual reality to social platforms. Generative approaches, such as 3D generative adversarial networks (GANs), are rapidly replacing laborious manual content creation tools. However, existing 3D GAN frameworks typically rely on scene representations that leverage either template meshes, which are fast but offer limited quality, or volumes, which offer high capacity but are slow to render, thereby limiting the 3D fidelity in GAN settings. In this work, we introduce layered surface volumes (LSVs) as a new 3D object representation for articulated digital humans. LSVs represent a human body using multiple textured mesh layers around a conventional template. These layers are rendered using alpha compositing with fast differentiable rasterization, and they can be interpreted as a volumetric representation that allocates its capacity to a manifold of finite thickness around the template. Unlike conventional single-layer templates that struggle with representing fine off-surface details like hair or accessories, our surface volumes naturally capture such details. LSVs can be articulated, and they exhibit exceptional efficiency in GAN settings, where a 2D generator learns to synthesize the RGBA textures for the individual layers. Trained on unstructured, single-view 2D image datasets, our LSV-GAN generates high-quality and view-consistent 3D articulated digital humans without the need for view-inconsistent 2D upsampling networks.	This paper introduces Layered Surface Volumes (LSVs), a novel 3D representation for articulated digital humans, and uses it in a GAN framework (LSV-GAN) to generate high-quality, animatable human bodies from single-view images.	High-quality 3D human assets are important for various applications, but existing generation methods struggle to balance realism, efficiency, and the ability to capture fine details like hair. This work aims to address these limitations.	LSVs represent a human body using multiple textured mesh layers around a template mesh (SMPL). These layers, textured with color and transparency, are efficiently rendered using alpha compositing and differentiable rasterization. A 2D GAN generator learns to synthesize these textures from single-view images.	LSV-GAN achieves state-of-the-art quality and diversity in generated 3D humans, outperforming baselines in FID and PCK metrics on multiple datasets. The method maintains excellent multi-view consistency, thanks to the use of rasterization and the absence of view-inconsistent upsampling networks. LSV-GAN is computationally efficient, achieving fast training and rendering times due to the use of LSVs and differentiable rasterization.	The level of detail in generated results is limited by the image resolution. Realistic motion of hair and clothes is limited by the use of linear blend skinning.	3d human generation, generative adversarial networks (gans), layered surface volumes (lsvs), differentiable rasterization, articulated human body
2307.05445 Report	AutoDecoding Latent 3D Diffusion Models	Evangelos Ntavelis, Aliaksandr Siarohin, Kyle Olszewski, Chaoyang Wang, Luc Van Gool, Sergey Tulyakov	We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core. The 3D autodecoder framework embeds properties learned from the target dataset in the latent space, which can then be decoded into a volumetric representation for rendering view-consistent appearance and geometry. We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations to learn a 3D diffusion from 2D images or monocular videos of rigid or articulated objects. Our approach is flexible enough to use either existing camera supervision or no camera information at all -- instead efficiently learning it during training. Our evaluations demonstrate that our generation results outperform state-of-the-art alternatives on various benchmark datasets and metrics, including multi-view image datasets of synthetic objects, real in-the-wild videos of moving people, and a large-scale, real video dataset of static objects.	This paper introduces 3DVADER, a novel two-stage approach for generating static and articulated 3D assets using a 3D autodecoder and a latent 3D diffusion model.	Existing 3D generative methods struggle with the limitations of 2D training data and the lack of standard representations for 3D geometry. 3DVADER overcomes these by using a volumetric autodecoder to learn 3D representations from 2D images or videos, enabling generation of diverse and realistic 3D objects.	The first stage trains a volumetric autodecoder to learn latent representations of objects from multi-view images or monocular videos. The second stage trains a 3D diffusion model in the compact latent space of the autodecoder, enabling efficient generation of diverse 3D content.	3DVADER outperforms state-of-the-art methods on benchmark datasets, including multi-view images of synthetic objects and real in-the-wild videos of humans. The method is scalable to large, multi-category datasets, exceeding the capacity of previous 3D diffusion models. Robust normalization and denormalization operations are introduced to identify and operate within the appropriate latent space of the autodecoder for optimal diffusion.	The method currently focuses on single-object scenes, limiting its application to more complex multi-object scenarios. It requires multi-view images or video sequences for training, restricting its use with single-image datasets.	3d generation, diffusion models, autodecoders, volumetric rendering, neural rendering
2307.05222 Report	Emu: Generative Pretraining in Multimodality	Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, Xinlong Wang	We present Emu, a Transformer-based multimodal foundation model, which can seamlessly generate images and texts in multimodal context. This omnivore model can take in any single-modality or multimodal data input indiscriminately (e.g., interleaved image, text and video) through a one-model-for-all autoregressive training process. First, visual signals are encoded into embeddings, and together with text tokens form an interleaved input sequence. Emu is then end-to-end trained with a unified objective of classifying the next text token or regressing the next visual embedding in the multimodal sequence. This versatile multimodality empowers the exploration of diverse pretraining data sources at scale, such as videos with interleaved frames and text, webpages with interleaved images and text, as well as web-scale image-text pairs and video-text pairs. Emu can serve as a generalist multimodal interface for both image-to-text and text-to-image tasks, and supports in-context image and text generation. Across a broad range of zero-shot/few-shot tasks including image captioning, visual question answering, video question answering and text-to-image generation, Emu demonstrates superb performance compared to state-of-the-art large multimodal models. Extended capabilities such as multimodal assistants via instruction tuning are also demonstrated with impressive performance.	Introducing Emu, a large multimodal model trained to predict the next element in interleaved visual and textual sequences, enabling it to perform diverse multimodal tasks like image captioning, visual question answering, and text-to-image generation.	Emu leverages the power of LLMs and diverse web-scale data, including a novel video-text interleaved dataset, to achieve strong performance in zero-shot and few-shot settings on various tasks, advancing the capabilities of multimodal models.	Emu utilizes a unified autoregressive training objective with a visual encoder (EVA-CLIP), causal transformer for visual sequence modeling, multimodal modeling LLM (LLaMA), and a visual decoder (Stable Diffusion) for image generation.	Emu demonstrates state-of-the-art performance on multiple zero-shot and few-shot benchmarks, outperforming existing large multimodal models. The model exhibits strong in-context learning abilities, improving performance with more in-context examples. Emu showcases impressive qualitative capabilities like image blending, in-context text and image generation, and real-world knowledge grounding.	Emu is primarily trained on English-language data, limiting its proficiency in other languages. Like other LLMs and LMMs, Emu is susceptible to hallucinations, slow inference speed, and potential biases from training data.	multimodal learning, large language models, image captioning, visual question answering, text-to-image generation
2307.05134 Report	TIAM -- A Metric for Evaluating Alignment in Text-to-Image Generation	Paul Grimal, Hervé Le Borgne, Olivier Ferret, Julien Tourille	The progress in the generation of synthetic images has made it crucial to assess their quality. While several metrics have been proposed to assess the rendering of images, it is crucial for Text-to-Image (T2I) models, which generate images based on a prompt, to consider additional aspects such as to which extent the generated image matches the important content of the prompt. Moreover, although the generated images usually result from a random starting point, the influence of this one is generally not considered. In this article, we propose a new metric based on prompt templates to study the alignment between the content specified in the prompt and the corresponding generated images. It allows us to better characterize the alignment in terms of the type of the specified objects, their number, and their color. We conducted a study on several recent T2I models about various aspects. An additional interesting result we obtained with our approach is that image quality can vary drastically depending on the noise used as a seed for the images. We also quantify the influence of the number of concepts in the prompt, their order as well as their (color) attributes. Finally, our method allows us to identify some seeds that produce better images than others, opening novel directions of research on this understudied topic.	The paper introduces TIAM, a novel metric to quantify the alignment between generated images and text prompts in text-to-image synthesis.	Existing image quality metrics fail to adequately assess the alignment between generated content and textual descriptions, particularly in complex scenarios involving multiple objects and attributes.	TIAM utilizes prompt templates to systematically analyze the success rate of generating images containing specific objects and their attributes (e.g., color). It leverages object detection and segmentation to compare generated images with ground truth labels derived from the prompt.	The alignment performance of text-to-image models significantly declines as the number of objects in the prompt increases. The initial objects mentioned in the prompt are more likely to be present and correctly attributed in the generated image. The study reveals the significant influence of the random seed used during image generation, indicating that certain seed values consistently produce higher-quality results.	TIAM's computational cost increases with the number of objects and attributes, potentially limiting its scalability. The current implementation focuses on a limited set of attributes (primarily color) and object labels derived from the COCO dataset, requiring further work to extend its applicability to a broader range of attributes and open-vocabulary settings.	text-to-image synthesis, image quality assessment, prompt engineering, semantic alignment, attribute binding
2307.05000 Report	Neural Point-based Volumetric Avatar: Surface-guided Neural Points for Efficient and Photorealistic Volumetric Head Avatar	Cong Wang, Di Kang, Yan-Pei Cao, Linchao Bao, Ying Shan, Song-Hai Zhang	Rendering photorealistic and dynamically moving human heads is crucial for ensuring a pleasant and immersive experience in AR/VR and video conferencing applications. However, existing methods often struggle to model challenging facial regions (e.g., mouth interior, eyes, hair/beard), resulting in unrealistic and blurry results. In this paper, we propose {\fullname} ({\name}), a method that adopts the neural point representation as well as the neural volume rendering process and discards the predefined connectivity and hard correspondence imposed by mesh-based approaches. Specifically, the neural points are strategically constrained around the surface of the target expression via a high-resolution UV displacement map, achieving increased modeling capacity and more accurate control. We introduce three technical innovations to improve the rendering and training efficiency: a patch-wise depth-guided (shading point) sampling strategy, a lightweight radiance decoding process, and a Grid-Error-Patch (GEP) ray sampling strategy during training. By design, our {\name} is better equipped to handle topologically changing regions and thin structures while also ensuring accurate expression control when animating avatars. Experiments conducted on three subjects from the Multiface dataset demonstrate the effectiveness of our designs, outperforming previous state-of-the-art methods, especially in handling challenging facial regions.	Proposes NPVA, a neural point-based volumetric representation for animatable head avatar creation that uses neural points constrained around a target expression's surface for efficient and photorealistic rendering.	Existing mesh-based methods struggle to model challenging facial regions like mouths and beards, leading to unrealistic results. NPVA addresses this by using flexible neural points and neural volume rendering.	NPVA uses a UV displacement map to guide neural points around a coarse target expression geometry. It introduces a patch-wise depth-guided sampling, lightweight radiance decoding, and Grid-Error-Patch training for efficiency.	Outperforms state-of-the-art methods in rendering quality on novel expressions and views, especially in challenging regions. Achieves ~70x faster rendering speed than NeRF while producing comparable high-fidelity results. Demonstrates through ablation studies the effectiveness of its technical innovations like lightweight decoding and GEP training.	Reliance on coarse mesh tracking limits handling of complex hairstyles. Relaxing displacement map constraints for unseen hairstyles can lead to blurry renderings.	neural representation, volume rendering, head avatar, facial animation, point cloud
2307.04859 Report	Articulated 3D Head Avatar Generation using Text-to-Image Diffusion Models	Alexander W. Bergman, Wang Yifan, Gordon Wetzstein	The ability to generate diverse 3D articulated head avatars is vital to a plethora of applications, including augmented reality, cinematography, and education. Recent work on text-guided 3D object generation has shown great promise in addressing these needs. These methods directly leverage pre-trained 2D text-to-image diffusion models to generate 3D-multi-view-consistent radiance fields of generic objects. However, due to the lack of geometry and texture priors, these methods have limited control over the generated 3D objects, making it difficult to operate inside a specific domain, e.g., human heads. In this work, we develop a new approach to text-guided 3D head avatar generation to address this limitation. Our framework directly operates on the geometry and texture of an articulable 3D morphable model (3DMM) of a head, and introduces novel optimization procedures to update the geometry and texture while keeping the 2D and 3D facial features aligned. The result is a 3D head avatar that is consistent with the text description and can be readily articulated using the deformation model of the 3DMM. We show that our diffusion-based articulated head avatars outperform state-of-the-art approaches for this task. The latter are typically based on CLIP, which is known to provide limited diversity of generation and accuracy for 3D object generation.	Presents a novel method for generating 3D-view-consistent and articulable human head avatars from text prompts using pre-trained 2D text-to-image diffusion models.	Addresses limitations of existing methods that struggle with control, diversity, and animation in text-guided 3D head avatar generation.	Leverages score distillation loss to optimize shape and appearance of a 3D morphable model (3DMM) with a novel dual optimization procedure for geometry and texture, ensuring alignment and realism.	Generates high-quality head avatars with diverse features, including fictional humanoids. Exhibits superior geometry-text consistency compared to baselines, capturing unique geometric attributes from prompts. Demonstrates realistic animation due to geometry-aware texture optimization and alignment with 3D facial landmarks.	Generated images may exhibit cartoon-ish stylization and high color saturation, impacting realism. Inconsistency in upsampling across camera views can lead to flickering during animation.	3d head avatar generation, text-guided synthesis, diffusion models, 3d morphable models, articulated animation
2307.04787 Report	Collaborative Score Distillation for Consistent Visual Synthesis	Subin Kim, Kyungmin Lee, June Suk Choi, Jongheon Jeong, Kihyuk Sohn, Jinwoo Shin	Generative priors of large-scale text-to-image diffusion models enable a wide range of new generation and editing applications on diverse visual modalities. However, when adapting these priors to complex visual modalities, often represented as multiple images (e.g., video), achieving consistency across a set of images is challenging. In this paper, we address this challenge with a novel method, Collaborative Score Distillation (CSD). CSD is based on the Stein Variational Gradient Descent (SVGD). Specifically, we propose to consider multiple samples as "particles" in the SVGD update and combine their score functions to distill generative priors over a set of images synchronously. Thus, CSD facilitates seamless integration of information across 2D images, leading to a consistent visual synthesis across multiple samples. We show the effectiveness of CSD in a variety of tasks, encompassing the visual editing of panorama images, videos, and 3D scenes. Our results underline the competency of CSD as a versatile method for enhancing inter-sample consistency, thereby broadening the applicability of text-to-image diffusion models.	This paper proposes Collaborative Score Distillation (CSD), a novel method that extends text-to-image diffusion models for consistent visual synthesis and editing of complex visual data represented as a set of images.	Existing text-to-image diffusion models struggle to maintain consistency across multiple images, limiting their application to complex visual modalities like videos and 3D scenes.	CSD leverages Stein Variational Gradient Descent (SVGD) to distill generative priors over a set of images synchronously, ensuring consistency by sharing information among multiple samples during optimization.	CSD enables spatially consistent panorama image editing, achieving a better balance between source-target consistency and instruction fidelity compared to baselines. CSD facilitates temporally consistent video editing, outperforming zero-shot methods and demonstrating comparable performance to a state-of-the-art video editing model trained on a large-scale dataset. CSD enhances 3D scene editing by encouraging multi-view consistency, leading to higher-quality edits and better preservation of source scene semantics compared to existing methods.	The method inherits limitations from pre-trained text-to-image diffusion models, such as potential biases and difficulty in handling certain editing tasks (e.g., viewpoint changes). Patch-wise processing of high-resolution images can sometimes lead to artifacts at patch boundaries.	text-to-image synthesis, score distillation sampling, stein variational gradient descent, video editing, 3d scene editing
2307.04767 Report	Semantic-SAM: Segment and Recognize Anything at Any Granularity	Feng Li, Hao Zhang, Peize Sun, Xueyan Zou, Shilong Liu, Jianwei Yang, Chunyuan Li, Lei Zhang, Jianfeng Gao	In this paper, we introduce Semantic-SAM, a universal image segmentation model to enable segment and recognize anything at any desired granularity. Our model offers two key advantages: semantic-awareness and granularity-abundance. To achieve semantic-awareness, we consolidate multiple datasets across three granularities and introduce decoupled classification for objects and parts. This allows our model to capture rich semantic information. For the multi-granularity capability, we propose a multi-choice learning scheme during training, enabling each click to generate masks at multiple levels that correspond to multiple ground-truth masks. Notably, this work represents the first attempt to jointly train a model on SA-1B, generic, and part segmentation datasets. Experimental results and visualizations demonstrate that our model successfully achieves semantic-awareness and granularity-abundance. Furthermore, combining SA-1B training with other segmentation tasks, such as panoptic and part segmentation, leads to performance improvements. We will provide code and a demo for further exploration and evaluation.	This paper presents Semantic-SAM, a universal image segmentation model capable of segmenting and recognizing objects at any desired granularity with semantic awareness.	A universal segmentation model is crucial for achieving human-level image understanding in various applications, going beyond the limitations of existing models with single-input-single-output pipelines and restricted training data.	Semantic-SAM leverages a multi-choice learning design with multiple queries per click, enabling the prediction of multi-granularity masks. It uses a shared text encoder for decoupled object and part classification, trained on a unified data format from seven datasets with different semantic and granularity levels, including SA-1B, COCO, ADE20k, Pascal Part, PACO, PartImageNet, and Objects365.	Semantic-SAM achieves state-of-the-art performance on various segmentation tasks, including generic, part, and interactive segmentation. Joint training with SA-1B significantly improves performance on COCO panoptic segmentation, demonstrating the benefit of multi-granularity learning. The model exhibits superior granularity completeness compared to SAM, generating more meaningful and higher-quality masks at multiple levels.	The model currently relies on a fixed number of prompts (6), potentially limiting its ability to capture even finer granularities. Future work could explore incorporating a dynamic prompt generation mechanism based on image content and user intent.	image segmentation, multi-granularity, semantic awareness, interactive segmentation, open-vocabulary segmentation
2307.04749 Report	Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image Alignment with Iterative VQA Feedback	Jaskirat Singh, Liang Zheng	The field of text-conditioned image generation has made unparalleled progress with the recent advent of latent diffusion models. While remarkable, as the complexity of given text input increases, the state-of-the-art diffusion models may still fail in generating images which accurately convey the semantics of the given prompt. Furthermore, it has been observed that such misalignments are often left undetected by pretrained multi-modal models such as CLIP. To address these problems, in this paper we explore a simple yet effective decompositional approach towards both evaluation and improvement of text-to-image alignment. In particular, we first introduce a Decompositional-Alignment-Score which given a complex prompt decomposes it into a set of disjoint assertions. The alignment of each assertion with generated images is then measured using a VQA model. Finally, alignment scores for different assertions are combined aposteriori to give the final text-to-image alignment score. Experimental analysis reveals that the proposed alignment metric shows significantly higher correlation with human ratings as opposed to traditional CLIP, BLIP scores. Furthermore, we also find that the assertion level alignment scores provide a useful feedback which can then be used in a simple iterative procedure to gradually increase the expression of different assertions in the final image outputs. Human user studies indicate that the proposed approach surpasses previous state-of-the-art by 8.7% in overall text-to-image alignment accuracy. Project page for our paper is available at https://1jsingh.github.io/divide-evaluate-and-refine	This paper introduces a novel decompositional framework for evaluating and refining text-to-image alignment in text-conditioned image generation models.	Existing text-to-image generation models often fail to accurately convey the semantics of complex text prompts, and existing evaluation metrics like CLIP and BLIP scores often fail to detect these misalignments.	The proposed framework, called Decompositional-Alignment-Score (DA-Score), decomposes complex prompts into disjoint assertions, evaluates the alignment of each assertion with the generated image using a VQA model, and then combines these scores to generate an overall text-to-image alignment score. This feedback is then used in an iterative refinement process to improve the generated image by increasing the expressiveness of the least aligned assertion.	DA-Score shows significantly higher correlation with human ratings for text-to-image alignment compared to traditional metrics like CLIP, BLIP, and BLIP2. The iterative refinement process, guided by DA-Score, generates images with improved alignment to complex prompts, outperforming prior works in terms of alignment accuracy. Despite the iterative process, the proposed method maintains comparable inference times to other state-of-the-art techniques.	The reliance on a pretrained BLIP-VQA model for assertion alignment evaluation introduces potential weaknesses based on the VQA model's limitations. The current approach treats all assertions as equally important, neglecting potential variations in user priorities and the visual verifiability of certain assertions.	text-to-image generation, text-image alignment, vqa, iterative refinement, diffusion models
2307.04725 Report	AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning	Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, Bo Dai	With the advance of text-to-image (T2I) diffusion models (e.g., Stable Diffusion) and corresponding personalization techniques such as DreamBooth and LoRA, everyone can manifest their imagination into high-quality images at an affordable cost. However, adding motion dynamics to existing high-quality personalized T2Is and enabling them to generate animations remains an open challenge. In this paper, we present AnimateDiff, a practical framework for animating personalized T2I models without requiring model-specific tuning. At the core of our framework is a plug-and-play motion module that can be trained once and seamlessly integrated into any personalized T2Is originating from the same base T2I. Through our proposed training strategy, the motion module effectively learns transferable motion priors from real-world videos. Once trained, the motion module can be inserted into a personalized T2I model to form a personalized animation generator. We further propose MotionLoRA, a lightweight fine-tuning technique for AnimateDiff that enables a pre-trained motion module to adapt to new motion patterns, such as different shot types, at a low training and data collection cost. We evaluate AnimateDiff and MotionLoRA on several public representative personalized T2I models collected from the community. The results demonstrate that our approaches help these models generate temporally smooth animation clips while preserving the visual quality and motion diversity. Codes and pre-trained weights are available at https://github.com/guoyww/AnimateDiff.	AnimateDiff, a practical framework for animating personalized text-to-image models without requiring model-specific tuning.	Enables users to generate animations from personalized text-to-image models, which is desirable in various industries and for creative applications.	Trains a plug-and-play motion module on real-world videos and integrates it into personalized T2I models. Also introduces AnimateDiff-LoRA for adapting the module to new motion patterns with few reference videos.	Generates temporally smooth animations while preserving the visual quality and motion diversity of personalized T2I models. Demonstrates that a Transformer architecture effectively captures motion priors. Shows that AnimateDiff-LoRA successfully adapts pre-trained motion modules to new motion patterns with limited data and computation.	Limited evaluation on controllable generation. Potential misuse for generating inappropriate content, although the paper proposes adding a content safety checker to mitigate this risk.	text-to-image synthesis, animation generation, motion modeling, personalization, diffusion models
2307.04684 Report	FreeDrag: Feature Dragging for Reliable Point-based Image Editing	Pengyang Ling, Lin Chen, Pan Zhang, Huaian Chen, Yi Jin, Jinjin Zheng	To serve the intricate and varied demands of image editing, precise and flexible manipulation in image content is indispensable. Recently, Drag-based editing methods have gained impressive performance. However, these methods predominantly center on point dragging, resulting in two noteworthy drawbacks, namely "miss tracking", where difficulties arise in accurately tracking the predetermined handle points, and "ambiguous tracking", where tracked points are potentially positioned in wrong regions that closely resemble the handle points. To address the above issues, we propose FreeDrag, a feature dragging methodology designed to free the burden on point tracking. The FreeDrag incorporates two key designs, i.e., template feature via adaptive updating and line search with backtracking, the former improves the stability against drastic content change by elaborately controls feature updating scale after each dragging, while the latter alleviates the misguidance from similar points by actively restricting the search area in a line. These two technologies together contribute to a more stable semantic dragging with higher efficiency. Comprehensive experimental results substantiate that our approach significantly outperforms pre-existing methodologies, offering reliable point-based editing even in various complex scenarios.	This paper introduces FreeDrag, a novel feature-dragging framework designed for robust and precise point-based image editing.	Existing point-dragging methods suffer from limitations like 'miss tracking' and 'ambiguous tracking,' leading to inaccurate and unreliable editing outcomes. FreeDrag aims to address these issues and improve the quality of interactive image editing.	FreeDrag utilizes two key mechanisms: 1) Adaptive template features, which dynamically adjust updating scales based on dragging quality, enhancing stability. 2) Line search with backtracking, constraining movements along a line to minimize ambiguity and employing backtracking for course correction.	FreeDrag successfully mitigates point disappearance and content distortion, enabling precise detail editing. It exhibits robustness against similar points, leading to more reliable and accurate dragging outcomes. Quantitative evaluations demonstrate FreeDrag's superiority in achieving high editing accuracy while preserving image fidelity.	The performance of FreeDrag is subject to the chosen parameters, requiring careful tuning for optimal results. Future work will explore the integration of FreeDrag with other generative models beyond StyleGAN2 and diffusion models, potentially expanding its applicability.	image editing, generative models, point-based editing, feature dragging, interactive editing
2307.04455 Report	SAM-IQA: Can Segment Anything Boost Image Quality Assessment?	Xinpeng Li, Ting Jiang, Haoqiang Fan, Shuaicheng Liu	Image Quality Assessment (IQA) is a challenging task that requires training on massive datasets to achieve accurate predictions. However, due to the lack of IQA data, deep learning-based IQA methods typically rely on pre-trained networks trained on massive datasets as feature extractors to enhance their generalization ability, such as the ResNet network trained on ImageNet. In this paper, we utilize the encoder of Segment Anything, a recently proposed segmentation model trained on a massive dataset, for high-level semantic feature extraction. Most IQA methods are limited to extracting spatial-domain features, while frequency-domain features have been shown to better represent noise and blur. Therefore, we leverage both spatial-domain and frequency-domain features by applying Fourier and standard convolutions on the extracted features, respectively. Extensive experiments are conducted to demonstrate the effectiveness of all the proposed components, and results show that our approach outperforms the state-of-the-art (SOTA) in four representative datasets, both qualitatively and quantitatively. Our experiments confirm the powerful feature extraction capabilities of Segment Anything and highlight the value of combining spatial-domain and frequency-domain features in IQA tasks. Code: https://github.com/Hedlen/SAM-IQA	This paper introduces a novel IQA method leveraging the Segment Anything (SAM) model for feature extraction, incorporating both spatial and frequency domain features through a spatial-frequency feature extraction module (SFEM).	Accurate IQA is crucial for various image processing tasks, but existing methods suffer from limited training data. This paper addresses this by utilizing the robust feature extraction capabilities of SAM, trained on a massive dataset.	The method extracts features using the SAM encoder and then employs SFEM to capture both spatial and frequency domain information using regular and Fourier convolutions. For FR-IQA, L1 distance is used to compare features, while for NR-IQA, features are directly fed into a regression block.	The method outperforms state-of-the-art approaches in both FR-IQA and NR-IQA tasks on various benchmark datasets. Ablation studies confirm the effectiveness of SAM encoder, Fourier convolution in SFEM, and L1 distance metric. The method achieves strong generalization ability and superior performance in image quality assessment.	The method's reliance on pre-trained SAM encoder limits its applicability in scenarios where SAM's performance is compromised. Further exploration of advanced distance metric learning techniques could potentially enhance the model's accuracy.	image quality assessment, segment anything, fourier convolution, spatial-frequency feature extraction, deep learning
2307.04157 Report	DIFF-NST: Diffusion Interleaving For deFormable Neural Style Transfer	Dan Ruta, Gemma Canet Tarrés, Andrew Gilbert, Eli Shechtman, Nicholas Kolkin, John Collomosse	Neural Style Transfer (NST) is the field of study applying neural techniques to modify the artistic appearance of a content image to match the style of a reference style image. Traditionally, NST methods have focused on texture-based image edits, affecting mostly low level information and keeping most image structures the same. However, style-based deformation of the content is desirable for some styles, especially in cases where the style is abstract or the primary concept of the style is in its deformed rendition of some content. With the recent introduction of diffusion models, such as Stable Diffusion, we can access far more powerful image generation techniques, enabling new possibilities. In our work, we propose using this new class of models to perform style transfer while enabling deformable style transfer, an elusive capability in previous models. We show how leveraging the priors of these models can expose new artistic controls at inference time, and we document our findings in exploring this new direction for the field of style transfer.	This paper proposes DIFF-NST, a novel Neural Style Transfer (NST) method leveraging diffusion models to enable deformable style transfer, going beyond texture-based edits to alter content shapes and structures according to the style image.	Traditional NST methods primarily focus on texture transfer, neglecting style-based content deformation. This work explores the potential of diffusion models for achieving deformable style transfer, a capability previously elusive in NST.	DIFF-NST freezes pre-trained diffusion model weights and trains MLPs within the UNet self-attention blocks. It interleaves content noise and style attention values during reverse diffusion to generate stylized images, enabling control over content deformation and stylization strength.	DIFF-NST achieves deformable style transfer, successfully altering content shapes and structures based on the style image. User studies demonstrate a strong preference for DIFF-NST in terms of style transfer quality compared to traditional NST methods and the closest related work, PARASOL. The method offers inference-time control over the degree of content deformation and stylization strength.	The method currently doesn't match textures to the style image with the same fidelity as traditional NST approaches. There are occasional instances where structure from the style image unintentionally influences the stylized image.	neural style transfer, diffusion models, deformable style transfer, image generation, artistic style
2307.04028 Report	Measuring the Success of Diffusion Models at Imitating Human Artists	Stephen Casper, Zifan Guo, Shreya Mogulothu, Zachary Marinov, Chinmay Deshpande, Rui-Jie Yew, Zheng Dai, Dylan Hadfield-Menell	Modern diffusion models have set the state-of-the-art in AI image generation. Their success is due, in part, to training on Internet-scale data which often includes copyrighted work. This prompts questions about the extent to which these models learn from, imitate, or copy the work of human artists. This work suggests that tying copyright liability to the capabilities of the model may be useful given the evolving ecosystem of generative models. Specifically, much of the legal analysis of copyright and generative systems focuses on the use of protected data for training. As a result, the connections between data, training, and the system are often obscured. In our approach, we consider simple image classification techniques to measure a model's ability to imitate specific artists. Specifically, we use Contrastive Language-Image Pretrained (CLIP) encoders to classify images in a zero-shot fashion. Our process first prompts a model to imitate a specific artist. Then, we test whether CLIP can be used to reclassify the artist (or the artist's work) from the imitation. If these tests match the imitation back to the original artist, this suggests the model can imitate that artist's expression. Our approach is simple and quantitative. Furthermore, it uses standard techniques and does not require additional training. We demonstrate our approach with an audit of Stable Diffusion's capacity to imitate 70 professional digital artists with copyrighted work online. When Stable Diffusion is prompted to imitate an artist from this set, we find that the artist can be identified from the imitation with an average accuracy of 81.0%. Finally, we also show that a sample of the artist's work can be matched to these imitation images with a high degree of statistical reliability. Overall, these results suggest that Stable Diffusion is broadly successful at imitating individual human artists.	This paper presents a method to quantify a diffusion model's capacity to imitate human artists by employing CLIP-based image classification, which could inform discussions on copyright liability tied to model capabilities.	As generative AI models, often trained on copyrighted works, become increasingly capable, it is crucial to develop objective measures of their ability to imitate specific artists, which could be relevant for copyright considerations. Current legal frameworks primarily focus on training data rather than model capabilities.	The authors use CLIP encoders to classify images generated by Stable Diffusion. They prompt the model to imitate a specific artist's style and then assess if CLIP can correctly identify the artist from the generated image. They also compare the similarity of real artwork to generated imitations using CLIP embeddings.	Stable Diffusion successfully imitates the style of 70 professional digital artists, as demonstrated by an average classification accuracy of 81.0% using their names as prompts. These results remain consistent when tested on a larger set of 250 artists, indicating the generalizability of the findings. Artwork generated by Stable Diffusion shows statistically significant similarity to real artwork by the artist it was prompted to imitate compared to other artists, further confirming its imitation capabilities.	The study focuses on Stable Diffusion and a specific set of digital artists, potentially limiting the generalizability of the findings to other models or artistic domains. Future work could investigate the effectiveness of different image classification techniques and explore potential defenses against AI imitation of copyrighted works.	diffusion models, copyright law, image classification, clip, ai art
2307.03869 Report	Sketch-A-Shape: Zero-Shot Sketch-to-3D Shape Generation	Aditya Sanghi, Pradeep Kumar Jayaraman, Arianna Rampini, Joseph Lambourne, Hooman Shayani, Evan Atherton, Saeid Asgari Taghanaki	Significant progress has recently been made in creative applications of large pre-trained models for downstream tasks in 3D vision, such as text-to-shape generation. This motivates our investigation of how these pre-trained models can be used effectively to generate 3D shapes from sketches, which has largely remained an open challenge due to the limited sketch-shape paired datasets and the varying level of abstraction in the sketches. We discover that conditioning a 3D generative model on the features (obtained from a frozen large pre-trained vision model) of synthetic renderings during training enables us to effectively generate 3D shapes from sketches at inference time. This suggests that the large pre-trained vision model features carry semantic signals that are resilient to domain shifts, i.e., allowing us to use only RGB renderings, but generalizing to sketches at inference time. We conduct a comprehensive set of experiments investigating different design factors and demonstrate the effectiveness of our straightforward approach for generation of multiple 3D shapes per each input sketch regardless of their level of abstraction without requiring any paired datasets during training.	The paper proposes ickname, a zero-shot approach for sketch-to-3D shape generation using pre-trained vision models.	Sketch-to-3D shape generation is challenging due to the limited availability of paired datasets and varying levels of abstraction in sketches. This method addresses these challenges by leveraging the semantic knowledge captured in pre-trained vision models.	The method uses a two-stage training process: (1) train a VQ-VAE to obtain shape embeddings, (2) train a masked transformer conditioned on local features from pre-trained vision models applied to synthetic renderings of the 3D shapes. During inference, the transformer is conditioned on features extracted from the input sketch to generate the 3D shape.	The method generates multiple plausible 3D shapes per sketch, even for highly abstract sketches. It generalizes well across different sketch datasets and 3D representations (voxels, implicit, CAD). The method outperforms existing supervised sketch-to-3D approaches on shape classification accuracy and achieves promising results in human evaluation.	The method is limited by the diversity and quality of the 3D shape dataset used for training. It struggles to generate shapes with complex local details.	sketch-to-3d, zero-shot learning, generative models, pre-trained vision models, clip
2307.03798 Report	Fooling Contrastive Language-Image Pre-trained Models with CLIPMasterPrints	Matthias Freiberger, Peter Kun, Christian Igel, Anders Sundnes Løvlie, Sebastian Risi	Models leveraging both visual and textual data such as Contrastive Language-Image Pre-training (CLIP), are the backbone of many recent advances in artificial intelligence. In this work, we show that despite their versatility, such models are vulnerable to what we refer to as fooling master images. Fooling master images are capable of maximizing the confidence score of a CLIP model for a significant number of widely varying prompts, while being either unrecognizable or unrelated to the attacked prompts for humans. The existence of such images is problematic as it could be used by bad actors to maliciously interfere with CLIP-trained image retrieval models in production with comparably small effort as a single image can attack many different prompts. We demonstrate how fooling master images for CLIP (CLIPMasterPrints) can be mined using stochastic gradient descent, projected gradient descent, or blackbox optimization. Contrary to many common adversarial attacks, the blackbox optimization approach allows us to mine CLIPMasterPrints even when the weights of the model are not accessible. We investigate the properties of the mined images, and find that images trained on a small number of image captions generalize to a much larger number of semantically related captions. We evaluate possible mitigation strategies, where we increase the robustness of the model and introduce an approach to automatically detect CLIPMasterPrints to sanitize the input of vulnerable models. Finally, we find that vulnerability to CLIPMasterPrints is related to a modality gap in contrastive pre-trained multi-modal networks. Code available at https://github.com/matfrei/CLIPMasterPrints.	This paper introduces fooling master images (CLIPMasterPrints), which are images capable of maximizing the confidence score of a CLIP model for a wide range of prompts while appearing meaningless or unrelated to humans.	The existence of CLIPMasterPrints poses a security risk as they can be used to manipulate CLIP-trained image retrieval systems in production, enabling attacks like censorship and adversarial marketing.	The authors exploit the modality gap in CLIP models and mine CLIPMasterPrints using stochastic gradient descent (SGD), black-box optimization based on Latent Variable Evolution (LVE), and projected gradient descent (PGD).	CLIPMasterPrints can successfully fool CLIP models for a variety of prompts, achieving higher cosine similarity scores than actual images matching the prompts. The fooling effect generalizes to semantically related prompts that were not directly targeted during optimization. Mitigations strategies include bridging the modality gap in CLIP models and sanitizing model inputs by training classifiers to detect CLIPMasterPrints.	The mitigation strategy of adding fooling images to the training set is only partially successful, as the model remains vulnerable to newly mined CLIPMasterPrints. Future work should focus on developing more effective mitigation strategies and investigating the vulnerability of other related models.	clip, fooling images, adversarial attacks, multi-modal networks, modality gap
2307.03441 Report	NOFA: NeRF-based One-shot Facial Avatar Reconstruction	Wangbo Yu, Yanbo Fan, Yong Zhang, Xuan Wang, Fei Yin, Yunpeng Bai, Yan-Pei Cao, Ying Shan, Yang Wu, Zhongqian Sun, Baoyuan Wu	3D facial avatar reconstruction has been a significant research topic in computer graphics and computer vision, where photo-realistic rendering and flexible controls over poses and expressions are necessary for many related applications. Recently, its performance has been greatly improved with the development of neural radiance fields (NeRF). However, most existing NeRF-based facial avatars focus on subject-specific reconstruction and reenactment, requiring multi-shot images containing different views of the specific subject for training, and the learned model cannot generalize to new identities, limiting its further applications. In this work, we propose a one-shot 3D facial avatar reconstruction framework that only requires a single source image to reconstruct a high-fidelity 3D facial avatar. For the challenges of lacking generalization ability and missing multi-view information, we leverage the generative prior of 3D GAN and develop an efficient encoder-decoder network to reconstruct the canonical neural volume of the source image, and further propose a compensation network to complement facial details. To enable fine-grained control over facial dynamics, we propose a deformation field to warp the canonical volume into driven expressions. Through extensive experimental comparisons, we achieve superior synthesis results compared to several state-of-the-art methods.	The paper introduces NOFA, a novel one-shot 3D facial avatar reconstruction framework using NeRF, enabling high-fidelity reconstruction and reenactment from a single image.	Existing NeRF-based facial avatars are subject-specific, requiring extensive multi-shot data and lacking generalization ability. This work addresses these limitations by proposing a generalizable one-shot approach.	The method leverages a 3D GAN's generative prior to synthesize neural volumes, trains an encoder-decoder network for mapping images to canonical volumes, and employs a deformation field for facial dynamics control guided by 3DMM parameters.	Outperforms state-of-the-art 2D and NeRF-based methods in novel view synthesis and reenactment tasks. Achieves comparable performance to multi-shot methods while only requiring a single image. Demonstrates superior identity preservation and detail rendering.	Background rotation is coupled with head rotation due to camera pose modeling. Potential misuse for creating deep-fakes.	facial avatar reconstruction, neural radiance fields (nerf), one-shot learning, 3d generative adversarial networks (gans), facial reenactment
2307.03190 Report	Text-Guided Synthesis of Eulerian Cinemagraphs	Aniruddha Mahapatra, Aliaksandr Siarohin, Hsin-Ying Lee, Sergey Tulyakov, Jun-Yan Zhu	We introduce Text2Cinemagraph, a fully automated method for creating cinemagraphs from text descriptions - an especially challenging task when prompts feature imaginary elements and artistic styles, given the complexity of interpreting the semantics and motions of these images. We focus on cinemagraphs of fluid elements, such as flowing rivers, and drifting clouds, which exhibit continuous motion and repetitive textures. Existing single-image animation methods fall short on artistic inputs, and recent text-based video methods frequently introduce temporal inconsistencies, struggling to keep certain regions static. To address these challenges, we propose an idea of synthesizing image twins from a single text prompt - a pair of an artistic image and its pixel-aligned corresponding natural-looking twin. While the artistic image depicts the style and appearance detailed in our text prompt, the realistic counterpart greatly simplifies layout and motion analysis. Leveraging existing natural image and video datasets, we can accurately segment the realistic image and predict plausible motion given the semantic information. The predicted motion can then be transferred to the artistic image to create the final cinemagraph. Our method outperforms existing approaches in creating cinemagraphs for natural landscapes as well as artistic and other-worldly scenes, as validated by automated metrics and user studies. Finally, we demonstrate two extensions: animating existing paintings and controlling motion directions using text.	This paper introduces Text2Cinemagraph, the first fully automated method for generating cinemagraphs from text descriptions, capable of handling both artistic and natural scenes.	This method allows content creators to easily generate cinemagraphs with a variety of styles and compositions, including those with imaginative elements, which are challenging to create with existing methods.	The method leverages a twin image synthesis approach, generating an artistic image and a corresponding realistic image with similar semantic layout. A motion prediction model trained on real videos is applied to the realistic image, and the resulting motion is transferred to the artistic image to create the final cinemagraph.	The method outperforms existing single-image animation techniques on both artistic and natural images, as measured by FVD scores and user studies. Text and mask conditioning in the flow prediction network are shown to be crucial for generating plausible motions. The method is extended to animate existing paintings and control motion directions using text.	Limitations include potential inconsistencies between the artistic and realistic images and challenges in segmenting complex natural images. Future work involves exploring more fine-grained text-guided direction control and addressing artifacts in the generated videos.	cinemagraph generation, text-to-video synthesis, single image animation, twin image synthesis, artistic style transfer
2307.03108 Report	DIAGNOSIS: Detecting Unauthorized Data Usages in Text-to-image Diffusion Models	Zhenting Wang, Chen Chen, Lingjuan Lyu, Dimitris N. Metaxas, Shiqing Ma	Recent text-to-image diffusion models have shown surprising performance in generating high-quality images. However, concerns have arisen regarding the unauthorized data usage during the training or fine-tuning process. One example is when a model trainer collects a set of images created by a particular artist and attempts to train a model capable of generating similar images without obtaining permission and giving credit to the artist. To address this issue, we propose a method for detecting such unauthorized data usage by planting the injected memorization into the text-to-image diffusion models trained on the protected dataset. Specifically, we modify the protected images by adding unique contents on these images using stealthy image warping functions that are nearly imperceptible to humans but can be captured and memorized by diffusion models. By analyzing whether the model has memorized the injected content (i.e., whether the generated images are processed by the injected post-processing function), we can detect models that had illegally utilized the unauthorized data. Experiments on Stable Diffusion and VQ Diffusion with different model training or fine-tuning methods (i.e, LoRA, DreamBooth, and standard training) demonstrate the effectiveness of our proposed method in detecting unauthorized data usages. Code: https://github.com/ZhentingWang/DIAGNOSIS.	This paper proposes DIAGNOSIS, a method for detecting unauthorized data usage in text-to-image diffusion models by injecting element-level memorizations into models trained on protected datasets.	The increasing use of text-to-image diffusion models raises concerns about unauthorized data usage, necessitating techniques to detect and prevent such misuse.	DIAGNOSIS modifies protected images using stealthy warping functions (signal functions) before release. A signal classifier is trained to detect the presence of this warping in generated images. By analyzing the memorization strength of a given model on the signal function, unauthorized usage can be determined.	DIAGNOSIS achieves 100.0% detection accuracy on various text-to-image diffusion models and training methods. The method has minimal impact on the generation quality of models trained on protected datasets. DIAGNOSIS is robust against adaptive infringers employing strong image augmentations.	The current signal function focuses on image warping, exploring other stealthy modifications could be beneficial. Investigating the impact of infringers utilizing a portion of the dataset for training is an area for future work.	unauthorized data usage, text-to-image diffusion models, memorization, data protection, copyright infringement
2307.02953 Report	SegNetr: Rethinking the local-global interactions and skip connections in U-shaped networks	Junlong Cheng, Chengrui Gao, Fengjie Wang, Min Zhu	Recently, U-shaped networks have dominated the field of medical image segmentation due to their simple and easily tuned structure. However, existing U-shaped segmentation networks: 1) mostly focus on designing complex self-attention modules to compensate for the lack of long-term dependence based on convolution operation, which increases the overall number of parameters and computational complexity of the network; 2) simply fuse the features of encoder and decoder, ignoring the connection between their spatial locations. In this paper, we rethink the above problem and build a lightweight medical image segmentation network, called SegNetr. Specifically, we introduce a novel SegNetr block that can perform local-global interactions dynamically at any stage and with only linear complexity. At the same time, we design a general information retention skip connection (IRSC) to preserve the spatial location information of encoder features and achieve accurate fusion with the decoder features. We validate the effectiveness of SegNetr on four mainstream medical image segmentation datasets, with 59\% and 76\% fewer parameters and GFLOPs than vanilla U-Net, while achieving segmentation performance comparable to state-of-the-art methods. Notably, the components proposed in this paper can be applied to other U-shaped networks to improve their segmentation performance.	This paper presents SegNetr, a lightweight U-shaped medical image segmentation network that improves local-global interaction and skip connections.	Existing U-shaped networks often rely on computationally expensive self-attention modules or simplistic feature fusion, limiting their efficiency and performance.	SegNetr introduces: (1) SegNetr blocks for dynamic local-global interaction with linear complexity using parallel processing and window displacement. (2) Information retention skip connections (IRSC) to preserve spatial information from the encoder and enhance feature fusion with the decoder.	SegNetr achieves comparable or superior segmentation performance to state-of-the-art methods on four medical image datasets (ISIC2017, PH2, TNSCUI, ACDC). It significantly reduces computational cost, with 59% fewer parameters and 76% fewer GFLOPs than vanilla U-Net. Ablation studies demonstrate the effectiveness of the proposed SegNetr blocks and IRSC, indicating their potential applicability in other U-shaped networks.	The paper primarily focuses on 2D medical image segmentation, leaving extensions to 3D data for future exploration. Further research could investigate the optimal patch size configuration for different datasets and segmentation tasks.	medical image segmentation, u-net, local-global interaction, skip connections, deep learning
2307.02609 Report	MRecGen: Multimodal Appropriate Reaction Generator	Jiaqi Xu, Cheng Luo, Weicheng Xie, Linlin Shen, Xiaofeng Liu, Lu Liu, Hatice Gunes, Siyang Song	Verbal and non-verbal human reaction generation is a challenging task, as different reactions could be appropriate for responding to the same behaviour. This paper proposes the first multiple and multimodal (verbal and nonverbal) appropriate human reaction generation framework that can generate appropriate and realistic human-style reactions (displayed in the form of synchronised text, audio and video streams) in response to an input user behaviour. This novel technique can be applied to various human-computer interaction scenarios by generating appropriate virtual agent/robot behaviours. Our demo is available at \url{https://github.com/SSYSteve/MRecGen}.	This paper proposes MRecGen, the first multiple and multimodal appropriate human reaction generation framework that produces synchronized text, audio, and video streams of realistic reactions to user behavior.	Generating realistic and appropriate reactions to human behavior is challenging due to the 'one-to-many mapping' problem where the same behavior can elicit various valid responses. Existing methods struggle with this and lack multimodal capabilities, limiting their applicability in human-computer interaction.	MRecGen uses a four-module deep learning approach: (1) User behavior encoding (UBE) module encodes multimodal user input. (2) Appropriate reaction prediction (ARP) module predicts a distribution of suitable multimodal reactions. (3) Behaviour synchronisation (BS) module aligns and synchronizes user and reaction representations. (4) Reaction display (RD) module generates the final text, audio, and facial video output.	MRecGen generates appropriate textual, audio, and facial reactions based on user studies. The framework achieves high lip sync quality in the generated videos. User study results demonstrate the effectiveness of MRecGen in generating appropriate and realistic reactions.	The current demo primarily focuses on generating reactions from a single identity, limiting its generalizability. Future work will investigate incorporating personality traits into the reaction generation process to enhance personalization.	human-computer interaction, multimodal generation, deep learning, reaction generation, virtual agents
2307.02321 Report	MSViT: Dynamic Mixed-Scale Tokenization for Vision Transformers	Jakob Drachmann Havtorn, Amelie Royer, Tijmen Blankevoort, Babak Ehteshami Bejnordi	The input tokens to Vision Transformers carry little semantic meaning as they are defined as regular equal-sized patches of the input image, regardless of its content. However, processing uniform background areas of an image should not necessitate as much compute as dense, cluttered areas. To address this issue, we propose a dynamic mixed-scale tokenization scheme for ViT, MSViT. Our method introduces a conditional gating mechanism that selects the optimal token scale for every image region, such that the number of tokens is dynamically determined per input. In addition, to enhance the conditional behavior of the gate during training, we introduce a novel generalization of the batch-shaping loss. We show that our gating module is able to learn meaningful semantics despite operating locally at the coarse patch-level. The proposed gating module is lightweight, agnostic to the choice of transformer backbone, and trained within a few epochs with little training overhead. Furthermore, in contrast to token pruning, MSViT does not lose information about the input, thus can be readily applied for dense tasks. We validate MSViT on the tasks of classification and segmentation where it leads to improved accuracy-complexity trade-off.	This paper proposes MSViT, a Vision Transformer that dynamically selects the optimal token scale for different image regions, thereby reducing the number of input tokens and computational cost.	Standard ViTs process images with a fixed token size, leading to computational redundancy, especially in uniform background areas. Dynamically adjusting token scale based on image content can improve efficiency.	A lightweight gating MLP is introduced to select between coarse and fine token scales for each image region. To optimize this conditional gating mechanism, a novel Generalized Batch-Shaping (GBaS) loss is proposed, and an adaptive trimming strategy reduces training overhead.	MSViT consistently improves the accuracy-complexity trade-off compared to standard ViTs across different backbones, pretraining methods, and input sizes. The learned gating mechanism effectively captures meaningful semantics to distinguish background from foreground even with local information. The pretrained MSViT gate transfers well to other tasks like semantic segmentation and can be combined with token pruning in hierarchical ViTs for further efficiency gains.	The current design explores two token scales. Investigating more scales might further improve performance but add complexity. The interaction between the gate's coarse scale, the base patch scale, and the attention window size in hierarchical ViTs requires further investigation to optimize token scale selection across layers.	vision transformer, tokenization, mixed-scale, efficiency, conditional computing
2307.01831 Report	DiT-3D: Exploring Plain Diffusion Transformers for 3D Shape Generation	Shentong Mo, Enze Xie, Ruihang Chu, Lewei Yao, Lanqing Hong, Matthias Nießner, Zhenguo Li	Recent Diffusion Transformers (e.g., DiT) have demonstrated their powerful effectiveness in generating high-quality 2D images. However, it is still being determined whether the Transformer architecture performs equally well in 3D shape generation, as previous 3D diffusion methods mostly adopted the U-Net architecture. To bridge this gap, we propose a novel Diffusion Transformer for 3D shape generation, namely DiT-3D, which can directly operate the denoising process on voxelized point clouds using plain Transformers. Compared to existing U-Net approaches, our DiT-3D is more scalable in model size and produces much higher quality generations. Specifically, the DiT-3D adopts the design philosophy of DiT but modifies it by incorporating 3D positional and patch embeddings to adaptively aggregate input from voxelized point clouds. To reduce the computational cost of self-attention in 3D shape generation, we incorporate 3D window attention into Transformer blocks, as the increased 3D token length resulting from the additional dimension of voxels can lead to high computation. Finally, linear and devoxelization layers are used to predict the denoised point clouds. In addition, our transformer architecture supports efficient fine-tuning from 2D to 3D, where the pre-trained DiT-2D checkpoint on ImageNet can significantly improve DiT-3D on ShapeNet. Experimental results on the ShapeNet dataset demonstrate that the proposed DiT-3D achieves state-of-the-art performance in high-fidelity and diverse 3D point cloud generation. In particular, our DiT-3D decreases the 1-Nearest Neighbor Accuracy of the state-of-the-art method by 4.59 and increases the Coverage metric by 3.51 when evaluated on Chamfer Distance.	This paper introduces DiT-3D, a novel diffusion transformer architecture designed for 3D shape generation that leverages the denoising process of DDPM on 3D point clouds.	Generating high-fidelity point clouds for 3D shape generation is a challenging and significant problem, and existing methods have limitations in terms of architecture and performance.	The proposed DiT-3D model adapts the DiT framework by incorporating 3D positional and patch embeddings, 3D window attention, and devoxelized prediction to handle the unique characteristics of 3D point clouds. It also supports efficient fine-tuning from 2D to 3D using pre-trained DiT-2D weights.	DiT-3D achieves state-of-the-art performance on the ShapeNet dataset, surpassing previous non-DDPM and DDPM-based 3D shape generation methods. The proposed 3D adaptations, including voxel diffusion, 3D positional embeddings, and 3D window attention, significantly contribute to improving the quality and diversity of generated shapes. DiT-3D exhibits strong scalability, allowing for flexible adjustments to patch sizes, voxel sizes, and model sizes.	The model has yet to be explored on other 3D modalities like SDFs and meshes. Scaling DiT-3D to large-scale training on more extensive 3D shape datasets is a potential area for future work.	3d shape generation, diffusion models, transformers, point clouds, denoising
2307.01425 Report	Consistent Multimodal Generation via A Unified GAN Framework	Zhen Zhu, Yijun Li, Weijie Lyu, Krishna Kumar Singh, Zhixin Shu, Soeren Pirk, Derek Hoiem	We investigate how to generate multimodal image outputs, such as RGB, depth, and surface normals, with a single generative model. The challenge is to produce outputs that are realistic, and also consistent with each other. Our solution builds on the StyleGAN3 architecture, with a shared backbone and modality-specific branches in the last layers of the synthesis network, and we propose per-modality fidelity discriminators and a cross-modality consistency discriminator. In experiments on the Stanford2D3D dataset, we demonstrate realistic and consistent generation of RGB, depth, and normal images. We also show a training recipe to easily extend our pretrained model on a new domain, even with a few pairwise data. We further evaluate the use of synthetically generated RGB and depth pairs for training or fine-tuning depth estimators. Code will be available at https://github.com/jessemelpolio/MultimodalGAN.	This paper presents MultimodalGAN, a unified GAN framework for generating consistent multi-modal images (e.g., RGB, depth, surface normals) using a shared representation.	Generating realistic and consistent multi-modal data is crucial for training vision models, especially when real data is scarce.	The method builds on StyleGAN3, employing a shared backbone and modality-specific branches. It introduces fidelity discriminators (per modality) and a consistency discriminator to ensure realism and cross-modal consistency. A simple, unified data augmentation strategy is used across modalities.	MultimodalGAN generates realistic and consistent RGB, depth, and normal images, outperforming previous methods on the Stanford2D3D dataset. The model can be effectively fine-tuned for new domains with limited paired data, enabling generation of missing modalities. Synthetic RGB and depth pairs generated by the model improve the performance of downstream depth estimation tasks.	The current work focuses on three specific modalities (RGB, depth, normals). Future work could explore the generation of more diverse modalities or leverage other generative models like diffusion models.	generative adversarial networks, multimodal generation, data augmentation, depth estimation, surface normal estimation
2307.01197 Report	Segment Anything Meets Point Tracking	Frano Rajič, Lei Ke, Yu-Wing Tai, Chi-Keung Tang, Martin Danelljan, Fisher Yu	The Segment Anything Model (SAM) has established itself as a powerful zero-shot image segmentation model, enabled by efficient point-centric annotation and prompt-based models. While click and brush interactions are both well explored in interactive image segmentation, the existing methods on videos focus on mask annotation and propagation. This paper presents SAM-PT, a novel method for point-centric interactive video segmentation, empowered by SAM and long-term point tracking. SAM-PT leverages robust and sparse point selection and propagation techniques for mask generation. Compared to traditional object-centric mask propagation strategies, we uniquely use point propagation to exploit local structure information agnostic to object semantics. We highlight the merits of point-based tracking through direct evaluation on the zero-shot open-world Unidentified Video Objects (UVO) benchmark. Our experiments on popular video object segmentation and multi-object segmentation tracking benchmarks, including DAVIS, YouTube-VOS, and BDD100K, suggest that a point-based segmentation tracker yields better zero-shot performance and efficient interactions. We release our code that integrates different point trackers and video segmentation benchmarks at https://github.com/SysCV/sam-pt.	This paper introduces SAM-PT, a novel method for interactive video segmentation that combines the Segment Anything Model (SAM) with long-term point tracking, enabling zero-shot video object segmentation.	Existing interactive video segmentation methods struggle with unseen objects and rely on laborious mask annotations. SAM-PT addresses these limitations by leveraging SAM's generalization ability and the efficiency of point-based tracking.	SAM-PT selects query points on the first frame, tracks them throughout the video using point trackers like CoTracker, and prompts SAM with the tracked points to generate per-frame segmentation masks. An optional point reinitialization strategy improves tracking accuracy over time.	SAM-PT achieves state-of-the-art zero-shot video object segmentation performance, outperforming previous methods on DAVIS, YouTube-VOS, and BDD100K datasets. It also surpasses a fully-supervised method on the open-world UVO dataset, demonstrating its strong generalization ability. In interactive settings, SAM-PT significantly reduces annotation effort, approaching the performance of fully supervised methods.	SAM-PT's performance depends on the accuracy of the underlying point tracker, which can be challenged by occlusions and fast-moving objects. The current implementation relies on high-resolution inputs for both SAM and point trackers, limiting real-time performance.	video segmentation, interactive segmentation, zero-shot learning, point tracking, segment anything model
2307.01187 Report	SAMAug: Point Prompt Augmentation for Segment Anything Model	Haixing Dai, Chong Ma, Zhiling Yan, Zhengliang Liu, Enze Shi, Yiwei Li, Peng Shu, Xiaozheng Wei, Lin Zhao, Zihao Wu, Fang Zeng, Dajiang Zhu, Wei Liu, Quanzheng Li, Lichao Sun, Shu Zhang Tianming Liu, Xiang Li	This paper introduces SAMAug, a novel visual point augmentation method for the Segment Anything Model (SAM) that enhances interactive image segmentation performance. SAMAug generates augmented point prompts to provide more information about the user's intention to SAM. Starting with an initial point prompt, SAM produces an initial mask, which is then fed into our proposed SAMAug to generate augmented point prompts. By incorporating these extra points, SAM can generate augmented segmentation masks based on both the augmented point prompts and the initial prompt, resulting in improved segmentation performance. We conducted evaluations using four different point augmentation strategies: random sampling, sampling based on maximum difference entropy, maximum distance, and saliency. Experiment results on the COCO, Fundus, COVID QUEx, and ISIC2018 datasets show that SAMAug can boost SAM's segmentation results, especially using the maximum distance and saliency. SAMAug demonstrates the potential of visual prompt augmentation for computer vision. Codes of SAMAug are available at github.com/yhydhx/SAMAug	This paper introduces SAMAug, a visual point augmentation method for the Segment Anything Model (SAM) to enhance interactive image segmentation.	SAM, while powerful, can be ambiguous with limited prompt information like a single point. SAMAug addresses this by generating augmented point prompts to better guide the model.	SAMAug leverages an initial point prompt and the resulting SAM mask to generate additional point prompts using four strategies: random sampling, maximum difference entropy, maximum distance, and saliency.	SAMAug consistently improves SAM's segmentation performance across COCO, Fundus, COVID QU-Ex, and ISIC2018 datasets. Maximum distance and saliency-based augmentation strategies demonstrate superior performance. Bounding box prompts generally outperform point prompts, but their augmentation proves less effective.	The optimal augmentation strategy appears dataset-dependent, requiring further investigation. Adding multiple augmented points does not necessarily yield better results, indicating a need for refined multi-point strategies.	prompt augmentation, segment anything model, visual prompting, interactive segmentation, image segmentation
2307.00997 Report	RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation	Yonglin Li, Jing Zhang, Xiao Teng, Long Lan	The Segment Anything Model (SAM) has gained significant attention for its impressive performance in image segmentation. However, it lacks proficiency in referring video object segmentation (RVOS) due to the need for precise user-interactive prompts and a limited understanding of different modalities, such as language and vision. This paper presents the RefSAM model, which explores the potential of SAM for RVOS by incorporating multi-view information from diverse modalities and successive frames at different timestamps in an online manner. Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-Modal MLP that projects the text embedding of the referring expression into sparse and dense embeddings, serving as user-interactive prompts. Additionally, we have introduced the hierarchical dense attention module to fuse hierarchical visual semantic information with sparse embeddings in order to obtain fine-grained dense embeddings, and an implicit tracking module to generate a track token and provide historical information for the mask decoder. Furthermore, we employ a parameter-efficient tuning strategy to effectively align and fuse the language and vision features. Through comprehensive ablation studies, we demonstrate the practical and effective design choices of our model. Extensive experiments conducted on Ref-Youtu-VOS, Ref-DAVIS17, and three referring image segmentation datasets validate the superiority and effectiveness of our RefSAM model over existing methods. The code and models will be made publicly at \href{https://github.com/LancasterLi/RefSAM}{github.com/LancasterLi/RefSAM}.	This paper proposes RefSAM, an end-to-end framework adapting the Segment Anything Model (SAM) for referring video object segmentation (RVOS) by incorporating multi-view information from different modalities and video frames.	SAM, while powerful in image segmentation, lacks proficiency in RVOS due to limitations in handling user prompts and understanding different modalities like language.	RefSAM adapts SAM by employing a Cross-Modal MLP to project text embeddings into prompts, a hierarchical dense attention module to fuse visual and textual features, and an implicit tracking module for temporal consistency.	RefSAM outperforms state-of-the-art methods on Ref-DAVIS17. RefSAM achieves competitive performance on Ref-Youtube-VOS while being more parameter-efficient. Ablation studies demonstrate the effectiveness of key modules and the parameter-efficient tuning strategy.	RefSAM's performance on Ref-Youtube-VOS, while competitive, falls slightly short of the state-of-the-art. Future work could explore more advanced designs for enhanced cross-modal fusion.	video object segmentation, vision transformer, language and vision, segment anything, referring video object segmentation
2307.00910 Report	CoPL: Contextual Prompt Learning for Vision-Language Understanding	Koustava Goswami, Srikrishna Karanam, Prateksha Udhayanan, K J Joseph, Balaji Vasan Srinivasan	Recent advances in multimodal learning has resulted in powerful vision-language models, whose representations are generalizable across a variety of downstream tasks. Recently, their generalization ability has been further extended by incorporating trainable prompts, borrowed from the natural language processing literature. While such prompt learning techniques have shown impressive results, we identify that these prompts are trained based on global image features which limits itself in two aspects: First, by using global features, these prompts could be focusing less on the discriminative foreground image, resulting in poor generalization to various out-of-distribution test cases. Second, existing work weights all prompts equally whereas intuitively, prompts should be reweighed according to the semantics of the image. We address these as part of our proposed Contextual Prompt Learning (CoPL) framework, capable of aligning the prompts to the localized features of the image. Our key innovations over earlier works include using local image features as part of the prompt learning process, and more crucially, learning to weight these prompts based on local features that are appropriate for the task at hand. This gives us dynamic prompts that are both aligned to local image features as well as aware of local contextual relationships. Our extensive set of experiments on a variety of standard and few-shot datasets show that our method produces substantially improved performance when compared to the current state of the art methods. We also demonstrate both few-shot and out-of-distribution performance to establish the utility of learning dynamic prompts that are aligned to local image features.	This paper introduces CoPL (Contextual Prompt Learning), a novel method for image classification that enhances the generalization of pre-trained vision-language models by aligning prompts with local image features and dynamically weighting them based on semantic relevance.	Existing prompt-based methods often rely on global image features, neglecting discriminative local information and treating all prompts equally, limiting their adaptability to diverse tasks and datasets.	CoPL utilizes local image features to determine semantically meaningful prompts. It employs an attention mechanism to generate context representations by comparing learnable prompt tokens with patch representations. These context vectors dynamically weight and update the prompts, making them contextually aware.	CoPL consistently outperforms baselines, including CLIP, CoOp, and CoCoOp, on 11 image classification datasets, demonstrating superior generalization to unseen classes and few-shot scenarios. CoPL achieves state-of-the-art zero-shot performance, surpassing CLIP by 1.4% in accuracy on average across datasets. The method exhibits strong inter-dataset transferability, effectively classifying images from unseen datasets after training on a different dataset.	CoPL's performance may be limited on datasets lacking salient local features, such as EuroSAT, where global context is more critical. Future work includes extending CoPL to incorporate user intents for local image editing tasks, leveraging its ability to understand and manipulate local image content.	prompt learning, image classification, vision-language models, few-shot learning, zero-shot learning
2307.00764 Report	Hierarchical Open-vocabulary Universal Image Segmentation	Xudong Wang, Shufan Li, Konstantinos Kallidromitis, Yusuke Kato, Kazuki Kozuka, Trevor Darrell	Open-vocabulary image segmentation aims to partition an image into semantic regions according to arbitrary text descriptions. However, complex visual scenes can be naturally decomposed into simpler parts and abstracted at multiple levels of granularity, introducing inherent segmentation ambiguity. Unlike existing methods that typically sidestep this ambiguity and treat it as an external factor, our approach actively incorporates a hierarchical representation encompassing different semantic-levels into the learning process. We propose a decoupled text-image fusion mechanism and representation learning modules for both "things" and "stuff". Additionally, we systematically examine the differences that exist in the textual and visual features between these types of categories. Our resulting model, named HIPIE, tackles HIerarchical, oPen-vocabulary, and unIvErsal segmentation tasks within a unified framework. Benchmarked on over 40 datasets, e.g., ADE20K, COCO, Pascal-VOC Part, RefCOCO/RefCOCOg, ODinW and SeginW, HIPIE achieves the state-of-the-art results at various levels of image comprehension, including semantic-level (e.g., semantic segmentation), instance-level (e.g., panoptic/referring segmentation and object detection), as well as part-level (e.g., part/subpart segmentation) tasks. Our code is released at https://github.com/berkeley-hipie/HIPIE.	HIPIE, a novel hierarchical, open-vocabulary, and universal image segmentation and detection model, effectively addresses inherent segmentation ambiguity using a hierarchical representation encompassing different semantic levels.	Existing open-vocabulary image segmentation methods often treat segmentation ambiguity as an external factor. This work directly tackles this issue by incorporating a hierarchical representation into the learning process, allowing for more comprehensive and nuanced image analysis.	The model uses decoupled text-image fusion and representation learning modules for "things" (countable objects) and "stuff" (uncountable regions) based on observed discrepancies in their visual and textual features. It employs a pretrained BERT for text features and ResNet-50 or ViT for image features. It utilizes early fusion for things and late fusion for stuff during mask generation. For hierarchical segmentation, class names from different granularity levels are concatenated as prompts during training and inference.	HIPIE achieves state-of-the-art performance on over 40 datasets across various segmentation tasks, including semantic, instance, panoptic, referring, and part segmentation. The decoupled representation learning and text-image fusion for things and stuff significantly improve performance compared to unified approaches. The model effectively generalizes to novel part classes, demonstrating its open-vocabulary hierarchical segmentation capability.	Future work will focus on applying HIPIE to video-related tasks and further evaluating its performance on video object tracking and segmentation. Exploring the impact of additional pretraining of the vision encoder on large-scale datasets like SA-1B and incorporating supplementary hierarchical datasets will be beneficial.	open-vocabulary segmentation, hierarchical segmentation, universal segmentation, text-image fusion, representation learning
2307.00716 Report	JourneyDB: A Benchmark for Generative Image Understanding	Keqiang Sun, Junting Pan, Yuying Ge, Hao Li, Haodong Duan, Xiaoshi Wu, Renrui Zhang, Aojun Zhou, Zipeng Qin, Yi Wang, Jifeng Dai, Yu Qiao, Limin Wang, Hongsheng Li	While recent advancements in vision-language models have had a transformative impact on multi-modal comprehension, the extent to which these models possess the ability to comprehend generated images remains uncertain. Synthetic images, in comparison to real data, encompass a higher level of diversity in terms of both content and style, thereby presenting significant challenges for the models to fully grasp. In light of this challenge, we introduce a comprehensive dataset, referred to as JourneyDB, that caters to the domain of generative images within the context of multi-modal visual understanding. Our meticulously curated dataset comprises 4 million distinct and high-quality generated images, each paired with the corresponding text prompts that were employed in their creation. Furthermore, we additionally introduce an external subset with results of another 22 text-to-image generative models, which makes JourneyDB a comprehensive benchmark for evaluating the comprehension of generated images. On our dataset, we have devised four benchmarks to assess the performance of generated image comprehension in relation to both content and style interpretation. These benchmarks encompass prompt inversion, style retrieval, image captioning, and visual question answering. Lastly, we evaluate the performance of state-of-the-art multi-modal models when applied to the JourneyDB dataset, providing a comprehensive analysis of their strengths and limitations in comprehending generated content. We anticipate that the proposed dataset and benchmarks will facilitate further research in the field of generative content understanding. The dataset is publicly available at https://journeydb.github.io.	This paper introduces JourneyDB, a large-scale dataset of 4 million generated image-prompt pairs designed for evaluating multi-modal visual understanding in the context of AI-generated images.	Existing vision-language models are primarily trained on real images and struggle to comprehend the unique characteristics of generated images, which often depict fictional scenes and complex styles.	The dataset was created by collecting image-prompt pairs from Midjourney, a text-to-image generation platform. GPT-3.5 was used to generate captions, separate prompts into content and style categories, and create visual question answering annotations.	Existing multi-modal models perform poorly on JourneyDB benchmarks compared to real image datasets, highlighting their limitations in understanding generated content. Fine-tuning models on JourneyDB significantly improves their performance, indicating the dataset's value for training models on generative content. Prompt inversion, style retrieval, and visual question answering tasks on JourneyDB reveal specific challenges for models in understanding content and style nuances within generated images.	Potential misalignment between some images and prompts may introduce noise into the annotations. Future work could explore expanding JourneyDB with images from other text-to-image generation models and incorporating human feedback to refine annotations.	generated images, multi-modal understanding, vision-language models, text-to-image generation, benchmark dataset
2307.00619 Report	Solving Linear Inverse Problems Provably via Posterior Sampling with Latent Diffusion Models	Litu Rout, Negin Raoof, Giannis Daras, Constantine Caramanis, Alexandros G. Dimakis, Sanjay Shakkottai	We present the first framework to solve linear inverse problems leveraging pre-trained latent diffusion models. Previously proposed algorithms (such as DPS and DDRM) only apply to pixel-space diffusion models. We theoretically analyze our algorithm showing provable sample recovery in a linear model setting. The algorithmic insight obtained from our analysis extends to more general settings often considered in practice. Experimentally, we outperform previously proposed posterior sampling algorithms in a wide variety of problems including random inpainting, block inpainting, denoising, deblurring, destriping, and super-resolution.	This paper presents the first framework to solve linear inverse problems using pre-trained latent diffusion models, enabling the use of foundation models like Stable Diffusion without finetuning.	Existing algorithms for inverse problems are limited to pixel-space diffusion models, preventing the utilization of powerful latent-based foundation models.	The method extends Diffusion Posterior Sampling (DPS) with a "gluing" objective that guides the diffusion process towards latents consistent with both measurements and the decoder-encoder mapping. Theoretical analysis proves sample recovery in a linear model setting with a two-step diffusion process.	The proposed Posterior Sampling with Latent Diffusion (PSLD) algorithm achieves state-of-the-art results on inpainting, block inpainting, denoising, deblurring, destriping, and super-resolution tasks. PSLD outperforms DPS on both in-distribution (FFHQ 256) and out-of-distribution (ImageNet 256) datasets using Stable Diffusion. Theoretical analysis demonstrates PSLD's advantage in avoiding the curse of ambient dimension associated with pixel-space diffusion models.	The evaluation is based on Stable Diffusion, inheriting potential biases from the LAION dataset. The paper focuses on linear inverse problems, leaving extension to non-linear problems for future work.	latent diffusion models, inverse problems, posterior sampling, stable diffusion, image restoration
2307.00522 Report	LEDITS: Real Image Editing with DDPM Inversion and Semantic Guidance	Linoy Tsaban, Apolinário Passos	Recent large-scale text-guided diffusion models provide powerful image-generation capabilities. Currently, a significant effort is given to enable the modification of these images using text only as means to offer intuitive and versatile editing. However, editing proves to be difficult for these generative models due to the inherent nature of editing techniques, which involves preserving certain content from the original image. Conversely, in text-based models, even minor modifications to the text prompt frequently result in an entirely distinct result, making attaining one-shot generation that accurately corresponds to the users intent exceedingly challenging. In addition, to edit a real image using these state-of-the-art tools, one must first invert the image into the pre-trained models domain - adding another factor affecting the edit quality, as well as latency. In this exploratory report, we propose LEDITS - a combined lightweight approach for real-image editing, incorporating the Edit Friendly DDPM inversion technique with Semantic Guidance, thus extending Semantic Guidance to real image editing, while harnessing the editing capabilities of DDPM inversion as well. This approach achieves versatile edits, both subtle and extensive as well as alterations in composition and style, while requiring no optimization nor extensions to the architecture.	The paper proposes LEDITS, a novel approach for editing real images by combining Edit Friendly DDPM inversion with Semantic Guidance (SEGA).	LEDITS enables intuitive and versatile editing of real images within the latent space of text-guided diffusion models, addressing the limitations of existing methods that struggle with preserving content and achieving fine-grained control.	LEDITS first inverts a real image into the latent space using Edit Friendly DDPM inversion. Then, it applies SEGA during the denoising process, utilizing pre-computed noise vectors from the inversion step to guide the generation towards the desired edits specified by text prompts and semantic concepts.	LEDITS achieves a balance between fidelity to the original image and the creativity of the edit, allowing for both subtle and extensive modifications. The method offers flexibility and versatility by enabling independent editing operations with DDPM inversion and SEGA, leading to more diverse outputs. LEDITS retains the strengths of both DDPM inversion and SEGA, such as preserving image semantics, achieving high fidelity to editing prompts, and demonstrating robustness and monotonicity in semantic guidance.	The paper primarily focuses on qualitative analysis and leaves quantitative evaluations for future work. Further exploration of the interplay between DDPM inversion parameters and SEGA guidance scales is suggested.	image editing, diffusion models, ddpm inversion, semantic guidance, text-guided image manipulation
2307.00430 Report	WaveMixSR: A Resource-efficient Neural Network for Image Super-resolution	Pranav Jeevan, Akella Srinidhi, Pasunuri Prathiba, Amit Sethi	Image super-resolution research recently been dominated by transformer models which need higher computational resources than CNNs due to the quadratic complexity of self-attention. We propose a new neural network -- WaveMixSR -- for image super-resolution based on WaveMix architecture which uses a 2D-discrete wavelet transform for spatial token-mixing. Unlike transformer-based models, WaveMixSR does not unroll the image as a sequence of pixels/patches. It uses the inductive bias of convolutions along with the lossless token-mixing property of wavelet transform to achieve higher performance while requiring fewer resources and training data. We compare the performance of our network with other state-of-the-art methods for image super-resolution. Our experiments show that WaveMixSR achieves competitive performance in all datasets and reaches state-of-the-art performance in the BSD100 dataset on multiple super-resolution tasks. Our model is able to achieve this performance using less training data and computational resources while maintaining high parameter efficiency compared to current state-of-the-art models.	Proposes WaveMixSR, a novel wavelet-based neural network architecture for image super-resolution, employing 2D discrete wavelet transform (DWT) for efficient spatial token mixing.	Addresses limitations of transformer-based SR models, such as high computational cost and data requirements, by leveraging the efficiency and inductive bias of DWT and CNNs.	Constructs a two-path network where the luminance (Y) channel undergoes upsampling, feature extraction using multiple WaveMix blocks (containing DWT, convolutions, and MLPs), and reconstruction, while the chrominance (CbCr) channels are upsampled separately.	Achieves state-of-the-art performance on the BSD100 dataset for multiple SR tasks, outperforming transformer-based methods. Demonstrates high parameter efficiency and reduced computational complexity compared to transformer-based models, requiring fewer resources and training data. Successfully reconstructs high-frequency details and sharp images, as evidenced by visual results and quantitative metrics (PSNR, SSIM).	Performance improvement potential by exploring larger training datasets (DF2K) and pre-training techniques. Further investigation into the benefits of adversarial training for potential enhancements.	image super-resolution, wavelet transform, token mixing, deep learning, computer vision
2307.00407 Report	WavePaint: Resource-efficient Token-mixer for Self-supervised Inpainting	Pranav Jeevan, Dharshan Sampath Kumar, Amit Sethi	Image inpainting, which refers to the synthesis of missing regions in an image, can help restore occluded or degraded areas and also serve as a precursor task for self-supervision. The current state-of-the-art models for image inpainting are computationally heavy as they are based on transformer or CNN backbones that are trained in adversarial or diffusion settings. This paper diverges from vision transformers by using a computationally-efficient WaveMix-based fully convolutional architecture -- WavePaint. It uses a 2D-discrete wavelet transform (DWT) for spatial and multi-resolution token-mixing along with convolutional layers. The proposed model outperforms the current state-of-the-art models for image inpainting on reconstruction quality while also using less than half the parameter count and considerably lower training and evaluation times. Our model even outperforms current GAN-based architectures in CelebA-HQ dataset without using an adversarially trainable discriminator. Our work suggests that neural architectures that are modeled after natural image priors require fewer parameters and computations to achieve generalization comparable to transformers.	This paper presents WavePaint, a computationally-efficient, fully convolutional model based on WaveMix for high-quality image inpainting.	Current state-of-the-art inpainting models heavily rely on computationally expensive transformers or CNNs, requiring substantial resources and training time.	WavePaint leverages the power of 2D discrete wavelet transform (DWT) for spatial and multi-resolution token mixing, enabling efficient global context understanding.	WavePaint achieves comparable, and in some cases superior, results to SOTA models on CelebA-HQ using fewer parameters and significantly faster training and inference. It outperforms larger models like LaMa in terms of FID score, parameter count, GPU memory usage, and speed. The model demonstrates the effectiveness of wavelet token mixing for realistic image generation from masked images without requiring adversarial or diffusion-based training.	The study primarily focuses on large mask inpainting and doesn't address blind mask inpainting. Future work includes exploring WavePaint's potential for resource-efficient image generation in adversarial or diffusion settings.	image inpainting, wavelet transform, token mixing, wavemix, image generation
2307.00398 Report	ProbVLM: Probabilistic Adapter for Frozen Vision-Language Models	Uddeshya Upadhyay, Shyamgopal Karthik, Massimiliano Mancini, Zeynep Akata	Large-scale vision-language models (VLMs) like CLIP successfully find correspondences between images and text. Through the standard deterministic mapping process, an image or a text sample is mapped to a single vector in the embedding space. This is problematic: as multiple samples (images or text) can abstract the same concept in the physical world, deterministic embeddings do not reflect the inherent ambiguity in the embedding space. We propose ProbVLM, a probabilistic adapter that estimates probability distributions for the embeddings of pre-trained VLMs via inter/intra-modal alignment in a post-hoc manner without needing large-scale datasets or computing. On four challenging datasets, i.e., COCO, Flickr, CUB, and Oxford-flowers, we estimate the multi-modal embedding uncertainties for two VLMs, i.e., CLIP and BLIP, quantify the calibration of embedding uncertainties in retrieval tasks and show that ProbVLM outperforms other methods. Furthermore, we propose active learning and model selection as two real-world downstream tasks for VLMs and show that the estimated uncertainty aids both tasks. Lastly, we present a novel technique for visualizing the embedding distributions using a large-scale pre-trained latent diffusion model. Code is available at https://github.com/ExplainableML/ProbVLM.	This paper introduces \texttt{ProbVLM}, a post-hoc probabilistic adapter that converts deterministic embeddings from pre-trained vision-language models (VLMs) into probabilistic embeddings.	Existing large-scale VLMs provide deterministic embeddings that do not capture the inherent ambiguity in image-text mappings, limiting their ability to model uncertainty in downstream tasks.	\texttt{ProbVLM} leverages pre-trained VLM encoders to predict parameters of a heteroscedastic generalized Gaussian distribution for each embedding. It is trained using a combination of intra-modal and cross-modal alignment objectives.	\texttt{ProbVLM} provides well-calibrated uncertainties, with higher uncertainties correlating with lower performance on retrieval tasks. Uncertainty estimates from \texttt{ProbVLM} enable effective model selection from a set of fine-tuned VLMs on unlabeled target datasets. The uncertainties facilitate active learning by selecting the most informative samples for fine-tuning, leading to improved performance with limited labeled data.	Exploration of more complex probability distributions beyond the generalized Gaussian distribution. Investigating the integration of \texttt{ProbVLM} into the training process of VLMs, rather than as a post-hoc adaptation.	vision-language models, probabilistic embeddings, uncertainty estimation, active learning, model selection
2307.00300 Report	DreamIdentity: Improved Editability for Efficient Face-identity Preserved Image Generation	Zhuowei Chen, Shancheng Fang, Wei Liu, Qian He, Mengqi Huang, Yongdong Zhang, Zhendong Mao	While large-scale pre-trained text-to-image models can synthesize diverse and high-quality human-centric images, an intractable problem is how to preserve the face identity for conditioned face images. Existing methods either require time-consuming optimization for each face-identity or learning an efficient encoder at the cost of harming the editability of models. In this work, we present an optimization-free method for each face identity, meanwhile keeping the editability for text-to-image models. Specifically, we propose a novel face-identity encoder to learn an accurate representation of human faces, which applies multi-scale face features followed by a multi-embedding projector to directly generate the pseudo words in the text embedding space. Besides, we propose self-augmented editability learning to enhance the editability of models, which is achieved by constructing paired generated face and edited face images using celebrity names, aiming at transferring mature ability of off-the-shelf text-to-image models in celebrity faces to unseen faces. Extensive experiments show that our methods can generate identity-preserved images under different scenes at a much faster speed.	This paper proposes DreamIdentity, an optimization-free method for preserving face identity in text-to-image models, enabling identity-preserved image generation under different scenes at a much faster speed.	Existing methods for preserving face identity in text-to-image synthesis either require time-consuming optimization per identity or compromise model editability. This paper addresses these limitations by introducing an efficient and effective approach.	DreamIdentity utilizes a novel Multi-word Multi-scale ID encoder (M2 ID encoder) to learn accurate representations of human faces by extracting multi-scale features and projecting them into multiple word embeddings. It also introduces Self-Augmented Editability Learning to enhance editability by training the encoder using a self-augmented dataset of celebrity faces and their edited versions.	DreamIdentity outperforms existing optimization-based and efficient methods in terms of text-alignment, face similarity, and encoding speed. The M2 ID encoder, with its multi-scale features and multi-word embedding projection, significantly improves identity preservation compared to using a standard CLIP encoder. Self-Augmented Editability Learning effectively enhances the model's ability to generate images that adhere to editing prompts while preserving identity.	The model's performance may be limited when presented with poor-quality or out-of-domain face images. Editability might be hindered when generating scenes that significantly deviate from the input identity's gender characteristics.	text-to-image synthesis, face identity preservation, personalized image generation, multi-word embedding, self-augmented editability learning
2307.00154 Report	Stitched ViTs are Flexible Vision Backbones	Zizheng Pan, Jing Liu, Haoyu He, Jianfei Cai, Bohan Zhuang	Large pretrained plain vision Transformers (ViTs) have been the workhorse for many downstream tasks. However, existing works utilizing off-the-shelf ViTs are inefficient in terms of training and deployment, because adopting ViTs with individual sizes requires separate trainings and is restricted by fixed performance-efficiency trade-offs. In this paper, we are inspired by stitchable neural networks (SN-Net), which is a new framework that cheaply produces a single model that covers rich subnetworks by stitching pretrained model families, supporting diverse performance-efficiency trade-offs at runtime. Building upon this foundation, we introduce SN-Netv2, a systematically improved model stitching framework to facilitate downstream task adaptation. Specifically, we first propose a two-way stitching scheme to enlarge the stitching space. We then design a resource-constrained sampling strategy that takes into account the underlying FLOPs distributions in the space for better sampling. Finally, we observe that learning stitching layers as a low-rank update plays an essential role on downstream tasks to stabilize training and ensure a good Pareto frontier. With extensive experiments on ImageNet-1K, ADE20K, COCO-Stuff-10K and NYUv2, SN-Netv2 demonstrates superior performance over SN-Netv1 on downstream dense predictions and shows strong ability as a flexible vision backbone, achieving great advantages in both training efficiency and deployment flexibility. Code is available at https://github.com/ziplab/SN-Netv2.	This paper introduces SN-Netv2, an improved framework for adapting large pretrained vision transformers (ViTs) to downstream tasks like semantic segmentation and depth estimation. SN-Netv2 creates a single model encompassing numerous subnetworks with varying performance-efficiency trade-offs by stitching together pretrained ViTs of different sizes.	Existing methods for adapting pretrained ViTs to downstream tasks are inefficient for training and deployment as they require separate training for each ViT size and lack flexibility in performance-efficiency trade-offs.	SN-Netv2 introduces three key improvements: 1) Two-way Stitching (TWS) for a larger, more optimal stitching space, 2) Resource-constrained Sampling (ROS) for balanced training across varying FLOPs constraints, 3) Low-Rank Adaptation of Stitching Layers (LoRA SL) for stabilizing training and achieving smoother performance curves.	SN-Netv2 outperforms its predecessor SN-Netv1 and achieves competitive performance with individually trained ViTs across benchmarks like ADE20K, COCO-Stuff-10K, and NYUv2. The framework offers significant training efficiency advantages, requiring less GPU hours than training individual ViT backbones separately. SN-Netv2 enables flexible deployment as a single model can adapt to various resource constraints at runtime.	Exploration of parameter-efficient approaches within SN-Netv2 is left for future work. Future work can investigate better training strategies to further improve the performance of stitches at the Pareto frontier.	vision transformers, model stitching, downstream task adaptation, semantic segmentation, depth estimation
2307.00040 Report	DisCo: Disentangled Control for Realistic Human Dance Generation	Tan Wang, Linjie Li, Kevin Lin, Yuanhao Zhai, Chung-Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, Lijuan Wang	Generative AI has made significant strides in computer vision, particularly in text-driven image/video synthesis (T2I/T2V). Despite the notable advancements, it remains challenging in human-centric content synthesis such as realistic dance generation. Current methodologies, primarily tailored for human motion transfer, encounter difficulties when confronted with real-world dance scenarios (e.g., social media dance), which require to generalize across a wide spectrum of poses and intricate human details. In this paper, we depart from the traditional paradigm of human motion transfer and emphasize two additional critical attributes for the synthesis of human dance content in social media contexts: (i) Generalizability: the model should be able to generalize beyond generic human viewpoints as well as unseen human subjects, backgrounds, and poses; (ii) Compositionality: it should allow for the seamless composition of seen/unseen subjects, backgrounds, and poses from different sources. To address these challenges, we introduce DISCO, which includes a novel model architecture with disentangled control to improve the compositionality of dance synthesis, and an effective human attribute pre-training for better generalizability to unseen humans. Extensive qualitative and quantitative results demonstrate that DisCc can generate high-quality human dance images and videos with diverse appearances and flexible motions. Code is available at https://disco-dance.github.io/.	This paper introduces \ourmodel, a novel approach for generating realistic human dance videos from a single image, particularly focusing on social media scenarios like TikTok.	Existing methods for human motion transfer struggle with generalizability to unseen subjects, backgrounds, and poses, and lack compositionality for creating novel combinations.	\ourmodel employs a disentangled control architecture with separate ControlNet branches for background and pose, and incorporates CLIP image embeddings for human foreground. It also utilizes a human attribute pre-training strategy on large-scale image datasets to enhance generalizability.	\ourmodel demonstrates superior quantitative results on FID, FID-VID, and FVD metrics compared to state-of-the-art methods like DreamPose. It exhibits strong generalizability, successfully generating dance videos with unseen human subjects, backgrounds, and poses. Qualitative results and a user study confirm the generation of high-quality, faithful, and composable human dance videos with diverse appearances and motions.	The model currently struggles with hand posture accuracy without fine-grained hand pose control. Extending the approach to multi-person scenarios and human-object interactions presents future challenges. Future work could explore motion pre-training via video data alongside attribute pre-training.	human dance generation, disentangled control, diffusion models, controlnet, human attribute pre-training
2307.00038 Report	Training-free Object Counting with Prompts	Zenglin Shi, Ying Sun, Mengmi Zhang	This paper tackles the problem of object counting in images. Existing approaches rely on extensive training data with point annotations for each object, making data collection labor-intensive and time-consuming. To overcome this, we propose a training-free object counter that treats the counting task as a segmentation problem. Our approach leverages the Segment Anything Model (SAM), known for its high-quality masks and zero-shot segmentation capability. However, the vanilla mask generation method of SAM lacks class-specific information in the masks, resulting in inferior counting accuracy. To overcome this limitation, we introduce a prior-guided mask generation method that incorporates three types of priors into the segmentation process, enhancing efficiency and accuracy. Additionally, we tackle the issue of counting objects specified through text by proposing a two-stage approach that combines reference object selection and prior-guided mask generation. Extensive experiments on standard datasets demonstrate the competitive performance of our training-free counter compared to learning-based approaches. This paper presents a promising solution for counting objects in various scenarios without the need for extensive data collection and counting-specific training. Code is available at \url{https://github.com/shizenglin/training-free-object-counter}	This paper presents a training-free object counting model that leverages the Segment Anything Model (SAM) and incorporates prior information for accurate and efficient object counting using prompts like points, boxes, or text.	Existing object counting methods rely heavily on extensive training data with point annotations, which is labor-intensive and time-consuming. This work addresses this limitation by proposing a training-free approach, making object counting more accessible and flexible.	The method formulates counting as a segmentation problem. It leverages SAM for segmentation and introduces a prior-guided mask generation approach incorporating three priors: similarity prior, segment prior, and semantic prior. For text-based counting, a two-stage approach combining reference object selection and prior-guided mask generation is proposed.	The training-free counter achieves competitive performance compared to learning-based approaches on standard datasets like FSC-147 and CARPK. The prior-guided mask generation method significantly improves counting efficiency and accuracy by effectively differentiating target objects from non-target objects. The reference object selection algorithm enhances text-specified counting by refining the similarity maps obtained from CLIP-Surgery.	The model faces challenges in counting extremely small, occluded, or densely clustered objects. Future work will focus on addressing these limitations by developing more advanced adaptive thresholding methods or fine-tuning SAM with limited annotated data.	object counting, training-free, segment anything model (sam), prior-guided segmentation, text-specified counting
2306.17843 Report	Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors	Guocheng Qian, Jinjie Mai, Abdullah Hamdi, Jian Ren, Aliaksandr Siarohin, Bing Li, Hsin-Ying Lee, Ivan Skorokhodov, Peter Wonka, Sergey Tulyakov, Bernard Ghanem	We present Magic123, a two-stage coarse-to-fine approach for high-quality, textured 3D meshes generation from a single unposed image in the wild using both2D and 3D priors. In the first stage, we optimize a neural radiance field to produce a coarse geometry. In the second stage, we adopt a memory-efficient differentiable mesh representation to yield a high-resolution mesh with a visually appealing texture. In both stages, the 3D content is learned through reference view supervision and novel views guided by a combination of 2D and 3D diffusion priors. We introduce a single trade-off parameter between the 2D and 3D priors to control exploration (more imaginative) and exploitation (more precise) of the generated geometry. Additionally, we employ textual inversion and monocular depth regularization to encourage consistent appearances across views and to prevent degenerate solutions, respectively. Magic123 demonstrates a significant improvement over previous image-to-3D techniques, as validated through extensive experiments on synthetic benchmarks and diverse real-world images. Our code, models, and generated 3D assets are available at https://github.com/guochengqian/Magic123.	Magic123: a novel two-stage coarse-to-fine approach for high-quality textured 3D mesh generation from a single unposed image, using both 2D and 3D diffusion priors.	Single-image 3D reconstruction is a challenging, ill-posed problem in computer vision, with existing methods often limited in quality, generalization, or computational cost. This work combines the advantages of both 2D and 3D priors to generate high-fidelity 3D content with detailed geometry and appealing textures.	The method utilizes a two-stage optimization: 1) A coarse stage optimizes a neural radiance field (NeRF) for initial geometry. 2) A fine stage employs a memory-efficient differentiable mesh (DMTet) to refine geometry and texture at high resolution. Both stages leverage 2D and 3D diffusion priors for novel view guidance, controlled by a trade-off parameter for exploration/exploitation.	Magic123 demonstrates significant improvement over existing image-to-3D techniques on both synthetic and real-world images, achieving state-of-the-art performance in quantitative metrics (PSNR, LPIPS, CLIP-similarity). The method successfully balances geometry exploration and exploitation, generating faithful 3D reconstructions with high generalizability to diverse objects. The two-stage coarse-to-fine approach enables high-resolution (1K) output with disentangled geometry and texture.	The current method assumes the reference image is captured from a front view, limiting its applicability to unposed images with significant viewpoint variations. The reliance on pre-processing steps like segmentation and depth estimation introduces potential error propagation to the 3D generation.	3d reconstruction, single image to 3d, diffusion models, neural radiance fields (nerf), deep learning
2306.17842 Report	SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs	Lijun Yu, Yong Cheng, Zhiruo Wang, Vivek Kumar, Wolfgang Macherey, Yanping Huang, David A. Ross, Irfan Essa, Yonatan Bisk, Ming-Hsuan Yang, Kevin Murphy, Alexander G. Hauptmann, Lu Jiang	In this work, we introduce Semantic Pyramid AutoEncoder (SPAE) for enabling frozen LLMs to perform both understanding and generation tasks involving non-linguistic modalities such as images or videos. SPAE converts between raw pixels and interpretable lexical tokens (or words) extracted from the LLM's vocabulary. The resulting tokens capture both the semantic meaning and the fine-grained details needed for visual reconstruction, effectively translating the visual content into a language comprehensible to the LLM, and empowering it to perform a wide array of multimodal tasks. Our approach is validated through in-context learning experiments with frozen PaLM 2 and GPT 3.5 on a diverse set of image understanding and generation tasks. Our method marks the first successful attempt to enable a frozen LLM to generate image content while surpassing state-of-the-art performance in image understanding tasks, under the same setting, by over 25%.	This paper introduces Semantic Pyramid AutoEncoder (SPAE), which enables frozen LLMs to perform visual understanding and generation tasks through conversion between visual content and interpretable lexical tokens.	This approach leverages the knowledge and generative capabilities of LLMs for multimodal tasks without requiring training on image-text pairs.	SPAE uses a frozen language codebook and a pyramid token structure to capture semantic concepts and fine-grained details. It employs a semantic loss to encourage conceptually relevant tokens and utilizes in-context learning with a progressive denoising method for image generation.	SPAE with PaLM 2 outperforms the best-published few-shot image classification accuracy by over 25%. The pyramid structure allows for representing semantic concepts with fewer tokens, improving efficiency. Frozen LLMs, when paired with SPAE, are capable of performing tasks like image captioning, visual question answering, and conditional image denoising.	Reconstructing images with a frozen language codebook requires more tokens compared to models with learned codebooks. The in-context learning approach is limited by the acceptable sequence length, impacting image resolution and quality.	multimodal learning, large language models, image generation, image understanding, in-context learning
2306.17723 Report	FlipNeRF: Flipped Reflection Rays for Few-shot Novel View Synthesis	Seunghyeon Seo, Yeonjin Chang, Nojun Kwak	Neural Radiance Field (NeRF) has been a mainstream in novel view synthesis with its remarkable quality of rendered images and simple architecture. Although NeRF has been developed in various directions improving continuously its performance, the necessity of a dense set of multi-view images still exists as a stumbling block to progress for practical application. In this work, we propose FlipNeRF, a novel regularization method for few-shot novel view synthesis by utilizing our proposed flipped reflection rays. The flipped reflection rays are explicitly derived from the input ray directions and estimated normal vectors, and play a role of effective additional training rays while enabling to estimate more accurate surface normals and learn the 3D geometry effectively. Since the surface normal and the scene depth are both derived from the estimated densities along a ray, the accurate surface normal leads to more exact depth estimation, which is a key factor for few-shot novel view synthesis. Furthermore, with our proposed Uncertainty-aware Emptiness Loss and Bottleneck Feature Consistency Loss, FlipNeRF is able to estimate more reliable outputs with reducing floating artifacts effectively across the different scene structures, and enhance the feature-level consistency between the pair of the rays cast toward the photo-consistent pixels without any additional feature extractor, respectively. Our FlipNeRF achieves the SOTA performance on the multiple benchmarks across all the scenarios.	FlipNeRF, a novel regularization method for few-shot novel view synthesis, utilizes flipped reflection rays as additional training cues.	NeRF struggles with performance degradation when trained on sparse views. FlipNeRF addresses this by generating effective reflection rays, enabling more accurate surface normal and depth estimation.	FlipNeRF generates flipped reflection rays based on input ray directions and estimated surface normals. It uses a masking strategy to filter ineffective rays and introduces Uncertainty-aware Emptiness Loss (UE Loss) and Bottleneck Feature Consistency Loss (BFC Loss) to improve the reliability and consistency of the model.	Achieves state-of-the-art performance on Realistic Synthetic 360°, DTU, and LLFF benchmarks. Significantly outperforms baselines under extremely sparse settings (e.g., 3/4-view). Demonstrates the importance of accurate surface normal estimation in few-shot novel view synthesis.	The improvement is less significant on LLFF due to its less dynamic camera poses. Exploring the combination of FlipNeRF with Ref-NeRF representation or further research on view-dependent appearance for few-shot novel view synthesis.	novel view synthesis, neural radiance fields (nerf), few-shot learning, surface normal estimation, regularization
2306.17643 Report	Neural 3D Scene Reconstruction from Multiple 2D Images without 3D Supervision	Yi Guo, Che Sun, Yunde Jia, Yuwei Wu	Neural 3D scene reconstruction methods have achieved impressive performance when reconstructing complex geometry and low-textured regions in indoor scenes. However, these methods heavily rely on 3D data which is costly and time-consuming to obtain in real world. In this paper, we propose a novel neural reconstruction method that reconstructs scenes using sparse depth under the plane constraints without 3D supervision. We introduce a signed distance function field, a color field, and a probability field to represent a scene. We optimize these fields to reconstruct the scene by using differentiable ray marching with accessible 2D images as supervision. We improve the reconstruction quality of complex geometry scene regions with sparse depth obtained by using the geometric constraints. The geometric constraints project 3D points on the surface to similar-looking regions with similar features in different 2D images. We impose the plane constraints to make large planes parallel or vertical to the indoor floor. Both two constraints help reconstruct accurate and smooth geometry structures of the scene. Without 3D supervision, our method achieves competitive performance compared with existing methods that use 3D supervision on the ScanNet dataset.	This paper proposes a novel neural 3D scene reconstruction method that reconstructs indoor scenes from 2D images without 3D supervision by using sparse depth under plane constraints.	Existing neural 3D scene reconstruction methods heavily rely on 3D supervision, which is costly and time-consuming to obtain. This paper aims to address this challenge by reconstructing scenes using only 2D images.	The method represents a scene as a signed distance function field, a color field, and a plane probability field. It uses differentiable volume rendering with 2D images to optimize these fields. The method utilizes geometry constraints to obtain sparse depth for reconstructing regions with complex geometry. It also imposes plane constraints to improve the reconstruction quality of large low-textured regions.	The method achieves comparable results to Manhattan-SDF with dense depth while using only sparse depth. The method outperforms Manhattan-SDF when both use dense depth. The method achieves comparable results to existing methods that use 3D supervision on the ScanNet dataset.	The method relies on the assumption that large planes in the scene are parallel or vertical to the floor, which may not hold for all indoor scenes. The plane estimation method used may not be robust to complex scenes with many small planes.	3d scene reconstruction, neural implicit representation, volume rendering, unsupervised learning, plane constraints
2306.17567 Report	Counting Guidance for High Fidelity Text-to-Image Synthesis	Wonjun Kang, Kevin Galim, Hyung Il Koo	Recently, the quality and performance of text-to-image generation significantly advanced due to the impressive results of diffusion models. However, text-to-image diffusion models still fail to generate high fidelity content with respect to the input prompt. One problem where text-to-diffusion models struggle is generating the exact number of objects specified in the text prompt. E.g. given a prompt "five apples and ten lemons on a table", diffusion-generated images usually contain the wrong number of objects. In this paper, we propose a method to improve diffusion models to focus on producing the correct object count given the input prompt. We adopt a counting network that performs reference-less class-agnostic counting for any given image. We calculate the gradients of the counting network and refine the predicted noise for each step. To handle multiple types of objects in the prompt, we use novel attention map guidance to obtain high-fidelity masks for each object. Finally, we guide the denoising process by the calculated gradients for each object. Through extensive experiments and evaluation, we demonstrate that our proposed guidance method greatly improves the fidelity of diffusion models to object count.	This paper proposes counting guidance, a novel method leveraging a counting network to guide Stable Diffusion in generating images with the precise number of objects specified in the text prompt.	Current text-to-image diffusion models struggle to accurately depict the correct object count as per user instructions, limiting their ability to fulfill specific image generation requests.	The method employs a pre-trained counting network (RCC) and uses its gradients to refine the noise prediction during the Stable Diffusion denoising process. For multiple object types, attention map guidance is introduced to prevent semantic information mixing and generate accurate object masks, enabling masked counting guidance for each object.	The proposed method successfully generates the specified number of objects for both single and multiple object type prompts. Attention map guidance effectively mitigates the semantic information mixing problem in Stable Diffusion, leading to more accurate object representation. The approach demonstrates efficacy in handling a large number of objects, improving upon the limitations of the base Stable Diffusion model.	Tuning the scale parameters of the counting network guidance is often required for different text prompts. Generating the exact number of complex objects remains challenging due to the early determination of image structure in the denoising process.	text-to-image generation, diffusion models, stable diffusion, object counting, attention map guidance
2306.17560 Report	Class-Incremental Learning using Diffusion Model for Distillation and Replay	Quentin Jodelet, Xin Liu, Yin Jun Phua, Tsuyoshi Murata	Class-incremental learning aims to learn new classes in an incremental fashion without forgetting the previously learned ones. Several research works have shown how additional data can be used by incremental models to help mitigate catastrophic forgetting. In this work, following the recent breakthrough in text-to-image generative models and their wide distribution, we propose the use of a pretrained Stable Diffusion model as a source of additional data for class-incremental learning. Compared to competitive methods that rely on external, often unlabeled, datasets of real images, our approach can generate synthetic samples belonging to the same classes as the previously encountered images. This allows us to use those additional data samples not only in the distillation loss but also for replay in the classification loss. Experiments on the competitive benchmarks CIFAR100, ImageNet-Subset, and ImageNet demonstrate how this new approach can be used to further improve the performance of state-of-the-art methods for class-incremental learning on large scale datasets.	This paper proposes SDDR, a novel class-incremental learning method leveraging a pre-trained Stable Diffusion model to generate synthetic images for both knowledge distillation and replay.	Existing CIL methods using additional data rely on external datasets of real images, limiting their use to distillation. SDDR overcomes this by generating labeled synthetic images of previously learned classes, allowing their use for both distillation and replay, leading to improved performance.	SDDR generates synthetic images using class names and descriptions as prompts for Stable Diffusion. During training, it combines these images with real data for both classification and distillation losses. The approach is designed to be complementary and can be integrated with other CIL methods.	SDDR significantly improves the average incremental accuracy of baselines like iCaRL and LUCIR on CIFAR100, ImageNet-Subset, and ImageNet. Combining SDDR with FOSTER achieves state-of-the-art performance on several benchmarks. SDDR shows significant improvements, especially in challenging scenarios with limited memory and a large number of incremental steps.	The quality and diversity of synthetic images are limited by the pre-trained Stable Diffusion model. Future work includes exploring fine-tuning of the generative model during training and modifying losses to bridge the gap between synthetic and real data.	class-incremental learning, stable diffusion, synthetic data, knowledge distillation, catastrophic forgetting
2306.17391 Report	EyeBAG: Accurate Control of Eye Blink and Gaze Based on Data Augmentation Leveraging Style Mixing	Bryan S. Kim, Jeong Young Jeong, Wonjong Ryu	Recent developments in generative models have enabled the generation of photo-realistic human face images, and downstream tasks utilizing face generation technology have advanced accordingly. However, models for downstream tasks are yet substandard at eye control (e.g. eye blink, gaze redirection). To overcome such eye control problems, we introduce a novel framework consisting of two distinct modules: a blink control module and a gaze redirection module. We also propose a novel data augmentation method to train each module, leveraging style mixing to obtain images with desired features. We show that our framework produces eye-controlled images of high quality, and demonstrate how it can be used to improve the performance of downstream tasks.	Introduces EyeBAG, a novel framework for accurate control of eye blinks and gaze in face images using generative models.	Current generative models struggle with realistic eye control, leading to awkwardness and a sense of alienation in generated images, particularly impacting downstream tasks like face swapping.	Presents a two-module approach: 1) Blink control module: regulates eye blink degree using a U-Net architecture trained on paired open/closed eye images generated through a novel style mixing data augmentation technique. 2) Gaze redirection module: controls gaze direction by manipulating iris position, trained on a dataset augmented with diverse gaze directions also generated via style mixing.	EyeBAG generates high-quality, photorealistic images of blinking and gaze-redirected faces. The framework's discriminator doubles as a highly accurate blink detection network for images and videos. Data augmentation using EyeBAG significantly improves the performance of downstream tasks, such as face swapping, particularly in scenarios with closed eyes or varying gazes.	The current implementation focuses solely on eye control and does not address other facial expressions or head movements. Future work could explore the generalization of the style mixing data augmentation technique to other facial features and expressions, further enhancing the realism of generated faces.	generative models, data augmentation, style mixing, eye blink control, gaze redirection
2306.17321 Report	Training-Free Neural Matte Extraction for Visual Effects	Sharif Elcott, J. P. Lewis, Nori Kanazawa, Christoph Bregler	Alpha matting is widely used in video conferencing as well as in movies, television, and social media sites. Deep learning approaches to the matte extraction problem are well suited to video conferencing due to the consistent subject matter (front-facing humans), however training-based approaches are somewhat pointless for entertainment videos where varied subjects (spaceships, monsters, etc.) may appear only a few times in a single movie -- if a method of creating ground truth for training exists, just use that method to produce the desired mattes. We introduce a training-free high quality neural matte extraction approach that specifically targets the assumptions of visual effects production. Our approach is based on the deep image prior, which optimizes a deep neural network to fit a single image, thereby providing a deep encoding of the particular image. We make use of the representations in the penultimate layer to interpolate coarse and incomplete "trimap" constraints. Videos processed with this approach are temporally consistent. The algorithm is both very simple and surprisingly effective.	This paper introduces a training-free neural matte extraction approach for visual effects using the deep image prior (DIP).	This approach is specifically designed for visual effects production, where subject matter is diverse, training data is often impractical, and clean plates are undesirable or infeasible.	The method utilizes a DIP network to reconstruct the target image and simultaneously inpaint the alpha matte in the trimap's unknown region, constrained by known regions. Separate networks reconstruct foreground and background, further coupled with the alpha output via the alpha-compositing equation.	The method produces high-quality alpha mattes comparable to ground truth data. It effectively handles challenging cases like hair and objects with similar colors to the background. Temporal consistency is achieved by warm-starting optimization from previous frames in videos.	The computational cost is high, limiting its use to offline applications. Objects with holes can pose challenges and require further investigation for robust handling.	alpha matting, deep learning, visual effects, deep image prior, training-free
2306.17319 Report	ReMaX: Relaxing for Better Training on Efficient Panoptic Segmentation	Shuyang Sun, Weijun Wang, Qihang Yu, Andrew Howard, Philip Torr, Liang-Chieh Chen	This paper presents a new mechanism to facilitate the training of mask transformers for efficient panoptic segmentation, democratizing its deployment. We observe that due to its high complexity, the training objective of panoptic segmentation will inevitably lead to much higher false positive penalization. Such unbalanced loss makes the training process of the end-to-end mask-transformer based architectures difficult, especially for efficient models. In this paper, we present ReMaX that adds relaxation to mask predictions and class predictions during training for panoptic segmentation. We demonstrate that via these simple relaxation techniques during training, our model can be consistently improved by a clear margin \textbf{without} any extra computational cost on inference. By combining our method with efficient backbones like MobileNetV3-Small, our method achieves new state-of-the-art results for efficient panoptic segmentation on COCO, ADE20K and Cityscapes. Code and pre-trained checkpoints will be available at \url{https://github.com/google-research/deeplab2}.	This paper introduces ReMaX, a novel mechanism to facilitate the training of mask transformers for efficient panoptic segmentation by adding relaxation to mask predictions and class predictions during training.	The training objective of panoptic segmentation often leads to unbalanced loss with high false positive penalization, making it difficult to train efficient models. ReMaX aims to stabilize training and improve performance without incurring extra computational cost during inference.	ReMaX consists of two relaxation techniques: (1) ReMask utilizes an auxiliary semantic segmentation branch during training to guide and calibrate panoptic predictions, suppressing false positive predictions. (2) ReClass softens one-hot class labels by considering the overlap between predicted masks and ground truth masks, accounting for potential multi-class regions in predictions.	ReMaX significantly improves training convergence, achieving up to 3x faster training speed compared to baselines. ReMaX achieves state-of-the-art results for efficient panoptic segmentation on COCO, ADE20K, and Cityscapes datasets, outperforming previous methods in terms of accuracy and speed. Ablation studies validate the effectiveness of both ReMask and ReClass, showing their contribution to improved performance and stable training.	The current implementation is limited to TensorFlow, which restricts the choice of baselines. The class weighting scheme in ReClass, based on mask size, might not be optimal and requires further investigation.	panoptic segmentation, mask transformers, efficient training, relaxation techniques, computer vision
2306.17123 Report	PVP: Personalized Video Prior for Editable Dynamic Portraits using StyleGAN	Kai-En Lin, Alex Trevithick, Keli Cheng, Michel Sarkis, Mohsen Ghafoorian, Ning Bi, Gerhard Reitmayr, Ravi Ramamoorthi	Portrait synthesis creates realistic digital avatars which enable users to interact with others in a compelling way. Recent advances in StyleGAN and its extensions have shown promising results in synthesizing photorealistic and accurate reconstruction of human faces. However, previous methods often focus on frontal face synthesis and most methods are not able to handle large head rotations due to the training data distribution of StyleGAN. In this work, our goal is to take as input a monocular video of a face, and create an editable dynamic portrait able to handle extreme head poses. The user can create novel viewpoints, edit the appearance, and animate the face. Our method utilizes pivotal tuning inversion (PTI) to learn a personalized video prior from a monocular video sequence. Then we can input pose and expression coefficients to MLPs and manipulate the latent vectors to synthesize different viewpoints and expressions of the subject. We also propose novel loss functions to further disentangle pose and expression in the latent space. Our algorithm shows much better performance over previous approaches on monocular video datasets, and it is also capable of running in real-time at 54 FPS on an RTX 3080.	This paper presents a novel algorithm for creating editable dynamic portraits from monocular portrait videos using StyleGAN, allowing for manipulation of pose, expression, and appearance.	Current methods for portrait synthesis either struggle with extreme head poses, lack editability, or require extensive multi-view input. This work aims to overcome these limitations and provide a comprehensive solution for creating interactive and personalized digital avatars.	The method involves two stages: 1) Learning a personalized video prior by fine-tuning a StyleGAN generator on selected frames from the input video. 2) Training pose and expression mapping networks to control the rendering within the personalized manifold using pose and expression parameters.	The method achieves state-of-the-art visual quality on monocular video datasets, outperforming existing 2D and 3D methods in terms of reconstruction accuracy and detail. It allows for direct control over head poses, enabling the synthesis of extreme viewpoints not achievable by previous 2D methods. The approach supports real-time rendering at 54 FPS on an RTX 3080 GPU, making it suitable for interactive applications.	The current method is limited to the facial region and does not handle the back of the head or upper body. The personalization process requires a time-consuming optimization stage for each subject. Future work could explore meta-learning for faster adaptation.	digital avatars, stylegan, personalized video prior, facial reenactment, portrait editing
2306.17115 Report	Michelangelo: Conditional 3D Shape Generation based on Shape-Image-Text Aligned Latent Representation	Zibo Zhao, Wen Liu, Xin Chen, Xianfang Zeng, Rui Wang, Pei Cheng, Bin Fu, Tao Chen, Gang Yu, Shenghua Gao	We present a novel alignment-before-generation approach to tackle the challenging task of generating general 3D shapes based on 2D images or texts. Directly learning a conditional generative model from images or texts to 3D shapes is prone to producing inconsistent results with the conditions because 3D shapes have an additional dimension whose distribution significantly differs from that of 2D images and texts. To bridge the domain gap among the three modalities and facilitate multi-modal-conditioned 3D shape generation, we explore representing 3D shapes in a shape-image-text-aligned space. Our framework comprises two models: a Shape-Image-Text-Aligned Variational Auto-Encoder (SITA-VAE) and a conditional Aligned Shape Latent Diffusion Model (ASLDM). The former model encodes the 3D shapes into the shape latent space aligned to the image and text and reconstructs the fine-grained 3D neural fields corresponding to given shape embeddings via the transformer-based decoder. The latter model learns a probabilistic mapping function from the image or text space to the latent shape space. Our extensive experiments demonstrate that our proposed approach can generate higher-quality and more diverse 3D shapes that better semantically conform to the visual or textural conditional inputs, validating the effectiveness of the shape-image-text-aligned space for cross-modality 3D shape generation.	This paper introduces an innovative "alignment-before-generation" approach for generating 3D shapes from 2D images or text descriptions, aiming to enhance the consistency between generated 3D shapes and their corresponding conditions.	Generating 3D shapes from 2D images or text is challenging due to the inherent domain gap between these modalities. Existing methods often struggle to produce consistent and high-quality results due to this gap.	The proposed framework utilizes two key components: a Shape-Image-Text-Aligned Variational Auto-Encoder (SITA-VAE) and an Aligned Shape Latent Diffusion Model (ASLDM). The SITA-VAE learns a shared representation space for 3D shapes, images, and texts using contrastive learning. The ASLDM, operating in this aligned space, learns a probabilistic mapping from images or texts to 3D shape embeddings.	The proposed method outperforms baseline methods in terms of reconstruction accuracy and generation quality, as evidenced by metrics like IoU, shape-image score (SI-S), and shape-text score (ST-S). The generated 3D shapes demonstrate a high degree of fidelity to the input conditions, exhibiting smoother surfaces, finer details, and better semantic consistency. The framework exhibits robustness in handling out-of-domain images and complex text descriptions, showcasing its generalization capabilities.	The method's reliance on ground-truth 3D shapes during training poses a limitation, as 3D data is often scarce. Exploring unsupervised or weakly-supervised learning approaches could mitigate this issue. Representing 3D shapes as occupancy fields necessitates converting meshes into watertight ones, potentially leading to a loss of geometric detail in the original mesh. Investigating alternative shape representations could address this limitation.	3d shape generation, cross-modal learning, contrastive learning, latent diffusion model, shape-image-text alignment
2306.16928 Report	One-2-3-45: Any Single Image to 3D Mesh in 45 Seconds without Per-Shape Optimization	Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Mukund Varma T, Zexiang Xu, Hao Su	Single image 3D reconstruction is an important but challenging task that requires extensive knowledge of our natural world. Many existing methods solve this problem by optimizing a neural radiance field under the guidance of 2D diffusion models but suffer from lengthy optimization time, 3D inconsistency results, and poor geometry. In this work, we propose a novel method that takes a single image of any object as input and generates a full 360-degree 3D textured mesh in a single feed-forward pass. Given a single image, we first use a view-conditioned 2D diffusion model, Zero123, to generate multi-view images for the input view, and then aim to lift them up to 3D space. Since traditional reconstruction methods struggle with inconsistent multi-view predictions, we build our 3D reconstruction module upon an SDF-based generalizable neural surface reconstruction method and propose several critical training strategies to enable the reconstruction of 360-degree meshes. Without costly optimizations, our method reconstructs 3D shapes in significantly less time than existing methods. Moreover, our method favors better geometry, generates more 3D consistent results, and adheres more closely to the input image. We evaluate our approach on both synthetic data and in-the-wild images and demonstrate its superiority in terms of both mesh quality and runtime. In addition, our approach can seamlessly support the text-to-3D task by integrating with off-the-shelf text-to-image diffusion models.	This paper proposes One-2-3-45, a novel method that reconstructs a full 360-degree textured 3D mesh from a single image in a feed-forward manner.	Existing optimization-based methods for single image 3D reconstruction are time-consuming, memory intensive, and often produce 3D inconsistent results with poor geometry. This work aims to address these limitations.	The method leverages a view-conditioned 2D diffusion model (Zero123) to generate multi-view images. It then estimates the elevation of the input view and utilizes a cost-volume-based neural surface reconstruction module trained on inconsistent multi-view predictions to reconstruct the 3D mesh.	Reconstructs 3D shapes significantly faster than existing optimization-based methods (45 seconds). Produces higher quality geometry and more 3D consistent results due to the use of SDF representation and camera-conditioned multi-view predictions. Exhibits better adherence to the input image compared to existing methods.	The method's performance depends on the quality of multi-view images generated by Zero123, which can be inconsistent in cases of limited input information or ambiguous structures. Minor artifacts on the backside of generated results suggest room for improvement in reconstruction techniques and regularization.	3d reconstruction, single image 3d reconstruction, diffusion models, neural surface reconstruction, zero-shot learning
2306.16894 Report	PFB-Diff: Progressive Feature Blending Diffusion for Text-driven Image Editing	Wenjing Huang, Shikui Tu, Lei Xu	Diffusion models have showcased their remarkable capability to synthesize diverse and high-quality images, sparking interest in their application for real image editing. However, existing diffusion-based approaches for local image editing often suffer from undesired artifacts due to the pixel-level blending of the noised target images and diffusion latent variables, which lack the necessary semantics for maintaining image consistency. To address these issues, we propose PFB-Diff, a Progressive Feature Blending method for Diffusion-based image editing. Unlike previous methods, PFB-Diff seamlessly integrates text-guided generated content into the target image through multi-level feature blending. The rich semantics encoded in deep features and the progressive blending scheme from high to low levels ensure semantic coherence and high quality in edited images. Additionally, we introduce an attention masking mechanism in the cross-attention layers to confine the impact of specific words to desired regions, further improving the performance of background editing. PFB-Diff can effectively address various editing tasks, including object/background replacement and object attribute editing. Our method demonstrates its superior performance in terms of image fidelity, editing accuracy, efficiency, and faithfulness to the original image, without the need for fine-tuning or training.	This paper introduces PFB-Diff, a novel method for text-driven image editing using diffusion models, which leverages progressive feature blending and attention masking to enable seamless and consistent edits.	Existing diffusion-based image editing methods often suffer from artifacts and inconsistencies due to pixel-level blending, especially when handling complex edits or rough masks. This method aims to address these issues and achieve more natural and accurate results.	PFB-Diff operates by progressively blending features of the input image with generated features at different layers of the diffusion model's U-Net. It also employs an attention masking mechanism to restrict the influence of specific words to the target regions, ensuring semantic consistency.	PFB-Diff demonstrates superior performance compared to existing state-of-the-art methods, achieving higher CLIP scores and Local CLIP scores, indicating better image-text alignment and accurate local editing. The method effectively tackles various editing tasks, including object/background replacement and object property changes, while maintaining high image quality and faithfulness to the original image. User studies confirm that PFB-Diff produces more favorable results compared to other methods, indicating higher user satisfaction in terms of editing accuracy, realism, and faithfulness.	PFB-Diff currently requires user-provided masks, which can be a limitation in certain scenarios compared to mask-free methods. The method is not currently applicable to style transfer tasks.	image editing, diffusion models, text-to-image synthesis, feature blending, attention masking
2306.15876 Report	Hybrid Distillation: Connecting Masked Autoencoders with Contrastive Learners	Bowen Shi, Xiaopeng Zhang, Yaoming Wang, Jin Li, Wenrui Dai, Junni Zou, Hongkai Xiong, Qi Tian	Representation learning has been evolving from traditional supervised training to Contrastive Learning (CL) and Masked Image Modeling (MIM). Previous works have demonstrated their pros and cons in specific scenarios, i.e., CL and supervised pre-training excel at capturing longer-range global patterns and enabling better feature discrimination, while MIM can introduce more local and diverse attention across all transformer layers. In this paper, we explore how to obtain a model that combines their strengths. We start by examining previous feature distillation and mask feature reconstruction methods and identify their limitations. We find that their increasing diversity mainly derives from the asymmetric designs, but these designs may in turn compromise the discrimination ability. In order to better obtain both discrimination and diversity, we propose a simple but effective Hybrid Distillation strategy, which utilizes both the supervised/CL teacher and the MIM teacher to jointly guide the student model. Hybrid Distill imitates the token relations of the MIM teacher to alleviate attention collapse, as well as distills the feature maps of the supervised/CL teacher to enable discrimination. Furthermore, a progressive redundant token masking strategy is also utilized to reduce the distilling costs and avoid falling into local optima. Experiment results prove that Hybrid Distill can achieve superior performance on different benchmarks.	This paper proposes Hybrid Distill, a novel framework for representation learning that combines the strengths of Contrastive Learning (CL) and Masked Image Modeling (MIM) by distilling knowledge from both CL/supervised and MIM pre-trained teachers into a student model.	Discrimination and diversity are both crucial for downstream adaptation of representation learning models. However, existing methods for combining CL and MIM, such as feature distillation and mask feature reconstruction, have limitations in effectively incorporating both properties. Hybrid Distill addresses these limitations by leveraging the strengths of both CL and MIM teachers.	Hybrid Distill utilizes a supervised/CL teacher (e.g., DeiT, CLIP) and an MIM teacher (e.g., MAE). It distills token relations from the MIM teacher in later layers to enhance diversity and feature maps from the supervised/CL teacher in the final layer to enhance discrimination. Additionally, a progressive redundant token masking strategy is employed to reduce computational cost and prevent local optima.	Hybrid Distill effectively combines discrimination from supervised/CL models with the diversity of MIM models, as demonstrated through property analysis using average head distance, normalized mutual information, and attention visualization. Hybrid Distill outperforms single-teacher distillation baselines and previous methods using asymmetric designs on various downstream tasks, including image classification, object detection, instance segmentation, and semantic segmentation. The progressive redundant token masking strategy successfully reduces computational cost while maintaining performance and even provides regularization benefits, preventing the model from falling into local optima.	The use of two teacher models introduces additional overhead, although the increase in training time is relatively small (around 1.2 times). The performance improvement when using CLIP as a teacher is less significant than with DeiT, possibly due to the gap in pre-training capacity between CLIP and the MIM teacher (MAE).	representation learning, knowledge distillation, contrastive learning, masked image modeling, vision transformer
2306.15832 Report	Easing Color Shifts in Score-Based Diffusion Models	Katherine Deck, Tobias Bischoff	Generated images of score-based models can suffer from errors in their spatial means, an effect, referred to as a color shift, which grows for larger images. This paper investigates a previously-introduced approach to mitigate color shifts in score-based diffusion models. We quantify the performance of a nonlinear bypass connection in the score network, designed to process the spatial mean of the input and to predict the mean of the score function. We show that this network architecture substantially improves the resulting quality of the generated images, and that this improvement is approximately independent of the size of the generated images. As a result, this modified architecture offers a simple solution for the color shift problem across image sizes. We additionally discuss the origin of color shifts in an idealized setting in order to motivate the approach.	This paper investigates and quantifies the performance of a nonlinear bypass connection in the score network, which processes the spatial mean of the input and predicts the mean of the score function, to mitigate color shifts (errors in spatial means) in score-based diffusion models.	Color shifts are a common problem in score-based diffusion models, especially for large images, and this paper offers a simple and effective solution to address this issue.	The authors employ a modified score network architecture with a mean-bypass layer that predicts the spatial mean of the score independently from the spatial variations. They compare this approach to a baseline U-net model with and without exponential moving average (EMA) smoothing on FashionMNIST and 2D turbulence datasets.	The mean-bypass layer significantly reduces color shifts across different image sizes compared to the baseline model with or without EMA. The modified network architecture achieves superior optimization of the spatial mean loss term, leading to more accurate spatial mean predictions. EMA smoothing alone is insufficient to mitigate color shifts in large images, particularly when training data is limited.	The mean-bypass layer architecture does not leverage potential correlations between image means and spatial variations. Future work could explore incorporating information about spatial variations into the mean-bypass layer to capture potential correlations.	score-based diffusion models, color shift, image generation, mean-bypass layer, spatial mean prediction
2306.15769 Report	What Makes ImageNet Look Unlike LAION	Ali Shirali, Moritz Hardt	ImageNet was famously created from Flickr image search results. What if we recreated ImageNet instead by searching the massive LAION dataset based on image captions alone? In this work, we carry out this counterfactual investigation. We find that the resulting ImageNet recreation, which we call LAIONet, looks distinctly unlike the original. Specifically, the intra-class similarity of images in the original ImageNet is dramatically higher than it is for LAIONet. Consequently, models trained on ImageNet perform significantly worse on LAIONet. We propose a rigorous explanation for the discrepancy in terms of a subtle, yet important, difference in two plausible causal data-generating processes for the respective datasets, that we support with systematic experimentation. In a nutshell, searching based on an image caption alone creates an information bottleneck that mitigates the selection bias otherwise present in image-based filtering. Our explanation formalizes a long-held intuition in the community that ImageNet images are stereotypical, unnatural, and overly simple representations of the class category. At the same time, it provides a simple and actionable takeaway for future dataset creation efforts.	This paper introduces LAIONet, a recreation of ImageNet using the LAION dataset and text-based image selection, and investigates the differences between LAIONet and ImageNet.	The research aims to understand the impact of different data collection methodologies on dataset bias and model performance.	The authors created LAIONet by searching LAION for images matching ImageNet synsets based on text descriptions. They then compared LAIONet and ImageNet in terms of CLIP zero-shot accuracy, model performance, and intra-class similarity.	LAIONet images are more diverse than ImageNet images, exhibiting lower intra-class similarity. ImageNet-trained models experience a significant performance drop on LAIONet, particularly on more frequent classes. The authors provide evidence that the image-to-selection link in ImageNet's creation process is responsible for its lower diversity and the observed performance drop.	The study is limited by the availability of accurate captions for only a subset of ImageNet. Scaling the analysis to LAION-5B could potentially provide a more comprehensive comparison.	imagenet, laion, dataset bias, intra-class similarity, information bottleneck
2306.15706 Report	Approximated Prompt Tuning for Vision-Language Pre-trained Models	Qiong Wu, Shubin Huang, Yiyi Zhou, Pingyang Dai, Annan Shu, Guannan Jiang, Rongrong Ji	Prompt tuning is a parameter-efficient way to deploy large-scale pre-trained models to downstream tasks by adding task-specific tokens. In terms of vision-language pre-trained (VLP) models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks, which greatly exacerbates the already high computational overhead. In this paper, we revisit the principle of prompt tuning for Transformer-based VLP models, and reveal that the impact of soft prompt tokens can be actually approximated via independent information diffusion steps, thereby avoiding the expensive global attention modeling and reducing the computational complexity to a large extent. Based on this finding, we propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning. To validate APT, we apply it to two representative VLP models, namely ViLT and METER, and conduct extensive experiments on a bunch of downstream tasks. Meanwhile, the generalization of APT is also validated on CLIP for image classification and StableDiffusion for text-to-image generation. The experimental results not only show the superior performance gains and computation efficiency of APT against the conventional prompt tuning methods, e.g., +7.01% accuracy and -82.30% additional computation overhead on METER, but also confirm its merits over other parameter-efficient transfer learning approaches.	This paper proposes Approximated Prompt Tuning (APT), a novel method for parameter- and computation-efficient adaptation of Vision-Language Pre-trained (VLP) models to downstream tasks.	Existing prompt tuning methods applied to VLP models often suffer from high computational overhead and inefficient adaptation due to the large gap between pre-training and downstream tasks.	APT approximates the influence of prompt tokens on the input sequence by separating them from the expensive global self-attention mechanism and aggregating them with low-rank transformations.	APT achieves superior performance gains over conventional prompt tuning methods on VLP models, with up to +8.30% accuracy improvement on VQA2.0 for METER. APT significantly reduces computational overhead compared to existing prompt tuning methods, saving up to 82.30% additional computations for ViLT. APT demonstrates better performance than other Parameter Efficient Transfer Learning (PETL) approaches on various VLP models and VL tasks, and its generalization is validated on CLIP for image classification and StableDiffusion for text-to-image generation.	The performance of APT is still slightly inferior to full fine-tuning, indicating room for further improvement. Future work includes exploring more effective information diffusion strategies and extending APT to other multimodal pre-trained models.	prompt tuning, vision-language pre-training, parameter efficient transfer learning, multimodal learning, approximation methods
2306.15658 Report	CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a \$10,000 Budget; An Extra \$4,000 Unlocks 81.8% Accuracy	Xianhang Li, Zeyu Wang, Cihang Xie	The recent work CLIPA presents an inverse scaling law for CLIP training -- whereby the larger the image/text encoders used, the shorter the sequence length of image/text tokens that can be applied in training. This finding enables us to train high-performance CLIP models with significantly reduced computations. Building upon this work, we hereby present CLIPA-v2 with two key contributions. Technically, we find this inverse scaling law is also applicable in the finetuning stage, enabling further reduction in computational needs. Empirically, we explore CLIPA at scale, extending the experiments up to the H/14 model with ~13B image-text pairs seen during training. Our results are exciting -- by only allocating a budget of \$10,000, our CLIP model achieves an impressive zero-shot ImageNet accuracy of 81.1%, surpassing the prior best CLIP model (from OpenCLIP, 80.1%) by 1.0% and meanwhile reducing the computational cost by ~39X. Moreover, with an additional investment of $4,000, we can further elevate the zero-shot ImageNet accuracy to 81.8%. Our code and models are available at https://github.com/UCSC-VLAA/CLIPA.	This paper introduces CLIPA-v2, building on CLIPA, which leverages an inverse scaling law to train high-performance CLIP models efficiently. CLIPA-v2 demonstrates that this law also applies to the fine-tuning stage, further reducing computation needs.	Training CLIP models is computationally expensive. CLIPA-v2 provides a solution to reduce training costs while achieving state-of-the-art zero-shot performance.	The authors scale CLIPA to larger models (up to H/14), datasets (LAION-2B, DataComp-1B), and training schedules (13B samples). They explore the inverse scaling law in fine-tuning and analyze different masking strategies.	CLIPA-v2 achieves 81.1% zero-shot ImageNet accuracy within a $10,000 budget, outperforming the previous best model (OpenCLIP) by 1.0% while being 39x faster. An additional $4,000 investment further boosts the accuracy to 81.8%, setting a new performance record. The inverse scaling law, allowing for training with fewer tokens for larger models, also proves effective during fine-tuning.	CLIPA-v2 lags behind in zero-shot retrieval tasks on COCO and Flickr30k compared to OpenCLIP's best model. The impact of different pre-training datasets on downstream tasks needs further investigation.	clip, clipa, zero-shot learning, vision-language model, efficient training
2306.15419 Report	Freestyle 3D-Aware Portrait Synthesis Based on Compositional Generative Priors	Tianxiang Ma, Kang Zhao, Jianxin Sun, Yingya Zhang, Jing Dong	Efficiently generating a freestyle 3D portrait with high quality and 3D-consistency is a promising yet challenging task. The portrait styles generated by most existing methods are usually restricted by their 3D generators, which are learned in specific facial datasets, such as FFHQ. To get the diverse 3D portraits, one can build a large-scale multi-style database to retrain a 3D-aware generator, or use a off-the-shelf tool to do the style translation. However, the former is time-consuming due to data collection and training process, the latter may destroy the multi-view consistency. To tackle this problem, we propose a novel text-driven 3D-aware portrait synthesis framework that can generate out-of-distribution portrait styles. Specifically, for a given portrait style prompt, we first composite two generative priors, a 3D-aware GAN generator and a text-guided image editor, to quickly construct a few-shot stylized portrait set. Then we map the special style domain of this set to our proposed 3D latent feature generator and obtain a 3D representation containing the given style information. Finally we use a pre-trained 3D renderer to generate view-consistent stylized portraits from the 3D representation. Extensive experimental results show that our method is capable of synthesizing high-quality 3D portraits with specified styles in a few minutes, outperforming the state-of-the-art.	This paper proposes a novel freestyle 3D-aware portrait synthesis framework based on compositional generative priors to efficiently generate high-quality 3D portraits with specified styles.	Existing 3D portrait synthesis methods are usually restricted by the training data, limiting their ability to generate diverse freestyle 3D portraits.	This work composites a 3D-aware GAN generator (EG3D) and a text-guided image editor (Instruct-pix2pix) to construct a few-shot stylized portrait dataset. Then, a proposed 3D latent feature generator maps the style information from this dataset to a 3D representation, which is used by a pre-trained 3D renderer to synthesize the final stylized 3D portrait.	The method can generate high-quality and 3D-consistent portraits with diverse styles specified by text prompts. The approach outperforms baselines in both qualitative and quantitative comparisons, demonstrating its superiority in generating freestyle 3D portraits. The framework is efficient, enabling the generation of a stylized 3D portrait model in approximately 3 minutes.	The method relies on two pre-trained generative priors, which may limit its performance when synthesizing styles that significantly deviate from human portrait shapes. Achieving perfect 3D-consistent portrait stylization across different viewpoints remains a challenge due to limitations of the text-guided image editor.	3d portrait synthesis, generative adversarial networks, text-guided image editing, few-shot learning, neural rendering
2306.15111 Report	Self-Supervised Image Captioning with CLIP	Chuanyang Jin	Image captioning, a fundamental task in vision-language understanding, seeks to generate accurate natural language descriptions for provided images. Current image captioning approaches heavily rely on high-quality image-caption pairs, which can be hard to obtain for many domains. To address this, we introduce a self-supervised image captioning method. After learning an initial signal from a small labeled dataset, our method transitions to self-supervised learning on unlabeled data, leveraging the auxiliary task of enhancing the CLIP relevance between images and generated captions. Remarkably, despite utilizing less than 2% of the labeled COCO dataset, our method delivers a performance comparable to state-of-the-art models trained on the complete dataset. Human evaluations further reveal that our method produces captions with greater distinctiveness and informativeness, two attributes inherently challenging to achieve through supervised learning.	This paper introduces a self-supervised image captioning method that leverages CLIP relevance to generate captions, reducing the reliance on labeled image-caption pairs.	Current image captioning approaches depend on large, high-quality labeled datasets, which are difficult and expensive to create. Additionally, relying on reference captions limits the quality of generated captions.	The method employs a two-stage approach: 1) Supervised Training: Train on a small labeled dataset to establish an initial signal. 2) Self-Supervised Training: Utilize unlabeled data and train the model to maximize CLIP relevance between generated captions and corresponding images.	The method achieves comparable performance to state-of-the-art models on standard metrics while using significantly less labeled data. The generated captions are found to be more distinctive and informative than those from supervised methods based on human evaluation. The proposed RefCompare Score, based on CLIP relevance, reveals that the generated captions are often better than the reference captions.	The model's performance depends on the quality of the initial signal obtained during supervised training. Further improvements might be achieved by exploring different language models or alternative self-supervised objectives.	image captioning, self-supervised learning, clip, vision-language understanding, natural language generation
2306.14644 Report	PTVD: A Large-Scale Plot-Oriented Multimodal Dataset Based on Television Dramas	Chen Li, Xutan Peng, Teng Wang, Yixiao Ge, Mengyang Liu, Xuyuan Xu, Yexin Wang, Ying Shan	Art forms such as movies and television (TV) dramas are reflections of the real world, which have attracted much attention from the multimodal learning community recently. However, existing corpora in this domain share three limitations: (1) annotated in a scene-oriented fashion, they ignore the coherence within plots; (2) their text lacks empathy and seldom mentions situational context; (3) their video clips fail to cover long-form relationship due to short duration. To address these fundamental issues, using 1,106 TV drama episodes and 24,875 informative plot-focused sentences written by professionals, with the help of 449 human annotators, we constructed PTVD, the first plot-oriented multimodal dataset in the TV domain. It is also the first non-English dataset of its kind. Additionally, PTVD contains more than 26 million bullet screen comments (BSCs), powering large-scale pre-training. Next, aiming to open-source a strong baseline for follow-up works, we developed the multimodal algorithm that attacks different cinema/TV modelling problems with a unified architecture. Extensive experiments on three cognitive-inspired tasks yielded a number of novel observations (some of them being quite counter-intuition), further validating the value of PTVD in promoting multimodal research. The dataset and codes are released at \url{https://ptvd.github.io/}.	This paper introduces \TVD, a novel plot-oriented multimodal dataset for TV dramas, addressing limitations of existing scene-oriented datasets.	Existing movie/TV datasets are limited by scene-oriented annotations, lack of empathy in text, and short clip durations, hindering research on modeling complex narratives and long-form relationships. \TVD tackles these limitations, enabling research on higher cognitive tasks in multimodal learning.	Researchers constructed \TVD using 1,106 TV drama episodes, 24,875 plot-focused sentences, and 26M+ Bullet Screen Comments (BSCs). 449 annotators aligned clips with plot descriptions, resulting in a dataset rich in contextual and emotional information, suitable for tasks beyond scene understanding.	Multimodal data, especially plot text, significantly improves Genre Classification, with a bias towards frequent genres observed. Plot Retrieval performance is enhanced by fine-tuning with plot text and pre-training with BSCs. Video input consistently outperforms image input for plot retrieval, demonstrating the dataset's ability to assess models' capacity to capture long-form relationships. Pre-training with BSCs surprisingly hinders BSC generation while benefiting plot text generation, suggesting potential differences in text distribution and complexity.	\TVD currently includes 83 TV dramas, potentially limiting diversity and generalizability. The dataset is in Chinese, potentially introducing cultural and linguistic biases. The proposed framework, while scalable, utilizes established techniques and lacks manual evaluation for Plot Text Generation, potentially overlooking nuanced insights.	multimodal learning, tv drama analysis, plot understanding, dataset creation, bullet screen comments
2306.14544 Report	A-STAR: Test-time Attention Segregation and Retention for Text-to-image Synthesis	Aishwarya Agarwal, Srikrishna Karanam, K J Joseph, Apoorv Saxena, Koustava Goswami, Balaji Vasan Srinivasan	While recent developments in text-to-image generative models have led to a suite of high-performing methods capable of producing creative imagery from free-form text, there are several limitations. By analyzing the cross-attention representations of these models, we notice two key issues. First, for text prompts that contain multiple concepts, there is a significant amount of pixel-space overlap (i.e., same spatial regions) among pairs of different concepts. This eventually leads to the model being unable to distinguish between the two concepts and one of them being ignored in the final generation. Next, while these models attempt to capture all such concepts during the beginning of denoising (e.g., first few steps) as evidenced by cross-attention maps, this knowledge is not retained by the end of denoising (e.g., last few steps). Such loss of knowledge eventually leads to inaccurate generation outputs. To address these issues, our key innovations include two test-time attention-based loss functions that substantially improve the performance of pretrained baseline text-to-image diffusion models. First, our attention segregation loss reduces the cross-attention overlap between attention maps of different concepts in the text prompt, thereby reducing the confusion/conflict among various concepts and the eventual capture of all concepts in the generated output. Next, our attention retention loss explicitly forces text-to-image diffusion models to retain cross-attention information for all concepts across all denoising time steps, thereby leading to reduced information loss and the preservation of all concepts in the generated output.	This paper proposes A-STAR, a training-free method using two new attention-based loss functions during inference to improve the semantic accuracy of pretrained text-to-image diffusion models.	Existing text-to-image diffusion models often fail to accurately represent all concepts from the input text prompt in the generated images.	The method introduces attention segregation loss to minimize overlap between cross-attention maps of different concepts and attention retention loss to enforce information retention across denoising steps.	A-STAR successfully reduces attention overlap and decay, leading to generated images that better capture all concepts in the input prompt. Quantitative evaluation using CLIP similarity and text-text similarity demonstrates significant improvement over baseline models and existing methods. User study confirms that A-STAR generates images that are semantically more faithful to the input text.	A-STAR currently does not explicitly model relationships between concepts, which can limit its ability to generate images with complex compositions. Integrating A-STAR with techniques for controlling camera pose and viewpoint could further enhance the quality of generated images.	text-to-image synthesis, diffusion models, attention mechanism, semantic accuracy, image generation
2306.14408 Report	Decompose and Realign: Tackling Condition Misalignment in Text-to-Image Diffusion Models	Luozhou Wang, Guibao Shen, Wenhang Ge, Guangyong Chen, Yijun Li, Ying-cong Chen	Text-to-image diffusion models have advanced towards more controllable generation via supporting various additional conditions (e.g., depth map, bounding box) beyond text. However, these models are learned based on the premise of perfect alignment between the text and extra conditions. If this alignment is not satisfied, the final output could be either dominated by one condition, or ambiguity may arise, failing to meet user expectations.To address this issue, we present a training-free approach called ``Decompose and Realign'' to further improve the controllability of existing models when provided with partially aligned conditions. The ``Decompose'' phase separates conditions based on pair relationships, computing the result individually for each pair. This ensures that each pair no longer has conflicting conditions. The ``Realign'' phase aligns these independently calculated results via a cross-attention mechanism to avoid new conflicts when combining them back. Both qualitative and quantitative results demonstrate the effectiveness of our approach in handling unaligned conditions, which performs favorably against recent methods and more importantly adds flexibility to the controllable image generation process. Our code will be available at: https://github.com/EnVision-Research/Decompose-and-Realign.	Presents "Decompose and Realign," a training-free approach to address misalignment between text and image conditions in multi-condition controllable image generation, aiming for more flexibility.	Existing controllable generation models struggle with misaligned conditions, resulting in either one condition dominating the output or ambiguity in object correspondence.	The "Decompose" phase separates conditions into aligned pairs to compute individual scores. The "Realign" phase aligns these scores with the unified text score via cross-attention to avoid conflicts during merging.	Effectively handles unaligned conditions, generating all objects from the text while respecting image guidance. Outperforms baselines in qualitative comparisons, achieving better object correspondence and reducing dominance/ambiguity. Quantitative evaluation demonstrates improved image-text similarity and better adherence to image conditions compared to other methods.	Effectiveness of "Realign" relies on the model's cross-attention control capability, which might be affected by model bias. Data bias in training data can lead to unexpected disentanglement or entanglement in specific object combinations.	image generation, controllable generation, diffusion models, cross-attention, condition misalignment
2306.14153 Report	DomainStudio: Fine-Tuning Diffusion Models for Domain-Driven Image Generation using Limited Data	Jingyuan Zhu, Huimin Ma, Jiansheng Chen, Jian Yuan	Denoising diffusion probabilistic models (DDPMs) have been proven capable of synthesizing high-quality images with remarkable diversity when trained on large amounts of data. Typical diffusion models and modern large-scale conditional generative models like text-to-image generative models are vulnerable to overfitting when fine-tuned on extremely limited data. Existing works have explored subject-driven generation using a reference set containing a few images. However, few prior works explore DDPM-based domain-driven generation, which aims to learn the common features of target domains while maintaining diversity. This paper proposes a novel DomainStudio approach to adapt DDPMs pre-trained on large-scale source datasets to target domains using limited data. It is designed to keep the diversity of subjects provided by source domains and get high-quality and diverse adapted samples in target domains. We propose to keep the relative distances between adapted samples to achieve considerable generation diversity. In addition, we further enhance the learning of high-frequency details for better generation quality. Our approach is compatible with both unconditional and conditional diffusion models. This work makes the first attempt to realize unconditional few-shot image generation with diffusion models, achieving better quality and greater diversity than current state-of-the-art GAN-based approaches. Moreover, this work also significantly relieves overfitting for conditional generation and realizes high-quality domain-driven generation, further expanding the applicable scenarios of modern large-scale text-to-image models.	This paper introduces DomainStudio, a novel approach to achieve few-shot domain-driven image generation with diffusion models, by preserving relative distances between generated samples and enhancing high-frequency details.	Existing diffusion models and large-scale conditional generative models often overfit when fine-tuned on limited data, resulting in poor quality and limited diversity. This work addresses this challenge for both unconditional and conditional image generation.	DomainStudio adapts pre-trained diffusion models to target domains using: 1) a pairwise similarity loss to maintain relative distances between generated samples, and 2) techniques to enhance high-frequency details by preserving details from the source model and learning from limited target data.	DomainStudio achieves better generation quality and diversity than state-of-the-art unconditional GAN-based approaches. It successfully adapts pre-trained text-to-image diffusion models to generate diverse samples in target domains with different subjects and contexts, outperforming existing subject-driven methods. Quantitative evaluations using Intra-LPIPS and FID demonstrate superior diversity and quality compared to baselines.	The current implementation faces challenges in scaling to higher image resolutions due to memory constraints. While the high-frequency details enhancement shows promising results, there is room for improvement, particularly when target domains contain significantly more high-frequency components than source domains.	image generation, diffusion models, few-shot learning, domain adaptation, text-to-image generation
2306.13776 Report	Swin-Free: Achieving Better Cross-Window Attention and Efficiency with Size-varying Window	Jinkyu Koo, John Yang, Le An, Gwenaelle Cunha Sergio, Su Inn Park	Transformer models have shown great potential in computer vision, following their success in language tasks. Swin Transformer is one of them that outperforms convolution-based architectures in terms of accuracy, while improving efficiency when compared to Vision Transformer (ViT) and its variants, which have quadratic complexity with respect to the input size. Swin Transformer features shifting windows that allows cross-window connection while limiting self-attention computation to non-overlapping local windows. However, shifting windows introduces memory copy operations, which account for a significant portion of its runtime. To mitigate this issue, we propose Swin-Free in which we apply size-varying windows across stages, instead of shifting windows, to achieve cross-connection among local windows. With this simple design change, Swin-Free runs faster than the Swin Transformer at inference with better accuracy. Furthermore, we also propose a few of Swin-Free variants that are faster than their Swin Transformer counterparts.	This paper proposes SwinNext, a Transformer-based vision model that improves both latency and accuracy over the Swin Transformer by replacing the memory-intensive shifted window scheme with size-varying windows across stages.	Swin Transformer's shifted window scheme, while effective, introduces significant memory movement overhead, impacting its efficiency, especially on GPUs.	SwinNext utilizes size-varying windows across stages to achieve cross-window connections without shifting windows. The authors also explored replacing LayerNorm and GELU layers with BatchNorm and ReLU and reducing the model depth to further enhance latency.	SwinNext consistently achieves better accuracy than Swin Transformer on ImageNet-1K classification tasks. SwinNext demonstrates reduced latency compared to Swin Transformer, thanks to less memory movement and better GPU utilization with larger matrix multiplications. Further optimizations like BatchNorm/ReLU replacement and depth reduction lead to even faster variants of SwinNext with competitive accuracy.	The paper mainly focuses on image classification, leaving its application to other vision tasks and larger input resolutions for future work. Exploring dynamic window size adjustment across stages for further GPU utilization improvement is another potential direction.	transformer, computer vision, model efficiency, swin transformer, image classification
2306.13653 Report	ProRes: Exploring Degradation-aware Visual Prompt for Universal Image Restoration	Jiaqi Ma, Tianheng Cheng, Guoli Wang, Qian Zhang, Xinggang Wang, Lefei Zhang	Image restoration aims to reconstruct degraded images, e.g., denoising or deblurring. Existing works focus on designing task-specific methods and there are inadequate attempts at universal methods. However, simply unifying multiple tasks into one universal architecture suffers from uncontrollable and undesired predictions. To address those issues, we explore prompt learning in universal architectures for image restoration tasks. In this paper, we present Degradation-aware Visual Prompts, which encode various types of image degradation, e.g., noise and blur, into unified visual prompts. These degradation-aware prompts provide control over image processing and allow weighted combinations for customized image restoration. We then leverage degradation-aware visual prompts to establish a controllable and universal model for image restoration, called ProRes, which is applicable to an extensive range of image restoration tasks. ProRes leverages the vanilla Vision Transformer (ViT) without any task-specific designs. Furthermore, the pre-trained ProRes can easily adapt to new tasks through efficient prompt tuning with only a few images. Without bells and whistles, ProRes achieves competitive performance compared to task-specific methods and experiments can demonstrate its ability for controllable restoration and adaptation for new tasks. The code and models will be released in \url{https://github.com/leonmakise/ProRes}.	This paper introduces ProRes, a universal image restoration framework based on degradation-aware visual prompts. These prompts, encoding specific degradation types, provide control over image processing within a unified architecture, eliminating the need for task-specific designs.	Existing image restoration methods are often task-specific, limiting their applicability to multiple degradation types. ProRes addresses this by offering a universal approach that handles diverse image restoration tasks within a single model, simplifying training and improving efficiency.	ProRes employs a vanilla Vision Transformer (ViT) as its backbone and incorporates degradation-aware visual prompts. These image-like prompts, added to degraded images, guide the restoration process. The model is trained on a joint dataset encompassing denoising, low-light enhancement, deraining, and deblurring tasks.	ProRes achieves competitive performance compared to task-specific methods on various benchmarks. The degradation-aware prompts enable controllable restoration by combining prompts for different degradation types. ProRes exhibits strong transferability, adapting effectively to new tasks or datasets via prompt tuning.	The performance of ProRes on certain tasks may benefit from further optimization compared to highly specialized methods. Future work can explore the impact of larger and more diverse datasets on ProRes's capabilities, particularly for complex or unseen degradation types.	image restoration, universal model, visual prompt learning, prompt tuning, vision transformer
2306.13455 Report	DreamEditor: Text-Driven 3D Scene Editing with Neural Fields	Jingyu Zhuang, Chen Wang, Lingjie Liu, Liang Lin, Guanbin Li	Neural fields have achieved impressive advancements in view synthesis and scene reconstruction. However, editing these neural fields remains challenging due to the implicit encoding of geometry and texture information. In this paper, we propose DreamEditor, a novel framework that enables users to perform controlled editing of neural fields using text prompts. By representing scenes as mesh-based neural fields, DreamEditor allows localized editing within specific regions. DreamEditor utilizes the text encoder of a pretrained text-to-Image diffusion model to automatically identify the regions to be edited based on the semantics of the text prompts. Subsequently, DreamEditor optimizes the editing region and aligns its geometry and texture with the text prompts through score distillation sampling [29]. Extensive experiments have demonstrated that DreamEditor can accurately edit neural fields of real-world scenes according to the given text prompts while ensuring consistency in irrelevant areas. DreamEditor generates highly realistic textures and geometry, significantly surpassing previous works in both quantitative and qualitative evaluations.	DreamEditor is a novel framework for text-driven 3D scene editing using neural fields, enabling localized modifications based on text prompts while preserving consistency in irrelevant areas.	Editing neural fields is challenging due to the implicit encoding of geometry and texture. DreamEditor offers an intuitive and precise way to modify 3D scenes using simple text descriptions.	DreamEditor represents scenes as mesh-based neural fields for localized editing. It utilizes a pretrained text-to-image diffusion model to automatically identify editing regions based on text prompts, then optimizes geometry and texture through score distillation sampling.	DreamEditor achieves accurate and high-quality editing of real-world neural fields based on text prompts. The method preserves irrelevant regions unchanged, ensuring consistency between original and edited scenes. Quantitative and qualitative evaluations demonstrate DreamEditor's superiority over existing methods in editing precision, visual fidelity, and user satisfaction.	DreamEditor inherits the Janus problem from DreamFusion, where objects may appear as front views from different viewpoints. The method currently focuses on object-centric editing in the foreground, limited by the challenges of reconstructing backgrounds in unbounded scenes.	neural fields, 3d scene editing, text-guided editing, score distillation sampling, mesh-based neural fields
2306.13078 Report	Continuous Layout Editing of Single Images with Diffusion Models	Zhiyuan Zhang, Zhitong Huang, Jing Liao	Recent advancements in large-scale text-to-image diffusion models have enabled many applications in image editing. However, none of these methods have been able to edit the layout of single existing images. To address this gap, we propose the first framework for layout editing of a single image while preserving its visual properties, thus allowing for continuous editing on a single image. Our approach is achieved through two key modules. First, to preserve the characteristics of multiple objects within an image, we disentangle the concepts of different objects and embed them into separate textual tokens using a novel method called masked textual inversion. Next, we propose a training-free optimization method to perform layout control for a pre-trained diffusion model, which allows us to regenerate images with learned concepts and align them with user-specified layouts. As the first framework to edit the layout of existing images, we demonstrate that our method is effective and outperforms other baselines that were modified to support this task. Our code will be freely available for public use upon acceptance.	This paper presents the first framework for continuous layout editing of single images using diffusion models, allowing users to rearrange object positions while preserving visual properties.	Existing layout control methods for image generation cannot edit the layout of existing images, limiting users' ability to experiment with different object arrangements within a given scene.	The framework employs two key modules: (1) Masked Textual Inversion, which disentangles and embeds concepts of multiple objects within a single image into separate tokens, and (2) Training-free Layout Editing, which optimizes cross-attention during the diffusion process to align objects with user-specified layouts.	The proposed method effectively edits image layouts while preserving visual fidelity, outperforming baselines in qualitative and quantitative comparisons. A user study confirms the superiority of the method in terms of visual similarity, layout alignment, image quality, and overall quality. The framework enables continuous layout editing, allowing users to experiment with various object arrangements within a single image.	The method may struggle to preserve visual details when object sizes differ significantly between input and edited images, and with recovering the full body of heavily occluded objects. The layout editing process is not real-time due to the iterative nature of diffusion models.	layout editing, diffusion models, textual inversion, image manipulation, single image editing
2306.12929 Report	Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing	Yelysei Bondarenko, Markus Nagel, Tijmen Blankevoort	Transformer models have been widely adopted in various domains over the last years, and especially large language models have advanced the field of AI significantly. Due to their size, the capability of these networks has increased tremendously, but this has come at the cost of a significant increase in necessary compute. Quantization is one of the most effective ways to reduce the computational time and memory consumption of neural networks. Many studies have shown, however, that modern transformer models tend to learn strong outliers in their activations, making them difficult to quantize. To retain acceptable performance, the existence of these outliers requires activations to be in higher bitwidth or the use of different numeric formats, extra fine-tuning, or other workarounds. We show that strong outliers are related to very specific behavior of attention heads that try to learn a "no-op" or just a partial update of the residual. To achieve the exact zeros needed in the attention matrix for a no-update, the input to the softmax is pushed to be larger and larger during training, causing outliers in other parts of the network. Based on these observations, we propose two simple (independent) modifications to the attention mechanism - clipped softmax and gated attention. We empirically show that models pre-trained using our methods learn significantly smaller outliers while maintaining and sometimes even improving the floating-point task performance. This enables us to quantize transformers to full INT8 quantization of the activations without any additional effort. We demonstrate the effectiveness of our methods on both language models (BERT, OPT) and vision transformers.	This paper proposes two simple modifications to the attention mechanism of transformer models: clipped softmax and gated attention, to address the problem of outliers in activations that hinder quantization.	Quantization is crucial for reducing the computational time and memory consumption of transformer models, especially large language models. However, outliers in activations make it difficult to quantize these models effectively without significant performance degradation.	The authors analyze the outlier problem and find that it stems from attention heads trying to learn a "no-op" or partial update of the residual. To achieve this, the input to the softmax is pushed to extreme values during training, causing outliers. The proposed methods, clipped softmax and gated attention, modify the attention mechanism to allow for small or zero attention outputs without requiring extreme softmax inputs.	Clipped softmax and gated attention significantly reduce the magnitude of outliers in activations and the kurtosis of activation distributions. The proposed methods enable effective quantization of transformers to full INT8 quantization without performance loss, achieving performance close to the original FP16/32 models. In some cases, the proposed methods even improve the floating-point performance of the models, potentially by facilitating the learning of "no-op" updates.	The scalability of the methods to very large transformers trained for extended periods needs further investigation. The methods introduce additional hyperparameters, although they demonstrate robustness to these parameters.	quantization, transformers, outliers, attention mechanism, clipped softmax, gated attention
2306.12624 Report	DreamEdit: Subject-driven Image Editing	Tianle Li, Max Ku, Cong Wei, Wenhu Chen	Subject-driven image generation aims at generating images containing customized subjects, which has recently drawn enormous attention from the research community. However, the previous works cannot precisely control the background and position of the target subject. In this work, we aspire to fill the void and propose two novel subject-driven sub-tasks, i.e., Subject Replacement and Subject Addition. The new tasks are challenging in multiple aspects: replacing a subject with a customized one can change its shape, texture, and color, while adding a target subject to a designated position in a provided scene necessitates a context-aware posture. To conquer these two novel tasks, we first manually curate a new dataset DreamEditBench containing 22 different types of subjects, and 440 source images with different difficulty levels. We plan to host DreamEditBench as a platform and hire trained evaluators for standard human evaluation. We also devise an innovative method DreamEditor to resolve these tasks by performing iterative generation, which enables a smooth adaptation to the customized subject. In this project, we conduct automatic and human evaluations to understand the performance of DreamEditor and baselines on DreamEditBench. For Subject Replacement, we found that the existing models are sensitive to the shape and color of the original subject. The model failure rate will dramatically increase when the source and target subjects are highly different. For Subject Addition, we found that the existing models cannot easily blend the customized subjects into the background smoothly, leading to noticeable artifacts in the generated image. We hope DreamEditBench can become a standard platform to enable future investigations toward building more controllable subject-driven image editing. Our project homepage is https://dreameditbenchteam.github.io/.	This paper introduces two novel subject-driven image editing tasks: Subject Replacement and Subject Addition, aiming to replace or add a customized subject to an image while maintaining background integrity and subject realism.	Existing subject-driven image generation methods lack control over subject placement and background, while image editing methods struggle with subject fidelity. This work aims to bridge this gap.	A new dataset called DreamEdit is curated with various subjects and backgrounds for the tasks. A novel iterative method, DreamEdit, is proposed. This method fine-tunes a text-to-image model with target subject images, iteratively in-paints the target subject onto the source image guided by segmentation masks and text prompts.	DreamEdit achieves better overall scores compared to baselines in both automatic and human evaluations. The proposed tasks pose significant challenges to existing models, especially when source and target subjects differ significantly or require complex contextual interaction. Human evaluation reveals significant discrepancies with automatic metrics, highlighting the need for rigorous human assessment in this field.	DreamEdit struggles with large discrepancies between source and target subjects. Iterative generation can lead to blurry backgrounds, and the model's success relies heavily on the performance of the segmentation and in-painting models.	image editing, subject-driven generation, iterative generation, dreambooth, human evaluation
2306.12570 Report	Local 3D Editing via 3D Distillation of CLIP Knowledge	Junha Hyung, Sungwon Hwang, Daejin Kim, Hyunji Lee, Jaegul Choo	3D content manipulation is an important computer vision task with many real-world applications (e.g., product design, cartoon generation, and 3D Avatar editing). Recently proposed 3D GANs can generate diverse photorealistic 3D-aware contents using Neural Radiance fields (NeRF). However, manipulation of NeRF still remains a challenging problem since the visual quality tends to degrade after manipulation and suboptimal control handles such as 2D semantic maps are used for manipulations. While text-guided manipulations have shown potential in 3D editing, such approaches often lack locality. To overcome these problems, we propose Local Editing NeRF (LENeRF), which only requires text inputs for fine-grained and localized manipulation. Specifically, we present three add-on modules of LENeRF, the Latent Residual Mapper, the Attention Field Network, and the Deformation Network, which are jointly used for local manipulations of 3D features by estimating a 3D attention field. The 3D attention field is learned in an unsupervised way, by distilling the zero-shot mask generation capability of CLIP to the 3D space with multi-view guidance. We conduct diverse experiments and thorough evaluations both quantitatively and qualitatively.	Proposes LENeRF, a framework for localized editing of 3D scenes using text prompts for manipulation and region specification, enabling real-time, high-fidelity edits.	Addresses limitations of existing 3D editing methods that lack locality, rely on suboptimal 2D guidance, or struggle with photorealism and multi-view consistency.	Combines a pretrained NeRF generator with three trainable modules: Latent Residual Mapper for generating target features, Attention Field Network for estimating 3D masks, and Deformation Network for handling geometric changes. Trained with CLIP guidance and pseudo-labels from CLIP-generated relevance maps.	Achieves localized editing with minimal unintended changes, as demonstrated by quantitative metrics and qualitative comparisons. Exhibits robustness to out-of-distribution editing scenarios. Enables sequential editing while preserving identity and content quality.	Relies on pretrained models (EG3D, CLIP) and may be limited by their capabilities. Generation of accurate 3D masks from 2D relevance maps remains challenging, potentially leading to artifacts.	3d editing, nerf, clip, text-guided editing, 3d mask generation
2306.12511 Report	Semi-Implicit Denoising Diffusion Models (SIDDMs)	Yanwu Xu, Mingming Gong, Shaoan Xie, Wei Wei, Matthias Grundmann, Kayhan Batmanghelich, Tingbo Hou	Despite the proliferation of generative models, achieving fast sampling during inference without compromising sample diversity and quality remains challenging. Existing models such as Denoising Diffusion Probabilistic Models (DDPM) deliver high-quality, diverse samples but are slowed by an inherently high number of iterative steps. The Denoising Diffusion Generative Adversarial Networks (DDGAN) attempted to circumvent this limitation by integrating a GAN model for larger jumps in the diffusion process. However, DDGAN encountered scalability limitations when applied to large datasets. To address these limitations, we introduce a novel approach that tackles the problem by matching implicit and explicit factors. More specifically, our approach involves utilizing an implicit model to match the marginal distributions of noisy data and the explicit conditional distribution of the forward diffusion. This combination allows us to effectively match the joint denoising distributions. Unlike DDPM but similar to DDGAN, we do not enforce a parametric distribution for the reverse step, enabling us to take large steps during inference. Similar to the DDPM but unlike DDGAN, we take advantage of the exact form of the diffusion process. We demonstrate that our proposed method obtains comparable generative performance to diffusion-based models and vastly superior results to models with a small number of sampling steps.	This paper introduces Semi-Implicit Denoising Diffusion Models (SIDDMs), a novel approach for fast sampling in generative models without compromising sample quality and diversity, addressing limitations in existing DDPM and DDGAN models.	Achieving fast sampling, high-quality samples, and mode coverage simultaneously in generative models is challenging. Existing methods struggle to address all three aspects effectively, especially for large-scale datasets.	SIDDMs decompose the denoising distribution into marginal and conditional distributions, leveraging both implicit GAN objectives for marginal matching and explicit L2 reconstruction loss for conditional matching (Auxiliary Forward Diffusion, AFD). This approach enables fast sampling similar to DDGANs while maintaining high generation quality comparable to DDPMs. Additionally, the paper introduces a novel discriminator regularization technique using an auxiliary denoising task.	SIDDMs demonstrate superior quantitative results over DDGANs on CIFAR10, CelebA-HQ, and ImageNet datasets. The proposed method achieves comparable generative performance to DDPMs while requiring significantly fewer sampling steps. Ablation studies confirm the effectiveness of the proposed decomposition and the discriminator regularization technique.	While SIDDMs show promising results, there is still a small quality gap compared to state-of-the-art diffusion-based models. Future work could explore further improvements in the discriminator regularization and investigate the application of SIDDMs to other generative tasks.	generative models, diffusion models, fast sampling, gans, image generation
2306.12423 Report	Benchmarking and Analyzing 3D-aware Image Synthesis with a Modularized Codebase	Qiuyu Wang, Zifan Shi, Kecheng Zheng, Yinghao Xu, Sida Peng, Yujun Shen	Despite the rapid advance of 3D-aware image synthesis, existing studies usually adopt a mixture of techniques and tricks, leaving it unclear how each part contributes to the final performance in terms of generality. Following the most popular and effective paradigm in this field, which incorporates a neural radiance field (NeRF) into the generator of a generative adversarial network (GAN), we build a well-structured codebase, dubbed Carver, through modularizing the generation process. Such a design allows researchers to develop and replace each module independently, and hence offers an opportunity to fairly compare various approaches and recognize their contributions from the module perspective. The reproduction of a range of cutting-edge algorithms demonstrates the availability of our modularized codebase. We also perform a variety of in-depth analyses, such as the comparison across different types of point feature, the necessity of the tailing upsampler in the generator, the reliance on the camera pose prior, etc., which deepen our understanding of existing methods and point out some further directions of the research work. We release code and models at https://github.com/qiuyu96/Carver to facilitate the development and evaluation of this field.	This paper introduces Carver, a modular codebase for 3D-aware image synthesis, enabling researchers to readily develop and replace individual modules within the generation pipeline.	Existing 3D-aware image synthesis methods often rely on entangled implementations, making it challenging to isolate and compare the contributions of different techniques. Carver addresses this limitation by offering a modular framework.	Carver decomposes the generation process into independent modules: pose sampler, stochasticity mapper, point sampler, point embedder, feature decoder, volume renderer, and upsampler. This modular design allows for flexible configuration and integration of various techniques.	Different point embedders (MLP, volume, tri-plane) show comparable performance when combined with an upsampler. SIREN-based MLPs excel without an upsampler, while ReLU-based MLPs benefit from upsampling. Exploiting SDF-based geometric representations generally yields inferior results compared to density-based representations.	Training 3D GANs remains computationally expensive, especially with MLP and volume-based point embedders. The paper primarily focuses on object-level datasets, and future work should explore extending 3D GANs to more diverse and complex scenes.	3d-aware image synthesis, generative adversarial networks (gans), neural radiance fields (nerfs), modular codebase, 3d representation learning
2306.12321 Report	Dynamic Implicit Image Function for Efficient Arbitrary-Scale Image Representation	Zongyao He, Zhi Jin	Recent years have witnessed the remarkable success of implicit neural representation methods. The recent work Local Implicit Image Function (LIIF) has achieved satisfactory performance for continuous image representation, where pixel values are inferred from a neural network in a continuous spatial domain. However, the computational cost of such implicit arbitrary-scale super-resolution (SR) methods increases rapidly as the scale factor increases, which makes arbitrary-scale SR time-consuming. In this paper, we propose Dynamic Implicit Image Function (DIIF), which is a fast and efficient method to represent images with arbitrary resolution. Instead of taking an image coordinate and the nearest 2D deep features as inputs to predict its pixel value, we propose a coordinate grouping and slicing strategy, which enables the neural network to perform decoding from coordinate slices to pixel value slices. We further propose a Coarse-to-Fine Multilayer Perceptron (C2F-MLP) to perform decoding with dynamic coordinate slicing, where the number of coordinates in each slice varies as the scale factor varies. With dynamic coordinate slicing, DIIF significantly reduces the computational cost when encountering arbitrary-scale SR. Experimental results demonstrate that DIIF can be integrated with implicit arbitrary-scale SR methods and achieves SOTA SR performance with significantly superior computational efficiency, thereby opening a path for real-time arbitrary-scale image representation. Our code can be found at https://github.com/HeZongyao/DIIF.	Proposes Dynamic Implicit Image Function (DIIF), a fast and efficient arbitrary-resolution image representation method, significantly reducing computational cost in arbitrary-scale super-resolution.	Arbitrary-scale super-resolution methods based on implicit neural representations are computationally expensive, limiting their practical use despite offering continuous image representation.	Introduces coordinate grouping and slicing strategy for efficient pixel value prediction, and a Coarse-to-Fine Multilayer Perceptron (C2F-MLP) for dynamic coordinate slicing based on scale factor.	DIIF significantly reduces the computational cost of arbitrary-scale super-resolution, achieving up to 87% lower cost compared to previous methods. DIIF, when integrated with existing implicit methods like LIIF and LTE, improves their super-resolution performance while enhancing efficiency. DIIF demonstrates state-of-the-art super-resolution performance with superior computational efficiency, enabling faster and higher-quality results.	Limited effectiveness in reducing the number of parameters, though most originate from the encoder. Future work to focus on improving image representation and exploring more efficient decoding function architectures.	image representation, super-resolution, implicit neural representation, arbitrary-scale, computational efficiency
2306.11719 Report	Diffusion with Forward Models: Solving Stochastic Inverse Problems Without Direct Supervision	Ayush Tewari, Tianwei Yin, George Cazenavette, Semon Rezchikov, Joshua B. Tenenbaum, Frédo Durand, William T. Freeman, Vincent Sitzmann	Denoising diffusion models are a powerful type of generative models used to capture complex distributions of real-world signals. However, their applicability is limited to scenarios where training samples are readily available, which is not always the case in real-world applications. For example, in inverse graphics, the goal is to generate samples from a distribution of 3D scenes that align with a given image, but ground-truth 3D scenes are unavailable and only 2D images are accessible. To address this limitation, we propose a novel class of denoising diffusion probabilistic models that learn to sample from distributions of signals that are never directly observed. Instead, these signals are measured indirectly through a known differentiable forward model, which produces partial observations of the unknown signal. Our approach involves integrating the forward model directly into the denoising process. This integration effectively connects the generative modeling of observations with the generative modeling of the underlying signals, allowing for end-to-end training of a conditional generative model over signals. During inference, our approach enables sampling from the distribution of underlying signals that are consistent with a given partial observation. We demonstrate the effectiveness of our method on three challenging computer vision tasks. For instance, in the context of inverse graphics, our model enables direct sampling from the distribution of 3D scenes that align with a single 2D input image.	This paper introduces a novel method that integrates differentiable forward models with conditional denoising diffusion models, enabling the sampling of distributions of signals never observed directly, but only through partial observations generated by a known forward model.	This approach addresses the limitation of existing generative models that require direct access to training samples from the output distribution, which is often not feasible for real-world tasks like inverse graphics.	The method integrates the forward model into the denoising process of a conditional diffusion model. It trains on pairs of partial observations of the same signal, using one as context and the other as the target for denoising. By iteratively denoising the target observation conditioned on the context and forward model, the model learns to sample underlying signals consistent with the observations.	The paper provides a formal proof demonstrating that the proposed model asymptotically learns the true conditional distribution over signals as the number of observations per signal increases. The method is successfully applied to three challenging computer vision tasks: inverse graphics, single-image motion prediction, and GAN inversion, demonstrating its efficacy in generating diverse and plausible samples consistent with partial observations. In inverse graphics, the method enables direct sampling from the distribution of 3D scenes consistent with a single 2D image, outperforming previous state-of-the-art approaches in generating realistic and diverse 3D scenes.	The method can be computationally expensive, particularly for tasks like inverse graphics that involve computationally intensive forward models (e.g., volume rendering). The current implementation requires multi-view observations for training in the inverse graphics application, limiting its applicability to scenarios with single-view data.	generative modeling, diffusion models, inverse problems, computer vision, inverse graphics
2306.11510 Report	Pushing the Limits of 3D Shape Generation at Scale	Yu Wang, Xuelin Qian, Jingyang Huo, Tiejun Huang, Bo Zhao, Yanwei Fu	We present a significant breakthrough in 3D shape generation by scaling it to unprecedented dimensions. Through the adaptation of the Auto-Regressive model and the utilization of large language models, we have developed a remarkable model with an astounding 3.6 billion trainable parameters, establishing it as the largest 3D shape generation model to date, named Argus-3D. Our approach addresses the limitations of existing methods by enhancing the quality and diversity of generated 3D shapes. To tackle the challenges of high-resolution 3D shape generation, our model incorporates tri-plane features as latent representations, effectively reducing computational complexity. Additionally, we introduce a discrete codebook for efficient quantization of these representations. Leveraging the power of transformers, we enable multi-modal conditional generation, facilitating the production of diverse and visually impressive 3D shapes. To train our expansive model, we leverage an ensemble of publicly-available 3D datasets, consisting of a comprehensive collection of approximately 900,000 objects from renowned repositories such as ModelNet40, ShapeNet, Pix3D, 3D-Future, and Objaverse. This diverse dataset empowers our model to learn from a wide range of object variations, bolstering its ability to generate high-quality and diverse 3D shapes. Extensive experimentation demonstrate the remarkable efficacy of our approach in significantly improving the visual quality of generated 3D shapes. By pushing the boundaries of 3D generation, introducing novel methods for latent representation learning, and harnessing the power of transformers for multi-modal conditional generation, our contributions pave the way for substantial advancements in the field. Our work unlocks new possibilities for applications in gaming, virtual reality, product design, and other domains that demand high-quality and diverse 3D objects.	This paper presents Argus-3D, a novel 3D shape generation model boasting 3.6 billion trainable parameters, making it the largest of its kind. This model leverages auto-regressive techniques and large language models to enhance the quality and diversity of generated shapes, outperforming previous state-of-the-art methods.	Existing 3D shape generation methods struggle to produce high-resolution shapes with both quality and diversity. This work addresses these limitations, pushing the boundaries of 3D generation and opening doors for applications in gaming, VR, and product design.	The methodology involves a two-stage process: 1) Learning discrete representations by encoding point clouds into tri-plane features and quantizing them with a discrete codebook. 2) Training a transformer to generate these quantized representations autoregressively, allowing for multi-modal conditional generation based on inputs like text or images.	Argus-3D significantly improves visual quality of generated 3D shapes, evidenced by quantitative metrics like IoU, MMD, and FPD. The model exhibits high diversity in generated shapes, surpassing previous methods in metrics like TMD and COV. Argus-3D demonstrates strong capability in multi-modal conditional generation, successfully producing 3D shapes guided by class labels, images, and even text prompts.	The model's effectiveness relies heavily on the availability of large-scale 3D datasets, which can be costly and complex to create. The transformer architecture demands significant computational resources, limiting accessibility and inference speed.	3d shape generation, auto-regressive model, large language models, multi-modal generation, deep learning
2306.11363 Report	Masked Diffusion Models Are Fast Distribution Learners	Jiachen Lei, Qinglong Wang, Peng Cheng, Zhongjie Ba, Zhan Qin, Zhibo Wang, Zhenguang Liu, Kui Ren	Diffusion model has emerged as the \emph{de-facto} model for image generation, yet the heavy training overhead hinders its broader adoption in the research community. We observe that diffusion models are commonly trained to learn all fine-grained visual information from scratch. This paradigm may cause unnecessary training costs hence requiring in-depth investigation. In this work, we show that it suffices to train a strong diffusion model by first pre-training the model to learn some primer distribution that loosely characterizes the unknown real image distribution. Then the pre-trained model can be fine-tuned for various generation tasks efficiently. In the pre-training stage, we propose to mask a high proportion (e.g., up to 90\%) of input images to approximately represent the primer distribution and introduce a masked denoising score matching objective to train a model to denoise visible areas. In subsequent fine-tuning stage, we efficiently train diffusion model without masking. Utilizing the two-stage training framework, we achieves significant training acceleration and a new FID score record of 6.27 on CelebA-HQ $256 \times 256$ for ViT-based diffusion models. The generalizability of a pre-trained model further helps building models that perform better than ones trained from scratch on different downstream datasets. For instance, a diffusion model pre-trained on VGGFace2 attains a 46\% quality improvement when fine-tuned on a different dataset that contains only 3000 images. Our code is available at \url{https://github.com/jiachenlei/maskdm}.	This paper proposes Masked Diffusion Models (MaskDM), a two-stage training framework for diffusion models that significantly reduces training time and improves performance in image generation.	Training diffusion models for image generation is computationally expensive, hindering broader research adoption. This work aims to improve the efficiency of the training process.	The authors employ masked pre-training, where a model learns from masked images to approximate a "primer" distribution that captures salient image features. This pre-trained model is then fine-tuned on full images using a standard denoising score matching objective.	MaskDM achieves a new FID score record of 6.27 on CelebA-HQ 256x256 for ViT-based diffusion models. Masked pre-training accelerates training across various datasets and shows superior performance even when fine-tuned with limited data. The training efficiency gains of MaskDM become increasingly significant as image resolution increases.	The study primarily focuses on U-ViT architecture; further exploration of other ViT variants is needed. While manual adjustment of mask rates during training demonstrates improved performance, future work could explore automated dynamic training schedules.	diffusion models, image generation, vision transformer (vit), masked pre-training, training efficiency
2306.10959 Report	RaViTT: Random Vision Transformer Tokens	Felipe A. Quezada, Carlos F. Navarro, Cristian Muñoz, Manuel Zamorano, Jorge Jara-Wilde, Violeta Chang, Cristóbal A. Navarro, Mauricio Cerda	Vision Transformers (ViTs) have successfully been applied to image classification problems where large annotated datasets are available. On the other hand, when fewer annotations are available, such as in biomedical applications, image augmentation techniques like introducing image variations or combinations have been proposed. However, regarding ViT patch sampling, less has been explored outside grid-based strategies. In this work, we propose Random Vision Transformer Tokens (RaViTT), a random patch sampling strategy that can be incorporated into existing ViTs. We experimentally evaluated RaViTT for image classification, comparing it with a baseline ViT and state-of-the-art (SOTA) augmentation techniques in 4 datasets, including ImageNet-1k and CIFAR-100. Results show that RaViTT increases the accuracy of the baseline in all datasets and outperforms the SOTA augmentation techniques in 3 out of 4 datasets by a significant margin +1.23% to +4.32%. Interestingly, RaViTT accuracy improvements can be achieved even with fewer tokens, thus reducing the computational load of any ViT model for a given accuracy value.	This paper introduces Random Vision Transformer Tokens (RaViTT), a random patch sampling strategy for Vision Transformer (ViT) models that enhances image classification performance, especially for datasets with limited training samples.	In image classification, especially biomedical applications, limited annotated datasets hinder the training of deep learning models. While augmentation techniques exist, exploring patch sampling beyond grid-based methods in ViTs is limited, creating a need for alternative approaches.	Instead of using the standard regular grid-like patch sampling, RaViTT randomly selects patches from the input image, potentially increasing the diversity of training samples and improving feature extraction. This random sampling allows for overlapping patches and employs a sampling factor (r) to control the number of patches extracted.	RaViTT increases the accuracy of the baseline ViT model in all four evaluated datasets (ImageNet-1k, CIFAR-100, G. CANCER-3, and DeFungi). RaViTT outperforms state-of-the-art (SOTA) augmentation techniques (RandAugment and MixUp) in three out of four datasets. RaViTT can achieve accuracy improvements even with fewer tokens than the baseline, indicating the potential for reducing computational load without sacrificing accuracy.	The performance gain of RaViTT is limited on the CIFAR-100 dataset, potentially due to the small image size and the resulting high overlap between randomly sampled patches. Future work can explore optimizing random distributions for patch sampling to further enhance RaViTT's efficiency, especially when the sampling factor is high (r>1).	vision transformers, image classification, random patch sampling, data augmentation, computational efficiency
2306.10730 Report	UniG3D: A Unified 3D Object Generation Dataset	Qinghong Sun, Yangguang Li, ZeXiang Liu, Xiaoshui Huang, Fenggang Liu, Xihui Liu, Wanli Ouyang, Jing Shao	The field of generative AI has a transformative impact on various areas, including virtual reality, autonomous driving, the metaverse, gaming, and robotics. Among these applications, 3D object generation techniques are of utmost importance. This technique has unlocked fresh avenues in the realm of creating, customizing, and exploring 3D objects. However, the quality and diversity of existing 3D object generation methods are constrained by the inadequacies of existing 3D object datasets, including issues related to text quality, the incompleteness of multi-modal data representation encompassing 2D rendered images and 3D assets, as well as the size of the dataset. In order to resolve these issues, we present UniG3D, a unified 3D object generation dataset constructed by employing a universal data transformation pipeline on Objaverse and ShapeNet datasets. This pipeline converts each raw 3D model into comprehensive multi-modal data representation by employing rendering engines and multi-modal models. These modules ensure the richness of textual information and the comprehensiveness of data representation. Remarkably, the universality of our pipeline refers to its ability to be applied to any 3D dataset, as it only requires raw 3D data. The selection of data sources for our dataset is based on their scale and quality. Subsequently, we assess the effectiveness of our dataset by employing Point-E and SDFusion, two widely recognized methods for object generation, tailored to the prevalent 3D representations of point clouds and signed distance functions. Our dataset is available at: https://unig3d.github.io.	This paper introduces \textbf{\Datasetname}, a unified large-scale 3D object generation dataset with rich textual descriptions and comprehensive multi-modal data (mesh, point cloud, image).	Existing 3D object generation methods are limited by the inadequacies of current datasets, including issues related to text quality, the lack of multi-modal data (e.g., 2D rendered images, 3D assets), and dataset size.	The authors construct \textbf{\Datasetname} by developing a universal data transformation pipeline that converts raw 3D models from ShapeNet and Objaverse into the unified multi-modal representation using a rendering engine (Blender), and multi-modal models (CLIP and BLIP).	Using both text and images as conditioning inputs leads to better 3D object generation than using either modality alone. Increasing data sources and incorporating multi-view data improve the diversity and quality of generated 3D objects. Generating and leveraging richer textual descriptions beyond object categories significantly improves the controllability and quality of text-conditioned 3D object generation.	The experiments are limited by computational resources, preventing the use of the full \textbf{\Datasetname-Objaverse} dataset. Future work will explore a wider range of 3D generation methods and incorporate 3D understanding tasks.	3d object generation, dataset, multi-modal, text-to-3d, image-to-3d
2306.10533 Report	Point-Cloud Completion with Pretrained Text-to-image Diffusion Models	Yoni Kasten, Ohad Rahamim, Gal Chechik	Point-cloud data collected in real-world applications are often incomplete. Data is typically missing due to objects being observed from partial viewpoints, which only capture a specific perspective or angle. Additionally, data can be incomplete due to occlusion and low-resolution sampling. Existing completion approaches rely on datasets of predefined objects to guide the completion of noisy and incomplete, point clouds. However, these approaches perform poorly when tested on Out-Of-Distribution (OOD) objects, that are poorly represented in the training dataset. Here we leverage recent advances in text-guided image generation, which lead to major breakthroughs in text-guided shape generation. We describe an approach called SDS-Complete that uses a pre-trained text-to-image diffusion model and leverages the text semantics of a given incomplete point cloud of an object, to obtain a complete surface representation. SDS-Complete can complete a variety of objects using test-time optimization without expensive collection of 3D information. We evaluate SDS Complete on incomplete scanned objects, captured by real-world depth sensors and LiDAR scanners. We find that it effectively reconstructs objects that are absent from common datasets, reducing Chamfer loss by 50% on average compared with current methods. Project page: https://sds-complete.github.io/	Presents SDS-Complete, a method for completing point clouds into complete surface representations using pre-trained text-to-image diffusion models and test-time optimization.	Addresses limitations of existing point cloud completion methods that struggle with out-of-distribution (OOD) objects not well-represented in training datasets.	Leverages the semantic prior of pre-trained text-to-image diffusion models through the SDS loss, combined with an SDF surface representation and constraints to enforce consistency with input points and sensor observations.	Achieves state-of-the-art completion results for OOD objects. Demonstrates robustness to variations in text prompts. Maintains comparable performance to existing methods on in-domain objects.	Limited by low-resolution image rendering for SDS loss due to GPU memory constraints. Struggles with objects containing components with disc topology due to SDF initialization.	point cloud completion, diffusion models, text-to-image synthesis, signed distance function, out-of-distribution generalization
2306.10441 Report	Image Harmonization with Diffusion Model	Jiajie Li, Jian Wang, Chen Wang, Jinjun Xiong	Image composition in image editing involves merging a foreground image with a background image to create a composite. Inconsistent lighting conditions between the foreground and background often result in unrealistic composites. Image harmonization addresses this challenge by adjusting illumination and color to achieve visually appealing and consistent outputs. In this paper, we present a novel approach for image harmonization by leveraging diffusion models. We conduct a comparative analysis of two conditional diffusion models, namely Classifier-Guidance and Classifier-Free. Our focus is on addressing the challenge of adjusting illumination and color in foreground images to create visually appealing outputs that seamlessly blend with the background. Through this research, we establish a solid groundwork for future investigations in the realm of diffusion model-based image harmonization.	This paper presents a novel image harmonization method utilizing diffusion models, focusing on adjusting foreground illumination and color for realistic integration with the background.	Image composition often suffers from inconsistent lighting between foreground and background, leading to unrealistic composites. This method leverages diffusion models to address this challenge and create visually appealing, consistent outputs.	The approach utilizes both classifier-guided and classifier-free conditional diffusion models, including DDPM and LDM. It introduces an appearance consistency discriminator and a color transfer method to maintain visual coherence throughout the harmonization process.	The method achieves superior performance compared to existing state-of-the-art approaches on the iHarmony4 dataset. Experiments on real composite images from Open Image Dataset V6 and Flick Dataset also demonstrate its effectiveness. The inherent stochasticity of the diffusion model allows generating multiple diverse harmonization results for a single input, providing users with flexibility and control.	The discrepancy between the iHarmony4 dataset's synthesized composite images and real-world scenarios might limit the model's generalizability. Future work will focus on addressing real-world image harmonization challenges with more complex and diverse lighting conditions.	image harmonization, diffusion models, image editing, deep learning, computer vision
2306.10128 Report	Systematic Architectural Design of Scale Transformed Attention Condenser DNNs via Multi-Scale Class Representational Response Similarity Analysis	Andre Hryniowski, Alexander Wong	Self-attention mechanisms are commonly included in a convolutional neural networks to achieve an improved efficiency performance balance. However, adding self-attention mechanisms adds additional hyperparameters to tune for the application at hand. In this work we propose a novel type of DNN analysis called Multi-Scale Class Representational Response Similarity Analysis (ClassRepSim) which can be used to identify specific design interventions that lead to more efficient self-attention convolutional neural network architectures. Using insights grained from ClassRepSim we propose the Spatial Transformed Attention Condenser (STAC) module, a novel attention-condenser based self-attention module. We show that adding STAC modules to ResNet style architectures can result in up to a 1.6% increase in top-1 accuracy compared to vanilla ResNet models and up to a 0.5% increase in top-1 accuracy compared to SENet models on the ImageNet64x64 dataset, at the cost of up to 1.7% increase in FLOPs and 2x the number of parameters. In addition, we demonstrate that results from ClassRepSim analysis can be used to select an effective parameterization of the STAC module resulting in competitive performance compared to an extensive parameter search.	This paper proposes the Spatial Transformed Attention Condenser (STAC) module, an efficient self-attention module for convolutional neural networks informed by a novel analysis method called Multi-Scale Class Representational Response Similarity Analysis (ClassRepSim).	Adding self-attention mechanisms can improve deep neural network efficiency and performance, but it introduces additional hyperparameters. This work aims to guide the design of more efficient self-attention architectures.	The authors introduce ClassRepSim, which analyzes the class-wise similarity of data representations at different spatial scales within a DNN. They use insights from this analysis to design the STAC module. Experiments are conducted on CIFAR10, ImageNet64x64-50, and ImageNet64x64 datasets using ResNet architectures.	Adding STAC modules to ResNet architectures consistently improves top-1 accuracy compared to vanilla ResNet models. STAC modules outperform existing self-attention modules like SENet and BAM in most cases, achieving higher accuracy with a smaller increase in computational cost. ClassRepSim analysis effectively guides the selection of STAC module parameters, leading to competitive performance compared to extensive parameter search.	Further experiments are needed to assess the generalizability of STAC modules across a wider range of datasets and model architectures. Future work could explore the relationship between ClassRepSim and other representational response metrics, such as intrinsic dimensionality.	self-attention, deep neural networks, computer vision, image classification, representational similarity analysis
2306.10012 Report	MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing	Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, Yu Su	Text-guided image editing is widely needed in daily life, ranging from personal use to professional applications such as Photoshop. However, existing methods are either zero-shot or trained on an automatically synthesized dataset, which contains a high volume of noise. Thus, they still require lots of manual tuning to produce desirable outcomes in practice. To address this issue, we introduce MagicBrush (https://osu-nlp-group.github.io/MagicBrush/), the first large-scale, manually annotated dataset for instruction-guided real image editing that covers diverse scenarios: single-turn, multi-turn, mask-provided, and mask-free editing. MagicBrush comprises over 10K manually annotated triplets (source image, instruction, target image), which supports trainining large-scale text-guided image editing models. We fine-tune InstructPix2Pix on MagicBrush and show that the new model can produce much better images according to human evaluation. We further conduct extensive experiments to evaluate current image editing baselines from multiple dimensions including quantitative, qualitative, and human evaluations. The results reveal the challenging nature of our dataset and the gap between current baselines and real-world editing needs.	This paper introduces MagicBrush, the first large-scale, manually annotated dataset specifically designed for instruction-guided real image editing, covering diverse scenarios like single-turn, multi-turn, mask-provided, and mask-free editing.	Existing methods rely on zero-shot learning or training on synthetic data with noise, limiting their effectiveness for real-world editing. MagicBrush addresses this gap by providing high-quality, human-annotated data to facilitate the development and evaluation of more robust and user-friendly image editing models.	The dataset was created using a rigorous crowdsourcing process involving qualified workers on Amazon Mechanical Turk. Workers proposed edit instructions and utilized the DALL-E 2 image editing platform to interactively synthesize target images. The process involved single and multi-turn edits with and without mask guidance, ensuring diversity in editing scenarios.	MagicBrush comprises over 10K manually annotated triplets (source image, instruction, target image), making it suitable for training large-scale text-guided image editing models. Fine-tuning InstructPix2Pix on MagicBrush significantly improved its performance compared to the original model and other baselines, as demonstrated by quantitative, qualitative, and human evaluations. Existing image editing models, even with additional guidance like masks, struggle to match the quality and consistency of human-annotated edits in MagicBrush, highlighting the challenging nature of the dataset and the need for more advanced models.	While efforts were made to ensure diversity, MagicBrush's reliance on DALL-E 2 for ground truth generation may introduce inherent biases. The dataset primarily focuses on local editing tasks and does not cover global editing operations like style transfer, which could be explored in future work.	image editing, text-guided image editing, dataset, instruction following, human evaluation
2306.09864 Report	AvatarBooth: High-Quality and Customizable 3D Human Avatar Generation	Yifei Zeng, Yuanxun Lu, Xinya Ji, Yao Yao, Hao Zhu, Xun Cao	We introduce AvatarBooth, a novel method for generating high-quality 3D avatars using text prompts or specific images. Unlike previous approaches that can only synthesize avatars based on simple text descriptions, our method enables the creation of personalized avatars from casually captured face or body images, while still supporting text-based model generation and editing. Our key contribution is the precise avatar generation control by using dual fine-tuned diffusion models separately for the human face and body. This enables us to capture intricate details of facial appearance, clothing, and accessories, resulting in highly realistic avatar generations. Furthermore, we introduce pose-consistent constraint to the optimization process to enhance the multi-view consistency of synthesized head images from the diffusion model and thus eliminate interference from uncontrolled human poses. In addition, we present a multi-resolution rendering strategy that facilitates coarse-to-fine supervision of 3D avatar generation, thereby enhancing the performance of the proposed system. The resulting avatar model can be further edited using additional text descriptions and driven by motion sequences. Experiments show that AvatarBooth outperforms previous text-to-3D methods in terms of rendering and geometric quality from either text prompts or specific images. Please check our project website at https://zeng-yifei.github.io/avatarbooth_page/.	AvatarBooth, a novel method for generating high-quality, customizable 3D avatars from text prompts or specific images, enabling personalized avatar creation with intricate details of facial appearance, clothing, and accessories.	Creating 3D human avatars from text or images is crucial for various applications, but existing methods struggle to synthesize high-quality shapes and appearances, especially for personalized avatars.	The method uses dual fine-tuned diffusion models for the face and body, a pose-consistent constraint for multi-view consistency, and a multi-resolution rendering strategy for coarse-to-fine supervision of 3D avatar generation.	Generates high-quality 3D avatars matching text prompts or specific images. Enables personalized avatar creation with detailed facial features, clothing, and accessories. Outperforms previous text-to-3D methods in rendering and geometric quality.	Accuracy and speed of model generation can be further improved. Leveraging existing 3D human datasets could enhance avatar quality.	avatar creation, diffusion model, neural implicit field, model fine-tuning, 3d human avatar generation
2306.09683 Report	Scaling Open-Vocabulary Object Detection	Matthias Minderer, Alexey Gritsenko, Neil Houlsby	Open-vocabulary object detection has benefited greatly from pretrained vision-language models, but is still limited by the amount of available detection training data. While detection training data can be expanded by using Web image-text pairs as weak supervision, this has not been done at scales comparable to image-level pretraining. Here, we scale up detection data with self-training, which uses an existing detector to generate pseudo-box annotations on image-text pairs. Major challenges in scaling self-training are the choice of label space, pseudo-annotation filtering, and training efficiency. We present the OWLv2 model and OWL-ST self-training recipe, which address these challenges. OWLv2 surpasses the performance of previous state-of-the-art open-vocabulary detectors already at comparable training scales (~10M examples). However, with OWL-ST, we can scale to over 1B examples, yielding further large improvement: With an L/14 architecture, OWL-ST improves AP on LVIS rare classes, for which the model has seen no human box annotations, from 31.2% to 44.6% (43% relative improvement). OWL-ST unlocks Web-scale training for open-world localization, similar to what has been seen for image classification and language modelling.	This paper introduces OWL-ST, a self-training approach for open-vocabulary object detection that leverages web-scale image-text pairs, and OWLv2, an architecture optimized for efficient training.	Open-vocabulary object detection, despite benefiting from pre-trained vision-language models, is still hampered by limited detection training data. This work aims to address this bottleneck using self-training on a massive scale, comparable to image-level pre-training.	The authors utilize OWL-ViT for generating bounding box pseudo-annotations on the WebLI dataset. They introduce a simple yet effective self-training recipe focusing on: (1) employing all N-grams from image captions as detection prompts, (2) utilizing weak confidence filtering for pseudo-labels, and (3) enhancing training efficiency through token dropping, instance selection, and image mosaics.	OWLv2 coupled with OWL-ST surpasses previous state-of-the-art open-vocabulary detectors even at moderate training scales. Scaling self-training to billions of examples further boosts performance, significantly improving AP on LVIS rare classes (unseen by human annotators). The authors demonstrate a trade-off between fine-tuned and open-vocabulary performance, suggesting potential for future research in robust generalization.	The self-training process demands significant compute and data resources, making further scaling increasingly challenging. The trade-off between performance on fine-tuned classes and open-vocabulary robustness needs further investigation for improved generalization.	open-vocabulary object detection, self-training, weak supervision, web-scale data, vision-language models
2306.09551 Report	Edit-DiffNeRF: Editing 3D Neural Radiance Fields using 2D Diffusion Model	Lu Yu, Wei Xiang, Kang Han	Recent research has demonstrated that the combination of pretrained diffusion models with neural radiance fields (NeRFs) has emerged as a promising approach for text-to-3D generation. Simply coupling NeRF with diffusion models will result in cross-view inconsistency and degradation of stylized view syntheses. To address this challenge, we propose the Edit-DiffNeRF framework, which is composed of a frozen diffusion model, a proposed delta module to edit the latent semantic space of the diffusion model, and a NeRF. Instead of training the entire diffusion for each scene, our method focuses on editing the latent semantic space in frozen pretrained diffusion models by the delta module. This fundamental change to the standard diffusion framework enables us to make fine-grained modifications to the rendered views and effectively consolidate these instructions in a 3D scene via NeRF training. As a result, we are able to produce an edited 3D scene that faithfully aligns to input text instructions. Furthermore, to ensure semantic consistency across different viewpoints, we propose a novel multi-view semantic consistency loss that extracts a latent semantic embedding from the input view as a prior, and aim to reconstruct it in different views. Our proposed method has been shown to effectively edit real-world 3D scenes, resulting in 25% improvement in the alignment of the performed 3D edits with text instructions compared to prior work.	This paper proposes Edit-DiffNeRF, a framework for editing pretrained NeRF scenes using text instructions by manipulating the latent space of frozen, pretrained diffusion models.	Existing methods for editing NeRFs with text instructions often result in inconsistencies across views and struggle to faithfully apply edits to the 3D scene.	Edit-DiffNeRF uses a delta module to learn edits in the latent space of a frozen diffusion model, guided by text instructions. A multi-view semantic consistency loss ensures consistent edits across different viewpoints during NeRF training.	Edit-DiffNeRF achieves 25% better alignment of 3D edits with text instructions compared to prior work. The method exhibits improved CLIP Direction Consistency, indicating better temporal stability of edits across multiple views. Edit-DiffNeRF maintains high visual fidelity after editing, as evidenced by FID scores comparable to pre-edit scenes.	The model's performance depends on the quality of the initial NeRF reconstruction and the diffusion model's generalization ability. Editing results can be negatively impacted by low-resolution or blurry input images.	neural radiance fields (nerfs), diffusion models, text-to-3d generation, 3d scene editing, multi-view consistency
2306.09349 Report	UrbanIR: Large-Scale Urban Scene Inverse Rendering from a Single Video	Zhi-Hao Lin, Bohan Liu, Yi-Ting Chen, David Forsyth, Jia-Bin Huang, Anand Bhattad, Shenlong Wang	We show how to build a model that allows realistic, free-viewpoint renderings of a scene under novel lighting conditions from video. Our method -- UrbanIR: Urban Scene Inverse Rendering -- computes an inverse graphics representation from the video. UrbanIR jointly infers shape, albedo, visibility, and sun and sky illumination from a single video of unbounded outdoor scenes with unknown lighting. UrbanIR uses videos from cameras mounted on cars (in contrast to many views of the same points in typical NeRF-style estimation). As a result, standard methods produce poor geometry estimates (for example, roofs), and there are numerous ''floaters''. Errors in inverse graphics inference can result in strong rendering artifacts. UrbanIR uses novel losses to control these and other sources of error. UrbanIR uses a novel loss to make very good estimates of shadow volumes in the original scene. The resulting representations facilitate controllable editing, delivering photorealistic free-viewpoint renderings of relit scenes and inserted objects. Qualitative evaluation demonstrates strong improvements over the state-of-the-art.	UrbanIR enables realistic, free-viewpoint renderings of large-scale urban scenes under novel lighting conditions from a single video.	Existing methods struggle with poor geometry estimates and rendering artifacts when applied to unbounded outdoor scenes from car-mounted cameras.	UrbanIR combines monocular intrinsic decomposition and inverse rendering with a neural scene model. It uses novel losses to ensure consistency between scene geometry, detected shadows, and deshadowed images.	Significantly improved geometry estimates and shadow rendering compared to baselines. Enables realistic relighting effects, including changes in sun position and nighttime simulations. Facilitates accurate object insertion with realistic shadow casting.	Relies on multiple 2D priors during optimization, leading to occasional shadow removal imperfections. Large changes in sun direction can lead to inaccurate shadows due to limitations in geometry refinement.	inverse rendering, neural rendering, scene relighting, shadow modeling, urban scenes
2306.09344 Report	DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data	Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, Phillip Isola	Current perceptual similarity metrics operate at the level of pixels and patches. These metrics compare images in terms of their low-level colors and textures, but fail to capture mid-level similarities and differences in image layout, object pose, and semantic content. In this paper, we develop a perceptual metric that assesses images holistically. Our first step is to collect a new dataset of human similarity judgments over image pairs that are alike in diverse ways. Critical to this dataset is that judgments are nearly automatic and shared by all observers. To achieve this we use recent text-to-image models to create synthetic pairs that are perturbed along various dimensions. We observe that popular perceptual metrics fall short of explaining our new data, and we introduce a new metric, DreamSim, tuned to better align with human perception. We analyze how our metric is affected by different visual attributes, and find that it focuses heavily on foreground objects and semantic content while also being sensitive to color and layout. Notably, despite being trained on synthetic data, our metric generalizes to real images, giving strong results on retrieval and reconstruction tasks. Furthermore, our metric outperforms both prior learned metrics and recent large vision models on these tasks.	This paper introduces a new perceptual metric, DreamSim, trained on a novel dataset of synthetic image triplets (NIGHTS), designed to capture mid-level visual similarities.	Existing perceptual similarity metrics fail to capture mid-level similarities like object pose, layout, and semantic content, which are crucial for human perception.	The authors collect human similarity judgments on synthetic image triplets generated with Stable Diffusion, ensuring cognitive impenetrability. They then train DreamSim by ensembling and fine-tuning large vision models (DINO, CLIP, OpenCLIP) on NIGHTS.	DreamSim achieves high agreement with human judgments on NIGHTS (96.16%) and generalizes well to real images, outperforming existing metrics in retrieval and reconstruction tasks. Analysis reveals DreamSim's sensitivity to foreground objects, color, and layout, surpassing prior metrics in capturing mid-level similarities. Despite being trained on synthetic data, DreamSim demonstrates improved performance on low-level similarity benchmarks (BAPPS, TID2013, KADID-10k) compared to base models.	The dataset predominantly focuses on object-centric domains, limiting the generalizability of DreamSim to other aspects of human similarity perception. The model might inherit biases from the pretrained backbones (Stable Diffusion, CLIP, OpenCLIP, DINO) and the generative process.	perceptual similarity, metric learning, synthetic data, image retrieval, feature inversion
2306.09341 Report	Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis	Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, Hongsheng Li	Recent text-to-image generative models can generate high-fidelity images from text inputs, but the quality of these generated images cannot be accurately evaluated by existing evaluation metrics. To address this issue, we introduce Human Preference Dataset v2 (HPD v2), a large-scale dataset that captures human preferences on images from a wide range of sources. HPD v2 comprises 798,090 human preference choices on 433,760 pairs of images, making it the largest dataset of its kind. The text prompts and images are deliberately collected to eliminate potential bias, which is a common issue in previous datasets. By fine-tuning CLIP on HPD v2, we obtain Human Preference Score v2 (HPS v2), a scoring model that can more accurately predict human preferences on generated images. Our experiments demonstrate that HPS v2 generalizes better than previous metrics across various image distributions and is responsive to algorithmic improvements of text-to-image generative models, making it a preferable evaluation metric for these models. We also investigate the design of the evaluation prompts for text-to-image generative models, to make the evaluation stable, fair and easy-to-use. Finally, we establish a benchmark for text-to-image generative models using HPS v2, which includes a set of recent text-to-image models from the academic, community and industry. The code and dataset is available at https://github.com/tgxs002/HPSv2 .	This paper introduces Human Preference Dataset v2 (HPD v2), a large-scale dataset for evaluating human preferences in text-to-image generation, and Human Preference Score v2 (HPS v2), a model fine-tuned on HPD v2 to predict human preferences.	Existing evaluation metrics fail to accurately assess the quality of images generated by text-to-image models, necessitating a method aligned with human perception.	HPD v2 is built by collecting prompts, generating images from various models, and gathering human preference annotations. HPS v2 is then trained by fine-tuning a CLIP model on HPD v2.	HPD v2 is larger and less biased than previous datasets, containing 798k human preference comparisons. HPS v2 outperforms previous preference prediction models, achieving 83.3% accuracy on HPD v2 test set. The authors establish a benchmark for text-to-image models using HPS v2, comparing models across different styles.	The prompts and images in HPD v2 are sourced from specific databases (DiffusionDB, COCO Captions) which may not cover all aspects of image generation. The use of ChatGPT for prompt cleaning, while mitigating bias, could potentially introduce new biases.	text-to-image generation, human preference, evaluation metrics, benchmark, clip
2306.09329 Report	DreamHuman: Animatable 3D Avatars from Text	Nikos Kolotouros, Thiemo Alldieck, Andrei Zanfir, Eduard Gabriel Bazavan, Mihai Fieraru, Cristian Sminchisescu	We present DreamHuman, a method to generate realistic animatable 3D human avatar models solely from textual descriptions. Recent text-to-3D methods have made considerable strides in generation, but are still lacking in important aspects. Control and often spatial resolution remain limited, existing methods produce fixed rather than animated 3D human models, and anthropometric consistency for complex structures like people remains a challenge. DreamHuman connects large text-to-image synthesis models, neural radiance fields, and statistical human body models in a novel modeling and optimization framework. This makes it possible to generate dynamic 3D human avatars with high-quality textures and learned, instance-specific, surface deformations. We demonstrate that our method is capable to generate a wide variety of animatable, realistic 3D human models from text. Our 3D models have diverse appearance, clothing, skin tones and body shapes, and significantly outperform both generic text-to-3D approaches and previous text-based 3D avatar generators in visual fidelity. For more results and animations please check our website at https://dream-human.github.io.	Presents DreamHuman, a method to generate realistic, animatable 3D human avatars solely from textual descriptions, by combining large text-to-image models, neural radiance fields, and statistical human body models.	Existing text-to-3D methods lack control, spatial resolution, animation capabilities, and anthropometric consistency, particularly for complex structures like human bodies. DreamHuman addresses these limitations.	Combines text-to-image diffusion models, neural radiance fields (using mip-NeRF 360 architecture), and the imGHUM statistical human body model. Employs a novel modeling and optimization framework with semantic zooming and refining prompts for detail, and incorporates multiple losses to ensure quality in structure, appearance, and deformation.	Generates high-quality, animatable 3D human avatars with diverse appearances, clothing, skin tones, and body shapes from text prompts. Learns instance-specific, pose-dependent geometric deformations, enabling realistic clothing representation, including loose garments. Outperforms generic text-to-3D approaches and previous text-based 3D avatar generators in visual fidelity, as demonstrated by qualitative comparisons and CLIP-based evaluation.	Fine details like wrinkles are sometimes drawn using the albedo map instead of geometry due to the lack of 3D training data. Occasional disentanglement issues between albedo and shading can result in baked reflections and shadows.	text-to-3d, 3d human avatar generation, neural radiance fields, diffusion models, human body model
2306.09316 Report	Diffusion Models for Zero-Shot Open-Vocabulary Segmentation	Laurynas Karazija, Iro Laina, Andrea Vedaldi, Christian Rupprecht	The variety of objects in the real world is nearly unlimited and is thus impossible to capture using models trained on a fixed set of categories. As a result, in recent years, open-vocabulary methods have attracted the interest of the community. This paper proposes a new method for zero-shot open-vocabulary segmentation. Prior work largely relies on contrastive training using image-text pairs, leveraging grouping mechanisms to learn image features that are both aligned with language and well-localised. This however can introduce ambiguity as the visual appearance of images with similar captions often varies. Instead, we leverage the generative properties of large-scale text-to-image diffusion models to sample a set of support images for a given textual category. This provides a distribution of appearances for a given text circumventing the ambiguity problem. We further propose a mechanism that considers the contextual background of the sampled images to better localise objects and segment the background directly. We show that our method can be used to ground several existing pre-trained self-supervised feature extractors in natural language and provide explainable predictions by mapping back to regions in the support set. Our proposal is training-free, relying on pre-trained components only, yet, shows strong performance on a range of open-vocabulary segmentation benchmarks, obtaining a lead of more than 10% on the Pascal VOC benchmark.	This paper introduces OVdiff, a training-free method for zero-shot open-vocabulary segmentation that leverages text-to-image diffusion models to generate visual prototypes for grounding pre-trained feature extractors.	Existing open-vocabulary segmentation methods rely on extensive training with image-text pairs or labeled datasets, which can introduce ambiguity and limit scalability to new categories. OVdiff overcomes these limitations by utilizing the generative power of diffusion models and pre-trained feature extractors.	OVdiff samples a support set of images for a given textual category using a text-to-image diffusion model. It then extracts visual prototypes at class, instance, and part levels from these images using an off-the-shelf feature extractor. These prototypes are used in a nearest-neighbor lookup scheme for segmenting any image.	OVdiff achieves state-of-the-art performance on open-vocabulary segmentation benchmarks, outperforming existing methods by a significant margin. The method effectively leverages contextual priors by encoding background prototypes, leading to improved object localization and boundary delineation. OVdiff provides a degree of explainability by mapping back segmentation decisions to specific regions in the support set images.	The resolution of segmentation masks is limited by the resolution of the employed feature extractor. Sampling support images for a large number of categories can be computationally expensive, although this cost can be amortized over multiple images.	open-vocabulary segmentation, zero-shot learning, diffusion models, feature grounding, explainable ai
2306.09305 Report	Fast Training of Diffusion Models with Masked Transformers	Hongkai Zheng, Weili Nie, Arash Vahdat, Anima Anandkumar	We propose an efficient approach to train large diffusion models with masked transformers. While masked transformers have been extensively explored for representation learning, their application to generative learning is less explored in the vision domain. Our work is the first to exploit masked training to reduce the training cost of diffusion models significantly. Specifically, we randomly mask out a high proportion (e.g., 50%) of patches in diffused input images during training. For masked training, we introduce an asymmetric encoder-decoder architecture consisting of a transformer encoder that operates only on unmasked patches and a lightweight transformer decoder on full patches. To promote a long-range understanding of full patches, we add an auxiliary task of reconstructing masked patches to the denoising score matching objective that learns the score of unmasked patches. Experiments on ImageNet-256x256 and ImageNet-512x512 show that our approach achieves competitive and even better generative performance than the state-of-the-art Diffusion Transformer (DiT) model, using only around 30% of its original training time. Thus, our method shows a promising way of efficiently training large transformer-based diffusion models without sacrificing the generative performance.	This paper introduces MaskDiT, a novel approach for training diffusion models efficiently using masked transformers.	Training large diffusion models is computationally expensive. This work aims to significantly reduce the training cost without compromising image generation quality.	The authors propose an asymmetric encoder-decoder architecture where the encoder processes only unmasked patches while the lightweight decoder handles all patches. They also introduce a new training objective combining denoising score matching on unmasked tokens and an auxiliary masked patch reconstruction task.	MaskDiT achieves competitive image generation quality compared to state-of-the-art models on ImageNet 256x256 and 512x512 benchmarks. The method significantly reduces training time and memory consumption compared to previous transformer-based diffusion models (DiT and MDT). An ablation study reveals that the success of MaskDiT comes from the interplay of image masking, the asymmetric architecture, and the dual training objective.	The current method requires a few steps of unmasking tuning to achieve the best FID scores with classifier-free guidance. Future work could focus on improving unconditional image generation performance with masked training.	diffusion models, masked image modeling, transformers, image generation, efficient training
2306.09117 Report	UniOcc: Unifying Vision-Centric 3D Occupancy Prediction with Geometric and Semantic Rendering	Mingjie Pan, Li Liu, Jiaming Liu, Peixiang Huang, Longlong Wang, Shanghang Zhang, Shaoqing Xu, Zhiyi Lai, Kuiyuan Yang	In this technical report, we present our solution, named UniOCC, for the Vision-Centric 3D occupancy prediction track in the nuScenes Open Dataset Challenge at CVPR 2023. Existing methods for occupancy prediction primarily focus on optimizing projected features on 3D volume space using 3D occupancy labels. However, the generation process of these labels is complex and expensive (relying on 3D semantic annotations), and limited by voxel resolution, they cannot provide fine-grained spatial semantics. To address this limitation, we propose a novel Unifying Occupancy (UniOcc) prediction method, explicitly imposing spatial geometry constraint and complementing fine-grained semantic supervision through volume ray rendering. Our method significantly enhances model performance and demonstrates promising potential in reducing human annotation costs. Given the laborious nature of annotating 3D occupancy, we further introduce a Depth-aware Teacher Student (DTS) framework to enhance prediction accuracy using unlabeled data. Our solution achieves 51.27\% mIoU on the official leaderboard with single model, placing 3rd in this challenge.	Introduces UniOcc, a novel method for unifying 2D and 3D representation supervision in multi-camera occupancy prediction by leveraging volume rendering to generate 2D semantic and depth maps for fine-grained supervision.	Addresses the limitations of existing 3D occupancy prediction methods that rely on expensive and complex 3D annotations by utilizing readily available 2D annotations and potentially reducing annotation costs.	Employs volume rendering to generate 2D semantic and depth maps from 3D occupancy predictions, enabling fine-grained supervision with 2D pixels and enforcing geometric and semantic consistency through explicit occlusion relationships.	Achieves comparable performance to methods using 3D labels without relying on them, highlighting the potential to reduce annotation costs. Integrating temporal frames as supplementary perspectives significantly enhances rendering supervision by considering occlusion relationships between voxels. The proposed Depth-aware Teacher Student (DTS) framework, utilizing unlabeled data and LiDAR information, effectively improves prediction accuracy.	Limited overlap between surrounding cameras hinders multi-view consistency in rendering supervision. Reliance on visibility masks during training, while improving evaluation metrics, may lead to overlooking occluded areas and affect visualization quality.	occupancy prediction, volume rendering, autonomous driving, semi-supervised learning, multi-view consistency
2306.09109 Report	NAVI: Category-Agnostic Image Collections with High-Quality 3D Shape and Pose Annotations	Varun Jampani, Kevis-Kokitsi Maninis, Andreas Engelhardt, Arjun Karpur, Karen Truong, Kyle Sargent, Stefan Popov, André Araujo, Ricardo Martin-Brualla, Kaushal Patel, Daniel Vlasic, Vittorio Ferrari, Ameesh Makadia, Ce Liu, Yuanzhen Li, Howard Zhou	Recent advances in neural reconstruction enable high-quality 3D object reconstruction from casually captured image collections. Current techniques mostly analyze their progress on relatively simple image collections where Structure-from-Motion (SfM) techniques can provide ground-truth (GT) camera poses. We note that SfM techniques tend to fail on in-the-wild image collections such as image search results with varying backgrounds and illuminations. To enable systematic research progress on 3D reconstruction from casual image captures, we propose NAVI: a new dataset of category-agnostic image collections of objects with high-quality 3D scans along with per-image 2D-3D alignments providing near-perfect GT camera parameters. These 2D-3D alignments allow us to extract accurate derivative annotations such as dense pixel correspondences, depth and segmentation maps. We demonstrate the use of NAVI image collections on different problem settings and show that NAVI enables more thorough evaluations that were not possible with existing datasets. We believe NAVI is beneficial for systematic research progress on 3D reconstruction and correspondence estimation. Project page: https://navidataset.github.io	The paper introduces NAVI, a novel dataset of image collections featuring precise 2D-3D alignments with high-quality 3D scans, aimed at advancing 3D reconstruction from casual captures, including in-the-wild scenarios.	Existing datasets often rely on SfM for camera poses, limiting capture setups and hindering research on in-the-wild reconstruction where SfM struggles. NAVI addresses this by providing accurate ground truth for challenging scenarios.	NAVI uses high-quality 3D scanners for ground truth shapes and employs a rigorous manual 2D-3D alignment process with interactive tools and expert verification, ensuring near-perfect annotations.	NAVI's ground truth poses significantly improve multiview reconstruction quality compared to using SfM (COLMAP). For in-the-wild reconstruction, NAVI enables camera pose analysis, revealing performance differences among techniques (SAMURAI, NeRS, NeROIC) under varying noise levels. NAVI's dense correspondence annotations highlight the limitations of existing methods in achieving comprehensive coverage, particularly in in-the-wild scenarios.	NAVI's main limitation is its scale, consisting of 36 objects and ~10K images due to the meticulous annotation process. Future work involves expanding NAVI to include video sequences.	3d reconstruction, dataset, in-the-wild, correspondence estimation, camera pose estimation
2306.08904 Report	Enhancing Neural Rendering Methods with Image Augmentations	Juan C. Pérez, Sara Rojas, Jesus Zarzar, Bernard Ghanem	Faithfully reconstructing 3D geometry and generating novel views of scenes are critical tasks in 3D computer vision. Despite the widespread use of image augmentations across computer vision applications, their potential remains underexplored when learning neural rendering methods (NRMs) for 3D scenes. This paper presents a comprehensive analysis of the use of image augmentations in NRMs, where we explore different augmentation strategies. We found that introducing image augmentations during training presents challenges such as geometric and photometric inconsistencies for learning NRMs from images. Specifically, geometric inconsistencies arise from alterations in shapes, positions, and orientations from the augmentations, disrupting spatial cues necessary for accurate 3D reconstruction. On the other hand, photometric inconsistencies arise from changes in pixel intensities introduced by the augmentations, affecting the ability to capture the underlying 3D structures of the scene. We alleviate these issues by focusing on color manipulations and introducing learnable appearance embeddings that allow NRMs to explain away photometric variations. Our experiments demonstrate the benefits of incorporating augmentations when learning NRMs, including improved photometric quality and surface reconstruction, as well as enhanced robustness against data quality issues, such as reduced training data and image degradations.	This paper presents a comprehensive analysis of the use of image augmentations in neural rendering methods for 3D scenes, focusing on addressing challenges and benefits for both static and dynamic augmentation strategies.	Image augmentations are widely used in computer vision but their potential in neural rendering is underexplored. This work investigates how to effectively incorporate them and analyzes their impact on performance and robustness.	The authors propose two methods: Static Image Augmentations (SIA) and Dynamic Image Augmentations (DIA). They address geometric inconsistencies by using color manipulations and photometric inconsistencies by introducing learnable appearance embeddings. Experiments are conducted on NeRF, NGP (for photometric quality) and NeuS (for surface reconstruction) using Blender and DTU datasets, respectively.	Both SIA and DIA, particularly SIA, improve photometric quality (PSNR, SSIM, LPIPS) on Blender dataset. SIA consistently outperforms other setups in surface reconstruction quality (Chamfer distance) on DTU dataset. SIA enhances robustness against reduced training data and image degradations for both photometric and geometric quality.	The focus on color manipulations as augmentations might limit diversity and robustness. Reliance on geometry-preserving augmentations restricts applicability to complex transformations involving shape or viewpoint changes.	neural rendering, image augmentation, 3d scene reconstruction, novel view synthesis, data augmentation
2306.08768 Report	Generalizable One-shot Neural Head Avatar	Xueting Li, Shalini De Mello, Sifei Liu, Koki Nagano, Umar Iqbal, Jan Kautz	We present a method that reconstructs and animates a 3D head avatar from a single-view portrait image. Existing methods either involve time-consuming optimization for a specific person with multiple images, or they struggle to synthesize intricate appearance details beyond the facial region. To address these limitations, we propose a framework that not only generalizes to unseen identities based on a single-view image without requiring person-specific optimization, but also captures characteristic details within and beyond the face area (e.g. hairstyle, accessories, etc.). At the core of our method are three branches that produce three tri-planes representing the coarse 3D geometry, detailed appearance of a source image, as well as the expression of a target image. By applying volumetric rendering to the combination of the three tri-planes followed by a super-resolution module, our method yields a high fidelity image of the desired identity, expression and pose. Once trained, our model enables efficient 3D head avatar reconstruction and animation via a single forward pass through a network. Experiments show that the proposed approach generalizes well to unseen validation datasets, surpassing SOTA baseline methods by a large margin on head avatar reconstruction and animation.	This paper proposes a novel framework for reconstructing and animating 3D head avatars from single-view portrait images, capturing intricate details while generalizing to unseen identities without test-time optimization.	Existing methods for head avatar animation are either inefficient, requiring per-person optimization, or lack fidelity in synthesizing detailed appearances beyond the face. This work addresses these limitations with a practical and efficient solution for high-quality avatar creation.	The framework uses three branches: 1) a canonical branch reconstructs coarse 3D geometry with a neutral expression, 2) an appearance branch captures detailed texture by mapping image pixels to the canonical 3D space, and 3) an expression branch modifies the reconstruction to match the target expression using a 3DMM rendering. A super-resolution module enhances the final output.	The method achieves state-of-the-art performance on 3D portrait reconstruction, surpassing baselines on fidelity metrics. It exhibits superior performance in cross-identity reenactment, accurately transferring expressions and head poses while preserving identity and details. The framework is highly efficient, reconstructing and animating avatars with a single forward pass, significantly faster than optimization-based methods.	The model currently struggles to accurately reconstruct teeth and pupils, often relying on hallucination which can lead to discrepancies with the source image. Future work includes addressing these limitations by developing mechanisms to better handle open/closed mouth and eye states during reconstruction.	3d head avatar, neural rendering, one-shot learning, facial animation, generative adversarial networks
2306.08757 Report	InfoDiffusion: Representation Learning Using Information Maximizing Diffusion Models	Yingheng Wang, Yair Schiff, Aaron Gokaslan, Weishen Pan, Fei Wang, Christopher De Sa, Volodymyr Kuleshov	While diffusion models excel at generating high-quality samples, their latent variables typically lack semantic meaning and are not suitable for representation learning. Here, we propose InfoDiffusion, an algorithm that augments diffusion models with low-dimensional latent variables that capture high-level factors of variation in the data. InfoDiffusion relies on a learning objective regularized with the mutual information between observed and hidden variables, which improves latent space quality and prevents the latents from being ignored by expressive diffusion-based decoders. Empirically, we find that InfoDiffusion learns disentangled and human-interpretable latent representations that are competitive with state-of-the-art generative and contrastive methods, while retaining the high sample quality of diffusion models. Our method enables manipulating the attributes of generated images and has the potential to assist tasks that require exploring a learned latent space to generate quality samples, e.g., generative design.	InfoDiffusion, an algorithm that augments diffusion models with low-dimensional latent variables to capture high-level factors of variation in the data.	Diffusion models, while excellent at generating high-quality samples, typically lack semantic meaning in their latent variables, making them unsuitable for representation learning.	InfoDiffusion uses variational inference and maximizes the mutual information between the observed data and the hidden variables. It also incorporates a prior regularization term to prevent the latent space from being ignored by the decoder.	InfoDiffusion learns disentangled and human-interpretable latent representations. The latent representations are competitive with state-of-the-art generative and contrastive methods. InfoDiffusion retains the high sample quality of diffusion models.	Investigating the impact of different divergence measures on the prior regularization term. Exploring alternative architectures for the encoder and decoder networks.	diffusion models, representation learning, mutual information, variational inference, disentanglement
2306.08707 Report	VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing	Paul Couairon, Clément Rambour, Jean-Emmanuel Haugeard, Nicolas Thome	Recently, diffusion-based generative models have achieved remarkable success for image generation and edition. However, existing diffusion-based video editing approaches lack the ability to offer precise control over generated content that maintains temporal consistency in long-term videos. On the other hand, atlas-based methods provide strong temporal consistency but are costly to edit a video and lack spatial control. In this work, we introduce VidEdit, a novel method for zero-shot text-based video editing that guarantees robust temporal and spatial consistency. In particular, we combine an atlas-based video representation with a pre-trained text-to-image diffusion model to provide a training-free and efficient video editing method, which by design fulfills temporal smoothness. To grant precise user control over generated content, we utilize conditional information extracted from off-the-shelf panoptic segmenters and edge detectors which guides the diffusion sampling process. This method ensures a fine spatial control on targeted regions while strictly preserving the structure of the original video. Our quantitative and qualitative experiments show that VidEdit outperforms state-of-the-art methods on DAVIS dataset, regarding semantic faithfulness, image preservation, and temporal consistency metrics. With this framework, processing a single video only takes approximately one minute, and it can generate multiple compatible edits based on a unique text prompt. Project web-page at https://videdit.github.io	\textsc{VidEdit} is a novel, lightweight, zero-shot text-based video editing method that leverages the power of pre-trained text-to-image diffusion models and the temporal consistency of atlas-based video representations.	Existing diffusion-based video editing approaches struggle with precise control over generated content and maintaining temporal consistency in long videos. Atlas-based methods, while offering strong temporal consistency, are computationally expensive and lack spatial editing control. \textsc{VidEdit} bridges this gap by combining the strengths of both approaches.	\textsc{VidEdit} decomposes a video into 2D atlas representations using Neural Layered Atlases (NLA). It then utilizes a pre-trained text-to-image diffusion model, guided by conditional information from a panoptic segmenter and edge detector, to perform spatially controlled edits on the atlases. These edits are then mapped back onto the original video frames, ensuring temporal consistency.	\textsc{VidEdit} demonstrates superior performance compared to state-of-the-art video editing methods in terms of semantic faithfulness to the target text prompt, preservation of original video content, and temporal consistency. The method offers a significant speed-up in editing time, capable of processing a full video in approximately one minute. By leveraging the probabilistic nature of diffusion models, \textsc{VidEdit} enables the generation of diverse and creative edits from a single text prompt.	The performance of \textsc{VidEdit} is reliant on the quality of the underlying atlas representations, which can be limited for videos with complex motions or long durations. Future work could focus on enhancing the robustness of atlas construction methods to broaden the applicability of \textsc{VidEdit} to a wider range of videos.	video editing, text-driven editing, diffusion models, neural layered atlases, zero-shot learning
2306.08687 Report	Norm-guided latent space exploration for text-to-image generation	Dvir Samuel, Rami Ben-Ari, Nir Darshan, Haggai Maron, Gal Chechik	Text-to-image diffusion models show great potential in synthesizing a large variety of concepts in new compositions and scenarios. However, the latent space of initial seeds is still not well understood and its structure was shown to impact the generation of various concepts. Specifically, simple operations like interpolation and finding the centroid of a set of seeds perform poorly when using standard Euclidean or spherical metrics in the latent space. This paper makes the observation that, in current training procedures, diffusion models observed inputs with a narrow range of norm values. This has strong implications for methods that rely on seed manipulation for image generation, with applications to few-shot and long-tail learning tasks. To address this issue, we propose a novel method for interpolating between two seeds and demonstrate that it defines a new non-Euclidean metric that takes into account a norm-based prior on seeds. We describe a simple yet efficient algorithm for approximating this interpolation procedure and use it to further define centroids in the latent seed space. We show that our new interpolation and centroid techniques significantly enhance the generation of rare concept images. This further leads to state-of-the-art performance on few-shot and long-tail benchmarks, improving prior approaches in terms of generation speed, image quality, and semantic content.	This paper proposes Norm-Aware Optimization (NAO), a novel method for interpolating between seeds and finding centroids in the latent space of text-to-image diffusion models, by leveraging a norm-based prior derived from the Chi distribution.	Current diffusion models exhibit poor performance in latent space interpolation and centroid finding due to a training bias towards specific seed norm values. This limits their ability to generate rare concepts and perform well in few-shot and long-tail learning tasks.	NAO defines a new distance metric based on the likelihood of a seed under a Chi distribution prior. It then leverages this metric to find optimal interpolation paths and centroids by minimizing the total distance between points in latent space.	NAO generates higher-quality images with better semantic content compared to baseline interpolation and centroid methods. Using NAO for seed initialization significantly improves the performance of SeedSelect in rare concept generation, achieving state-of-the-art results on few-shot and long-tail learning benchmarks. NAO significantly reduces the runtime of SeedSelect by providing a better starting point for optimization.	NAO involves an additional optimization step compared to standard interpolation and centroid calculation. While NAO improves seed initialization, it might still require further optimization using methods like SeedSelect for optimal results.	diffusion models, latent space exploration, rare concept generation, few-shot learning, long-tail learning
2306.08659 Report	Explore In-Context Learning for 3D Point Cloud Understanding	Zhongbin Fang, Xiangtai Li, Xia Li, Joachim M. Buhmann, Chen Change Loy, Mengyuan Liu	With the rise of large-scale models trained on broad data, in-context learning has become a new learning paradigm that has demonstrated significant potential in natural language processing and computer vision tasks. Meanwhile, in-context learning is still largely unexplored in the 3D point cloud domain. Although masked modeling has been successfully applied for in-context learning in 2D vision, directly extending it to 3D point clouds remains a formidable challenge. In the case of point clouds, the tokens themselves are the point cloud positions (coordinates) that are masked during inference. Moreover, position embedding in previous works may inadvertently introduce information leakage. To address these challenges, we introduce a novel framework, named Point-In-Context, designed especially for in-context learning in 3D point clouds, where both inputs and outputs are modeled as coordinates for each task. Additionally, we propose the Joint Sampling module, carefully designed to work in tandem with the general point sampling operator, effectively resolving the aforementioned technical issues. We conduct extensive experiments to validate the versatility and adaptability of our proposed methods in handling a wide range of tasks.	This paper presents Point-In-Context (PIC), the first framework to explore in-context learning for 3D point cloud understanding.	In-context learning, showing promise in NLP and 2D vision, remains unexplored for 3D point clouds. This work establishes a baseline for this novel research direction.	The authors create a new benchmark dataset with four tasks: reconstruction, denoising, registration, and part segmentation. They propose PIC with a Joint Sampling module to address information leakage issues inherent in adapting existing methods.	PIC achieves state-of-the-art performance on the benchmark, outperforming multitask models. The method generalizes to out-of-distribution data and unseen tasks. Prompt selection significantly impacts performance, suggesting future research directions.	The model struggles to reconstruct fine details in complex point clouds. Future work includes exploring higher-quality prompts for improved performance.	in-context learning, 3d point cloud, masked point modeling, joint sampling, prompt engineering
2306.08645 Report	Training-free Diffusion Model Adaptation for Variable-Sized Text-to-Image Synthesis	Zhiyu Jin, Xuli Shen, Bin Li, Xiangyang Xue	Diffusion models (DMs) have recently gained attention with state-of-the-art performance in text-to-image synthesis. Abiding by the tradition in deep learning, DMs are trained and evaluated on the images with fixed sizes. However, users are demanding for various images with specific sizes and various aspect ratio. This paper focuses on adapting text-to-image diffusion models to handle such variety while maintaining visual fidelity. First we observe that, during the synthesis, lower resolution images suffer from incomplete object portrayal, while higher resolution images exhibit repetitively disordered presentation. Next, we establish a statistical relationship indicating that attention entropy changes with token quantity, suggesting that models aggregate spatial information in proportion to image resolution. The subsequent interpretation on our observations is that objects are incompletely depicted due to limited spatial information for low resolutions, while repetitively disorganized presentation arises from redundant spatial information for high resolutions. From this perspective, we propose a scaling factor to alleviate the change of attention entropy and mitigate the defective pattern observed. Extensive experimental results validate the efficacy of the proposed scaling factor, enabling models to achieve better visual effects, image quality, and text alignment. Notably, these improvements are achieved without additional training or fine-tuning techniques.	This paper proposes a novel scaling factor for visual attention layers in text-to-image diffusion models, enabling them to synthesize high-fidelity images of varying sizes without additional training.	Existing diffusion models struggle to maintain visual fidelity when synthesizing images at resolutions different from their training resolution. This limits their practical application and requires costly training of specialized models.	The authors establish a statistical relationship between attention entropy and token quantity, demonstrating that attention entropy changes proportionally to the logarithm of the token number. They then propose a scaling factor to mitigate these entropy fluctuations during image synthesis.	The proposed scaling factor significantly improves FID scores across various resolutions, indicating improved image quality and diversity. It enhances the semantic alignment between generated images and text prompts, resulting in higher CLIP scores. Qualitative results demonstrate that the method alleviates issues of incomplete objects in low-resolution images and repetitive patterns in high-resolution images.	The paper lacks a dedicated metric for evaluating image fidelity across different resolutions. Further investigation is needed to assess the generalizability of the proposed scaling factor to other diffusion-based models.	diffusion models, text-to-image synthesis, attention mechanism, entropy, variable-sized image generation
2306.08637 Report	TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement	Carl Doersch, Yi Yang, Mel Vecerik, Dilara Gokay, Ankush Gupta, Yusuf Aytar, Joao Carreira, Andrew Zisserman	We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence. Our approach employs two stages: (1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on local correlations. The resulting model surpasses all baseline methods by a significant margin on the TAP-Vid benchmark, as demonstrated by an approximate 20% absolute average Jaccard (AJ) improvement on DAVIS. Our model facilitates fast inference on long and high-resolution video sequences. On a modern GPU, our implementation has the capacity to track points faster than real-time, and can be flexibly extended to higher-resolution videos. Given the high-quality trajectories extracted from a large dataset, we demonstrate a proof-of-concept diffusion model which generates trajectories from static images, enabling plausible animations. Visualizations, source code, and pretrained models can be found on our project webpage.	Presents TAPIR, a novel model for long-term point tracking that combines per-frame matching with temporal refinement, significantly improving performance on the TAP-Vid benchmark.	Addresses limitations of prior methods in handling occlusions and leveraging temporal continuity for accurate and robust point tracking in videos.	Combines a TAP-Net-like matching stage for robust initialization with a PIPs-inspired refinement stage using depthwise convolutional networks for efficient temporal smoothing.	Achieves state-of-the-art results on the TAP-Vid benchmark, with a 10.6% absolute improvement on Kinetics and 19.3% on DAVIS over previous best methods. Demonstrates robust performance even on high-resolution videos by employing an image pyramid approach. Enables a proof-of-concept diffusion model for animating still images by generating plausible motion trajectories.	Performance on the RGB-Stacking dataset, while improved, suggests further research is needed for tracking points on textureless objects. Exploring more sophisticated temporal integration methods beyond RNNs could further enhance performance.	point tracking, video understanding, deep learning, computer vision, motion analysis
2306.08571 Report	GenImage: A Million-Scale Benchmark for Detecting AI-Generated Image	Mingjian Zhu, Hanting Chen, Qiangyu Yan, Xudong Huang, Guanyu Lin, Wei Li, Zhijun Tu, Hailin Hu, Jie Hu, Yunhe Wang	The extraordinary ability of generative models to generate photographic images has intensified concerns about the spread of disinformation, thereby leading to the demand for detectors capable of distinguishing between AI-generated fake images and real images. However, the lack of large datasets containing images from the most advanced image generators poses an obstacle to the development of such detectors. In this paper, we introduce the GenImage dataset, which has the following advantages: 1) Plenty of Images, including over one million pairs of AI-generated fake images and collected real images. 2) Rich Image Content, encompassing a broad range of image classes. 3) State-of-the-art Generators, synthesizing images with advanced diffusion models and GANs. The aforementioned advantages allow the detectors trained on GenImage to undergo a thorough evaluation and demonstrate strong applicability to diverse images. We conduct a comprehensive analysis of the dataset and propose two tasks for evaluating the detection method in resembling real-world scenarios. The cross-generator image classification task measures the performance of a detector trained on one generator when tested on the others. The degraded image classification task assesses the capability of the detectors in handling degraded images such as low-resolution, blurred, and compressed images. With the GenImage dataset, researchers can effectively expedite the development and evaluation of superior AI-generated image detectors in comparison to prevailing methodologies.	This paper introduces GenImage, a large-scale dataset designed for detecting fake images generated by both diffusion models and GANs.	The proliferation of highly realistic AI-generated images necessitates robust detectors, and existing datasets are limited in scale, content diversity, or the use of advanced generators, hindering detector development.	The authors generate over one million fake images across 1000 ImageNet classes using eight state-of-the-art diffusion models and GANs, paired with real ImageNet images.	The dataset enables cross-generator image classification, showing that detectors struggle to generalize to unseen generators. Detectors are evaluated on degraded images (low resolution, compression, blur), revealing performance drops under real-world conditions. Analysis shows that diffusion models pose a greater challenge for detection than GANs due to fewer spectral artifacts.	The study primarily focuses on ResNet-based detectors, leaving room to explore more specialized architectures. Future work can investigate the impact of different prompts and generation parameters on detector performance.	ai-generated image detection, fake image detection, diffusion models, generative adversarial networks, dataset
2306.08498 Report	Extending CLIP's Image-Text Alignment to Referring Image Segmentation	Seoyeon Kim, Minguk Kang, Dongwon Kim, Jaesik Park, Suha Kwak	Referring Image Segmentation (RIS) is a cross-modal task that aims to segment an instance described by a natural language expression. Recent methods leverage large-scale pretrained unimodal models as backbones along with fusion techniques for joint reasoning across modalities. However, the inherent cross-modal nature of RIS raises questions about the effectiveness of unimodal backbones. We propose RISCLIP, a novel framework that effectively leverages the cross-modal nature of CLIP for RIS. Observing CLIP's inherent alignment between image and text features, we capitalize on this starting point and introduce simple but strong modules that enhance unimodal feature extraction and leverage rich alignment knowledge in CLIP's image-text shared-embedding space. RISCLIP exhibits outstanding results on all three major RIS benchmarks and also outperforms previous CLIP-based methods, demonstrating the efficacy of our strategy in extending CLIP's image-text alignment to RIS.	RISCLIP, a novel framework that leverages the cross-modal alignment capabilities of CLIP for Referring Image Segmentation (RIS).	Existing RIS methods often rely on unimodal backbones and fusion techniques, but the inherent cross-modal nature of RIS suggests that a model like CLIP, pretrained on a massive dataset of image-text pairs, could be more effective.	The method freezes the CLIP backbone and introduces three key components: (1) Adapters to refine CLIP features for segmentation; (2) Cross-modal Feature Extraction (CFE) modules to align image and text features at candidate regions; (3) Shared-space Knowledge Exploitation (SKE) modules to leverage the rich alignment knowledge in CLIP's shared embedding space for target discernment. Finally, a decoder transforms the patch-level grounding into a pixel-wise segmentation.	RISCLIP achieves state-of-the-art performance on three major RIS benchmarks: RefCOCO, RefCOCO+, and RefCOCOg. The ablation study demonstrates that freezing CLIP and adapting its features with the proposed modules is crucial for optimal performance. The method excels in handling complex referring expressions, particularly on the challenging RefCOCOg dataset.	RISCLIP currently exhibits limitations in recognizing alphanumeric characters and comprehending expressions that describe target objects based on the absence of specific attributes. Future work will explore the adaptation of other image-text alignment backbones like ALIGN and Florence to RIS.	referring image segmentation, cross-modal learning, clip, image-text alignment, deep learning
2306.08276 Report	TryOnDiffusion: A Tale of Two UNets	Luyang Zhu, Dawei Yang, Tyler Zhu, Fitsum Reda, William Chan, Chitwan Saharia, Mohammad Norouzi, Ira Kemelmacher-Shlizerman	Given two images depicting a person and a garment worn by another person, our goal is to generate a visualization of how the garment might look on the input person. A key challenge is to synthesize a photorealistic detail-preserving visualization of the garment, while warping the garment to accommodate a significant body pose and shape change across the subjects. Previous methods either focus on garment detail preservation without effective pose and shape variation, or allow try-on with the desired shape and pose but lack garment details. In this paper, we propose a diffusion-based architecture that unifies two UNets (referred to as Parallel-UNet), which allows us to preserve garment details and warp the garment for significant pose and body change in a single network. The key ideas behind Parallel-UNet include: 1) garment is warped implicitly via a cross attention mechanism, 2) garment warp and person blend happen as part of a unified process as opposed to a sequence of two separate tasks. Experimental results indicate that TryOnDiffusion achieves state-of-the-art performance both qualitatively and quantitatively.	TryOnDiffusion, a diffusion-based model using a novel architecture called \unet, synthesizes high-resolution (1024x1024) virtual try-on images, realistically warping garments onto target individuals while preserving intricate garment details even with significant pose and shape variations.	Virtual try-on enhances online shopping experiences, but existing methods struggle to balance garment detail preservation with accurate warping across different body shapes and poses, particularly in high resolution. This work tackles this challenge, aiming to improve realism and detail in virtual try-on.	The method uses cascaded diffusion models with a novel architecture called \unet. \unet consists of two sub-UNets, one handling the person and the other the garment. It implicitly warps the garment onto the target person using cross-attention between their features. The model is trained on a massive dataset of 4 million paired images and further enhanced by super-resolution diffusion models for high-quality output.	TryOnDiffusion achieves state-of-the-art performance, quantitatively outperforming baselines like TryOnGAN, SDAFN, and HR-VITON in FID and KID metrics. Extensive user studies confirm TryOnDiffusion's superiority, with participants consistently ranking its results as the most realistic. The method excels in preserving garment details like patterns, text, and textures, even under challenging conditions of occlusion and pose variation, surpassing the capabilities of existing techniques.	Limitations include potential garment leaking artifacts due to errors in preprocessing steps like segmentation and pose estimation, and challenges in fully representing individual identity using clothing-agnostic RGB. Future work will focus on addressing limitations, extending the model to full-body try-on, incorporating more complex backgrounds, and exploring its application to videos and general image editing.	virtual try-on, diffusion models, image synthesis, deep learning, computer vision
2306.08257 Report	On the Robustness of Latent Diffusion Models	Jianping Zhang, Zhuoer Xu, Shiwen Cui, Changhua Meng, Weibin Wu, Michael R. Lyu	Latent diffusion models achieve state-of-the-art performance on a variety of generative tasks, such as image synthesis and image editing. However, the robustness of latent diffusion models is not well studied. Previous works only focus on the adversarial attacks against the encoder or the output image under white-box settings, regardless of the denoising process. Therefore, in this paper, we aim to analyze the robustness of latent diffusion models more thoroughly. We first study the influence of the components inside latent diffusion models on their white-box robustness. In addition to white-box scenarios, we evaluate the black-box robustness of latent diffusion models via transfer attacks, where we consider both prompt-transfer and model-transfer settings and possible defense mechanisms. However, all these explorations need a comprehensive benchmark dataset, which is missing in the literature. Therefore, to facilitate the research of the robustness of latent diffusion models, we propose two automatic dataset construction pipelines for two kinds of image editing models and release the whole dataset. Our code and dataset are available at \url{https://github.com/jpzhang1810/LDM-Robustness}.	This paper investigates the robustness of latent diffusion models, particularly in image editing, against adversarial attacks.	Assessing the robustness of latent diffusion models is crucial for ensuring their reliable deployment in real-world applications, especially given their increasing use in image editing.	The authors propose two automatic dataset construction pipelines for image variation and inpainting models. They then evaluate the models' robustness by launching adversarial attacks under both white-box and black-box settings, analyzing the effects of attacks on different model components.	The denoising process, especially the Resnet module, is identified as the most vulnerable component in latent diffusion models. Instruct-pix2pix demonstrates greater robustness compared to standard stable diffusion models. Adversarial examples exhibit transferability across different prompts (prompt-transfer) and models (model-transfer), raising concerns about the vulnerability of newer diffusion model versions.	The attacking strategy, which destroys all internal features of the target module in the denoising process, may not be optimal. Future work could explore attacking specific steps in the denoising process or developing more robust defense mechanisms.	adversarial attacks, latent diffusion models, image editing, robustness, transfer attacks
2306.08247 Report	Diffusion in Diffusion: Cyclic One-Way Diffusion for Text-Vision-Conditioned Generation	Ruoyu Wang, Yongqi Yang, Zhihao Qian, Ye Zhu, Yu Wu	Originating from the diffusion phenomenon in physics that describes particle movement, the diffusion generative models inherit the characteristics of stochastic random walk in the data space along the denoising trajectory. However, the intrinsic mutual interference among image regions contradicts the need for practical downstream application scenarios where the preservation of low-level pixel information from given conditioning is desired (e.g., customization tasks like personalized generation and inpainting based on a user-provided single image). In this work, we investigate the diffusion (physics) in diffusion (machine learning) properties and propose our Cyclic One-Way Diffusion (COW) method to control the direction of diffusion phenomenon given a pre-trained frozen diffusion model for versatile customization application scenarios, where the low-level pixel information from the conditioning needs to be preserved. Notably, unlike most current methods that incorporate additional conditions by fine-tuning the base text-to-image diffusion model or learning auxiliary networks, our method provides a novel perspective to understand the task needs and is applicable to a wider range of customization scenarios in a learning-free manner. Extensive experiment results show that our proposed COW can achieve more flexible customization based on strict visual conditions in different application settings. Project page: https://wangruoyu02.github.io/cow.github.io/.	This paper proposes Cyclic One-Way Diffusion (COW), a training-free method that controls information diffusion in pre-trained diffusion models for image customization.	Existing methods for customizing diffusion models with visual conditions often rely on computationally expensive fine-tuning and struggle to balance fidelity to both visual and textual inputs. This work aims to address these limitations by controlling the direction of information diffusion during generation.	COW employs three main components: (1) Seed Initialization, where the visual condition is embedded in a neutral background, (2) Cyclic One-Way Diffusion, where the visual condition's latent representation is gradually injected during generation to promote unidirectional information flow, and (3) Visual Condition Preservation, where the visual condition is re-introduced at a later stage to maintain fidelity.	COW achieves high fidelity to both textual and visual conditions, outperforming baselines in quantitative metrics and human evaluations. The cyclic one-way diffusion strategy effectively propagates information from the visual condition while adapting to the textual prompt. COW is efficient, generating images in 6 seconds compared to minutes or more for fine-tuning based approaches.	The model can struggle with extreme conflicts between visual and textual conditions. Future work could explore extending COW to handle multiple visual conditions more robustly.	diffusion models, image generation, customization, training-free, visual conditioning
2306.08226 Report	CLIPXPlore: Coupled CLIP and Shape Spaces for 3D Shape Exploration	Jingyu Hu, Ka-Hei Hui, Zhengzhe liu, Hao Zhang, Chi-Wing Fu	This paper presents CLIPXPlore, a new framework that leverages a vision-language model to guide the exploration of the 3D shape space. Many recent methods have been developed to encode 3D shapes into a learned latent shape space to enable generative design and modeling. Yet, existing methods lack effective exploration mechanisms, despite the rich information. To this end, we propose to leverage CLIP, a powerful pre-trained vision-language model, to aid the shape-space exploration. Our idea is threefold. First, we couple the CLIP and shape spaces by generating paired CLIP and shape codes through sketch images and training a mapper network to connect the two spaces. Second, to explore the space around a given shape, we formulate a co-optimization strategy to search for the CLIP code that better matches the geometry of the shape. Third, we design three exploration modes, binary-attribute-guided, text-guided, and sketch-guided, to locate suitable exploration trajectories in shape space and induce meaningful changes to the shape. We perform a series of experiments to quantitatively and visually compare CLIPXPlore with different baselines in each of the three exploration modes, showing that CLIPXPlore can produce many meaningful exploration results that cannot be achieved by the existing solutions.	CLIPXPlore, a framework that leverages the CLIP vision-language model to guide the exploration of a pre-trained 3D shape latent space.	Existing shape exploration methods lack fine-grained semantic control and struggle to connect to user-friendly interfaces like language or sketching.	The framework connects CLIP and shape spaces by training a mapper network on paired CLIP and shape codes generated from sketch images. It then co-optimizes these codes for accurate shape representation and provides three exploration modes: binary-attribute-guided, text-guided, and sketch-guided, to locate suitable exploration trajectories.	CLIPXPlore produces meaningful shape variations based on different conditions. Quantitative and qualitative evaluations show CLIPXPlore outperforms existing methods in shape exploration. Model analysis confirms the effectiveness of the space connection and the co-optimization strategy.	Exploring the latent space may lead to unexpected shape changes beyond the given condition. Identifying the optimal step size along the exploration trajectory remains a challenge.	3d shape exploration, clip, vision-language model, latent space exploration, multi-modal shape modeling
2306.07969 Report	GeneCIS: A Benchmark for General Conditional Image Similarity	Sagar Vaze, Nicolas Carion, Ishan Misra	We argue that there are many notions of 'similarity' and that models, like humans, should be able to adapt to these dynamically. This contrasts with most representation learning methods, supervised or self-supervised, which learn a fixed embedding function and hence implicitly assume a single notion of similarity. For instance, models trained on ImageNet are biased towards object categories, while a user might prefer the model to focus on colors, textures or specific elements in the scene. In this paper, we propose the GeneCIS ('genesis') benchmark, which measures models' ability to adapt to a range of similarity conditions. Extending prior work, our benchmark is designed for zero-shot evaluation only, and hence considers an open-set of similarity conditions. We find that baselines from powerful CLIP models struggle on GeneCIS and that performance on the benchmark is only weakly correlated with ImageNet accuracy, suggesting that simply scaling existing methods is not fruitful. We further propose a simple, scalable solution based on automatically mining information from existing image-caption datasets. We find our method offers a substantial boost over the baselines on GeneCIS, and further improves zero-shot performance on related image retrieval benchmarks. In fact, though evaluated zero-shot, our model surpasses state-of-the-art supervised models on MIT-States. Project page at https://sgvaze.github.io/genecis/.	The paper introduces GeneCIS, a benchmark designed to measure a model's ability to adapt to various notions of image similarity given explicit conditions.	Most existing representation learning methods, supervised or self-supervised, learn a fixed embedding function and implicitly assume a single notion of similarity, which is insufficient for real-world applications.	GeneCIS is constructed by re-purposing existing datasets (VAW, COCO) to create four retrieval tasks: Focus on an Attribute, Change an Attribute, Focus on an Object, and Change an Object. The authors propose a method to automatically mine training data for conditional image similarity from large-scale image-caption datasets by extracting Subject-Predicate-Object relationships.	Baselines using only image or text information struggle on GeneCIS, indicating the benchmark effectively evaluates conditional similarity. The proposed method, trained on mined triplets, outperforms CLIP-only baselines and even surpasses a model trained on manually annotated data (CIRR). Performance on GeneCIS is weakly correlated with ImageNet accuracy, suggesting that the benchmark measures different aspects of model capability compared to traditional vision tasks.	The benchmark currently relies on potentially noisy annotations from source datasets and requires manual verification. Future work could explore mining triplets from even larger image-caption datasets like LAION-5B.	image similarity, conditional similarity, benchmarking, representation learning, zero-shot learning
2306.07967 Report	One-for-All: Generalized LoRA for Parameter-Efficient Fine-tuning	Arnav Chavan, Zhuang Liu, Deepak Gupta, Eric Xing, Zhiqiang Shen	We present Generalized LoRA (GLoRA), an advanced approach for universal parameter-efficient fine-tuning tasks. Enhancing Low-Rank Adaptation (LoRA), GLoRA employs a generalized prompt module to optimize pre-trained model weights and adjust intermediate activations, providing more flexibility and capability across diverse tasks and datasets. Moreover, GLoRA facilitates efficient parameter adaptation by employing a scalable, modular, layer-wise structure search that learns individual adapter of each layer. Originating from a unified mathematical formulation, GLoRA exhibits strong transfer learning, few-shot learning and domain generalization abilities, as it adapts to new tasks through not only weights but also additional dimensions like activations. Comprehensive experiments demonstrate that GLoRA outperforms all previous methods in natural, specialized, and structured vision benchmarks, achieving superior accuracy with fewer parameters and computations. The proposed method on LLaMA-1 and LLaMA-2 also show considerable enhancements compared to the original LoRA in the language domain. Furthermore, our structural re-parameterization design ensures that GLoRA incurs no extra inference cost, rendering it a practical solution for resource-limited applications. Code and models are available at: https://github.com/Arnav0400/ViT-Slim/tree/master/GLoRA.	This paper introduces Generalized LoRA (GLoRA), a universal parameter-efficient fine-tuning method that improves upon LoRA by optimizing pre-trained model weights and adjusting intermediate activations.	GLoRA addresses limitations of existing parameter-efficient fine-tuning methods, offering more flexibility and capability across diverse tasks and datasets while avoiding extra inference costs.	GLoRA employs a generalized prompt module and facilitates efficient parameter adaptation using a scalable, modular, layer-wise structure search with a unified mathematical formulation.	GLoRA outperforms previous PEFT methods on VTAB-1K, achieving state-of-the-art accuracy with fewer parameters. It shows superior few-shot learning abilities on fine-grained visual recognition datasets. GLoRA exhibits strong domain generalization capabilities, outperforming existing methods on out-of-domain datasets.	The search process in GLoRA, while automated, can increase training time compared to methods requiring manual hyperparameter tuning. The paper primarily focuses on vision tasks, with limited exploration of GLoRA's potential in other domains like NLP.	parameter-efficient fine-tuning, low-rank adaptation (lora), transfer learning, few-shot learning, domain generalization
2306.07954 Report	Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation	Shuai Yang, Yifan Zhou, Ziwei Liu, Chen Change Loy	Large text-to-image diffusion models have exhibited impressive proficiency in generating high-quality images. However, when applying these models to video domain, ensuring temporal consistency across video frames remains a formidable challenge. This paper proposes a novel zero-shot text-guided video-to-video translation framework to adapt image models to videos. The framework includes two parts: key frame translation and full video translation. The first part uses an adapted diffusion model to generate key frames, with hierarchical cross-frame constraints applied to enforce coherence in shapes, textures and colors. The second part propagates the key frames to other frames with temporal-aware patch matching and frame blending. Our framework achieves global style and local texture temporal consistency at a low cost (without re-training or optimization). The adaptation is compatible with existing image diffusion techniques, allowing our framework to take advantage of them, such as customizing a specific subject with LoRA, and introducing extra spatial guidance with ControlNet. Extensive experimental results demonstrate the effectiveness of our proposed framework over existing methods in rendering high-quality and temporally-coherent videos.	This paper introduces a novel zero-shot framework for text-guided video-to-video translation, capable of rendering temporally consistent videos by adapting pre-trained image diffusion models to videos.	Existing text-to-image diffusion models struggle to maintain temporal consistency when applied to videos. This work addresses this challenge by proposing a method that leverages the strengths of both diffusion models and frame interpolation for high-quality, efficient, and temporally consistent video translation.	The framework comprises two stages: key frame translation and full video translation. Key frame translation utilizes hierarchical cross-frame constraints, including style-aware cross-frame attention, shape-aware latent fusion, pixel-aware latent fusion with a novel fidelity-oriented image encoding method, and color-aware adaptive latent adjustment, to ensure temporal consistency at different levels. Full video translation propagates the rendered key frames to other frames using temporal-aware patch matching and frame blending (adapted from EbSynth).	The proposed framework outperforms existing zero-shot video translation methods in terms of visual quality and temporal consistency, as demonstrated by both qualitative and quantitative evaluations. The hierarchical cross-frame constraints effectively enforce temporal consistency at different levels, from global style to local texture. The fidelity-oriented image encoding significantly reduces error accumulation during iterative encoding and decoding, crucial for preserving details in the pixel-aware latent fusion.	The framework relies on accurate optical flow estimation, which may be challenging for large motions or significant appearance changes. Uniform key frame sampling may not be optimal for all videos, and future work could explore content-aware key frame selection or user-interactive translation.	video-to-video translation, text-guided video editing, diffusion models, temporal consistency, zero-shot learning
2306.07881 Report	Viewset Diffusion: (0-)Image-Conditioned 3D Generative Models from 2D Data	Stanislaw Szymanowicz, Christian Rupprecht, Andrea Vedaldi	We present Viewset Diffusion, a diffusion-based generator that outputs 3D objects while only using multi-view 2D data for supervision. We note that there exists a one-to-one mapping between viewsets, i.e., collections of several 2D views of an object, and 3D models. Hence, we train a diffusion model to generate viewsets, but design the neural network generator to reconstruct internally corresponding 3D models, thus generating those too. We fit a diffusion model to a large number of viewsets for a given category of objects. The resulting generator can be conditioned on zero, one or more input views. Conditioned on a single view, it performs 3D reconstruction accounting for the ambiguity of the task and allowing to sample multiple solutions compatible with the input. The model performs reconstruction efficiently, in a feed-forward manner, and is trained using only rendering losses using as few as three views per viewset. Project page: szymanowiczs.github.io/viewset-diffusion.	Introduces Viewset Diffusion, a diffusion-based generative model for 3D objects that learns from multi-view 2D data, enabling probabilistic single-view 3D reconstruction and unconditional generation.	Addresses the limitations of deterministic 3D reconstruction methods in handling ambiguity and leverages the abundance of 2D data for learning 3D object priors.	Trains a DDPM to generate viewsets by reconstructing a 3D radiance field internally, allowing for conditional generation based on varying noise levels across input views.	Achieves state-of-the-art single-view reconstruction results on ShapeNet-SRN Cars in terms of PSNR. Demonstrates superior perceptual quality and sharpness in reconstructions compared to deterministic baselines, particularly in ambiguous settings. Enables unconditional 3D generation with higher visual detail than previous diffusion-based methods trained on single views.	Reconstruction quality on complex objects with high ambiguity can be further improved, potentially by exploring larger sample sizes during inference. Exploring alternative 3D representations beyond radiance fields could enhance the model's efficiency and expressiveness.	3d reconstruction, diffusion models, generative models, single-view reconstruction, computer vision
2306.07754 Report	Generative Watermarking Against Unauthorized Subject-Driven Image Synthesis	Yihan Ma, Zhengyu Zhao, Xinlei He, Zheng Li, Michael Backes, Yang Zhang	Large text-to-image models have shown remarkable performance in synthesizing high-quality images. In particular, the subject-driven model makes it possible to personalize the image synthesis for a specific subject, e.g., a human face or an artistic style, by fine-tuning the generic text-to-image model with a few images from that subject. Nevertheless, misuse of subject-driven image synthesis may violate the authority of subject owners. For example, malicious users may use subject-driven synthesis to mimic specific artistic styles or to create fake facial images without authorization. To protect subject owners against such misuse, recent attempts have commonly relied on adversarial examples to indiscriminately disrupt subject-driven image synthesis. However, this essentially prevents any benign use of subject-driven synthesis based on protected images. In this paper, we take a different angle and aim at protection without sacrificing the utility of protected images for general synthesis purposes. Specifically, we propose GenWatermark, a novel watermark system based on jointly learning a watermark generator and a detector. In particular, to help the watermark survive the subject-driven synthesis, we incorporate the synthesis process in learning GenWatermark by fine-tuning the detector with synthesized images for a specific subject. This operation is shown to largely improve the watermark detection accuracy and also ensure the uniqueness of the watermark for each individual subject. Extensive experiments validate the effectiveness of GenWatermark, especially in practical scenarios with unknown models and text prompts (74% Acc.), as well as partial data watermarking (80% Acc. for 1/4 watermarking). We also demonstrate the robustness of GenWatermark to two potential countermeasures that substantially degrade the synthesis quality.	This paper introduces \MethodName, a novel generative watermarking method designed to safeguard images from unauthorized subject-driven synthesis while preserving their usability for authorized purposes.	The rise of subject-driven image synthesis models raises concerns about the potential misuse of personal images, such as replicating an artist's style or generating fake facial images without consent. Existing protection methods often disrupt both malicious and benign uses, hindering authorized applications.	\MethodName employs a two-phase learning approach. The first phase involves jointly training a watermark generator and detector on a large-scale dataset. In the second phase, the detector is fine-tuned for each subject using images synthesized from both clean and watermarked versions of their images.	\MethodName achieves high detection accuracy (above 98%) in scenarios with known models and prompts, and maintains reasonable accuracy (around 74%) even with unknown models and prompts. Injecting watermarks has minimal impact on image synthesis quality, with FID scores changing by less than 1%. \MethodName demonstrates robustness against partial watermarking, watermark forgery with random noise, and watermark removal attempts using image transformations.	The cross-model transferability of \MethodName could be further enhanced, potentially by incorporating model-specific properties during detector fine-tuning. While \MethodName exhibits substantial watermark uniqueness, there is potential for improvement by fine-tuning both the generator and detector based on subject-specific images.	image watermarking, subject-driven synthesis, generative models, image protection, digital copyright
2306.07716 Report	Dynamically Masked Discriminator for Generative Adversarial Networks	Wentian Zhang, Haozhe Liu, Bing Li, Jinheng Xie, Yawen Huang, Yuexiang Li, Yefeng Zheng, Bernard Ghanem	Training Generative Adversarial Networks (GANs) remains a challenging problem. The discriminator trains the generator by learning the distribution of real/generated data. However, the distribution of generated data changes throughout the training process, which is difficult for the discriminator to learn. In this paper, we propose a novel method for GANs from the viewpoint of online continual learning. We observe that the discriminator model, trained on historically generated data, often slows down its adaptation to the changes in the new arrival generated data, which accordingly decreases the quality of generated results. By treating the generated data in training as a stream, we propose to detect whether the discriminator slows down the learning of new knowledge in generated data. Therefore, we can explicitly enforce the discriminator to learn new knowledge fast. Particularly, we propose a new discriminator, which automatically detects its retardation and then dynamically masks its features, such that the discriminator can adaptively learn the temporally-vary distribution of generated data. Experimental results show our method outperforms the state-of-the-art approaches.	This paper proposes DMD, a novel method for training GANs that tackles the challenge of time-varying generated data distributions by viewing it as an online continual learning problem.	Training GANs is difficult due to the discriminator's struggle to adapt to the evolving distribution of generated data throughout the training process, leading to subpar generated results.	DMD employs two key modules: (1) discriminator retardation detection, which identifies when the discriminator relies too heavily on past data, and (2) dynamic discriminator adjustment, which utilizes dynamic feature masking to force the discriminator to learn new knowledge rapidly.	DMD achieves state-of-the-art FID scores on FFHQ, AFHQ-V2, and LSUN-Church datasets, outperforming both traditional GAN methods and diffusion models. Ablation studies demonstrate that DMD's dynamic masking strategy is superior to fixed interval masking or dropout. Analysis reveals that DMD effectively reduces the discriminator's reliance on historical knowledge while improving its adaptation to new data distributions.	The paper lacks theoretical analysis of the proposed method. Further investigation is needed to explore the integration of DMD with Transformer-based GAN models.	generative adversarial networks (gans), online continual learning, dynamic feature masking, discriminator regularization, image generation
2306.07684 Report	Lookaround Optimizer: $k$ steps around, 1 step average	Jiangtao Zhang, Shunyu Liu, Jie Song, Tongtian Zhu, Zhengqi Xu, Mingli Song	Weight Average (WA) is an active research topic due to its simplicity in ensembling deep networks and the effectiveness in promoting generalization. Existing weight average approaches, however, are often carried out along only one training trajectory in a post-hoc manner (i.e., the weights are averaged after the entire training process is finished), which significantly degrades the diversity between networks and thus impairs the effectiveness. In this paper, inspired by weight average, we propose Lookaround, a straightforward yet effective SGD-based optimizer leading to flatter minima with better generalization. Specifically, Lookaround iterates two steps during the whole training period: the around step and the average step. In each iteration, 1) the around step starts from a common point and trains multiple networks simultaneously, each on transformed data by a different data augmentation, and 2) the average step averages these trained networks to get the averaged network, which serves as the starting point for the next iteration. The around step improves the functionality diversity while the average step guarantees the weight locality of these networks during the whole training, which is essential for WA to work. We theoretically explain the superiority of Lookaround by convergence analysis, and make extensive experiments to evaluate Lookaround on popular benchmarks including CIFAR and ImageNet with both CNNs and ViTs, demonstrating clear superiority over state-of-the-arts. Our code is available at https://github.com/Ardcy/Lookaround.	This paper proposes Lookaround, an SGD-based optimizer that leverages data augmentation and iterative weight averaging throughout training to find flatter minima, enhancing generalization performance in deep neural networks.	Existing weight averaging techniques for deep network ensembling often struggle to balance model diversity and weight locality, limiting their effectiveness in finding flat minima and improving generalization.	Lookaround iterates two steps: 1) "around" trains multiple networks from a common starting point using diverse data augmentations for higher functional diversity, and 2) "average" averages these network weights to maintain weight locality and guide training towards flatter minima.	Theoretical analysis demonstrates Lookaround achieves lower variance and faster convergence than SGD and Lookahead. Empirical evaluations on CIFAR and ImageNet datasets, with both CNNs and ViTs, show consistent performance improvements and improved training stability compared to baselines and ensemble methods. Lookaround's performance is robust across varying data augmentation counts and around step sizes.	Increased training time proportional to the number of networks trained due to multiple forward and backward passes. Future work could explore learning rate schedulers tailored to Lookaround's iterative averaging for optimal performance.	deep learning, optimization, weight averaging, generalization, data augmentation
2306.07596 Report	Paste, Inpaint and Harmonize via Denoising: Subject-Driven Image Editing with Pre-Trained Diffusion Model	Xin Zhang, Jiaxian Guo, Paul Yoo, Yutaka Matsuo, Yusuke Iwasawa	Text-to-image generative models have attracted rising attention for flexible image editing via user-specified descriptions. However, text descriptions alone are not enough to elaborate the details of subjects, often compromising the subjects' identity or requiring additional per-subject fine-tuning. We introduce a new framework called \textit{Paste, Inpaint and Harmonize via Denoising} (PhD), which leverages an exemplar image in addition to text descriptions to specify user intentions. In the pasting step, an off-the-shelf segmentation model is employed to identify a user-specified subject within an exemplar image which is subsequently inserted into a background image to serve as an initialization capturing both scene context and subject identity in one. To guarantee the visual coherence of the generated or edited image, we introduce an inpainting and harmonizing module to guide the pre-trained diffusion model to seamlessly blend the inserted subject into the scene naturally. As we keep the pre-trained diffusion model frozen, we preserve its strong image synthesis ability and text-driven ability, thus achieving high-quality results and flexible editing with diverse texts. In our experiments, we apply PhD to both subject-driven image editing tasks and explore text-driven scene generation given a reference subject. Both quantitative and qualitative comparisons with baseline methods demonstrate that our approach achieves state-of-the-art performance in both tasks. More qualitative results can be found at \url{https://sites.google.com/view/phd-demo-page}.	This paper introduces PhD, a novel framework that leverages pre-trained diffusion models and an inpainting and harmonizing module for subject-driven image editing and generation.	Existing text-to-image methods struggle to accurately portray user-specific subjects, and subject-driven editing methods often compromise subject identity or model flexibility.	PhD uses a two-step process: 1) pasting a segmented subject from an exemplar image onto a background scene, 2) using an inpainting and harmonizing module to guide a frozen pre-trained diffusion model in generating context-consistent and photorealistic images.	PhD achieves state-of-the-art performance on subject-driven image editing tasks, surpassing baselines in metrics like FID and CLIP scores. The method effectively preserves both subject identity and background scene quality, as shown in quantitative and qualitative evaluations. PhD demonstrates promising results for subject-driven scene generation and style transfer by leveraging the text-guided capabilities of the frozen diffusion model.	While PhD excels in harmonizing subjects into scenes, it can struggle with generating unseen regions of the subject. Future work will explore incorporating 3D information to enhance the generation of complete and consistent subjects.	image editing, image generation, diffusion models, subject-driven synthesis, image harmonization
2306.07470 Report	Reviving Shift Equivariance in Vision Transformers	Peijian Ding, Davit Soselia, Thomas Armstrong, Jiahao Su, Furong Huang	Shift equivariance is a fundamental principle that governs how we perceive the world - our recognition of an object remains invariant with respect to shifts. Transformers have gained immense popularity due to their effectiveness in both language and vision tasks. While the self-attention operator in vision transformers (ViT) is permutation-equivariant and thus shift-equivariant, patch embedding, positional encoding, and subsampled attention in ViT variants can disrupt this property, resulting in inconsistent predictions even under small shift perturbations. Although there is a growing trend in incorporating the inductive bias of convolutional neural networks (CNNs) into vision transformers, it does not fully address the issue. We propose an adaptive polyphase anchoring algorithm that can be seamlessly integrated into vision transformer models to ensure shift-equivariance in patch embedding and subsampled attention modules, such as window attention and global subsampled attention. Furthermore, we utilize depth-wise convolution to encode positional information. Our algorithms enable ViT, and its variants such as Twins to achieve 100% consistency with respect to input shift, demonstrate robustness to cropping, flipping, and affine transformations, and maintain consistent predictions even when the original models lose 20 percentage points on average when shifted by just a few pixels with Twins' accuracy dropping from 80.57% to 62.40%.	This paper introduces an adaptive polyphase anchoring algorithm to address the lack of shift equivariance in vision transformers (ViT), improving their robustness to image translations.	Shift equivariance is a crucial property for consistent object recognition regardless of its position. Existing ViT models often lack this, leading to inconsistent predictions even with small shifts in the input image.	The authors propose replacing non-shift-equivariant modules in ViTs, like patch embedding and subsampled attention, with their polyphase anchoring counterparts. Additionally, they utilize depth-wise convolution with circular padding for positional encoding, further enhancing shift equivariance.	The modified ViT models achieve 100% consistency in image classification under shift perturbations. The approach provides significant improvements in accuracy under challenging transformations like cropping, flipping, and affine transformations. The proposed method leads to a substantial gain in robustness, as demonstrated by a 20% average improvement under a worst-of-30 shift attack.	Due to computational limitations, this work focuses on controlled experiments rather than achieving state-of-the-art accuracy. Future work will explore using larger-scale computing resources to optimize the proposed method for state-of-the-art performance on ViT models.	vision transformers, shift equivariance, polyphase anchoring, robustness, image classification
2306.07346 Report	Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training	Lorenzo Baraldi, Roberto Amoroso, Marcella Cornia, Lorenzo Baraldi, Andrea Pilzer, Rita Cucchiara	The use of self-supervised pre-training has emerged as a promising approach to enhance the performance of visual tasks such as image classification. In this context, recent approaches have employed the Masked Image Modeling paradigm, which pre-trains a backbone by reconstructing visual tokens associated with randomly masked image patches. This masking approach, however, introduces noise into the input data during pre-training, leading to discrepancies that can impair performance during the fine-tuning phase. Furthermore, input masking neglects the dependencies between corrupted patches, increasing the inconsistencies observed in downstream fine-tuning tasks. To overcome these issues, we propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT), that employs autoregressive and permuted predictions to capture intra-patch dependencies. In addition, MaPeT employs auxiliary positional information to reduce the disparity between the pre-training and fine-tuning phases. In our experiments, we employ a fair setting to ensure reliable and meaningful comparisons and conduct investigations on multiple visual tokenizers, including our proposed $k$-CLIP which directly employs discretized CLIP features. Our results demonstrate that MaPeT achieves competitive performance on ImageNet, compared to baselines and competitors under the same model setting. Source code and trained models are publicly available at: https://github.com/aimagelab/MaPeT.	The paper introduces Masked and Permuted Vision Transformer (MaPeT), a self-supervised pre-training approach for vision tasks, and k-CLIP, a novel visual tokenizer employing discretized CLIP features.	Addresses limitations of Masked Image Modeling (MIM) in self-supervised pre-training for vision tasks, aiming to improve performance by capturing inter-token dependencies and mitigating pre-training/fine-tuning discrepancies.	Combines masked and permuted image modeling strategies, incorporating auxiliary position embeddings to provide full patch position information. Leverages two-stream self-attention mechanism and attention masking for capturing bidirectional context. k-CLIP tokenizer directly utilizes discretized CLIP features for visual token generation.	MaPeT consistently outperforms MIM and permutation-based image pre-training (PIM) in image classification tasks. k-CLIP tokenizer demonstrates superior performance compared to VQ-KD and DALL-E, especially with smaller models, due to its ability to capture rich semantic information. MaPeT exhibits strong cross-domain transfer learning capabilities, showcasing its potential for real-world applications.	High computational requirements during training pose challenges for widespread adoption. Further research is needed to assess scalability and adaptability to more diverse and complex domains.	self-supervised learning, vision transformers, masked image modeling, visual tokenizers, clip
2306.07280 Report	Controlling Text-to-Image Diffusion by Orthogonal Finetuning	Zeju Qiu, Weiyang Liu, Haiwen Feng, Yuxuan Xue, Yao Feng, Zhen Liu, Dan Zhang, Adrian Weller, Bernhard Schölkopf	Large text-to-image diffusion models have impressive capabilities in generating photorealistic images from text prompts. How to effectively guide or control these powerful models to perform different downstream tasks becomes an important open problem. To tackle this challenge, we introduce a principled finetuning method -- Orthogonal Finetuning (OFT), for adapting text-to-image diffusion models to downstream tasks. Unlike existing methods, OFT can provably preserve hyperspherical energy which characterizes the pairwise neuron relationship on the unit hypersphere. We find that this property is crucial for preserving the semantic generation ability of text-to-image diffusion models. To improve finetuning stability, we further propose Constrained Orthogonal Finetuning (COFT) which imposes an additional radius constraint to the hypersphere. Specifically, we consider two important finetuning text-to-image tasks: subject-driven generation where the goal is to generate subject-specific images given a few images of a subject and a text prompt, and controllable generation where the goal is to enable the model to take in additional control signals. We empirically show that our OFT framework outperforms existing methods in generation quality and convergence speed.	This paper presents Orthogonal Finetuning (OFT), a novel method for adapting large text-to-image diffusion models to downstream tasks while preserving their generative capabilities.	Fine-tuning diffusion models for tasks like subject-driven and controllable generation is crucial for extending their utility, but existing methods often fail to preserve or struggle to balance generation quality and task-specific control.	OFT learns layer-wise orthogonal transformations of neuron weights, provably preserving hyperspherical energy, which is argued to be key to retaining the semantic knowledge of the pretrained model. A constrained variant (COFT) further enhances stability by limiting deviation from pretrained weights.	OFT demonstrates significantly improved stability and convergence speed over DreamBooth and LoRA in subject-driven generation. For controllable generation, OFT achieves superior control accuracy compared to ControlNet, T2I-Adapter, and LoRA, often with fewer training data and parameters. Experiments on various control tasks (Canny edges, segmentation maps, landmarks) validate OFT's effectiveness, showcasing its ability to generate high-fidelity images with accurate control.	The reliance on Cayley parametrization for orthogonality introduces a matrix inversion step that can be computationally expensive for large models. While block-diagonal parametrization improves efficiency, it introduces biases and limits flexibility; exploring alternative efficient parametrizations is crucial.	text-to-image generation, diffusion models, fine-tuning, controllable generation, subject-driven generation
2306.07257 Report	MovieFactory: Automatic Movie Creation from Text using Large Generative Models for Language and Images	Junchen Zhu, Huan Yang, Huiguo He, Wenjing Wang, Zixi Tuo, Wen-Huang Cheng, Lianli Gao, Jingkuan Song, Jianlong Fu	In this paper, we present MovieFactory, a powerful framework to generate cinematic-picture (3072$\times$1280), film-style (multi-scene), and multi-modality (sounding) movies on the demand of natural languages. As the first fully automated movie generation model to the best of our knowledge, our approach empowers users to create captivating movies with smooth transitions using simple text inputs, surpassing existing methods that produce soundless videos limited to a single scene of modest quality. To facilitate this distinctive functionality, we leverage ChatGPT to expand user-provided text into detailed sequential scripts for movie generation. Then we bring scripts to life visually and acoustically through vision generation and audio retrieval. To generate videos, we extend the capabilities of a pretrained text-to-image diffusion model through a two-stage process. Firstly, we employ spatial finetuning to bridge the gap between the pretrained image model and the new video dataset. Subsequently, we introduce temporal learning to capture object motion. In terms of audio, we leverage sophisticated retrieval models to select and align audio elements that correspond to the plot and visual content of the movie. Extensive experiments demonstrate that our MovieFactory produces movies with realistic visuals, diverse scenes, and seamlessly fitting audio, offering users a novel and immersive experience. Generated samples can be found in YouTube or Bilibili (1080P).	MovieFactory: a novel framework for generating high-definition, cinematic-style (ultrawide format), multi-scene, and sounding movies from text inputs.	Automatic movie generation is a challenging task with the potential to democratize filmmaking and empower individuals to bring their stories to life.	The framework leverages ChatGPT for script generation, a two-stage fine-tuned diffusion model for video generation, and retrieval models for synchronized audio.	MovieFactory produces high-quality videos with clear visuals and smooth object motion. The two-stage training strategy effectively addresses the domain shift between image and video datasets. The framework successfully combines large-scale AI models to create engaging and immersive movie experiences.	Current limitations include reliance on retrieval-based audio and the potential for further enhancing the quality of generated content. Future work may explore end-to-end audio generation, improved temporal consistency, and the incorporation of user feedback for interactive movie creation.	movie generation, diffusion model, text-to-video, chatgpt, multi-modal generation
2306.07200 Report	Fill-Up: Balancing Long-Tailed Data with Generative Models	Joonghyuk Shin, Minguk Kang, Jaesik Park	Modern text-to-image synthesis models have achieved an exceptional level of photorealism, generating high-quality images from arbitrary text descriptions. In light of the impressive synthesis ability, several studies have exhibited promising results in exploiting generated data for image recognition. However, directly supplementing data-hungry situations in the real-world (e.g. few-shot or long-tailed scenarios) with existing approaches result in marginal performance gains, as they suffer to thoroughly reflect the distribution of the real data. Through extensive experiments, this paper proposes a new image synthesis pipeline for long-tailed situations using Textual Inversion. The study demonstrates that generated images from textual-inverted text tokens effectively aligns with the real domain, significantly enhancing the recognition ability of a standard ResNet50 backbone. We also show that real-world data imbalance scenarios can be successfully mitigated by filling up the imbalanced data with synthetic images. In conjunction with techniques in the area of long-tailed recognition, our method achieves state-of-the-art results on standard long-tailed benchmarks when trained from scratch.	This paper proposes a new image synthesis pipeline for long-tailed image recognition using Textual Inversion, which fills up imbalanced datasets with synthetic images aligned with the real domain, improving recognition accuracy.	Real-world data often exhibits imbalanced distributions, posing challenges for long-tailed recognition tasks, and existing synthetic data generation methods struggle to reflect real data distributions effectively.	The authors evaluate various image generation strategies, finding Textual Inversion most effective. They optimize per-class text tokens to generate images that align with the real domain and adopt a two-stage training procedure with Balanced Softmax loss.	Textual Inversion-based image generation outperforms other methods in terms of diversity and alignment with real data. The proposed pipeline achieves state-of-the-art results on standard long-tailed benchmarks (ImageNet-LT, Places-LT, iNaturalist2018) when trained from scratch. The method demonstrates significant improvements in few-shot scenarios, particularly for classes with fewer than 20 samples.	Generating synthetic images through diffusion models demands extensive computational resources. Despite the efficiency of the approach, generating images with features on par with real samples remains a challenge.	long-tailed recognition, text-to-image synthesis, textual inversion, synthetic data generation, data imbalance
2306.07180 Report	Diffusion Models for Black-Box Optimization	Siddarth Krishnamoorthy, Satvik Mehul Mashkaria, Aditya Grover	The goal of offline black-box optimization (BBO) is to optimize an expensive black-box function using a fixed dataset of function evaluations. Prior works consider forward approaches that learn surrogates to the black-box function and inverse approaches that directly map function values to corresponding points in the input domain of the black-box function. These approaches are limited by the quality of the offline dataset and the difficulty in learning one-to-many mappings in high dimensions, respectively. We propose Denoising Diffusion Optimization Models (DDOM), a new inverse approach for offline black-box optimization based on diffusion models. Given an offline dataset, DDOM learns a conditional generative model over the domain of the black-box function conditioned on the function values. We investigate several design choices in DDOM, such as re-weighting the dataset to focus on high function values and the use of classifier-free guidance at test-time to enable generalization to function values that can even exceed the dataset maxima. Empirically, we conduct experiments on the Design-Bench benchmark and show that DDOM achieves results competitive with state-of-the-art baselines.	Presents Denoising Diffusion Optimization Models (DDOM), a novel inverse method for offline black-box optimization that leverages conditional diffusion models to learn a mapping from function values to input points.	Addresses limitations of existing forward (surrogate-based) and inverse approaches for offline black-box optimization, particularly in handling limited dataset coverage and challenges in learning one-to-many mappings in high-dimensional spaces.	Trains a conditional diffusion model on an offline dataset of input-value pairs, employing loss reweighting to prioritize high function values and classifier-free guidance during sampling to enhance conditioning and enable generalization beyond dataset maxima.	DDOM successfully learns the inverse mapping, generating points with function values closely matching the conditioned values. Outperforms existing forward and inverse baselines on the Design-Bench suite, achieving the best average rank and demonstrating robustness to initialization compared to alternatives. Effectiveness of loss reweighting and classifier-free guidance is validated through ablation studies, highlighting their contribution to DDOM's performance.	Sampling speed in diffusion models can be a limitation for some real-time applications. Potential for misuse in optimizing for undesirable outcomes necessitates careful consideration during real-world deployment.	black-box optimization, diffusion models, offline optimization, generative models, conditional generation
2306.06991 Report	Fast Diffusion Model	Zike Wu, Pan Zhou, Kenji Kawaguchi, Hanwang Zhang	Diffusion models (DMs) have been adopted across diverse fields with its remarkable abilities in capturing intricate data distributions. In this paper, we propose a Fast Diffusion Model (FDM) to significantly speed up DMs from a stochastic optimization perspective for both faster training and sampling. We first find that the diffusion process of DMs accords with the stochastic optimization process of stochastic gradient descent (SGD) on a stochastic time-variant problem. Then, inspired by momentum SGD that uses both gradient and an extra momentum to achieve faster and more stable convergence than SGD, we integrate momentum into the diffusion process of DMs. This comes with a unique challenge of deriving the noise perturbation kernel from the momentum-based diffusion process. To this end, we frame the process as a Damped Oscillation system whose critically damped state -- the kernel solution -- avoids oscillation and yields a faster convergence speed of the diffusion process. Empirical results show that our FDM can be applied to several popular DM frameworks, e.g., VP, VE, and EDM, and reduces their training cost by about 50% with comparable image synthesis performance on CIFAR-10, FFHQ, and AFHQv2 datasets. Moreover, FDM decreases their sampling steps by about 3x to achieve similar performance under the same samplers. The code is available at https://github.com/sail-sg/FDM.	The paper proposes Fast Diffusion Model (FDM) which integrates momentum into the diffusion process of Diffusion Models (DMs) to accelerate both training and sampling.	DMs, while powerful in generative tasks, suffer from slow and costly training and sampling, hindering broader applications. FDM tackles this limitation by fundamentally improving the diffusion process.	The authors establish a connection between DMs' diffusion process and stochastic gradient descent (SGD). Leveraging the faster convergence of momentum SGD, they incorporate momentum into the diffusion process and derive a tractable perturbation kernel for efficient training and sampling.	FDM reduces training cost by about 50% compared to popular DM frameworks (VP, VE, EDM) while maintaining comparable image synthesis performance. FDM achieves similar image generation quality with 3 times fewer sampling steps compared to baselines. The momentum-based diffusion process shows stable and faster convergence towards the target distribution both theoretically and empirically.	Verification is limited to three popular DMs (VP, VE, EDM). Further validation on a wider range of DMs is needed. Evaluation is conducted on a limited set of datasets. Testing on diverse tasks is necessary to fully understand FDM's potential.	diffusion models, generative models, momentum sgd, fast sampling, efficient training
2306.06899 Report	Augmenting Zero-Shot Detection Training with Image Labels	Katharina Kornmeier, Ulla Scheler, Pascal Herrmann	Zero-shot detection (ZSD), i.e., detection on classes not seen during training, is essential for real world detection use-cases, but remains a difficult task. Recent research attempts ZSD with detection models that output embeddings instead of direct class labels. To this aim, the output of the detection model must be aligned to a learned embedding space such as CLIP. However, this alignment is hindered by detection data sets which are expensive to produce compared to image classification annotations, and the resulting lack of category diversity in the training data. We address this challenge by leveraging the CLIP embedding space in combination with image labels from ImageNet. Our results show that image labels are able to better align the detector output to the embedding space and thus have a high potential for ZSD. Compared to only training on detection data, we see a significant gain by adding image label data of 3.3 mAP for the 65/15 split on COCO on the unseen classes, i.e., we more than double the gain of related work.	This paper proposes a method to improve zero-shot detection (ZSD) performance by augmenting the training of object detectors with image labels from a large-scale image classification dataset (ImageNet).	ZSD is crucial for real-world applications as it allows detectors to identify objects not present in the training data. Existing ZSD methods suffer from limited category diversity due to the expensive nature of object detection annotations.	The authors modify a single-stage detector (YOLOX) to predict embedding vectors instead of class probabilities. They align the model to the CLIP embedding space using both object detection data (COCO) and image classification data (ImageNet). A key aspect is the filtering and selection of appropriate bounding box predictions from ImageNet data for backpropagation.	Adding ImageNet image labels significantly improves ZSD performance, more than doubling the gain of previous work using image embeddings. This approach outperforms methods aligning to both text and image embeddings from COCO, highlighting the benefit of diverse category information from ImageNet. The authors identify a new failure mode related to the underlying structure of label embeddings in embedding spaces.	The study uses a smaller YOLOX model and fewer data augmentations due to resource constraints, potentially limiting performance. Future work could explore alternative training losses or embedding spaces for improved alignment.	zero-shot detection, object detection, clip embedding space, image labels, data augmentation
2306.06684 Report	Happy People -- Image Synthesis as Black-Box Optimization Problem in the Discrete Latent Space of Deep Generative Models	Steffen Jung, Jan Christian Schwedhelm, Claudia Schillings, Margret Keuper	In recent years, optimization in the learned latent space of deep generative models has been successfully applied to black-box optimization problems such as drug design, image generation or neural architecture search. Existing models thereby leverage the ability of neural models to learn the data distribution from a limited amount of samples such that new samples from the distribution can be drawn. In this work, we propose a novel image generative approach that optimizes the generated sample with respect to a continuously quantifiable property. While we anticipate absolutely no practically meaningful application for the proposed framework, it is theoretically principled and allows to quickly propose samples at the mere boundary of the training data distribution. Specifically, we propose to use tree-based ensemble models as mathematical programs over the discrete latent space of vector quantized VAEs, which can be globally solved. Subsequent weighted retraining on these queries allows to induce a distribution shift. In lack of a practically relevant problem, we consider a visually appealing application: the generation of happily smiling faces (where the training distribution only contains less happy people) - and show the principled behavior of our approach in terms of improved FID and higher smile degree over baseline approaches.	This paper presents a novel method for black-box optimization in the discrete latent space of VQ-VAEs, aiming to generate high-quality images with desired properties by optimizing a continuously quantifiable objective.	This approach addresses the limitations of existing latent space optimization (LSO) methods, particularly in situations where the global optimum lies far from the training data distribution, by enabling efficient optimization in discrete latent spaces and inducing distribution shifts via weighted retraining.	The methodology involves training a tree-based ensemble model as a surrogate for the black-box objective function in the discrete latent space. This model's predictions are then encoded as a mixed-integer optimization problem, solved globally to determine the next query point for image generation. The VQ-VAE is iteratively fine-tuned on the weighted data acquired during optimization, inducing a distribution shift towards the desired properties.	The proposed method significantly outperforms continuous LSO with VAEs in a smiling face generation task, achieving higher objective function values. Weighted retraining is shown to effectively induce a distribution shift, leading to improved results compared to optimization without retraining. The use of a discrete latent space through VQ-VAE allows for the generation of higher-quality images compared to standard VAEs.	The method is computationally more expensive than continuous LSO approaches due to the need for global optimization in the discrete latent space. The quality of generated images, although better than those from VAEs, can still be further improved, potentially by exploring more sophisticated generative models or optimization strategies. Future work could explore the application of this method to other domains beyond image synthesis, such as drug discovery or neural architecture search.	black-box optimization, latent space optimization, vq-vae, image synthesis, distribution shift
2306.06638 Report	Face0: Instantaneously Conditioning a Text-to-Image Model on a Face	Dani Valevski, Danny Wasserman, Yossi Matias, Yaniv Leviathan	We present Face0, a novel way to instantaneously condition a text-to-image generation model on a face, in sample time, without any optimization procedures such as fine-tuning or inversions. We augment a dataset of annotated images with embeddings of the included faces and train an image generation model, on the augmented dataset. Once trained, our system is practically identical at inference time to the underlying base model, and is therefore able to generate images, given a user-supplied face image and a prompt, in just a couple of seconds. Our method achieves pleasing results, is remarkably simple, extremely fast, and equips the underlying model with new capabilities, like controlling the generated images both via text or via direct manipulation of the input face embeddings. In addition, when using a fixed random vector instead of a face embedding from a user supplied image, our method essentially solves the problem of consistent character generation across images. Finally, while requiring further research, we hope that our method, which decouples the model's textual biases from its biases on faces, might be a step towards some mitigation of biases in future text-to-image models.	Presents Face0, a novel method for instantaneously conditioning a text-to-image model on a face without fine-tuning or inversions.	Addresses the challenge of generating images depicting a person from a user-supplied image instantly and efficiently.	Augments a dataset with face embeddings, trains a projection module to map embeddings to CLIP space, and jointly fine-tunes a diffusion model (Stable Diffusion) on text and projected embeddings.	Enables instant generation of images resembling a person from a single photo. Allows control over generated faces through text prompts and direct manipulation of face embeddings. Facilitates consistent character generation across multiple images using fixed random embedding vectors.	May not perfectly preserve provided identities, sometimes generating "look-alike" characters. Relies on a face embedding mechanism that primarily fixes pose and expression, limiting control over these aspects.	text-to-image synthesis, face embedding, diffusion models, stable diffusion, personalized image generation
2306.06577 Report	Semantically-aware Mask CycleGAN for Translating Artistic Portraits to Photo-realistic Visualizations	Zhuohao Yin	Image-to-image translation (I2I) is defined as a computer vision task where the aim is to transfer images in a source domain to a target domain with minimal loss or alteration of the content representations. Major progress has been made since I2I was proposed with the invention of a variety of revolutionary generative models. Among them, GAN-based models perform exceptionally well as they are mostly tailor-made for specific domains or tasks. However, few works proposed a tailor-made method for the artistic domain. In this project, I propose the Semantic-aware Mask CycleGAN (SMCycleGAN) architecture which can translate artistic portraits to photo-realistic visualizations. This model can generate realistic human portraits by feeding the discriminators semantically masked fake samples, thus enforcing them to make discriminative decisions with partial information so that the generators can be optimized to synthesize more realistic human portraits instead of increasing the similarity of other irrelevant components, such as the background. Experiments have shown that the SMCycleGAN generate images with significantly increased realism and minimal loss of content representations.	This paper proposes Semantic-aware Mask CycleGAN (SMCycleGAN) to translate artistic portraits to photo-realistic visualizations.	This work aims to restore the realistic appearances of subjects in art portraits, bridging the gap between painted and photorealistic representations.	SMCycleGAN utilizes semantic segmentation to mask generated images, focusing the discriminator on human subjects and improving realism.	SMCycleGAN generates portraits with high realism, adjusting skin color and smoothing textures. It reduces artifacts and maintains facial details better than baseline models like CycleGAN and Art2Real. Quantitative evaluation using Fréchet Inception Distance shows SMCycleGAN generates images closest to realistic portraits.	The model struggles with diverse ethnicities due to training data imbalance. Highly abstract or artifact-heavy artworks pose challenges for realistic image generation.	image-to-image translation, generative adversarial networks, cyclegan, semantic segmentation, art portrait
2306.06513 Report	Learning Image-Adaptive Codebooks for Class-Agnostic Image Restoration	Kechun Liu, Yitong Jiang, Inchang Choi, Jinwei Gu	Recent work on discrete generative priors, in the form of codebooks, has shown exciting performance for image reconstruction and restoration, as the discrete prior space spanned by the codebooks increases the robustness against diverse image degradations. Nevertheless, these methods require separate training of codebooks for different image categories, which limits their use to specific image categories only (e.g. face, architecture, etc.), and fail to handle arbitrary natural images. In this paper, we propose AdaCode for learning image-adaptive codebooks for class-agnostic image restoration. Instead of learning a single codebook for each image category, we learn a set of basis codebooks. For a given input image, AdaCode learns a weight map with which we compute a weighted combination of these basis codebooks for adaptive image restoration. Intuitively, AdaCode is a more flexible and expressive discrete generative prior than previous work. Experimental results demonstrate that AdaCode achieves state-of-the-art performance on image reconstruction and restoration tasks, including image super-resolution and inpainting.	This paper presents AdaCode, a novel image-adaptive codebook learning method for class-agnostic image restoration.	Existing methods using discrete generative priors (codebooks) often require separate training for different image categories, limiting their applicability to arbitrary natural images.	AdaCode learns a set of basis codebooks, each trained on a specific image category. For a given image, it then learns a weight map to combine these basis codebooks into an image-adaptive representation for improved restoration.	AdaCode achieves state-of-the-art performance on image reconstruction, outperforming methods using single general codebooks or merged codebooks. AdaCode demonstrates superior performance in super-resolution tasks compared to existing methods, showing better detail preservation and fewer artifacts. AdaCode exhibits state-of-the-art results in image inpainting, effectively recovering missing regions with high fidelity across various scenes.	The optimal number of basis codebooks and code entries per codebook requires further investigation. The current method could benefit from incorporating high-level semantic information for improved restoration.	image restoration, generative priors, codebook learning, class-agnostic, image super-resolution, image inpainting
2306.06189 Report	FasterViT: Fast Vision Transformers with Hierarchical Attention	Ali Hatamizadeh, Greg Heinrich, Hongxu Yin, Andrew Tao, Jose M. Alvarez, Jan Kautz, Pavlo Molchanov	We design a new family of hybrid CNN-ViT neural networks, named FasterViT, with a focus on high image throughput for computer vision (CV) applications. FasterViT combines the benefits of fast local representation learning in CNNs and global modeling properties in ViT. Our newly introduced Hierarchical Attention (HAT) approach decomposes global self-attention with quadratic complexity into a multi-level attention with reduced computational costs. We benefit from efficient window-based self-attention. Each window has access to dedicated carrier tokens that participate in local and global representation learning. At a high level, global self-attentions enable the efficient cross-window communication at lower costs. FasterViT achieves a SOTA Pareto-front in terms of accuracy and image throughput. We have extensively validated its effectiveness on various CV tasks including classification, object detection and segmentation. We also show that HAT can be used as a plug-and-play module for existing networks and enhance them. We further demonstrate significantly faster and more accurate performance than competitive counterparts for images with high resolution. Code is available at https://github.com/NVlabs/FasterViT.	This paper proposes FasterViT, a novel hybrid CNN-ViT architecture designed for high image throughput in computer vision tasks. It leverages a new Hierarchical Attention (HAT) approach to reduce the computational cost of global self-attention while maintaining accuracy.	FasterViT addresses the limitations of ViTs in terms of computational complexity and the need for efficient global modeling, particularly for high-resolution images, making it suitable for real-world applications requiring fast image processing.	The methodology involves combining CNNs for early-stage feature extraction with HAT-based transformer blocks in later stages. HAT decomposes global self-attention into a multi-level approach using learnable carrier tokens to summarize local windows and facilitate efficient cross-window communication.	FasterViT achieves state-of-the-art performance in terms of image throughput and accuracy trade-off on ImageNet-1k classification, outperforming both convolutional and transformer-based counterparts. It demonstrates competitive performance on dense prediction tasks like object detection, instance segmentation (MS COCO), and semantic segmentation (ADE20K). The effectiveness of HAT as a plug-and-play module is demonstrated by its ability to enhance the performance of existing architectures like Swin-T on various tasks with minimal overhead.	The paper acknowledges the potential for further exploration of joint optimization with acceleration methods like compression. Further investigation into the scalability of HAT to even higher image resolutions and its impact on performance is left for future work.	vision transformer, cnn, hierarchical attention, image throughput, computer vision
2306.06093 Report	HyP-NeRF: Learning Improved NeRF Priors using a HyperNetwork	Bipasha Sen, Gaurav Singh, Aditya Agarwal, Rohith Agaram, K Madhava Krishna, Srinath Sridhar	Neural Radiance Fields (NeRF) have become an increasingly popular representation to capture high-quality appearance and shape of scenes and objects. However, learning generalizable NeRF priors over categories of scenes or objects has been challenging due to the high dimensionality of network weight space. To address the limitations of existing work on generalization, multi-view consistency and to improve quality, we propose HyP-NeRF, a latent conditioning method for learning generalizable category-level NeRF priors using hypernetworks. Rather than using hypernetworks to estimate only the weights of a NeRF, we estimate both the weights and the multi-resolution hash encodings resulting in significant quality gains. To improve quality even further, we incorporate a denoise and finetune strategy that denoises images rendered from NeRFs estimated by the hypernetwork and finetunes it while retaining multiview consistency. These improvements enable us to use HyP-NeRF as a generalizable prior for multiple downstream tasks including NeRF reconstruction from single-view or cluttered scenes and text-to-NeRF. We provide qualitative comparisons and evaluate HyP-NeRF on three tasks: generalization, compression, and retrieval, demonstrating our state-of-the-art results.	\acro is a latent conditioning method that learns generalizable, category-level NeRF priors using hypernetworks. It generates both instance-specific multi-resolution hash encodings and neural network weights, significantly improving quality. It also employs a denoising and finetuning strategy for further improvement, enabling applications like single-view NeRF reconstruction, text-to-NeRF, and reconstruction from cluttered scenes.	Existing methods struggle to learn generalizable NeRF priors due to the high dimensionality of network weight space, often resulting in lower quality or inconsistent representations. This work addresses these limitations by combining the advantages of instance-specific representations with the generalization capabilities of hypernetworks.	\acro utilizes a two-step process: 1) A hypernetwork is trained to predict both the multi-resolution hash encodings and weights of a NeRF model, conditioned on an instance code. 2) A denoising network improves the rendered views, and the NeRF is finetuned using these enhanced images to achieve higher quality while retaining multiview consistency.	\acro significantly outperforms baselines like PixelNeRF in single-view novel NeRF generation, achieving higher PSNR, SSIM, and lower LPIPS and FID scores. It demonstrates effective compression by learning from thousands of NeRF instances, achieving 60x compression gain compared to instance-specific methods with minimal quality degradation. The learned prior enables retrieval of novel NeRFs from various modalities like single-view images, segmented images, and text prompts, showcasing its generalizability.	Test-time optimization requires known poses, limiting its application in unposed image scenarios. The learned prior is non-standard, making unconditional generation challenging and requiring mapping from known distributions.	neural radiance fields, nerf, hypernetworks, generative models, 3d reconstruction
2306.06092 Report	Realistic Saliency Guided Image Enhancement	S. Mahdi H. Miangoleh, Zoya Bylinskii, Eric Kee, Eli Shechtman, Yağız Aksoy	Common editing operations performed by professional photographers include the cleanup operations: de-emphasizing distracting elements and enhancing subjects. These edits are challenging, requiring a delicate balance between manipulating the viewer's attention while maintaining photo realism. While recent approaches can boast successful examples of attention attenuation or amplification, most of them also suffer from frequent unrealistic edits. We propose a realism loss for saliency-guided image enhancement to maintain high realism across varying image types, while attenuating distractors and amplifying objects of interest. Evaluations with professional photographers confirm that we achieve the dual objective of realism and effectiveness, and outperform the recent approaches on their own datasets, while requiring a smaller memory footprint and runtime. We thus offer a viable solution for automating image enhancement and photo cleanup operations.	This paper introduces a novel saliency-guided image enhancement method that leverages a realism loss to maintain photorealism while attenuating distracting elements or enhancing subjects in an image.	This work addresses the limitations of existing saliency-guided image editing techniques, which often produce unrealistic edits. It offers a viable solution for automating image enhancement and photo cleanup operations while preserving realism.	The method utilizes a realism network trained on a dataset of realistic and unrealistic edits. This network learns to estimate the realism of local image edits. This realism score is then incorporated into a saliency-guided image editing pipeline that optimizes for both saliency and realism.	The method outperforms state-of-the-art approaches in terms of realism and effectiveness, as confirmed by evaluations with professional photographers. The realism network effectively learns a continuous measure of realism for various editing operations despite being trained on binary data (realistic vs. unrealistic). The approach generalizes well to multiple image regions and masks, allowing for iterative editing.	The reliance on global edits within a mask can lead to artifacts at mask boundaries, especially with imperfect masks. Future work could explore incorporating pixel-wise optimization to address boundary artifacts and further enhance realism.	image enhancement, saliency detection, realism estimation, photo editing, deep learning
2306.05720 Report	Beyond Surface Statistics: Scene Representations in a Latent Diffusion Model	Yida Chen, Fernanda Viégas, Martin Wattenberg	Latent diffusion models (LDMs) exhibit an impressive ability to produce realistic images, yet the inner workings of these models remain mysterious. Even when trained purely on images without explicit depth information, they typically output coherent pictures of 3D scenes. In this work, we investigate a basic interpretability question: does an LDM create and use an internal representation of simple scene geometry? Using linear probes, we find evidence that the internal activations of the LDM encode linear representations of both 3D depth data and a salient-object / background distinction. These representations appear surprisingly early in the denoising process$-$well before a human can easily make sense of the noisy images. Intervention experiments further indicate these representations play a causal role in image synthesis, and may be used for simple high-level editing of an LDM's output. Project page: https://yc015.github.io/scene-representation-diffusion-model/	This paper investigates whether Latent Diffusion Models (LDMs) develop internal representations of 3D scene geometry despite being trained only on 2D images.	Understanding how LDMs generate realistic images, particularly the emergence of 3D understanding, is crucial for interpretability and potential applications in image editing.	The authors use linear probes to analyze the internal activations of a pre-trained LDM (Stable Diffusion) and conduct intervention experiments to study the causal role of these representations.	LDMs encode linear representations of both continuous depth maps and a salient object/background distinction. These representations appear early in the denoising process, well before a human can perceive coherent structures in the noisy images. Intervention experiments demonstrate a causal relationship between these internal representations and the final output image, enabling simple high-level editing.	The study primarily focuses on a single LDM (Stable Diffusion) and a limited set of scene attributes. Future work could explore the representation of other scene attributes like lighting, texture, and semantic aspects.	latent diffusion models, interpretability, 3d scene understanding, linear probing, causal intervention
2306.05544 Report	BOOT: Data-free Distillation of Denoising Diffusion Models with Bootstrapping	Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Lingjie Liu, Josh Susskind	Diffusion models have demonstrated excellent potential for generating diverse images. However, their performance often suffers from slow generation due to iterative denoising. Knowledge distillation has been recently proposed as a remedy that can reduce the number of inference steps to one or a few without significant quality degradation. However, existing distillation methods either require significant amounts of offline computation for generating synthetic training data from the teacher model or need to perform expensive online learning with the help of real data. In this work, we present a novel technique called BOOT, that overcomes these limitations with an efficient data-free distillation algorithm. The core idea is to learn a time-conditioned model that predicts the output of a pre-trained diffusion model teacher given any time step. Such a model can be efficiently trained based on bootstrapping from two consecutive sampled steps. Furthermore, our method can be easily adapted to large-scale text-to-image diffusion models, which are challenging for conventional methods given the fact that the training sets are often large and difficult to access. We demonstrate the effectiveness of our approach on several benchmark datasets in the DDIM setting, achieving comparable generation quality while being orders of magnitude faster than the diffusion teacher. The text-to-image results show that the proposed approach is able to handle highly complex distributions, shedding light on more efficient generative modeling.	This paper proposes BOOT, a data-free knowledge distillation method to boost the inference speed of diffusion models by distilling them into single-step models.	Diffusion models excel in generating diverse images but suffer from slow generation speed due to iterative denoising. Existing distillation methods either demand extensive offline computation or require real data for online learning, making them impractical for large models.	BOOT learns a time-conditioned model to predict the output of a pre-trained diffusion model for any given time-step. It utilizes a novel signal-ODE derived from the original probability-flow ODE for efficient training based on bootstrapping from consecutive sampled steps. This eliminates the reliance on real data during distillation.	BOOT achieves comparable image generation quality to multi-step diffusion models (around 10 steps) with a 10x speedup on standard benchmarks (FFHQ, LSUN, ImageNet). It successfully distills large-scale text-to-image models like Stable Diffusion and DeepFloyd IF, maintaining good generation quality with significant speed improvements. The method enables controllable generation by interpolating in the learned latent space or modifying text prompts while keeping the noise input fixed.	BOOT's sampling quality depends on the pre-trained teacher model and might be lower than data-dependent distillation methods. The current implementation uses a similar architecture for the student and teacher models. Exploring different architectures could further improve performance.	knowledge distillation, diffusion models, generative models, text-to-image generation, fast inference
2306.05493 Report	Multi-Modal Classifiers for Open-Vocabulary Object Detection	Prannay Kaul, Weidi Xie, Andrew Zisserman	The goal of this paper is open-vocabulary object detection (OVOD) $\unicode{x2013}$ building a model that can detect objects beyond the set of categories seen at training, thus enabling the user to specify categories of interest at inference without the need for model retraining. We adopt a standard two-stage object detector architecture, and explore three ways for specifying novel categories: via language descriptions, via image exemplars, or via a combination of the two. We make three contributions: first, we prompt a large language model (LLM) to generate informative language descriptions for object classes, and construct powerful text-based classifiers; second, we employ a visual aggregator on image exemplars that can ingest any number of images as input, forming vision-based classifiers; and third, we provide a simple method to fuse information from language descriptions and image exemplars, yielding a multi-modal classifier. When evaluating on the challenging LVIS open-vocabulary benchmark we demonstrate that: (i) our text-based classifiers outperform all previous OVOD works; (ii) our vision-based classifiers perform as well as text-based classifiers in prior work; (iii) using multi-modal classifiers perform better than either modality alone; and finally, (iv) our text-based and multi-modal classifiers yield better performance than a fully-supervised detector.	This paper presents a multi-modal open-vocabulary object detector that uses language descriptions, image exemplars, or a combination of both to specify novel categories.	This is important because it enables users to specify categories of interest at inference time without retraining the model, overcoming limitations of previous methods that rely solely on class names.	The authors propose three methods: 1) prompting an LLM for rich category descriptions to generate text-based classifiers, 2) using a visual aggregator on image exemplars to form vision-based classifiers, and 3) fusing both language and image information for multi-modal classifiers.	Text-based classifiers using LLM descriptions outperform previous OVOD methods. Vision-based classifiers achieve comparable performance to prior text-based methods. Multi-modal classifiers outperform single-modality methods and even a fully-supervised detector on LVIS.	Vision-based classifiers still lag behind text-based classifiers, requiring further research. Exploration of more sophisticated multi-modal fusion techniques could further enhance performance.	open-vocabulary object detection, multi-modal learning, large language models, vision-language models, zero-shot learning
2306.05427 Report	Grounded Text-to-Image Synthesis with Attention Refocusing	Quynh Phung, Songwei Ge, Jia-Bin Huang	Driven by the scalable diffusion models trained on large-scale datasets, text-to-image synthesis methods have shown compelling results. However, these models still fail to precisely follow the text prompt involving multiple objects, attributes, or spatial compositions. In this paper, we reveal the potential causes in the diffusion model's cross-attention and self-attention layers. We propose two novel losses to refocus attention maps according to a given spatial layout during sampling. Creating the layouts manually requires additional effort and can be tedious. Therefore, we explore using large language models (LLM) to produce these layouts for our method. We conduct extensive experiments on the DrawBench, HRS, and TIFA benchmarks to evaluate our proposed method. We show that our proposed attention refocusing effectively improves the controllability of existing approaches.	This paper introduces an attention-refocusing approach to enhance the controllability of layout-conditioned text-to-image synthesis using diffusion models by regulating both cross- and self-attention layers during sampling, guided by explicit layout representations.	Existing text-to-image models often struggle to accurately represent the spatial relationships, quantities, and attributes of multiple objects described in text prompts. This work aims to improve the controllability of these models, allowing for more accurate and user-intended image generation.	The proposed method uses a two-stage pipeline: 1) text-to-layout: utilize a Large Language Model (LLM) like GPT-4 to generate object bounding boxes from the text prompt. 2) grounded text-to-image generation: introduce Cross-Attention Refocusing (CAR) and Self-Attention Refocusing (SAR) losses to guide the diffusion model's attention towards the desired regions within the generated layout during the sampling process.	The proposed attention-refocusing method consistently improves performance across various text-to-image models and benchmarks, including HRS, DrawBench, and TIFA, particularly in spatial accuracy and object counting. Utilizing LLMs like GPT-4 for layout generation demonstrates strong spatial reasoning ability and allows for flexible integration with existing text-to-image models without requiring retraining. The framework enables novel capabilities such as chatGPT-based iterative image refinement by instructing layout modifications.	The LLM-based layout generation may still struggle with complex prompts involving a large number of objects or unusual spatial compositions. The grounded text-to-image model may not always perfectly adhere to out-of-distribution layouts generated by the LLM, requiring further research in layout-conditional generation.	text-to-image synthesis, diffusion models, attention mechanisms, layout generation, large language models
2306.05422 Report	Tracking Everything Everywhere All at Once	Qianqian Wang, Yen-Yu Chang, Ruojin Cai, Zhengqi Li, Bharath Hariharan, Aleksander Holynski, Noah Snavely	We present a new test-time optimization method for estimating dense and long-range motion from a video sequence. Prior optical flow or particle video tracking algorithms typically operate within limited temporal windows, struggling to track through occlusions and maintain global consistency of estimated motion trajectories. We propose a complete and globally consistent motion representation, dubbed OmniMotion, that allows for accurate, full-length motion estimation of every pixel in a video. OmniMotion represents a video using a quasi-3D canonical volume and performs pixel-wise tracking via bijections between local and canonical space. This representation allows us to ensure global consistency, track through occlusions, and model any combination of camera and object motion. Extensive evaluations on the TAP-Vid benchmark and real-world footage show that our approach outperforms prior state-of-the-art methods by a large margin both quantitatively and qualitatively. See our project page for more results: http://omnimotion.github.io/	Proposes OmniMotion, a test-time optimization method using a quasi-3D representation for estimating dense, long-range, globally consistent motion trajectories in videos, even through occlusions.	Existing methods struggle to estimate both dense and long-range pixel trajectories accurately and consistently, especially through occlusions.	Represents a video as a canonical 3D volume and per-frame local volumes, connected by learned bijections. Optimizes the representation using noisy pairwise correspondences (e.g., optical flow) and photometric consistency.	Achieves state-of-the-art performance on the TAP-Vid benchmark, outperforming previous methods in position accuracy, occlusion accuracy, and temporal coherence. Successfully tracks points through long occlusions and provides plausible locations even during occlusion. Demonstrates robustness to varying camera setups and scene dynamics.	Struggles with rapid, highly non-rigid motions and thin structures due to reliance on reliable pairwise correspondence input. Can be computationally expensive, particularly in the flow collection and optimization stages.	motion estimation, dense correspondence, occlusion handling, video representation, test-time optimization
2306.05414 Report	Improving Tuning-Free Real Image Editing with Proximal Guidance	Ligong Han, Song Wen, Qi Chen, Zhixing Zhang, Kunpeng Song, Mengwei Ren, Ruijiang Gao, Anastasis Stathopoulos, Xiaoxiao He, Yuxiao Chen, Di Liu, Qilong Zhangli, Jindong Jiang, Zhaoyang Xia, Akash Srivastava, Dimitris Metaxas	DDIM inversion has revealed the remarkable potential of real image editing within diffusion-based methods. However, the accuracy of DDIM reconstruction degrades as larger classifier-free guidance (CFG) scales being used for enhanced editing. Null-text inversion (NTI) optimizes null embeddings to align the reconstruction and inversion trajectories with larger CFG scales, enabling real image editing with cross-attention control. Negative-prompt inversion (NPI) further offers a training-free closed-form solution of NTI. However, it may introduce artifacts and is still constrained by DDIM reconstruction quality. To overcome these limitations, we propose proximal guidance and incorporate it to NPI with cross-attention control. We enhance NPI with a regularization term and reconstruction guidance, which reduces artifacts while capitalizing on its training-free nature. Additionally, we extend the concepts to incorporate mutual self-attention control, enabling geometry and layout alterations in the editing process. Our method provides an efficient and straightforward approach, effectively addressing real image editing tasks with minimal computational overhead.	This paper introduces "proximal guidance," a technique for improving tuning-free real image editing in diffusion models. It enhances both Negative-Prompt Inversion (NPI) and Mutual Self-Attention Control by regularizing the editing process and optionally aligning with inversion latents.	Existing methods for real image editing with diffusion models often struggle with identity preservation or require time-consuming per-image optimization. This method addresses these limitations, enabling high-quality edits with minimal computational overhead.	The proposed method incorporates a proximal function, akin to proximal gradient methods, to constrain the noise difference between target and source prompts during image generation. It optionally uses inversion guidance, performing a gradient descent step towards the inversion latent to further refine the editing process.	Proximal guidance enhances NPI, achieving better reconstruction and editing quality compared to NTI and baseline NPI. When applied to Mutual Self-Attention Control, it improves stability and preserves desired details, addressing limitations of direct NPI integration. The method allows for simultaneous editing of both texture and geometry by sequentially applying proximal guidance to NPI and MasaCtrl.	The performance of proximal guidance can be sensitive to hyperparameters like threshold and step size, necessitating careful tuning. Future work could explore heuristics or automated methods for hyperparameter selection, improving usability and generalizability.	image editing, diffusion models, negative-prompt inversion, mutual self-attention control, proximal guidance
2306.05410 Report	LU-NeRF: Scene and Pose Estimation by Synchronizing Local Unposed NeRFs	Zezhou Cheng, Carlos Esteves, Varun Jampani, Abhishek Kar, Subhransu Maji, Ameesh Makadia	A critical obstacle preventing NeRF models from being deployed broadly in the wild is their reliance on accurate camera poses. Consequently, there is growing interest in extending NeRF models to jointly optimize camera poses and scene representation, which offers an alternative to off-the-shelf SfM pipelines which have well-understood failure modes. Existing approaches for unposed NeRF operate under limited assumptions, such as a prior pose distribution or coarse pose initialization, making them less effective in a general setting. In this work, we propose a novel approach, LU-NeRF, that jointly estimates camera poses and neural radiance fields with relaxed assumptions on pose configuration. Our approach operates in a local-to-global manner, where we first optimize over local subsets of the data, dubbed mini-scenes. LU-NeRF estimates local pose and geometry for this challenging few-shot task. The mini-scene poses are brought into a global reference frame through a robust pose synchronization step, where a final global optimization of pose and scene can be performed. We show our LU-NeRF pipeline outperforms prior attempts at unposed NeRF without making restrictive assumptions on the pose prior. This allows us to operate in the general SE(3) pose setting, unlike the baselines. Our results also indicate our model can be complementary to feature-based SfM pipelines as it compares favorably to COLMAP on low-texture and low-resolution images.	LU-NeRF: a novel local-to-global pipeline that jointly estimates camera poses in general configurations and neural radiance fields from unposed image sets.	Existing NeRF models often rely on accurate camera poses from SfM pipelines, which can fail in challenging conditions like low-texture scenes. LU-NeRF aims to address this limitation by directly optimizing poses within the NeRF framework.	LU-NeRF partitions the scene into mini-scenes, optimizing local poses and geometry for each. It utilizes a novel two-stage training process to address the mirror symmetry ambiguity inherent in few-shot unposed settings. These local estimations are then aligned into a global frame via robust pose synchronization, followed by joint refinement of global poses and scene representation.	Outperforms prior unposed NeRF methods without relying on pose priors, operating in the general SE(3) pose setting. Demonstrates robustness to outliers in mini-scene construction. Shows complementarity to feature-based SfM (COLMAP) by achieving better pose estimation in low-texture or low-resolution settings.	Computational cost is high, but can be potentially addressed by recent advances in neural rendering. Building reliable graphs for unordered image collections remains challenging and requires further investigation.	neural radiance fields, pose estimation, structure from motion, 3d scene reconstruction, few-shot learning
2306.05399 Report	Matting Anything	Jiachen Li, Jitesh Jain, Humphrey Shi	In this paper, we propose the Matting Anything Model (MAM), an efficient and versatile framework for estimating the alpha matte of any instance in an image with flexible and interactive visual or linguistic user prompt guidance. MAM offers several significant advantages over previous specialized image matting networks: (i) MAM is capable of dealing with various types of image matting, including semantic, instance, and referring image matting with only a single model; (ii) MAM leverages the feature maps from the Segment Anything Model (SAM) and adopts a lightweight Mask-to-Matte (M2M) module to predict the alpha matte through iterative refinement, which has only 2.7 million trainable parameters. (iii) By incorporating SAM, MAM simplifies the user intervention required for the interactive use of image matting from the trimap to the box, point, or text prompt. We evaluate the performance of MAM on various image matting benchmarks, and the experimental results demonstrate that MAM achieves comparable performance to the state-of-the-art specialized image matting models under different metrics on each benchmark. Overall, MAM shows superior generalization ability and can effectively handle various image matting tasks with fewer parameters, making it a practical solution for unified image matting. Our code and models are open-sourced at https://github.com/SHI-Labs/Matting-Anything.	This paper introduces Matting Anything Model (MAM), a versatile framework for estimating alpha mattes of any instance in an image, utilizing flexible and interactive visual or linguistic user prompts.	Existing image matting methods are often specialized for specific tasks and lack the flexibility to handle various scenarios with a single model. MAM addresses this limitation by providing a unified and efficient solution for different matting types.	MAM leverages the Segment Anything Model (SAM) for instance segmentation and incorporates a lightweight Mask-to-Matte (M2M) module to refine SAM's binary masks into high-quality alpha mattes through multi-scale predictions and iterative refinement.	MAM achieves comparable performance to state-of-the-art specialized models on various benchmarks for semantic, instance, and referring image matting. The model exhibits superior generalization ability and effectively handles different matting tasks with fewer parameters compared to previous approaches. MAM demonstrates significant improvements in refining alpha matte predictions, particularly in transition areas, without requiring trimap guidance.	The performance of referring image matting using text prompts is significantly lower than using bounding box prompts. Future work could focus on improving text-guided matting performance and exploring more complex prompt engineering techniques.	image matting, interactive segmentation, instance segmentation, referring image matting, segment anything model (sam)
2306.05390 Report	HQ-50K: A Large-scale, High-quality Dataset for Image Restoration	Qinhong Yang, Dongdong Chen, Zhentao Tan, Qiankun Liu, Qi Chu, Jianmin Bao, Lu Yuan, Gang Hua, Nenghai Yu	This paper introduces a new large-scale image restoration dataset, called HQ-50K, which contains 50,000 high-quality images with rich texture details and semantic diversity. We analyze existing image restoration datasets from five different perspectives, including data scale, resolution, compression rates, texture details, and semantic coverage. However, we find that all of these datasets are deficient in some aspects. In contrast, HQ-50K considers all of these five aspects during the data curation process and meets all requirements. We also present a new Degradation-Aware Mixture of Expert (DAMoE) model, which enables a single model to handle multiple corruption types and unknown levels. Our extensive experiments demonstrate that HQ-50K consistently improves the performance on various image restoration tasks, such as super-resolution, denoising, dejpeg, and deraining. Furthermore, our proposed DAMoE, trained on our \dataset, outperforms existing state-of-the-art unified models designed for multiple restoration tasks and levels. The dataset and code are available at \url{https://github.com/littleYaang/HQ-50K}.	This paper introduces \dataset, a large-scale, high-quality dataset for image restoration containing 50,000 high-quality images, and proposes DAMoE, a Degradation-Aware Mixture of Expert model for unified image restoration.	Existing image restoration datasets are limited in scale, resolution, compression rates, texture details, or semantic coverage, hindering the development of robust and generalizable restoration models.	\dataset is curated by selecting high-quality images from the internet and existing datasets, considering five key aspects: scale, resolution, compression, texture details, and semantic coverage. DAMoE leverages Mixture of Expert (MoE) layers within a transformer-based architecture to handle various restoration tasks and degradation levels with shared modules and task/degradation-specific experts.	\dataset consistently improves performance on various image restoration tasks, including super-resolution, denoising, dejpeg, and deraining. Models trained on \dataset demonstrate better generalization across different semantic categories of images. DAMoE, trained on \dataset, outperforms existing state-of-the-art unified models designed for multiple restoration tasks and levels.	While larger than existing restoration datasets, \dataset is still smaller than datasets for high-level vision tasks, limiting its potential for training even larger models. Further research is needed to explore different MoE block designs and routing strategies for optimal performance in DAMoE.	image restoration, dataset, deep learning, mixture of experts, unified model
2306.05382 Report	Image Blending Algorithm with Automatic Mask Generation	Haochen Xue, Mingyu Jin, Chong Zhang, Yuxuan Huang, Qian Weng, Xiaobo Jin	In recent years, image blending has gained popularity for its ability to create visually stunning content. However, the current image blending algorithms mainly have the following problems: manually creating image blending masks requires a lot of manpower and material resources; image blending algorithms cannot effectively solve the problems of brightness distortion and low resolution. To this end, we propose a new image blending method with automatic mask generation: it combines semantic object detection and segmentation with mask generation to achieve deep blended images based on our proposed new saturation loss and two-stage iteration of the PAN algorithm to fix brightness distortion and low-resolution issues. Results on publicly available datasets show that our method outperforms other classical image blending algorithms on various performance metrics, including PSNR and SSIM.	This paper introduces an automatic two-stage image blending method utilizing DINO and SAM for mask generation and a novel saturation loss with PAN for enhanced image quality.	Existing image blending algorithms suffer from manual mask creation effort, brightness distortion, and low resolution in blended images. This work aims to automate the process and address these quality issues.	The method employs DINO for object detection and SAM for accurate mask generation. Erosion and dilation refine the mask. A two-stage blending process uses gradient, content, style, and a novel saturation loss, further enhanced by PAN for high-resolution output.	The proposed automatic mask generation surpasses traditional RCNN in accuracy and generalizability. The introduction of saturation loss effectively mitigates brightness discrepancies at blended image seams. The method outperforms classic algorithms like GP-GAN and Poisson Blending in PSNR and SSIM, demonstrating superior visual quality.	The evaluation currently relies on standard metrics (PSNR, SSIM, MSE) which may not perfectly capture human perception of image quality. Future research could focus on addressing challenges posed by object occlusion in image blending scenarios.	image blending, mask generation, image segmentation, object detection, deep learning
2306.05356 Report	ReliableSwap: Boosting General Face Swapping Via Reliable Supervision	Ge Yuan, Maomao Li, Yong Zhang, Huicheng Zheng	Almost all advanced face swapping approaches use reconstruction as the proxy task, i.e., supervision only exists when the target and source belong to the same person. Otherwise, lacking pixel-level supervision, these methods struggle for source identity preservation. This paper proposes to construct reliable supervision, dubbed cycle triplets, which serves as the image-level guidance when the source identity differs from the target one during training. Specifically, we use face reenactment and blending techniques to synthesize the swapped face from real images in advance, where the synthetic face preserves source identity and target attributes. However, there may be some artifacts in such a synthetic face. To avoid the potential artifacts and drive the distribution of the network output close to the natural one, we reversely take synthetic images as input while the real face as reliable supervision during the training stage of face swapping. Besides, we empirically find that the existing methods tend to lose lower-face details like face shape and mouth from the source. This paper additionally designs a FixerNet, providing discriminative embeddings of lower faces as an enhancement. Our face swapping framework, named ReliableSwap, can boost the performance of any existing face swapping network with negligible overhead. Extensive experiments demonstrate the efficacy of our ReliableSwap, especially in identity preservation. The project page is https://reliable-swap.github.io/.	The paper proposes ReliableSwap, a general face swapping framework that improves identity preservation in existing methods by using synthetically generated "cycle triplets" as reliable supervision during training and introducing a FixerNet to enhance lower face details.	Existing face swapping methods often struggle to maintain the source identity, resulting in swapped faces that appear as an interpolation between the source and target. This is due to the lack of pixel-level supervision when the source and target identities differ.	The method involves 1) synthesizing "naive triplets" of images using face reenactment and blending to have controlled identity and attributes, 2) constructing "cycle triplets" from these naive triplets to provide reliable supervision during training, and 3) incorporating a FixerNet that extracts discriminative features of the lower face to enhance detail preservation.	ReliableSwap enhances identity preservation in swapped faces, as demonstrated by quantitative metrics and qualitative comparisons on datasets like FaceForensics++ and CelebA-HQ. The proposed cycle triplets effectively address the issue of lacking supervision for different identities during training. The FixerNet successfully improves the consistency of lower face details like mouth and face shape in swapped results.	The face reenactment model used in constructing cycle triplets may not perfectly transfer pose, leading to potential artifacts. Limited evaluation on high-resolution images (512$^2$ or 1024$^2$) due to the scarcity of publicly available training code for such methods.	face swapping, identity preservation, cycle triplet, fixernet, reliable supervision
2306.05178 Report	SyncDiffusion: Coherent Montage via Synchronized Joint Diffusions	Yuseung Lee, Kunho Kim, Hyunjin Kim, Minhyuk Sung	The remarkable capabilities of pretrained image diffusion models have been utilized not only for generating fixed-size images but also for creating panoramas. However, naive stitching of multiple images often results in visible seams. Recent techniques have attempted to address this issue by performing joint diffusions in multiple windows and averaging latent features in overlapping regions. However, these approaches, which focus on seamless montage generation, often yield incoherent outputs by blending different scenes within a single image. To overcome this limitation, we propose SyncDiffusion, a plug-and-play module that synchronizes multiple diffusions through gradient descent from a perceptual similarity loss. Specifically, we compute the gradient of the perceptual loss using the predicted denoised images at each denoising step, providing meaningful guidance for achieving coherent montages. Our experimental results demonstrate that our method produces significantly more coherent outputs compared to previous methods (66.35% vs. 33.65% in our user study) while still maintaining fidelity (as assessed by GIQA) and compatibility with the input prompt (as measured by CLIP score). We further demonstrate the versatility of our method across three plug-and-play applications: layout-guided image generation, conditional image generation and 360-degree panorama generation. Our project page is at https://syncdiffusion.github.io.	This paper proposes SyncDiffusion, a plug-and-play module for synchronizing multiple diffusions to enhance global coherence in image montage generation, particularly panoramas.	Existing panorama generation methods using diffusion models often result in either visible seams or incoherent outputs, lacking global semantic consistency.	SyncDiffusion utilizes gradient descent from a perceptual similarity loss (LPIPS or Style Loss) calculated between predicted denoised images at each denoising step to synchronize multiple diffusions across different image regions.	SyncDiffusion produces significantly more coherent panoramas compared to previous methods, as demonstrated by lower Intra-LPIPS and Intra-Style-L scores. The method maintains fidelity to the input prompt (Mean-CLIP-S) and image quality (Mean-GIQA), comparable to single-image generations. User studies confirm a strong preference for SyncDiffusion (66.35%) over previous methods (33.65%) in terms of coherence, image quality, and prompt compatibility.	Generating realistic panoramas relies on suitable input prompts. The gradient descent computation in SyncDiffusion introduces additional computational overhead.	image generation, diffusion models, panorama generation, coherence, perceptual similarity
2306.04988 Report	StreetSurf: Extending Multi-view Implicit Surface Reconstruction to Street Views	Jianfei Guo, Nianchen Deng, Xinyang Li, Yeqi Bai, Botian Shi, Chiyu Wang, Chenjing Ding, Dongliang Wang, Yikang Li	We present a novel multi-view implicit surface reconstruction technique, termed StreetSurf, that is readily applicable to street view images in widely-used autonomous driving datasets, such as Waymo-perception sequences, without necessarily requiring LiDAR data. As neural rendering research expands rapidly, its integration into street views has started to draw interests. Existing approaches on street views either mainly focus on novel view synthesis with little exploration of the scene geometry, or rely heavily on dense LiDAR data when investigating reconstruction. Neither of them investigates multi-view implicit surface reconstruction, especially under settings without LiDAR data. Our method extends prior object-centric neural surface reconstruction techniques to address the unique challenges posed by the unbounded street views that are captured with non-object-centric, long and narrow camera trajectories. We delimit the unbounded space into three parts, close-range, distant-view and sky, with aligned cuboid boundaries, and adapt cuboid/hyper-cuboid hash-grids along with road-surface initialization scheme for finer and disentangled representation. To further address the geometric errors arising from textureless regions and insufficient viewing angles, we adopt geometric priors that are estimated using general purpose monocular models. Coupled with our implementation of efficient and fine-grained multi-stage ray marching strategy, we achieve state of the art reconstruction quality in both geometry and appearance within only one to two hours of training time with a single RTX3090 GPU for each street view sequence. Furthermore, we demonstrate that the reconstructed implicit surfaces have rich potential for various downstream tasks, including ray tracing and LiDAR simulation.	StreetSurf, a novel multi-view implicit surface reconstruction framework for street views, achieving state-of-the-art geometry and appearance quality within a short training time without requiring LiDAR data.	Existing methods for street view reconstruction either focus on novel view synthesis without exploring scene geometry or rely heavily on LiDAR data. StreetSurf addresses these limitations, enabling accurate surface reconstruction from camera images alone.	The method divides the scene into close-range, distant-view, and sky regions, each modeled by a dedicated neural network. It uses aligned cuboid boundaries and hash-grids for efficient representation and employs road-surface initialization and entropy regularization for disentangling close-range and distant-view models. Geometric priors from monocular estimations further enhance reconstruction accuracy.	StreetSurf reconstructs high-quality surfaces from street view images without LiDAR data. The disentanglement of close-range and distant-view models improves reconstruction quality and enables accurate LiDAR simulation. The reconstructed implicit surfaces can be used for various downstream tasks, such as ray tracing and occupancy grid extraction.	StreetSurf currently ignores dynamic foreground objects in street views. The method faces challenges in handling complex lighting conditions and long-tail environmental variations.	neural rendering, implicit surface reconstruction, street views, multi-view reconstruction, autonomous driving
2306.04865 Report	MyStyle++: A Controllable Personalized Generative Prior	Libing Zeng, Lele Chen, Yi Xu, Nima Kalantari	In this paper, we propose an approach to obtain a personalized generative prior with explicit control over a set of attributes. We build upon MyStyle, a recently introduced method, that tunes the weights of a pre-trained StyleGAN face generator on a few images of an individual. This system allows synthesizing, editing, and enhancing images of the target individual with high fidelity to their facial features. However, MyStyle does not demonstrate precise control over the attributes of the generated images. We propose to address this problem through a novel optimization system that organizes the latent space in addition to tuning the generator. Our key contribution is to formulate a loss that arranges the latent codes, corresponding to the input images, along a set of specific directions according to their attributes. We demonstrate that our approach, dubbed MyStyle++, is able to synthesize, edit, and enhance images of an individual with great control over the attributes, while preserving the unique facial characteristics of that individual.	This paper introduces MyStyle++, a novel optimization system that enhances personalized generative priors, like MyStyle, by enabling explicit control over attributes in synthesized images while preserving individual facial characteristics.	Existing methods for personalized image synthesis often lack precise control over attributes or fail to maintain identity during editing. MyStyle++ aims to address these issues.	MyStyle++ builds upon StyleGAN and employs a two-pronged approach: 1) It organizes the latent space by optimizing anchor latent codes based on their attributes, and 2) It tunes the generator to ensure fidelity to the target individual.	MyStyle++ demonstrates superior control over attributes like expression, yaw, pitch, and age compared to baseline methods. Quantitative evaluations reveal lower standard deviation in desired attributes and better preservation of identity during editing. The method proves effective for controllable image enhancement tasks such as inpainting and super-resolution.	The number of images required for MyStyle++ grows with the number of attributes, posing a limitation for highly controllable synthesis. While attribute control is precise, reconstructions for attributes like view lack physical accuracy, suggesting an area for future improvement.	generative adversarial networks, personalized image synthesis, controllable gans, few-shot learning, semantic image editing
2306.04849 Report	ScaleDet: A Scalable Multi-Dataset Object Detector	Yanbei Chen, Manchen Wang, Abhay Mittal, Zhenlin Xu, Paolo Favaro, Joseph Tighe, Davide Modolo	Multi-dataset training provides a viable solution for exploiting heterogeneous large-scale datasets without extra annotation cost. In this work, we propose a scalable multi-dataset detector (ScaleDet) that can scale up its generalization across datasets when increasing the number of training datasets. Unlike existing multi-dataset learners that mostly rely on manual relabelling efforts or sophisticated optimizations to unify labels across datasets, we introduce a simple yet scalable formulation to derive a unified semantic label space for multi-dataset training. ScaleDet is trained by visual-textual alignment to learn the label assignment with label semantic similarities across datasets. Once trained, ScaleDet can generalize well on any given upstream and downstream datasets with seen and unseen classes. We conduct extensive experiments using LVIS, COCO, Objects365, OpenImages as upstream datasets, and 13 datasets from Object Detection in the Wild (ODinW) as downstream datasets. Our results show that ScaleDet achieves compelling strong model performance with an mAP of 50.7 on LVIS, 58.8 on COCO, 46.8 on Objects365, 76.2 on OpenImages, and 71.8 on ODinW, surpassing state-of-the-art detectors with the same backbone.	ScaleDet is a novel multi-dataset object detector that effectively scales its generalization as the number of training datasets increases.	Training on multiple datasets is crucial for building robust and generalizable object detectors, but unifying diverse label spaces across datasets is challenging.	ScaleDet unifies labels from different datasets into a single semantic space using text embeddings and learns via hard and soft label assignments for visual-textual alignment.	Scaling up the number of training datasets significantly improves performance on both upstream and downstream datasets. ScaleDet outperforms state-of-the-art multi-dataset detectors like UniDet and Detic, even when trained on fewer datasets. The method achieves strong transferability, as demonstrated by its state-of-the-art results on the challenging ODinW benchmark.	The reliance on text embeddings might limit performance if the text encoder doesn't adequately capture certain visual concepts. Future work can explore incorporating weakly-supervised or semi-supervised learning techniques to further leverage the large amounts of available unlabeled or partially labeled data.	object detection, multi-dataset learning, visual-textual alignment, zero-shot detection, generalization
2306.04848 Report	Interpreting and Improving Diffusion Models Using the Euclidean Distance Function	Frank Permenter, Chenyang Yuan	Denoising is intuitively related to projection. Indeed, under the manifold hypothesis, adding random noise is approximately equivalent to orthogonal perturbation. Hence, learning to denoise is approximately learning to project. In this paper, we use this observation to reinterpret denoising diffusion models as approximate gradient descent applied to the Euclidean distance function. We then provide straight-forward convergence analysis of the DDIM sampler under simple assumptions on the projection-error of the denoiser. Finally, we propose a new sampler based on two simple modifications to DDIM using insights from our theoretical results. In as few as 5-10 function evaluations, our sampler achieves state-of-the-art FID scores on pretrained CIFAR-10 and CelebA models and can generate high quality samples on latent diffusion models.	This paper presents a novel interpretation of denoising diffusion models as performing approximate gradient descent on the Euclidean distance function to the data manifold, providing theoretical analysis and a new improved sampler.	Diffusion models achieve state-of-the-art results in generative tasks, but their understanding is mainly probabilistic. This work offers a deterministic analysis, enabling new insights and algorithmic improvements.	The authors analyze DDIM sampling under a relative-error model, showing its equivalence to gradient descent with error. They leverage this to design a second-order sampler that reduces error in denoiser output by combining previous estimates.	The paper validates the proposed relative error model both theoretically and empirically on image datasets. It provides convergence analysis of DDIM under the error model, linking error parameters to the noise schedule. The proposed second-order sampler achieves state-of-the-art FID scores on pretrained CIFAR-10 and CelebA models, outperforming existing samplers.	The analysis assumes existence of admissible noise schedules, which are characterized but their optimality remains unexplored. The relative error model, while empirically validated, is not guaranteed to hold in all cases, suggesting future work on tighter bounds or alternative models.	denoising diffusion models, generative models, distance functions, sampling algorithms, gradient descent
2306.04744 Report	WOUAF: Weight Modulation for User Attribution and Fingerprinting in Text-to-Image Diffusion Models	Changhoon Kim, Kyle Min, Maitreya Patel, Sheng Cheng, Yezhou Yang	The rapid advancement of generative models, facilitating the creation of hyper-realistic images from textual descriptions, has concurrently escalated critical societal concerns such as misinformation. Although providing some mitigation, traditional fingerprinting mechanisms fall short in attributing responsibility for the malicious use of synthetic images. This paper introduces a novel approach to model fingerprinting that assigns responsibility for the generated images, thereby serving as a potential countermeasure to model misuse. Our method modifies generative models based on each user's unique digital fingerprint, imprinting a unique identifier onto the resultant content that can be traced back to the user. This approach, incorporating fine-tuning into Text-to-Image (T2I) tasks using the Stable Diffusion Model, demonstrates near-perfect attribution accuracy with a minimal impact on output quality. Through extensive evaluation, we show that our method outperforms baseline methods with an average improvement of 11\% in handling image post-processes. Our method presents a promising and novel avenue for accountable model distribution and responsible use. Our code is available in \url{https://github.com/kylemin/WOUAF}.	This paper introduces WOUAF, a novel distributor-centered weight modulation method for fingerprinting text-to-image diffusion models, enabling user attribution for generated images.	The rise of hyper-realistic image generation raises concerns about misuse, like misinformation. WOUAF provides a way to attribute generated images to their source, combating malicious use.	WOUAF embeds user-specific fingerprints directly into the model weights of the Stable Diffusion decoder via a mapping network and affine transformations during the fine-tuning process.	WOUAF achieves near-perfect attribution accuracy while minimally impacting the quality of generated images. The method demonstrates robustness against various image post-processing techniques, outperforming baseline methods. WOUAF proves resilient to deliberate fingerprint removal attempts like auto-encoder obfuscation and model purification.	The fingerprint capacity, while supporting over 4 billion users, shows a trade-off with increasing fingerprint dimensions. Future work aims to extend the methodology beyond image data to encompass diverse data types like text, audio, and video.	model fingerprinting, user attribution, text-to-image synthesis, diffusion models, stable diffusion
2306.04695 Report	ConceptBed: Evaluating Concept Learning Abilities of Text-to-Image Diffusion Models	Maitreya Patel, Tejas Gokhale, Chitta Baral, Yezhou Yang	The ability to understand visual concepts and replicate and compose these concepts from images is a central goal for computer vision. Recent advances in text-to-image (T2I) models have lead to high definition and realistic image quality generation by learning from large databases of images and their descriptions. However, the evaluation of T2I models has focused on photorealism and limited qualitative measures of visual understanding. To quantify the ability of T2I models in learning and synthesizing novel visual concepts (a.k.a. personalized T2I), we introduce ConceptBed, a large-scale dataset that consists of 284 unique visual concepts, and 33K composite text prompts. Along with the dataset, we propose an evaluation metric, Concept Confidence Deviation (CCD), that uses the confidence of oracle concept classifiers to measure the alignment between concepts generated by T2I generators and concepts contained in target images. We evaluate visual concepts that are either objects, attributes, or styles, and also evaluate four dimensions of compositionality: counting, attributes, relations, and actions. Our human study shows that CCD is highly correlated with human understanding of concepts. Our results point to a trade-off between learning the concepts and preserving the compositionality which existing approaches struggle to overcome. The data, code, and interactive demo is available at: https://conceptbed.github.io/	The paper introduces ConceptBed, a large-scale dataset and evaluation framework for assessing the ability of text-to-image models to learn and synthesize novel visual concepts.	Existing evaluation methods for text-to-image models primarily focus on photorealism and lack robust measures for visual understanding, particularly in concept learning (personalized T2I).	ConceptBed consists of 284 unique visual concepts and 33K composite text prompts. The authors propose Concept Confidence Deviation (CCD), a metric leveraging oracle concept classifiers to measure the alignment between generated and target images.	There's a trade-off between concept alignment and composition alignment: methods excelling at one often struggle with the other. CCD strongly correlates with human preferences for concept and composition alignment, outperforming prior metrics. Using a pre-trained CLIP textual encoder helps maintain compositionality but hinders learning complex concepts.	While large-scale, ConceptBed doesn't encompass all possible concepts; future work should combine it with qualitative examples. The focus is on Stable Diffusion-based models; extending to other text-conditioned concept learners is important.	concept learning, text-to-image synthesis, personalized t2i, evaluation metrics, diffusion models
2306.04654 Report	DenseDINO: Boosting Dense Self-Supervised Learning with Token-Based Point-Level Consistency	Yike Yuan, Xinghe Fu, Yunlong Yu, Xi Li	In this paper, we propose a simple yet effective transformer framework for self-supervised learning called DenseDINO to learn dense visual representations. To exploit the spatial information that the dense prediction tasks require but neglected by the existing self-supervised transformers, we introduce point-level supervision across views in a novel token-based way. Specifically, DenseDINO introduces some extra input tokens called reference tokens to match the point-level features with the position prior. With the reference token, the model could maintain spatial consistency and deal with multi-object complex scene images, thus generalizing better on dense prediction tasks. Compared with the vanilla DINO, our approach obtains competitive performance when evaluated on classification in ImageNet and achieves a large margin (+7.2% mIoU) improvement in semantic segmentation on PascalVOC under the linear probing protocol for segmentation.	The paper proposes DenseDINO, a simple yet effective transformer framework for self-supervised learning of dense visual representations, introducing point-level supervision across views in a novel token-based way.	Existing self-supervised transformers struggle to produce high-quality vision representations that generalize well to diverse downstream tasks, particularly dense prediction tasks like segmentation that require spatial information neglected by image-level approaches.	DenseDINO introduces extra input tokens called reference tokens, defined as positional embeddings of randomly sampled point pairs across views, enabling the model to maintain spatial consistency and attend to multiple objects in complex scenes. The framework minimizes both image-level and point-level distillation losses using a modified masked-attention module to disentangle reference tokens.	DenseDINO achieves competitive performance on ImageNet classification compared to the state-of-the-art DINO. DenseDINO significantly surpasses DINO on PascalVOC semantic segmentation, demonstrating superior dense prediction capabilities. Analysis reveals that multi-crop augmentation, while beneficial for classification, can hurt segmentation due to object misalignment between views, an issue mitigated by DenseDINO's point-level consistency and modified view generation.	The selection and generation of reference tokens can be further optimized for improved object localization and supervision accuracy. Exploring the framework's applicability to other dense prediction tasks beyond segmentation.	self-supervised learning, transformer, dense visual representation, point-level supervision, object misalignment
2306.04642 Report	DiffusionShield: A Watermark for Copyright Protection against Generative Diffusion Models	Yingqian Cui, Jie Ren, Han Xu, Pengfei He, Hui Liu, Lichao Sun, Yue Xing, Jiliang Tang	Recently, Generative Diffusion Models (GDMs) have showcased their remarkable capabilities in learning and generating images. A large community of GDMs has naturally emerged, further promoting the diversified applications of GDMs in various fields. However, this unrestricted proliferation has raised serious concerns about copyright protection. For example, artists including painters and photographers are becoming increasingly concerned that GDMs could effortlessly replicate their unique creative works without authorization. In response to these challenges, we introduce a novel watermarking scheme, DiffusionShield, tailored for GDMs. DiffusionShield protects images from copyright infringement by GDMs through encoding the ownership information into an imperceptible watermark and injecting it into the images. Its watermark can be easily learned by GDMs and will be reproduced in their generated images. By detecting the watermark from generated images, copyright infringement can be exposed with evidence. Benefiting from the uniformity of the watermarks and the joint optimization method, DiffusionShield ensures low distortion of the original image, high watermark detection performance, and the ability to embed lengthy messages. We conduct rigorous and comprehensive experiments to show the effectiveness of DiffusionShield in defending against infringement by GDMs and its superiority over traditional watermarking methods. The code for DiffusionShield is accessible in https://github.com/Yingqiancui/DiffusionShield.	This paper proposes DiffusionShield, a novel watermarking scheme designed to protect image copyright against infringement by Generative Diffusion Models (GDMs).	The rise of GDMs raises concerns about copyright infringement, as these models can easily replicate creative works without permission. Existing watermarking techniques are not designed for GDMs and often fail to be reproduced in generated images.	DiffusionShield enhances "pattern uniformity" by employing a blockwise watermarking approach, where the same watermark pattern is applied to all images from the same owner. It also uses a joint optimization method to optimize the watermark patterns and the watermark detector simultaneously.	DiffusionShield achieves high bit accuracy (close to 100%) in watermark detection on generated images, even with very small watermark budgets. It demonstrates robustness against image corruptions and variations in GDM training hyperparameters. The method is flexible for multiple-user scenarios, allowing new users to adopt the scheme without retraining.	The current version of DiffusionShield focuses on protecting image data and may need further adaptation for other data modalities. Future work could explore more advanced encoding and decoding techniques to further improve the capacity and robustness of the watermarking scheme.	watermark, copyright protection, generative diffusion models, pattern uniformity, joint optimization
2306.04619 Report	ARTIC3D: Learning Robust Articulated 3D Shapes from Noisy Web Image Collections	Chun-Han Yao, Amit Raj, Wei-Chih Hung, Yuanzhen Li, Michael Rubinstein, Ming-Hsuan Yang, Varun Jampani	Estimating 3D articulated shapes like animal bodies from monocular images is inherently challenging due to the ambiguities of camera viewpoint, pose, texture, lighting, etc. We propose ARTIC3D, a self-supervised framework to reconstruct per-instance 3D shapes from a sparse image collection in-the-wild. Specifically, ARTIC3D is built upon a skeleton-based surface representation and is further guided by 2D diffusion priors from Stable Diffusion. First, we enhance the input images with occlusions/truncation via 2D diffusion to obtain cleaner mask estimates and semantic features. Second, we perform diffusion-guided 3D optimization to estimate shape and texture that are of high-fidelity and faithful to input images. We also propose a novel technique to calculate more stable image-level gradients via diffusion models compared to existing alternatives. Finally, we produce realistic animations by fine-tuning the rendered shape and texture under rigid part transformations. Extensive evaluations on multiple existing datasets as well as newly introduced noisy web image collections with occlusions and truncation demonstrate that ARTIC3D outputs are more robust to noisy images, higher quality in terms of shape and texture details, and more realistic when animated. Project page: https://chhankyao.github.io/artic3d/	ARTIC3D, a self-supervised framework for reconstructing 3D articulated animal shapes from sparse, noisy in-the-wild images, guided by 2D diffusion priors from Stable Diffusion.	Creating articulated animal models is crucial for various applications, but existing methods struggle with noisy, real-world images. This work leverages the power of 2D diffusion priors to reconstruct high-quality, animatable 3D animals from limited, imperfect data.	ARTIC3D uses a skeleton-based surface representation. It preprocesses images with diffusion to improve mask and feature estimates. It employs a novel Decoder-based Accumulative Score Sampling (DASS) for stable gradient calculation during 3D optimization. Finally, it refines animations using a temporal consistency loss.	ARTIC3D demonstrates robustness to occlusions and truncation in images, outperforming baselines on keypoint transfer accuracy, particularly on the newly introduced noisy E-LASSIE dataset. Qualitative results showcase detailed, realistic 3D shapes and textures, faithful to input images from both input and novel viewpoints. The framework allows for realistic animation and applications like texture transfer due to its explicit part representation.	ARTIC3D's reliance on accurate skeleton initialization can be limiting for heavily occluded images or animals with ambiguous skeletal structures. The front-facing bias inherent in diffusion models can sometimes lead to unrealistic textures.	3d reconstruction, diffusion models, articulated shapes, animal modeling, sparse image optimization
2306.04396 Report	Improving Diffusion-based Image Translation using Asymmetric Gradient Guidance	Gihyun Kwon, Jong Chul Ye	Diffusion models have shown significant progress in image translation tasks recently. However, due to their stochastic nature, there's often a trade-off between style transformation and content preservation. Current strategies aim to disentangle style and content, preserving the source image's structure while successfully transitioning from a source to a target domain under text or one-shot image conditions. Yet, these methods often require computationally intense fine-tuning of diffusion models or additional neural networks. To address these challenges, here we present an approach that guides the reverse process of diffusion sampling by applying asymmetric gradient guidance. This results in quicker and more stable image manipulation for both text-guided and image-guided image translation. Our model's adaptability allows it to be implemented with both image- and latent-diffusion models. Experiments show that our method outperforms various state-of-the-art models in image translation tasks.	This paper presents Asymmetric Gradient Guidance (AGG), a novel sampling approach for efficient and flexible image translation in both image- and latent-diffusion models.	Existing image translation methods using diffusion models often struggle to balance style transformation with content preservation and can be computationally expensive. This work aims to address these limitations.	AGG combines the strengths of MCG and DDS methods. It guides the reverse diffusion sampling process by first applying a single step of MCG for initial update, followed by computationally efficient DDS update using the Adam optimizer. A simpler structural regularization term based on intermediate products of the DDIM forward step helps preserve source image structure.	AGG outperforms state-of-the-art models in text-guided image translation on Animals and Landscapes datasets, achieving better image quality (SFID, CSFID) and comparable content preservation (LPIPS) with faster sampling. For image-guided translation, AGG demonstrates superior perceptual quality compared to existing appearance and style transfer methods. AGG effectively adapts to latent diffusion models, enabling fast and accurate semantic image manipulation with better content preservation compared to methods like P2P and PnP.	The method's performance can be limited when there's a large semantic gap between the source image and target text in the CLIP space (e.g., lion to building). Future work could explore integrating better text embedding models to address this limitation.	image translation, diffusion models, gradient guidance, text-to-image synthesis, image manipulation
2306.04356 Report	Fine-Grained Visual Prompting	Lingfeng Yang, Yueze Wang, Xiang Li, Xinlong Wang, Jian Yang	Vision-Language Models (VLMs), such as CLIP, have demonstrated impressive zero-shot transfer capabilities in image-level visual perception. However, these models have shown limited performance in instance-level tasks that demand precise localization and recognition. Previous works have suggested that incorporating visual prompts, such as colorful boxes or circles, can improve the ability of models to recognize objects of interest. Nonetheless, compared to language prompting, visual prompting designs are rarely explored. Existing approaches, which employ coarse visual cues such as colorful boxes or circles, often result in sub-optimal performance due to the inclusion of irrelevant and noisy pixels. In this paper, we carefully study the visual prompting designs by exploring more fine-grained markings, such as segmentation masks and their variations. In addition, we introduce a new zero-shot framework that leverages pixel-level annotations acquired from a generalist segmentation model for fine-grained visual prompting. Consequently, our investigation reveals that a straightforward application of blur outside the target mask, referred to as the Blur Reverse Mask, exhibits exceptional effectiveness. This proposed prompting strategy leverages the precise mask annotations to reduce focus on weakly related regions while retaining spatial coherence between the target and the surrounding background. Our Fine-Grained Visual Prompting (FGVP) demonstrates superior performance in zero-shot comprehension of referring expressions on the RefCOCO, RefCOCO+, and RefCOCOg benchmarks. It outperforms prior methods by an average margin of 3.0% to 4.6%, with a maximum improvement of 12.5% on the RefCOCO+ testA subset. Code is available at https://github.com/ylingfeng/FGVP.	This paper proposes Fine-Grained Visual Prompting (FGVP), which uses precise semantic masks to guide Vision-Language Models (VLMs) for improved zero-shot instance-level understanding.	Existing VLMs struggle with precise localization in tasks like referring expression comprehension and part detection, often relying on coarse visual prompts (e.g., boxes, circles). FGVP aims to overcome this limitation by providing fine-grained guidance.	FGVP leverages a robust segmentation model (Segment Anything Model, SAM) to generate accurate semantic masks. These masks are then used to prompt VLMs, particularly focusing on a 'Blur Reverse Mask' strategy where the background is blurred to highlight the target.	Blur Reverse Mask prompting consistently outperforms other visual prompting methods across multiple datasets. FGVP achieves state-of-the-art zero-shot results on referring expression comprehension benchmarks (RefCOCO, RefCOCO+, RefCOCOg), surpassing previous methods by significant margins. The proposed zero-shot pipeline also demonstrates superior performance in part detection on the PACO dataset compared to existing techniques.	FGVP's reliance on a segmentation model increases inference time compared to methods using simpler prompts. The current implementation does not yet explore joint optimization of visual and language prompts for potentially enhanced performance.	visual prompting, vision-language models, referring expression comprehension, part detection, zero-shot learning
2306.04180 Report	FusedRF: Fusing Multiple Radiance Fields	Rahul Goel, Dhawal Sirikonda, Rajvi Shah, PJ Narayanan	Radiance Fields (RFs) have shown great potential to represent scenes from casually captured discrete views. Compositing parts or whole of multiple captured scenes could greatly interest several XR applications. Prior works can generate new views of such scenes by tracing each scene in parallel. This increases the render times and memory requirements with the number of components. In this work, we provide a method to create a single, compact, fused RF representation for a scene composited using multiple RFs. The fused RF has the same render times and memory utilizations as a single RF. Our method distills information from multiple teacher RFs into a single student RF while also facilitating further manipulations like addition and deletion into the fused representation.	This paper introduces FusedRF, a method to fuse multiple Radiance Fields (RFs) into a single, compact RF representation for efficient rendering of composited scenes.	Compositing scenes from multiple RFs currently requires parallel tracing, which increases rendering time and memory proportionally to the number of scenes. FusedRF addresses this by creating a single representation with the same efficiency as a single RF.	The method distills information from multiple teacher RFs into a single student RF. It iteratively fuses source RFs with affine composition, using supervised losses on density and color values at sampled points. The process is sped up by pruning low-density points and initializing with the dominant scene's weights.	FusedRF significantly reduces rendering time and memory consumption compared to rendering composited scenes with existing methods. Quantitative results demonstrate that FusedRF maintains comparable visual quality to naive composition. The method is applicable to various RF representations using explicit 3D lattices, including TensoRF, InstantNGP, DVGO, and Plenoxels.	The paper primarily demonstrates fusion with TensoRF; further evaluation with other RF representations is needed. Exploration of more complex compositions beyond affine transformations is a potential avenue for future work.	radiance fields, scene composition, 3d reconstruction, neural rendering, distillation
2306.03881 Report	Emergent Correspondence from Image Diffusion	Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, Bharath Hariharan	Finding correspondences between images is a fundamental problem in computer vision. In this paper, we show that correspondence emerges in image diffusion models without any explicit supervision. We propose a simple strategy to extract this implicit knowledge out of diffusion networks as image features, namely DIffusion FeaTures (DIFT), and use them to establish correspondences between real images. Without any additional fine-tuning or supervision on the task-specific data or annotations, DIFT is able to outperform both weakly-supervised methods and competitive off-the-shelf features in identifying semantic, geometric, and temporal correspondences. Particularly for semantic correspondence, DIFT from Stable Diffusion is able to outperform DINO and OpenCLIP by 19 and 14 accuracy points respectively on the challenging SPair-71k benchmark. It even outperforms the state-of-the-art supervised methods on 9 out of 18 categories while remaining on par for the overall performance. Project page: https://diffusionfeatures.github.io	This paper discovers and leverages the implicit correspondence learning capability within image diffusion models, introducing a novel feature extractor called DIFT (DIffusion FeaTures).	Discovering implicit correspondence learning in diffusion models offers a new path towards robust and accurate correspondence estimation without the need for explicit supervision, which is crucial for tasks like 3D reconstruction, object tracking, and image editing.	DIFT extracts correspondence information from pre-trained diffusion models by adding noise to real images to simulate the forward diffusion process. Then, intermediate layer activations from the model's U-Net are used as feature maps for correspondence matching.	Without explicit supervision, DIFT outperforms weakly-supervised methods and other self-supervised features on semantic, geometric, and temporal correspondence benchmarks. DIFT achieves state-of-the-art performance on semantic correspondence, even rivaling supervised methods on PF-WILLOW and certain SPair-71k categories. The choice of time step during feature extraction significantly influences the type of correspondence captured, with larger time steps favoring semantic relationships.	The reliance on potentially biased datasets like LAION for training diffusion models might lead to uneven performance across different image types. DIFT's performance could be further enhanced through sophisticated adaptation mechanisms, such as combining features from various time steps and layers or fine-tuning with task-specific supervision.	diffusion models, correspondence learning, self-supervision, feature extraction, image editing
2306.03436 Report	Intellectual Property Protection of Diffusion Models via the Watermark Diffusion Process	Sen Peng, Yufei Chen, Cong Wang, Xiaohua Jia	Diffusion models have rapidly become a vital part of deep generative architectures, given today's increasing demands. Obtaining large, high-performance diffusion models demands significant resources, highlighting their importance as intellectual property worth protecting. However, existing watermarking techniques for ownership verification are insufficient when applied to diffusion models. Very recent research in watermarking diffusion models either exposes watermarks during task generation, which harms the imperceptibility, or is developed for conditional diffusion models that require prompts to trigger the watermark. This paper introduces WDM, a novel watermarking solution for diffusion models without imprinting the watermark during task generation. It involves training a model to concurrently learn a Watermark Diffusion Process (WDP) for embedding watermarks alongside the standard diffusion process for task generation. We provide a detailed theoretical analysis of WDP training and sampling, relating it to a shifted Gaussian diffusion process via the same reverse noise. Extensive experiments are conducted to validate the effectiveness and robustness of our approach in various trigger and watermark data configurations.	This paper introduces WDM, a novel watermarking solution for diffusion models that embeds watermarks without affecting the task generation process.	Protecting intellectual property of large diffusion models is crucial due to the significant resources required to train them and the potential for misuse, such as generating disinformation.	WDM trains a model to learn a Watermark Diffusion Process (WDP) for embedding watermarks alongside the standard diffusion process for task generation. The WDP utilizes a trigger to generate distinct data distributions, enabling watermark extraction and verification.	WDM achieves high watermark fidelity, allowing effective extraction and verification of embedded watermarks. WDM demonstrates robustness against model compression and weight perturbation attacks. The watermark remains detectable even when using DDIM architecture or varying watermark extraction timesteps.	WDM's robustness against model fine-tuning attacks is limited, especially when a large amount of data is used during fine-tuning. Future work can explore methods to improve the robustness of WDM against more sophisticated watermark removal attacks.	watermarking, diffusion models, intellectual property protection, deep generative models, watermark diffusion process
2306.03253 Report	Zero-Shot 3D Shape Correspondence	Ahmed Abdelreheem, Abdelrahman Eldesokey, Maks Ovsjanikov, Peter Wonka	We propose a novel zero-shot approach to computing correspondences between 3D shapes. Existing approaches mainly focus on isometric and near-isometric shape pairs (e.g., human vs. human), but less attention has been given to strongly non-isometric and inter-class shape matching (e.g., human vs. cow). To this end, we introduce a fully automatic method that exploits the exceptional reasoning capabilities of recent foundation models in language and vision to tackle difficult shape correspondence problems. Our approach comprises multiple stages. First, we classify the 3D shapes in a zero-shot manner by feeding rendered shape views to a language-vision model (e.g., BLIP2) to generate a list of class proposals per shape. These proposals are unified into a single class per shape by employing the reasoning capabilities of ChatGPT. Second, we attempt to segment the two shapes in a zero-shot manner, but in contrast to the co-segmentation problem, we do not require a mutual set of semantic regions. Instead, we propose to exploit the in-context learning capabilities of ChatGPT to generate two different sets of semantic regions for each shape and a semantic mapping between them. This enables our approach to match strongly non-isometric shapes with significant differences in geometric structure. Finally, we employ the generated semantic mapping to produce coarse correspondences that can further be refined by the functional maps framework to produce dense point-to-point maps. Our approach, despite its simplicity, produces highly plausible results in a zero-shot manner, especially between strongly non-isometric shapes. Project webpage: https://samir55.github.io/3dshapematch/.	This paper proposes a novel zero-shot approach for computing correspondences between 3D shapes, particularly targeting strongly non-isometric and inter-class shape matching.	Existing methods struggle with matching shapes across different classes and with significant geometric variations, limiting their application to understanding relationships between diverse 3D shapes.	The method leverages foundation models in language and vision (BLIP2, ChatGPT, DINO, Segment-Anything) for: - Zero-shot 3D shape classification - Generating shape-specific semantic regions and mappings between them - Zero-shot 3D semantic segmentation (SAM-3D) - Dense correspondence refinement using functional maps.	Achieves high zero-shot 3D object classification accuracy using BLIP2 and ChatGPT reasoning. Generates accurate semantic regions and mappings using ChatGPT in-context learning, outperforming BLIP2. Proposed SAM-3D outperforms existing zero-shot 3D segmentation methods in terms of region IoU and keypoint label matching.	Current method focuses on coarse semantic regions, with finer-grained segmentation being a challenge for future work. Developing foundation models specifically for 3D shapes and adapting functional maps for strongly non-isometric cases are potential future directions.	3d shape correspondence, zero-shot learning, semantic segmentation, foundation models, non-isometric shape matching
2306.03092 Report	Neuralangelo: High-Fidelity Neural Surface Reconstruction	Zhaoshuo Li, Thomas Müller, Alex Evans, Russell H. Taylor, Mathias Unberath, Ming-Yu Liu, Chen-Hsuan Lin	Neural surface reconstruction has been shown to be powerful for recovering dense 3D surfaces via image-based neural rendering. However, current methods struggle to recover detailed structures of real-world scenes. To address the issue, we present Neuralangelo, which combines the representation power of multi-resolution 3D hash grids with neural surface rendering. Two key ingredients enable our approach: (1) numerical gradients for computing higher-order derivatives as a smoothing operation and (2) coarse-to-fine optimization on the hash grids controlling different levels of details. Even without auxiliary inputs such as depth, Neuralangelo can effectively recover dense 3D surface structures from multi-view images with fidelity significantly surpassing previous methods, enabling detailed large-scale scene reconstruction from RGB video captures.	Neuralangelo, a framework for high-fidelity 3D surface reconstruction from multi-view images, even without auxiliary data like depth or segmentation.	Current neural surface reconstruction methods struggle to recover fine details. Neuralangelo addresses this by combining multi-resolution 3D hash grids with neural surface rendering.	Leverages representation power of multi-resolution hash grids with two key components: 1) Numerical gradients for smoothing higher-order derivatives and 2) Coarse-to-fine optimization on hash grids for capturing different detail levels.	Significantly surpasses previous methods in fidelity on DTU and Tanks and Temples benchmarks. Enables detailed large-scale scene reconstruction from RGB video captures. Progressively recovers more details as hash grid resolution increases during optimization.	Sampling strategy could be improved for faster training. Robustness for highly reflective scenes can be improved.	3d reconstruction, neural rendering, hash grids, surface reconstruction, neuralangelo
2306.02949 Report	INDigo: An INN-Guided Probabilistic Diffusion Algorithm for Inverse Problems	Di You, Andreas Floros, Pier Luigi Dragotti	Recently it has been shown that using diffusion models for inverse problems can lead to remarkable results. However, these approaches require a closed-form expression of the degradation model and can not support complex degradations. To overcome this limitation, we propose a method (INDigo) that combines invertible neural networks (INN) and diffusion models for general inverse problems. Specifically, we train the forward process of INN to simulate an arbitrary degradation process and use the inverse as a reconstruction process. During the diffusion sampling process, we impose an additional data-consistency step that minimizes the distance between the intermediate result and the INN-optimized result at every iteration, where the INN-optimized image is composed of the coarse information given by the observed degraded image and the details generated by the diffusion process. With the help of INN, our algorithm effectively estimates the details lost in the degradation process and is no longer limited by the requirement of knowing the closed-form expression of the degradation model. Experiments demonstrate that our algorithm obtains competitive results compared with recently leading methods both quantitatively and visually. Moreover, our algorithm performs well on more complex degradation models and real-world low-quality images.	This paper introduces INDigo, an algorithm for inverse problems that combines Invertible Neural Networks (INN) and diffusion models. INDigo leverages INN to simulate the degradation process and uses a diffusion model to generate detailed reconstructions, effectively handling complex degradations without requiring a closed-form expression of the degradation model.	Existing diffusion-based methods for inverse problems often struggle with complex or unknown degradation processes and can blur reconstructed details. This new method addresses these limitations, enabling high-quality reconstruction in challenging scenarios.	The method trains a Wavelet-inspired INN (WINN) to decompose images into a coarse representation similar to degraded observations and lost details. During diffusion sampling, WINN guides the process by replacing intermediate coarse representations with observed data, ensuring data consistency while the diffusion model generates missing details.	INDigo achieves state-of-the-art results compared to recent diffusion-based methods on super-resolution tasks, both with and without noise. The method effectively handles complex degradation models, such as combined downsampling and JPEG compression, producing high-quality reconstructions with realistic details. INDigo demonstrates strong performance on real-world image restoration tasks, reconstructing high-quality images from real degraded images with unknown degradation processes.	The current implementation of INDigo focuses on image restoration tasks, and its extension to other inverse problems requires further investigation. The computational complexity of INDigo remains relatively high due to the iterative diffusion process, and exploring optimization strategies for faster inference is a promising direction.	inverse problems, diffusion models, invertible neural networks, image restoration, deep learning
2306.02903 Report	Instruct-Video2Avatar: Video-to-Avatar Generation with Instructions	Shaoxu Li	We propose a method for synthesizing edited photo-realistic digital avatars with text instructions. Given a short monocular RGB video and text instructions, our method uses an image-conditioned diffusion model to edit one head image and uses the video stylization method to accomplish the editing of other head images. Through iterative training and update (three times or more), our method synthesizes edited photo-realistic animatable 3D neural head avatars with a deformable neural radiance field head synthesis method. In quantitative and qualitative studies on various subjects, our method outperforms state-of-the-art methods.	This paper introduces Instruct-Video2Avatar, a novel approach for generating customizable, photorealistic, and animatable 3D head avatars from a short RGB video and text instructions.	The method addresses the growing demand for personalized and stylized avatars for various applications, including VR/AR, by simplifying the avatar creation process and enabling users to customize avatars with text instructions.	The method employs a three-step process: (1) edits an exemplar head image using an image-conditioned diffusion model (InstructPix2Pix) guided by text instructions, (2) propagates the edits to other frames in the video using a video stylization technique (EbSynth), (3) iteratively trains and updates a 3D neural head avatar (using INSTA) based on the edited images.	The method generates high-quality, stylized avatars that preserve facial expressions and outperform existing techniques in terms of visual fidelity and temporal consistency. The iterative dataset update strategy effectively minimizes inconsistencies and artifacts in the final rendered avatar. A perceptual study confirms the superiority of Instruct-Video2Avatar compared to baseline approaches.	Limitations include potential expression inconsistencies when applying large spatial manipulations and difficulties handling edits that introduce new objects. Future work involves exploring techniques to address these limitations and improve the method's robustness and versatility.	3d head avatar, text-guided editing, neural radiance fields, diffusion models, video stylization
2306.02854 Report	Asymmetric Patch Sampling for Contrastive Learning	Chengchao Shen, Jianzhong Chen, Shu Wang, Hulin Kuang, Jin Liu, Jianxin Wang	Asymmetric appearance between positive pair effectively reduces the risk of representation degradation in contrastive learning. However, there are still a mass of appearance similarities between positive pair constructed by the existing methods, which inhibits the further representation improvement. In this paper, we propose a novel asymmetric patch sampling strategy for contrastive learning, to further boost the appearance asymmetry for better representations. Specifically, dual patch sampling strategies are applied to the given image, to obtain asymmetric positive pairs. First, sparse patch sampling is conducted to obtain the first view, which reduces spatial redundancy of image and allows a more asymmetric view. Second, a selective patch sampling is proposed to construct another view with large appearance discrepancy relative to the first one. Due to the inappreciable appearance similarity between positive pair, the trained model is encouraged to capture the similarity on semantics, instead of low-level ones. Experimental results demonstrate that our proposed method significantly outperforms the existing self-supervised methods on both ImageNet-1K and CIFAR dataset, e.g., 2.5% finetune accuracy improvement on CIFAR100. Furthermore, our method achieves state-of-the-art performance on downstream tasks, object detection and instance segmentation on COCO.Additionally, compared to other self-supervised methods, our method is more efficient on both memory and computation during training. The source code is available at https://github.com/visresearch/aps.	This paper proposes Asymmetric Patch Sampling (APS), a novel strategy for contrastive learning that constructs positive pairs with significant appearance differences but consistent semantics.	Existing contrastive learning methods suffer from appearance similarities in positive pairs, hindering representation learning. This paper addresses this by maximizing appearance asymmetry while preserving semantic consistency.	APS employs dual patch sampling strategies: sparse sampling for reduced spatial redundancy and selective sampling for minimizing overlapping patches between views. This encourages the model to learn semantic representations to minimize the contrastive objective.	APS significantly outperforms previous state-of-the-art self-supervised methods on ImageNet-1K and CIFAR datasets. The method achieves state-of-the-art performance on downstream tasks like object detection and instance segmentation on COCO. APS demonstrates greater efficiency in memory and computation compared to other self-supervised methods.	The paper primarily focuses on spatial asymmetry and could explore other forms of asymmetry. Future work can investigate extending APS to other self-supervised learning paradigms beyond contrastive learning.	contrastive learning, self-supervised learning, asymmetric patch sampling, image representation learning, computer vision
2306.02850 Report	TRACE: 5D Temporal Regression of Avatars with Dynamic Cameras in 3D Environments	Yu Sun, Qian Bao, Wu Liu, Tao Mei, Michael J. Black	Although the estimation of 3D human pose and shape (HPS) is rapidly progressing, current methods still cannot reliably estimate moving humans in global coordinates, which is critical for many applications. This is particularly challenging when the camera is also moving, entangling human and camera motion. To address these issues, we adopt a novel 5D representation (space, time, and identity) that enables end-to-end reasoning about people in scenes. Our method, called TRACE, introduces several novel architectural components. Most importantly, it uses two new "maps" to reason about the 3D trajectory of people over time in camera, and world, coordinates. An additional memory unit enables persistent tracking of people even during long occlusions. TRACE is the first one-stage method to jointly recover and track 3D humans in global coordinates from dynamic cameras. By training it end-to-end, and using full image information, TRACE achieves state-of-the-art performance on tracking and HPS benchmarks. The code and dataset are released for research purposes.	TRACE is a novel one-stage method for tracking and recovering 3D human motion from videos captured by moving cameras.	Recovering the 3D motion of humans in a global coordinate frame is critical for applications like computer graphics, sports analysis, and XR.	TRACE introduces a holistic 5D representation (space, time, identity) and leverages novel "maps" to reason about human trajectories across time in both camera and world coordinates. A memory unit is incorporated to handle long-term occlusions.	TRACE outperforms previous methods in estimating global 3D human trajectories from videos with dynamic cameras. It achieves state-of-the-art results in tracking people, particularly under long-term occlusions. TRACE demonstrates the effectiveness of learning a holistic 5D representation for this task.	The synthetic camera motion used to generate the DynaCam dataset may not fully capture the complexity of real-world camera movement. Future work should investigate explicitly estimating camera motion for improved global trajectory recovery.	3d human pose estimation, human motion tracking, dynamic cameras, 5d representation, temporal reasoning
2306.02741 Report	ZIGNeRF: Zero-shot 3D Scene Representation with Invertible Generative Neural Radiance Fields	Kanghyeok Ko, Minhyeok Lee	Generative Neural Radiance Fields (NeRFs) have demonstrated remarkable proficiency in synthesizing multi-view images by learning the distribution of a set of unposed images. Despite the aptitude of existing generative NeRFs in generating 3D-consistent high-quality random samples within data distribution, the creation of a 3D representation of a singular input image remains a formidable challenge. In this manuscript, we introduce ZIGNeRF, an innovative model that executes zero-shot Generative Adversarial Network (GAN) inversion for the generation of multi-view images from a single out-of-domain image. The model is underpinned by a novel inverter that maps out-of-domain images into the latent code of the generator manifold. Notably, ZIGNeRF is capable of disentangling the object from the background and executing 3D operations such as 360-degree rotation or depth and horizontal translation. The efficacy of our model is validated using multiple real-image datasets: Cats, AFHQ, CelebA, CelebA-HQ, and CompCars.	Presents ZIGNeRF, a novel approach for generating multi-view images from single, out-of-domain images using a 3D-aware zero-shot GAN inversion technique.	Existing generative NeRF models struggle to create 3D representations of single, out-of-domain images without computationally expensive fine-tuning.	Combines a 3D generation module (based on GIRAFFE with enhancements) with a 3D-aware GAN inversion module trained on synthesized images to map input images to the generator's latent space.	Successfully generates multi-view consistent images from out-of-domain images across various datasets (Cats, AFHQ, CelebA, CelebA-HQ, CompCars). Demonstrates 3D controllability, including 360-degree rotation and object disentanglement. Shows robust adaptation capabilities, generating plausible multi-view images from FFHQ images using a model trained on CelebA-HQ.	Explored generating multiple objects in a single scene only with CompCars dataset. Future work includes enabling image editing by manipulating the inverted latent code.	generative neural radiance fields, gan inversion, multi-view synthesis, zero-shot learning, 3d image representation
2306.02583 Report	Stable Diffusion is Unstable	Chengbin Du, Yanxi Li, Zhongwei Qiu, Chang Xu	Recently, text-to-image models have been thriving. Despite their powerful generative capacity, our research has uncovered a lack of robustness in this generation process. Specifically, the introduction of small perturbations to the text prompts can result in the blending of primary subjects with other categories or their complete disappearance in the generated images. In this paper, we propose Auto-attack on Text-to-image Models (ATM), a gradient-based approach, to effectively and efficiently generate such perturbations. By learning a Gumbel Softmax distribution, we can make the discrete process of word replacement or extension continuous, thus ensuring the differentiability of the perturbation generation. Once the distribution is learned, ATM can sample multiple attack samples simultaneously. These attack samples can prevent the generative model from generating the desired subjects without compromising image quality. ATM has achieved a 91.1% success rate in short-text attacks and an 81.2% success rate in long-text attacks. Further empirical analysis revealed four attack patterns based on: 1) the variability in generation speed, 2) the similarity of coarse-grained characteristics, 3) the polysemy of words, and 4) the positioning of words.	The paper proposes ATM (Auto-attack on Text-to-image Models), a gradient-based approach to generate attack prompts against text-to-image models, causing them to fail in generating desired subjects.	This is important for revealing vulnerabilities in text-to-image generation pipelines and inspiring research on attack/defense mechanisms to improve their robustness.	ATM uses a Gumbel Softmax distribution to enable differentiable word replacements or extensions in text prompts. It incorporates fluency and semantic similarity constraints during optimization and utilizes a margin loss to minimize the classifier's confidence in the true class.	ATM achieves a 91.1% success rate in short-text attacks and 81.2% in long-text attacks. Four attack patterns are identified: variability in generation speed, similarity of coarse-grained characteristics, polysemy of words, and positioning of words. Generated attack prompts are transferable to other models like DALL·E2 and Midjourney (black-box attacks).	The paper mainly focuses on attacking Stable Diffusion; applying ATM to other text-to-image models is left for future work. The impact of different decoding strategies on attack performance could be further explored.	text-to-image generation, adversarial attacks, stable diffusion, vulnerability analysis, robustness
2306.02245 Report	SAM3D: Zero-Shot 3D Object Detection via Segment Anything Model	Dingyuan Zhang, Dingkang Liang, Hongcheng Yang, Zhikang Zou, Xiaoqing Ye, Zhe Liu, Xiang Bai	With the development of large language models, many remarkable linguistic systems like ChatGPT have thrived and achieved astonishing success on many tasks, showing the incredible power of foundation models. In the spirit of unleashing the capability of foundation models on vision tasks, the Segment Anything Model (SAM), a vision foundation model for image segmentation, has been proposed recently and presents strong zero-shot ability on many downstream 2D tasks. However, whether SAM can be adapted to 3D vision tasks has yet to be explored, especially 3D object detection. With this inspiration, we explore adapting the zero-shot ability of SAM to 3D object detection in this paper. We propose a SAM-powered BEV processing pipeline to detect objects and get promising results on the large-scale Waymo open dataset. As an early attempt, our method takes a step toward 3D object detection with vision foundation models and presents the opportunity to unleash their power on 3D vision tasks. The code is released at https://github.com/DYZhang09/SAM3D.	This paper presents SAM3D, a method for zero-shot 3D object detection using the Segment Anything Model (SAM) by leveraging Bird's Eye View (BEV) representations of LiDAR data.	Exploring zero-shot 3D object detection is crucial for practical applications due to the high cost of 3D data annotation. This work investigates the potential of powerful vision foundation models like SAM for 3D vision tasks.	SAM3D projects LiDAR points to BEV images, enhances them to better fit SAM's training domain, employs SAM for segmentation with mesh grid prompts, applies rule-based post-processing to filter noisy masks, and finally predicts 3D bounding boxes by leveraging depth information from BEV and LiDAR points.	SAM3D demonstrates the capability of SAM to segment objects in BEV images without any 3D training data, showcasing promising zero-shot detection ability. Using reflection intensity and a predefined color palette for BEV generation significantly improves the performance compared to binary or grayscale BEV representations. Post-processing techniques, including morphological dilation for BEV and area/aspect ratio filtering for masks, are crucial for bridging the domain gap and enhancing detection results.	SAM3D's reliance on BEV might limit its applicability in indoor scenes with vertical object stacking. The inference speed, while improved, is still limited by SAM's complexity, especially with a large number of prompts.	zero-shot learning, 3d object detection, lidar, segment anything model (sam), "birds eye view (bev)"
2306.02236 Report	Detector Guidance for Multi-Object Text-to-Image Generation	Luping Liu, Zijian Zhang, Yi Ren, Rongjie Huang, Xiang Yin, Zhou Zhao	Diffusion models have demonstrated impressive performance in text-to-image generation. They utilize a text encoder and cross-attention blocks to infuse textual information into images at a pixel level. However, their capability to generate images with text containing multiple objects is still restricted. Previous works identify the problem of information mixing in the CLIP text encoder and introduce the T5 text encoder or incorporate strong prior knowledge to assist with the alignment. We find that mixing problems also occur on the image side and in the cross-attention blocks. The noisy images can cause different objects to appear similar, and the cross-attention blocks inject information at a pixel level, leading to leakage of global object understanding and resulting in object mixing. In this paper, we introduce Detector Guidance (DG), which integrates a latent object detection model to separate different objects during the generation process. DG first performs latent object detection on cross-attention maps (CAMs) to obtain object information. Based on this information, DG then masks conflicting prompts and enhances related prompts by manipulating the following CAMs. We evaluate the effectiveness of DG using Stable Diffusion on COCO, CC, and a novel multi-related object benchmark, MRO. Human evaluations demonstrate that DG provides an 8-22\% advantage in preventing the amalgamation of conflicting concepts and ensuring that each object possesses its unique region without any human involvement and additional iterations. Our implementation is available at \url{https://github.com/luping-liu/Detector-Guidance}.	This paper introduces Detector Guidance (DG), a method that integrates a latent object detection model into pre-trained diffusion models to improve the generation of images with multiple objects.	Existing text-to-image diffusion models struggle with generating images containing multiple objects due to information mixing problems, leading to attribute mixing, object mixing, and object disappearance.	DG uses a latent object detection model trained on cross-attention maps (CAMs) to identify objects during the image generation process. It then leverages object information to correct CAMs by masking conflicting prompts and enhancing related prompts, improving object separation and attribute alignment.	DG achieves 8-22% improvement in preventing the mixing of conflicting concepts and ensures each object has its unique region. The latent object detection model, trained on COCO, exhibits good generalization to unseen categories. DG shows improvement in both FID and CLIP-score when the guidance scale is larger than 3.	Limited improvement in practice despite the theoretical importance of Smooth Involvement. Reliance on language parsers, which can sometimes introduce errors.	diffusion models, text-to-image generation, object detection, cross-attention, multi-object generation
2306.02083 Report	Efficient Text-Guided 3D-Aware Portrait Generation with Score Distillation Sampling on Distribution	Yiji Cheng, Fei Yin, Xiaoke Huang, Xintong Yu, Jiaxiang Liu, Shikun Feng, Yujiu Yang, Yansong Tang	Text-to-3D is an emerging task that allows users to create 3D content with infinite possibilities. Existing works tackle the problem by optimizing a 3D representation with guidance from pre-trained diffusion models. An apparent drawback is that they need to optimize from scratch for each prompt, which is computationally expensive and often yields poor visual fidelity. In this paper, we propose DreamPortrait, which aims to generate text-guided 3D-aware portraits in a single-forward pass for efficiency. To achieve this, we extend Score Distillation Sampling from datapoint to distribution formulation, which injects semantic prior into a 3D distribution. However, the direct extension will lead to the mode collapse problem since the objective only pursues semantic alignment. Hence, we propose to optimize a distribution with hierarchical condition adapters and GAN loss regularization. For better 3D modeling, we further design a 3D-aware gated cross-attention mechanism to explicitly let the model perceive the correspondence between the text and the 3D-aware space. These elaborated designs enable our model to generate portraits with robust multi-view semantic consistency, eliminating the need for optimization-based methods. Extensive experiments demonstrate our model's highly competitive performance and significant speed boost against existing methods.	Proposes \ours{}, a method for efficient text-guided 3D-aware portrait generation by extending Score Distillation Sampling (SDS) to distribution formulation.	Existing text-to-3D methods are computationally expensive, requiring optimization from scratch for each prompt, and often yield poor visual fidelity, especially in multi-view consistency.	Extends SDS to optimize a 3D-aware distribution using hierarchical condition adapters to inject textual information and GAN loss regularization to prevent mode collapse. Employs a 3D-aware gated cross-attention mechanism to enhance multi-view consistency.	Significantly faster than optimization-based methods, achieving ~15 FPS generation speed. Generates higher-quality 3D portraits with better multi-view semantic consistency compared to two-stage methods. Demonstrates robust generalization ability by effectively handling out-of-distribution text prompts.	Limited to generating avatars and cannot handle general 3D scenes or objects. Future work includes expanding modeling capabilities to broader applications beyond avatars.	text-to-3d, score distillation sampling, 3d-aware portrait generation, multi-view consistency, generative adversarial networks
2306.02080 Report	Benchmarking Robustness of Adaptation Methods on Pre-trained Vision-Language Models	Shuo Chen, Jindong Gu, Zhen Han, Yunpu Ma, Philip Torr, Volker Tresp	Various adaptation methods, such as LoRA, prompts, and adapters, have been proposed to enhance the performance of pre-trained vision-language models in specific domains. The robustness of these adaptation methods against distribution shifts have not been studied. In this study, we assess the robustness of 11 widely-used adaptation methods across 4 vision-language datasets under multimodal corruptions. Concretely, we introduce 7 benchmark datasets, including 96 visual and 87 textual corruptions, to investigate the robustness of different adaptation methods, the impact of available adaptation examples, and the influence of trainable parameter size during adaptation. Our analysis reveals that: 1) Adaptation methods are more sensitive to text corruptions than visual corruptions. 2) Full fine-tuning does not consistently provide the highest robustness; instead, adapters can achieve better robustness with comparable clean performance. 3) Contrary to expectations, our findings indicate that increasing the number of adaptation data and parameters does not guarantee enhanced robustness; instead it results in even lower robustness. We hope this study could benefit future research in the development of robust multimodal adaptation methods. The benchmark, code, and dataset used in this study can be accessed at https://adarobustness.github.io .	This paper introduces a large-scale benchmark to evaluate the robustness of different adaptation methods for pre-trained vision-language models under multimodal corruptions, including variations in lighting conditions in images and typos in texts.	Robustness against distribution shifts in vision-language models is crucial for real-world applications, especially in safety-critical domains like self-driving systems and clinical diagnostics.	The authors introduce a benchmark with 96 visual and 87 textual corruptions across 4 VL datasets. They evaluate 11 widely-used adaptation methods, analyzing the impact of adaptation examples and trainable parameter size.	Adaptation methods are more sensitive to text corruptions than visual corruptions. Full fine-tuning doesn't guarantee the highest robustness; adapters can achieve better robustness with comparable clean performance. Increasing adaptation data and parameters doesn't guarantee enhanced robustness; it can even lead to lower robustness.	The analysis is limited to a limited number of multimodal models due to the availability of usable code and model weights. Future work includes investigating more diverse VL models and designing more robust adaptation methods.	robustness, vision-language models, adaptation methods, multimodal corruptions, benchmarking
2306.02064 Report	Flew Over Learning Trap: Learn Unlearnable Samples by Progressive Staged Training	Pucheng Dang, Xing Hu, Kaidi Xu, Jinhao Duan, Di Huang, Husheng Han, Rui Zhang, Zidong Du, Qi Guo, Yunji Chen	Unlearning techniques are proposed to prevent third parties from exploiting unauthorized data, which generate unlearnable samples by adding imperceptible perturbations to data for public publishing. These unlearnable samples effectively misguide model training to learn perturbation features but ignore image semantic features. We make the in-depth analysis and observe that models can learn both image features and perturbation features of unlearnable samples at an early stage, but rapidly go to the overfitting stage since the shallow layers tend to overfit on perturbation features and make models fall into overfitting quickly. Based on the observations, we propose Progressive Staged Training to effectively prevent models from overfitting in learning perturbation features. We evaluated our method on multiple model architectures over diverse datasets, e.g., CIFAR-10, CIFAR-100, and ImageNet-mini. Our method circumvents the unlearnability of all state-of-the-art methods in the literature and provides a reliable baseline for further evaluation of unlearnable techniques.	This paper proposes a novel training framework called Progressive Staged Training (ST) to defeat unlearnable samples, a data protection method.	Unlearnable samples, by injecting imperceptible perturbations into training data, mislead models into learning perturbation features instead of valuable semantic features, limiting their practical use. This work aims to circumvent this protection and provide a reliable baseline for evaluating unlearnable techniques.	ST utilizes an Activation Cluster Measurement (ACM) to identify model overfitting on perturbation features. It then adjusts learning rates progressively, slowing down shallow layer learning to resist overfitting. The authors also investigate the effectiveness of color-jitter and gray-scale augmentation (CG).	ST significantly improves model accuracy on unlearnable samples across various datasets (CIFAR-10, CIFAR-100, ImageNet-mini) and model architectures (ResNet, VGG, DenseNet, WideResNet). The augmentation CG further enhances ST's performance, making perturbations less effective even when CG is used during their generation. Analysis of activation patterns and loss landscape demonstrates that ST effectively prevents overfitting on perturbation features.	The paper lacks a clear explanation of why CG augmentation weakens the protection of unlearnable perturbations. Further investigation on the effectiveness of ST on different mixing ratios of unlearnable samples and clean data is required.	unlearnable samples, data protection, overfitting, staged training, data augmentation
2306.02000 Report	Context-PIPs: Persistent Independent Particles Demands Spatial Context Features	Weikang Bian, Zhaoyang Huang, Xiaoyu Shi, Yitong Dong, Yijin Li, Hongsheng Li	We tackle the problem of Persistent Independent Particles (PIPs), also called Tracking Any Point (TAP), in videos, which specifically aims at estimating persistent long-term trajectories of query points in videos. Previous methods attempted to estimate these trajectories independently to incorporate longer image sequences, therefore, ignoring the potential benefits of incorporating spatial context features. We argue that independent video point tracking also demands spatial context features. To this end, we propose a novel framework Context-PIPs, which effectively improves point trajectory accuracy by aggregating spatial context features in videos. Context-PIPs contains two main modules: 1) a SOurse Feature Enhancement (SOFE) module, and 2) a TArget Feature Aggregation (TAFA) module. Context-PIPs significantly improves PIPs all-sided, reducing 11.4% Average Trajectory Error of Occluded Points (ATE-Occ) on CroHD and increasing 11.8% Average Percentage of Correct Keypoint (A-PCK) on TAP-Vid-Kinectics. Demos are available at https://wkbian.github.io/Projects/Context-PIPs/.	The paper introduces Context-PIPs, a novel framework for enhancing Persistent Independent Particles (PIPs) in video point tracking by incorporating spatial context features from both source and target frames.	Existing methods for video point tracking, like PIPs, primarily focus on temporal information while neglecting valuable spatial context, limiting accuracy and robustness, especially in challenging scenarios like occlusions or texture-less regions.	Context-PIPs extends PIPs with two key modules: SOFE (Source Feature Enhancement) and TAFA (Target Feature Aggregation). SOFE leverages self-similarity in the source frame to guide the sampling of auxiliary features, enriching the representation of the query point. TAFA utilizes cross-attention between augmented correlation features and target frame features to aggregate relevant context, further improving point trajectory refinement.	Context-PIPs achieves state-of-the-art performance on four benchmarks: FlyingThings++, CroHD, TAP-Vid-DAVIS, and TAP-Vid-Kinectics, demonstrating significant improvements over previous methods like PIPs and TAP-Net. The ablation study confirms the effectiveness of both SOFE and TAFA modules in enhancing point tracking accuracy. Context-PIPs exhibits high efficiency, achieving superior results even with fewer parameters and computations compared to PIPs.	Context-PIPs currently relies on a sliding window approach for tracking, limiting its ability to re-identify points that become lost. Future work will explore methods for re-identifying lost points when they reappear in the video.	video point tracking, persistent independent particles (pips), spatial context, source feature enhancement, target feature aggregation
2306.01923 Report	The Surprising Effectiveness of Diffusion Models for Optical Flow and Monocular Depth Estimation	Saurabh Saxena, Charles Herrmann, Junhwa Hur, Abhishek Kar, Mohammad Norouzi, Deqing Sun, David J. Fleet	Denoising diffusion probabilistic models have transformed image generation with their impressive fidelity and diversity. We show that they also excel in estimating optical flow and monocular depth, surprisingly, without task-specific architectures and loss functions that are predominant for these tasks. Compared to the point estimates of conventional regression-based methods, diffusion models also enable Monte Carlo inference, e.g., capturing uncertainty and ambiguity in flow and depth. With self-supervised pre-training, the combined use of synthetic and real data for supervised training, and technical innovations (infilling and step-unrolled denoising diffusion training) to handle noisy-incomplete training data, and a simple form of coarse-to-fine refinement, one can train state-of-the-art diffusion models for depth and optical flow estimation. Extensive experiments focus on quantitative performance against benchmarks, ablations, and the model's ability to capture uncertainty and multimodality, and impute missing values. Our model, DDVM (Denoising Diffusion Vision Model), obtains a state-of-the-art relative depth error of 0.074 on the indoor NYU benchmark and an Fl-all outlier rate of 3.26\% on the KITTI optical flow benchmark, about 25\% better than the best published method. For an overview see https://diffusion-vision.github.io.	The paper introduces a denoising diffusion model for optical flow and monocular depth estimation, using an image-to-image translation framework without task-specific architectures or loss functions.	This approach offers several advantages over traditional regression-based methods, including the ability to capture uncertainty and multimodality, and impute missing values.	The model is trained using a combination of self-supervised pretraining and supervised training on synthetic and real data. Key technical innovations include infilling, step-unrolled denoising diffusion training, and coarse-to-fine refinement.	The model achieves state-of-the-art results on optical flow benchmarks, with a 3.26% Fl-all outlier rate on KITTI, significantly outperforming the best published method. For monocular depth estimation, the model obtains a state-of-the-art relative depth error of 0.074 on the indoor NYU benchmark. The diffusion model effectively captures multimodality and uncertainty in both depth and optical flow, enabling it to handle challenging cases such as transparent, reflective, and occluded regions.	Diffusion models are computationally expensive compared to traditional methods, requiring many denoising steps during inference, which leads to longer inference times. While achieving state-of-the-art on zero-shot optical flow estimation for both Sintel and KITTI, the model falls behind FlowFormer on Sintel after finetuning. Possible reasons include a fine-tuning procedure better suited for KITTI, and a significant domain gap between training and testing data for Sintel compared to KITTI.	diffusion models, optical flow estimation, monocular depth estimation, image-to-image translation, generative models
2306.01900 Report	Conditional Generation from Unconditional Diffusion Models using Denoiser Representations	Alexandros Graikos, Srikar Yellapragada, Dimitris Samaras	Denoising diffusion models have gained popularity as a generative modeling technique for producing high-quality and diverse images. Applying these models to downstream tasks requires conditioning, which can take the form of text, class labels, or other forms of guidance. However, providing conditioning information to these models can be challenging, particularly when annotations are scarce or imprecise. In this paper, we propose adapting pre-trained unconditional diffusion models to new conditions using the learned internal representations of the denoiser network. We demonstrate the effectiveness of our approach on various conditional generation tasks, including attribute-conditioned generation and mask-conditioned generation. Additionally, we show that augmenting the Tiny ImageNet training set with synthetic images generated by our approach improves the classification accuracy of ResNet baselines by up to 8%. Our approach provides a powerful and flexible way to adapt diffusion models to new conditions and generate high-quality augmented data for various conditional generation tasks.	This paper proposes a method to adapt pre-trained unconditional diffusion models to new conditions (e.g., attributes, masks) using the learned internal representations of the denoiser network.	This is important because it allows for conditional image generation even when annotations are scarce or imprecise, eliminating the need for extensive labeled data for training guidance classifiers.	The method leverages the denoiser's intermediate features to train a guidance network, exploiting the denoiser's robustness to noisy inputs and ability to learn from limited data. This guidance network then modifies the diffusion process to generate images aligned with the desired conditions. For larger datasets, the method combines this guidance with fine-tuning and rejection sampling to further enhance image quality.	The method achieves comparable FID scores to state-of-the-art methods for few-shot attribute-conditioned generation on CelebA-64. It outperforms baseline approaches in few-shot segmentation-conditioned generation on CelebA-Mask, achieving better mIoU and FID scores. Augmenting Tiny ImageNet with synthetic images generated by this method significantly improves classification accuracy (up to 8%) over ResNet baselines, demonstrating its potential for data augmentation.	The method relies on the assumption that the denoiser's estimates of the final image become accurate relatively early in the denoising process. Future work could explore methods for controlling the guidance strength during sampling to better balance image quality and diversity.	diffusion models, conditional image generation, few-shot learning, data augmentation, image classification
2306.01721 Report	Denoising Diffusion Semantic Segmentation with Mask Prior Modeling	Zeqiang Lai, Yuchen Duan, Jifeng Dai, Ziheng Li, Ying Fu, Hongsheng Li, Yu Qiao, Wenhai Wang	The evolution of semantic segmentation has long been dominated by learning more discriminative image representations for classifying each pixel. Despite the prominent advancements, the priors of segmentation masks themselves, e.g., geometric and semantic constraints, are still under-explored. In this paper, we propose to ameliorate the semantic segmentation quality of existing discriminative approaches with a mask prior modeled by a recently-developed denoising diffusion generative model. Beginning with a unified architecture that adapts diffusion models for mask prior modeling, we focus this work on a specific instantiation with discrete diffusion and identify a variety of key design choices for its successful application. Our exploratory analysis revealed several important findings, including: (1) a simple integration of diffusion models into semantic segmentation is not sufficient, and a poorly-designed diffusion process might lead to degradation in segmentation performance; (2) during the training, the object to which noise is added is more important than the type of noise; (3) during the inference, the strict diffusion denoising scheme may not be essential and can be relaxed to a simpler scheme that even works better. We evaluate the proposed prior modeling with several off-the-shelf segmentors, and our experimental results on ADE20K and Cityscapes demonstrate that our approach could achieve competitively quantitative performance and more appealing visual quality.	This paper proposes DDPS, a novel framework utilizing denoising diffusion generative models to enhance semantic segmentation by modeling segmentation mask priors, such as geometric and semantic constraints.	Existing semantic segmentation methods primarily focus on discriminative feature learning, often overlooking the intrinsic properties and priors of segmentation masks themselves, which limits their performance.	DDPS employs a two-stage pipeline. First, an off-the-shelf segmentation model generates initial predictions. Then, a denoising diffusion model, specifically a discrete diffusion model in this work, refines these predictions by aligning them with the learned mask prior distribution.	DDPS consistently enhances the performance of various base segmentation models, including DeepLabV3+ and Segformer, on ADE20K and Cityscapes datasets. The method demonstrates significant gains in boundary IoU, indicating its effectiveness in modeling geometric constraints. Key design choices, such as noise applied to the first prediction and free re-noising during inference, are crucial for DDPS's success.	The impact of DDPS on datasets with less inherent structure, like Cityscapes, is less pronounced compared to datasets like ADE20K. Exploration of more sophisticated mask representation codecs and alternative diffusion models beyond discrete diffusion could be interesting future directions.	semantic segmentation, denoising diffusion models, mask prior modeling, generative models for segmentation, deep learning
2306.01667 Report	Towards In-context Scene Understanding	Ivana Balažević, David Steiner, Nikhil Parthasarathy, Relja Arandjelović, Olivier J. Hénaff	In-context learning$\unicode{x2013}$the ability to configure a model's behavior with different prompts$\unicode{x2013}$has revolutionized the field of natural language processing, alleviating the need for task-specific models and paving the way for generalist models capable of assisting with any query. Computer vision, in contrast, has largely stayed in the former regime: specialized decoders and finetuning protocols are generally required to perform dense tasks such as semantic segmentation and depth estimation. In this work we explore a simple mechanism for in-context learning of such scene understanding tasks: nearest neighbor retrieval from a prompt of annotated features. We propose a new pretraining protocol$\unicode{x2013}$leveraging attention within and across images$\unicode{x2013}$which yields representations particularly useful in this regime. The resulting Hummingbird model, suitably prompted, performs various scene understanding tasks without modification while approaching the performance of specialists that have been finetuned for each task. Moreover, Hummingbird can be configured to perform new tasks much more efficiently than finetuned models, raising the possibility of scene understanding in the interactive assistant regime.	This paper explores in-context learning for scene understanding tasks like semantic segmentation and depth estimation using a simple nearest neighbor retrieval mechanism from annotated features.	This approach is significant because it eliminates the need for task-specific decoders and finetuning, paving the way for general-purpose vision models.	The authors propose Hummingbird, a pretraining method that uses attention across and within images. The model retrieves nearest neighbors from a prompt of annotated features to make predictions on new images.	Hummingbird representations perform well on semantic segmentation and depth estimation using NN retrieval without modification. The approach achieves performance comparable to fully finetuned specialist models on some tasks. Hummingbird with NN retrieval adapts to new tasks faster and more data-efficiently than finetuned models.	The absolute performance in the low-data regime (less than 100 examples) needs improvement. Expanding the evaluation to other tasks like object detection is left for future work.	in-context learning, scene understanding, nearest neighbor retrieval, self-supervised learning, computer vision
2306.01293 Report	LoCoOp: Few-Shot Out-of-Distribution Detection via Prompt Learning	Atsuyuki Miyai, Qing Yu, Go Irie, Kiyoharu Aizawa	We present a novel vision-language prompt learning approach for few-shot out-of-distribution (OOD) detection. Few-shot OOD detection aims to detect OOD images from classes that are unseen during training using only a few labeled in-distribution (ID) images. While prompt learning methods such as CoOp have shown effectiveness and efficiency in few-shot ID classification, they still face limitations in OOD detection due to the potential presence of ID-irrelevant information in text embeddings. To address this issue, we introduce a new approach called Local regularized Context Optimization (LoCoOp), which performs OOD regularization that utilizes the portions of CLIP local features as OOD features during training. CLIP's local features have a lot of ID-irrelevant nuisances (e.g., backgrounds), and by learning to push them away from the ID class text embeddings, we can remove the nuisances in the ID class text embeddings and enhance the separation between ID and OOD. Experiments on the large-scale ImageNet OOD detection benchmarks demonstrate the superiority of our LoCoOp over zero-shot, fully supervised detection methods and prompt learning methods. Notably, even in a one-shot setting -- just one label per class, LoCoOp outperforms existing zero-shot and fully supervised detection methods. The code will be available via https://github.com/AtsuMiyai/LoCoOp.	This paper tackles few-shot out-of-distribution (OOD) detection with vision-language models and proposes a novel prompt learning method called LoCoOp.	Existing OOD detection methods for vision-language models are limited to zero-shot or fully supervised settings, which either suffer from domain gaps or require large training costs. Few-shot OOD detection offers a balanced solution.	LoCoOp leverages CLIP's local features to identify ID-irrelevant regions (treated as OOD) and pushes them away from ID class text embeddings during training, effectively removing irrelevant information from text embeddings.	LoCoOp outperforms existing zero-shot, few-shot, and fully supervised OOD detection methods on ImageNet benchmarks. Remarkably, LoCoOp surpasses existing methods with only one label per class (one-shot setting). Experiments show LoCoOp's effectiveness with both ViT and ResNet architectures.	The application of LoCoOp to other light-weight tuning methods (e.g., Tip-Adapter, visual prompt methods) is left for future work. LoCoOp requires models with strong local visual-text alignment and may not be readily applicable to models lacking such capabilities.	out-of-distribution detection, few-shot learning, prompt learning, vision-language models, clip
2306.01272 Report	DeepfakeArt Challenge: A Benchmark Dataset for Generative AI Art Forgery and Data Poisoning Detection	Hossein Aboutalebi, Dayou Mao, Rongqi Fan, Carol Xu, Chris He, Alexander Wong	The tremendous recent advances in generative artificial intelligence techniques have led to significant successes and promise in a wide range of different applications ranging from conversational agents and textual content generation to voice and visual synthesis. Amid the rise in generative AI and its increasing widespread adoption, there has been significant growing concern over the use of generative AI for malicious purposes. In the realm of visual content synthesis using generative AI, key areas of significant concern has been image forgery (e.g., generation of images containing or derived from copyright content), and data poisoning (i.e., generation of adversarially contaminated images). Motivated to address these key concerns to encourage responsible generative AI, we introduce the DeepfakeArt Challenge, a large-scale challenge benchmark dataset designed specifically to aid in the building of machine learning algorithms for generative AI art forgery and data poisoning detection. Comprising of over 32,000 records across a variety of generative forgery and data poisoning techniques, each entry consists of a pair of images that are either forgeries / adversarially contaminated or not. Each of the generated images in the DeepfakeArt Challenge benchmark dataset \footnote{The link to the dataset: http://anon\_for\_review.com} has been quality checked in a comprehensive manner.	Introduces DeepfakeArt Challenge, a large-scale benchmark dataset for detecting art forgery and data poisoning in generative AI.	Addresses growing concerns of copyright infringement and adversarial data poisoning in AI-generated visual content to encourage responsible generative AI.	Creates over 32,000 image pairs using four generative forgery and data poisoning techniques: Inpainting, Style Transfer, Adversarial Data Poisoning, and Cutmix, based on modifications of source images from the WikiArt dataset.	DINO-v2 ViT-L/14 model achieves the best overall performance for detecting similar and dissimilar image pairs. Models generally show high precision but low recall, indicating a high rate of false negatives. The high false negative rate highlights the need for more robust detection tools to identify and mitigate copyright infringements in generative AI models.	The dataset currently focuses on four specific generative techniques and may not encompass the full spectrum of potential forgery methods. Future work could explore expanding the dataset with additional techniques and exploring more sophisticated detection algorithms.	generative ai, copyright infringement, data poisoning, deep learning, computer vision
2306.00987 Report	StyleGAN knows Normal, Depth, Albedo, and More	Anand Bhattad, Daniel McKee, Derek Hoiem, D. A. Forsyth	Intrinsic images, in the original sense, are image-like maps of scene properties like depth, normal, albedo or shading. This paper demonstrates that StyleGAN can easily be induced to produce intrinsic images. The procedure is straightforward. We show that, if StyleGAN produces $G({w})$ from latents ${w}$, then for each type of intrinsic image, there is a fixed offset ${d}_c$ so that $G({w}+{d}_c)$ is that type of intrinsic image for $G({w})$. Here ${d}_c$ is {\em independent of ${w}$}. The StyleGAN we used was pretrained by others, so this property is not some accident of our training regime. We show that there are image transformations StyleGAN will {\em not} produce in this fashion, so StyleGAN is not a generic image regression engine. It is conceptually exciting that an image generator should ``know'' and represent intrinsic images. There may also be practical advantages to using a generative model to produce intrinsic images. The intrinsic images obtained from StyleGAN compare well both qualitatively and quantitatively with those obtained by using SOTA image regression techniques; but StyleGAN's intrinsic images are robust to relighting effects, unlike SOTA methods.	This paper reveals that StyleGAN, despite not being trained on intrinsic images, can be prompted to generate them by applying specific offsets to its latent codes.	This finding is significant as it suggests that intrinsic image representations might be inherently encoded within StyleGAN, indicating a natural alignment with these scene properties.	The authors search for fixed offsets in the StyleGAN latent space that correspond to different intrinsic images (normals, depth, albedo, shading, segmentation). This is achieved by minimizing the L1 distance between StyleGAN outputs and predictions from pre-trained, state-of-the-art intrinsic image prediction models.	StyleGAN generates intrinsic images comparable in quality to those produced by state-of-the-art supervised methods, even though it was never explicitly trained on such data. StyleGAN-derived intrinsic images exhibit remarkable robustness to lighting variations, outperforming current leading methods which show sensitivity to such changes. A control experiment demonstrates that StyleGAN is not simply a generic image processing engine, as it cannot perform tasks unrelated to intrinsic image manipulation, like swapping image halves.	The reliance on accurate GAN inversion methods to apply this technique to real images is currently a limiting factor. Further investigation into whether this capability extends to other generative models and exploration of potentially undiscovered intrinsic representations within StyleGAN is warranted.	stylegan, intrinsic images, generative models, image editing, representation learning
2306.00984 Report	StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners	Yonglong Tian, Lijie Fan, Phillip Isola, Huiwen Chang, Dilip Krishnan	We investigate the potential of learning visual representations using synthetic images generated by text-to-image models. This is a natural question in the light of the excellent performance of such models in generating high-quality images. We consider specifically the Stable Diffusion, one of the leading open source text-to-image models. We show that (1) when the generative model is configured with proper classifier-free guidance scale, training self-supervised methods on synthetic images can match or beat the real image counterpart; (2) by treating the multiple images generated from the same text prompt as positives for each other, we develop a multi-positive contrastive learning method, which we call StableRep. With solely synthetic images, the representations learned by StableRep surpass the performance of representations learned by SimCLR and CLIP using the same set of text prompts and corresponding real images, on large scale datasets. When we further add language supervision, StableRep trained with 20M synthetic images achieves better accuracy than CLIP trained with 50M real images.	This paper explores using synthetic images from text-to-image models, especially Stable Diffusion, for visual representation learning.	Collecting and curating large real-world datasets for training AI models is costly and prone to bias. Leveraging generative models as data sources offers a promising alternative.	The authors investigate training standard self-supervised methods (SimCLR, MAE) on Stable Diffusion images, finding optimal guidance scales for image generation. They further propose StableRep, a novel multi-positive contrastive learning approach leveraging the unique property of generative models to create diverse positive samples from a single text prompt.	Training self-supervised methods on synthetic images from Stable Diffusion with an appropriate guidance scale often surpasses the performance achieved by training on an equivalent amount of real data. StableRep, solely trained on synthetic data, outperforms state-of-the-art methods like CLIP trained on real data, achieving 76.7% linear accuracy on ImageNet with ViT-B/16. Adding language supervision to StableRep exhibits a 5x improvement in caption efficiency compared to CLIP trained on real images.	Current image generation speed is slow, hindering online image synthesis during training. Semantic mismatch between prompts and generated images remains an open problem, impacting data quality.	representation learning, synthetic data, text-to-image generation, stable diffusion, contrastive learning
2306.00977 Report	AGILE3D: Attention Guided Interactive Multi-object 3D Segmentation	Yuanwen Yue, Sabarinath Mahadevan, Jonas Schult, Francis Engelmann, Bastian Leibe, Konrad Schindler, Theodora Kontogianni	During interactive segmentation, a model and a user work together to delineate objects of interest in a 3D point cloud. In an iterative process, the model assigns each data point to an object (or the background), while the user corrects errors in the resulting segmentation and feeds them back into the model. The current best practice formulates the problem as binary classification and segments objects one at a time. The model expects the user to provide positive clicks to indicate regions wrongly assigned to the background and negative clicks on regions wrongly assigned to the object. Sequentially visiting objects is wasteful since it disregards synergies between objects: a positive click for a given object can, by definition, serve as a negative click for nearby objects. Moreover, a direct competition between adjacent objects can speed up the identification of their common boundary. We introduce AGILE3D, an efficient, attention-based model that (1) supports simultaneous segmentation of multiple 3D objects, (2) yields more accurate segmentation masks with fewer user clicks, and (3) offers faster inference. Our core idea is to encode user clicks as spatial-temporal queries and enable explicit interactions between click queries as well as between them and the 3D scene through a click attention module. Every time new clicks are added, we only need to run a lightweight decoder that produces updated segmentation masks. In experiments with four different 3D point cloud datasets, AGILE3D sets a new state-of-the-art. Moreover, we also verify its practicality in real-world setups with real user studies.	AGILE3D is introduced, an attention-based deep learning model for interactive segmentation of multiple objects in 3D point clouds, overcoming the limitations of sequential, single-object segmentation.	Existing methods for interactive 3D segmentation are inefficient, disregarding synergies between objects, and limiting the segmentation to one object at a time.	AGILE3D encodes user clicks as spatial-temporal queries, enabling interaction between clicks and the 3D scene via a click attention module. It employs a pre-computed backbone for efficiency and trains using an iterative strategy simulating user behavior.	AGILE3D outperforms state-of-the-art methods in both single- and multi-object segmentation benchmarks. The model effectively segments multiple objects simultaneously, requiring fewer clicks for higher-quality masks. Real-user studies confirm AGILE3D's efficiency and the effectiveness of the iterative training strategy.	AGILE3D may require more clicks to accurately segment fine-grained object parts. The model currently doesn't provide semantic labels along with the segmented masks.	3d point clouds, interactive segmentation, multi-object segmentation, attention mechanism, deep learning
2306.00973 Report	Intelligent Grimm -- Open-ended Visual Storytelling via Latent Diffusion Models	Chang Liu, Haoning Wu, Yujie Zhong, Xiaoyun Zhang, Yanfeng Wang, Weidi Xie	Generative models have recently exhibited exceptional capabilities in text-to-image generation, but still struggle to generate image sequences coherently. In this work, we focus on a novel, yet challenging task of generating a coherent image sequence based on a given storyline, denoted as open-ended visual storytelling. We make the following three contributions: (i) to fulfill the task of visual storytelling, we propose a learning-based auto-regressive image generation model, termed as StoryGen, with a novel vision-language context module, that enables to generate the current frame by conditioning on the corresponding text prompt and preceding image-caption pairs; (ii) to address the data shortage of visual storytelling, we collect paired image-text sequences by sourcing from online videos and open-source E-books, establishing processing pipeline for constructing a large-scale dataset with diverse characters, storylines, and artistic styles, named StorySalon; (iii) Quantitative experiments and human evaluations have validated the superiority of our StoryGen, where we show StoryGen can generalize to unseen characters without any optimization, and generate image sequences with coherent content and consistent character. Code, dataset, and models are available at https://haoningwu3639.github.io/StoryGen_Webpage/	This paper proposes StoryGen, a novel learning-based auto-regressive image generation model for open-ended visual storytelling, enabling the generation of coherent image sequences from given storylines, even with unseen characters.	Open-ended visual storytelling holds significant potential in education by offering an engaging way for children to learn visual concepts and fostering imagination, creativity, and language skills.	StoryGen builds upon a pre-trained stable diffusion model and incorporates a novel vision-language context module to condition image generation on both text prompts and preceding image-caption pairs, ensuring both content coherence and character consistency. The model is trained on StorySalon, a newly constructed large-scale dataset of storybooks with diverse characters, storylines, and artistic styles.	StoryGen generates visually coherent stories with unseen characters without requiring character-specific optimization. Quantitative experiments show StoryGen outperforms baselines in terms of image quality and text-image alignment, validated by FID and CLIP scores. Human evaluations confirm StoryGen's superiority in generating coherent and engaging visual stories, as evidenced by higher scores in consistency, quality, and user preference.	StoryGen inherits limitations from the underlying stable diffusion model, such as inaccuracies in limb counts and reduced quality with multiple objects. Future work will explore more robust architectures like DALL-E 3 or consistency models to address these limitations.	visual storytelling, latent diffusion models, open-ended generation, storysalon dataset, character consistency
2306.00968 Report	GRES: Generalized Referring Expression Segmentation	Chang Liu, Henghui Ding, Xudong Jiang	Referring Expression Segmentation (RES) aims to generate a segmentation mask for the object described by a given language expression. Existing classic RES datasets and methods commonly support single-target expressions only, i.e., one expression refers to one target object. Multi-target and no-target expressions are not considered. This limits the usage of RES in practice. In this paper, we introduce a new benchmark called Generalized Referring Expression Segmentation (GRES), which extends the classic RES to allow expressions to refer to an arbitrary number of target objects. Towards this, we construct the first large-scale GRES dataset called gRefCOCO that contains multi-target, no-target, and single-target expressions. GRES and gRefCOCO are designed to be well-compatible with RES, facilitating extensive experiments to study the performance gap of the existing RES methods on the GRES task. In the experimental study, we find that one of the big challenges of GRES is complex relationship modeling. Based on this, we propose a region-based GRES baseline ReLA that adaptively divides the image into regions with sub-instance clues, and explicitly models the region-region and region-language dependencies. The proposed approach ReLA achieves new state-of-the-art performance on the both newly proposed GRES and classic RES tasks. The proposed gRefCOCO dataset and method are available at https://henghuiding.github.io/GRES.	This paper introduces Generalized Referring Expression Segmentation (GRES), a new benchmark extending classic Referring Expression Segmentation (RES) to handle expressions referring to an arbitrary number of target objects, including multi-target and no-target expressions.	Classic RES suffers from limitations as it only supports single-target expressions, hindering its practical usage in scenarios with multiple or no target objects. GRES addresses this by supporting a wider range of expressions, enabling greater flexibility and robustness in real-world applications.	The authors create gRefCOCO, a large-scale dataset for GRES, by augmenting RefCOCO with multi-target and no-target expressions. They also propose ReLA, a region-based GRES baseline method that leverages sub-instance clues to explicitly model region-region and region-language dependencies.	Models trained solely on single-target RES datasets generalize poorly to GRES, highlighting the need for gRefCOCO. Explicit modeling of region-region and region-language interactions significantly improves performance on GRES. ReLA achieves state-of-the-art results on both classic RES and the newly proposed GRES tasks.	No-target expression identification, while improved, still presents challenges due to the deceptive nature of some expressions. Future research should focus on addressing complex relationships, such as possession and fine-grained attribute understanding.	referring expression segmentation, multi-target segmentation, no-target identification, relationship modeling, computer vision
2306.00965 Report	BUOL: A Bottom-Up Framework with Occupancy-aware Lifting for Panoptic 3D Scene Reconstruction From A Single Image	Tao Chu, Pan Zhang, Qiong Liu, Jiaqi Wang	Understanding and modeling the 3D scene from a single image is a practical problem. A recent advance proposes a panoptic 3D scene reconstruction task that performs both 3D reconstruction and 3D panoptic segmentation from a single image. Although having made substantial progress, recent works only focus on top-down approaches that fill 2D instances into 3D voxels according to estimated depth, which hinders their performance by two ambiguities. (1) instance-channel ambiguity: The variable ids of instances in each scene lead to ambiguity during filling voxel channels with 2D information, confusing the following 3D refinement. (2) voxel-reconstruction ambiguity: 2D-to-3D lifting with estimated single view depth only propagates 2D information onto the surface of 3D regions, leading to ambiguity during the reconstruction of regions behind the frontal view surface. In this paper, we propose BUOL, a Bottom-Up framework with Occupancy-aware Lifting to address the two issues for panoptic 3D scene reconstruction from a single image. For instance-channel ambiguity, a bottom-up framework lifts 2D information to 3D voxels based on deterministic semantic assignments rather than arbitrary instance id assignments. The 3D voxels are then refined and grouped into 3D instances according to the predicted 2D instance centers. For voxel-reconstruction ambiguity, the estimated multi-plane occupancy is leveraged together with depth to fill the whole regions of things and stuff. Our method shows a tremendous performance advantage over state-of-the-art methods on synthetic dataset 3D-Front and real-world dataset Matterport3D. Code and models are available in https://github.com/chtsy/buol.	This paper presents BUOL, a novel bottom-up framework with occupancy-aware lifting, for panoptic 3D scene reconstruction from a single RGB image.	Existing top-down methods suffer from instance-channel ambiguity (inconsistent instance ID assignment) and voxel-reconstruction ambiguity (limited 2D information propagation to 3D).	BUOL uses a 2D model to predict semantic maps, instance centers, depth, and multi-plane occupancy. It then lifts 2D semantics to 3D using occupancy-aware lifting and refines them with a 3D model, predicting occupancy, semantics, and offsets for 3D instance grouping.	BUOL outperforms previous state-of-the-art methods on 3D-Front and Matterport3D datasets by significant margins (+11.81% and +7.46% in PRQ). The bottom-up framework effectively addresses instance-channel ambiguity by utilizing semantic information for lifting. Occupancy-aware lifting alleviates voxel-reconstruction ambiguity, enabling accurate 3D reconstruction.	The performance on the Matterport3D dataset is relatively low due to noisy ground truth data. The computational cost of 3D models limits the use of complex architectures like 3D UNet.	panoptic 3d scene reconstruction, single image 3d reconstruction, occupancy-aware lifting, bottom-up framework, 3d instance segmentation
2306.00956 Report	The ObjectFolder Benchmark: Multisensory Learning with Neural and Real Objects	Ruohan Gao, Yiming Dou, Hao Li, Tanmay Agarwal, Jeannette Bohg, Yunzhu Li, Li Fei-Fei, Jiajun Wu	We introduce the ObjectFolder Benchmark, a benchmark suite of 10 tasks for multisensory object-centric learning, centered around object recognition, reconstruction, and manipulation with sight, sound, and touch. We also introduce the ObjectFolder Real dataset, including the multisensory measurements for 100 real-world household objects, building upon a newly designed pipeline for collecting the 3D meshes, videos, impact sounds, and tactile readings of real-world objects. We conduct systematic benchmarking on both the 1,000 multisensory neural objects from ObjectFolder, and the real multisensory data from ObjectFolder Real. Our results demonstrate the importance of multisensory perception and reveal the respective roles of vision, audio, and touch for different object-centric learning tasks. By publicly releasing our dataset and benchmark suite, we hope to catalyze and enable new research in multisensory object-centric learning in computer vision, robotics, and beyond. Project page: https://objectfolder.stanford.edu	This paper introduces the ObjectFolder Benchmark, a suite of 10 tasks for multisensory object-centric learning, and the ObjectFolder Real dataset, which includes multisensory measurements for 100 real-world household objects.	Modeling the complete multisensory profile of objects is important for applications in computer vision, robotics, graphics, and VR/AR, but existing datasets and benchmarks are limited.	The authors designed a data collection pipeline for capturing 3D meshes, videos, impact sounds, and tactile readings of real objects. They also standardized 10 tasks and developed baseline approaches for each.	Vision and audio are more reliable than touch for object recognition. Fusing multiple sensory modalities achieves the best results for object reconstruction. Vision and touch are both crucial for robotic manipulation tasks.	Sim-to-real transfer remains challenging for some tasks. Future work includes exploring more robust sim-to-real calibration methods.	multisensory learning, object-centric learning, benchmarking, dataset, robotics
2306.00943 Report	Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance	Jinbo Xing, Menghan Xia, Yuxin Liu, Yuechen Zhang, Yong Zhang, Yingqing He, Hanyuan Liu, Haoxin Chen, Xiaodong Cun, Xintao Wang, Ying Shan, Tien-Tsin Wong	Creating a vivid video from the event or scenario in our imagination is a truly fascinating experience. Recent advancements in text-to-video synthesis have unveiled the potential to achieve this with prompts only. While text is convenient in conveying the overall scene context, it may be insufficient to control precisely. In this paper, we explore customized video generation by utilizing text as context description and motion structure (e.g. frame-wise depth) as concrete guidance. Our method, dubbed Make-Your-Video, involves joint-conditional video generation using a Latent Diffusion Model that is pre-trained for still image synthesis and then promoted for video generation with the introduction of temporal modules. This two-stage learning scheme not only reduces the computing resources required, but also improves the performance by transferring the rich concepts available in image datasets solely into video generation. Moreover, we use a simple yet effective causal attention mask strategy to enable longer video synthesis, which mitigates the potential quality degradation effectively. Experimental results show the superiority of our method over existing baselines, particularly in terms of temporal coherence and fidelity to users' guidance. In addition, our model enables several intriguing applications that demonstrate potential for practical usage.	This paper introduces Make-Your-Video, an efficient approach for customized video generation using both textual descriptions and motion structures (e.g., frame-wise depth) as guidance.	The method aims to address the limitations of text-only video generation, where precise control over video content can be challenging. By incorporating motion structures, the model offers enhanced controllability and enables users to create videos that closely align with their specific vision.	The method leverages a Latent Diffusion Model (LDM) pre-trained for still image synthesis and adapts it for video generation. It introduces temporal modules while keeping the pre-trained spatial modules frozen to maintain visual richness. A causal attention mask strategy is also employed to enhance temporal coherence, especially in longer video synthesis.	Make-Your-Video outperforms existing text-to-video generation baselines in terms of temporal coherence and fidelity to user guidance, as demonstrated by quantitative metrics like FVD and KVD. The method enables various applications, including generating videos from real-life scene setups, 3D scene modeling, and video re-rendering with different styles. Ablation studies confirm the importance of the proposed adapting strategy and the causal attention mask for improving performance.	The model currently lacks precise control over visual details, such as synthesizing videos featuring specific individuals or objects. Relying on frame-wise depth guidance can be demanding; exploring sparse keyframe guidance could broaden applicability.	video generation, text-to-video synthesis, latent diffusion models, motion structures, conditional video generation
2306.00926 Report	Inserting Anybody in Diffusion Models via Celeb Basis	Ge Yuan, Xiaodong Cun, Yong Zhang, Maomao Li, Chenyang Qi, Xintao Wang, Ying Shan, Huicheng Zheng	Exquisite demand exists for customizing the pretrained large text-to-image model, $\textit{e.g.}$, Stable Diffusion, to generate innovative concepts, such as the users themselves. However, the newly-added concept from previous customization methods often shows weaker combination abilities than the original ones even given several images during training. We thus propose a new personalization method that allows for the seamless integration of a unique individual into the pre-trained diffusion model using just $\textbf{one facial photograph}$ and only $\textbf{1024 learnable parameters}$ under $\textbf{3 minutes}$. So as we can effortlessly generate stunning images of this person in any pose or position, interacting with anyone and doing anything imaginable from text prompts. To achieve this, we first analyze and build a well-defined celeb basis from the embedding space of the pre-trained large text encoder. Then, given one facial photo as the target identity, we generate its own embedding by optimizing the weight of this basis and locking all other parameters. Empowered by the proposed celeb basis, the new identity in our customized model showcases a better concept combination ability than previous personalization methods. Besides, our model can also learn several new identities at once and interact with each other where the previous customization model fails to. The code will be released.	This paper presents a novel personalization method for pre-trained text-to-image models, enabling the seamless integration of a unique individual into the model using only one facial photograph and 1024 learnable parameters.	Existing methods for customizing pre-trained text-to-image models often struggle to generate text description-aligned images with newly learned concepts, especially for fine-grained concepts like human identities. This limits users' ability to generate images of themselves or others in diverse scenarios.	The method leverages a 'celeb basis' constructed from the embedding space of celebrity names in the pre-trained model. Given a facial photo, the method optimizes coefficients for this basis to represent the new identity. This personalized embedding then drives image generation, preserving the model's original composition abilities.	The method produces high-quality images of new identities that maintain consistency with text prompts. It surpasses previous personalization methods in terms of identity preservation and concept combination abilities. The approach is efficient, requiring only 1024 learnable parameters and 3 minutes of training time per identity.	The quality of generated images is limited by the pre-trained model's inherent biases and artifacts. The current work focuses on human faces, and exploring the applicability to other concept classes remains for future work.	text-to-image generation, personalization, diffusion models, identity representation, celeb basis
2306.00905 Report	T2IAT: Measuring Valence and Stereotypical Biases in Text-to-Image Generation	Jialu Wang, Xinyue Gabby Liu, Zonglin Di, Yang Liu, Xin Eric Wang	Warning: This paper contains several contents that may be toxic, harmful, or offensive. In the last few years, text-to-image generative models have gained remarkable success in generating images with unprecedented quality accompanied by a breakthrough of inference speed. Despite their rapid progress, human biases that manifest in the training examples, particularly with regard to common stereotypical biases, like gender and skin tone, still have been found in these generative models. In this work, we seek to measure more complex human biases exist in the task of text-to-image generations. Inspired by the well-known Implicit Association Test (IAT) from social psychology, we propose a novel Text-to-Image Association Test (T2IAT) framework that quantifies the implicit stereotypes between concepts and valence, and those in the images. We replicate the previously documented bias tests on generative models, including morally neutral tests on flowers and insects as well as demographic stereotypical tests on diverse social attributes. The results of these experiments demonstrate the presence of complex stereotypical behaviors in image generations.	The paper proposes Text-to-Image Association Test (T2IAT), a novel framework to quantify implicit stereotypes in text-to-image generation models, going beyond simple demographic biases.	Text-to-image models, trained on massive datasets, can perpetuate harmful stereotypes. Existing bias detection methods are limited in capturing nuanced associations between visual concepts and attributes.	T2IAT adapts the Implicit Association Test from social psychology. It measures the distance between images generated with neutral prompts and those generated with attribute-guided prompts (e.g., gender, valence). Statistical tests determine the significance of observed biases.	Generative models exhibit human-like biases even for non-demographic concepts (e.g., flowers are associated with pleasantness, insects with unpleasantness). Significant biases were found in areas like race, sexuality, and gender roles, aligning with documented societal biases. The model amplifies implicit stereotypes present in textual prompts, exacerbating existing biases in generated images.	The verbal stimuli used, while aligned with prior IAT tests, might not fully represent all nuances of a concept. The image encoder used to measure distance between images might introduce its own biases.	bias detection, text-to-image generation, implicit association test, stereotype amplification, ai ethics
2306.00783 Report	FaceDNeRF: Semantics-Driven Face Reconstruction, Prompt Editing and Relighting with Diffusion Models	Hao Zhang, Yanbo Xu, Tianyuan Dai, Yu-Wing Tai, Chi-Keung Tang	The ability to create high-quality 3D faces from a single image has become increasingly important with wide applications in video conferencing, AR/VR, and advanced video editing in movie industries. In this paper, we propose Face Diffusion NeRF (FaceDNeRF), a new generative method to reconstruct high-quality Face NeRFs from single images, complete with semantic editing and relighting capabilities. FaceDNeRF utilizes high-resolution 3D GAN inversion and expertly trained 2D latent-diffusion model, allowing users to manipulate and construct Face NeRFs in zero-shot learning without the need for explicit 3D data. With carefully designed illumination and identity preserving loss, as well as multi-modal pre-training, FaceDNeRF offers users unparalleled control over the editing process enabling them to create and edit face NeRFs using just single-view images, text prompts, and explicit target lighting. The advanced features of FaceDNeRF have been designed to produce more impressive results than existing 2D editing approaches that rely on 2D segmentation maps for editable attributes. Experiments show that our FaceDNeRF achieves exceptionally realistic results and unprecedented flexibility in editing compared with state-of-the-art 3D face reconstruction and editing methods. Our code will be available at https://github.com/BillyXYB/FaceDNeRF.	Proposes FaceDNeRF, a novel method for reconstructing high-quality 3D face NeRFs from single images, enabling semantic editing and relighting.	Addresses limitations of existing 3D face generation and editing methods, which lack photorealism, flexibility, and ease of control.	Leverages a pre-trained EG3D generator and a stable diffusion model. Optimizes latent codes in EG3D's latent space via a combination of reconstruction, identity, diffusion, and illumination losses.	Achieves high-fidelity 3D face reconstruction and editing from single images using text prompts. Enables explicit and view-consistent control over illumination. Demonstrates generalizability across different data domains (faces, cats, cars) and backbone architectures (EG3D, PanoHead).	Performance is limited by the capabilities of the chosen GAN and diffusion models. Generation from rare-sampled latent codes can produce unrealistic results.	3d face reconstruction, nerf, semantic editing, relighting, diffusion models
2306.00738 Report	ReFACT: Updating Text-to-Image Models by Editing the Text Encoder	Dana Arad, Hadas Orgad, Yonatan Belinkov	Our world is marked by unprecedented technological, global, and socio-political transformations, posing a significant challenge to text-to-image generative models. These models encode factual associations within their parameters that can quickly become outdated, diminishing their utility for end-users. To that end, we introduce ReFACT, a novel approach for editing factual associations in text-to-image models without relaying on explicit input from end-users or costly re-training. ReFACT updates the weights of a specific layer in the text encoder, modifying only a tiny portion of the model's parameters and leaving the rest of the model unaffected. We empirically evaluate ReFACT on an existing benchmark, alongside a newly curated dataset. Compared to other methods, ReFACT achieves superior performance in both generalization to related concepts and preservation of unrelated concepts. Furthermore, ReFACT maintains image generation quality, making it a practical tool for updating and correcting factual information in text-to-image models.	This paper introduces ReFACT, a novel method for revising factual knowledge in text-to-image models without retraining or explicit user input.	Text-to-image models can encode outdated or incorrect factual associations, limiting their utility. ReFACT provides an efficient way to update these models and keep them current.	ReFACT modifies weights in the text encoder's MLP layer, viewing it as a key-value store. It optimizes a vector to align the representation of an edit prompt with a target prompt while contrasting against negative examples.	ReFACT effectively edits various factual associations, including implicit model assumptions and object appearances. It outperforms previous methods in efficacy, generalization to related concepts, and specificity, minimizing impact on unrelated concepts. ReFACT maintains the model's image generation quality, as demonstrated by comparable FID and CLIP scores to the unedited model.	ReFACT is slower than the compared editing method (TIME), requiring an optimization process. The method exhibits limitations in editing facial features and occasional specificity failures, prompting further investigation into knowledge encoding and layer-specific editing effects.	text-to-image generation, knowledge editing, model updating, factual consistency, clip
2306.00693 Report	GPT4Image: Can Large Pre-trained Models Help Vision Models on Perception Tasks?	Ning Ding, Yehui Tang, Zhongqian Fu, Chao Xu, Kai Han, Yunhe Wang	The recent upsurge in pre-trained large models (e.g. GPT-4) has swept across the entire deep learning community. Such powerful large language models (LLMs) demonstrate advanced generative ability and multimodal understanding capability, which quickly achieve new state-of-the-art performances on a variety of benchmarks. The pre-trained LLM usually plays the role as a universal AI model that can conduct various tasks, including context reasoning, article analysis and image content comprehension. However, considering the prohibitively high memory and computational cost for implementing such a large model, the conventional models (such as CNN and ViT), are still essential for many visual perception tasks. In this paper, we propose to enhance the representation ability of ordinary vision models for perception tasks (e.g. image classification) by taking advantage of large pre-trained models. We present a new learning paradigm in which the knowledge extracted from large pre-trained models are utilized to help models like CNN and ViT learn enhanced representations and achieve better performance. Firstly, we curate a high quality description set by prompting a multimodal LLM to generate descriptive text for all training images. Furthermore, we feed these detailed descriptions into a pre-trained encoder to extract text embeddings with rich semantic information that encodes the content of images. During training, text embeddings will serve as extra supervising signals and be aligned with image representations learned by vision models. The alignment process helps vision models learn better and achieve higher accuracy with the assistance of pre-trained LLMs. We conduct extensive experiments to verify that the proposed algorithm consistently improves the performance for various vision models with heterogeneous architectures.	This paper proposes GPT4Image, a novel supervised learning framework where conventional vision models learn enhanced representations by leveraging the knowledge and multimodal capabilities of large pre-trained models (LLMs) for improved performance in perception tasks like image classification.	This approach allows smaller companies with limited resources to benefit from the power of LLMs without the need for the high computational cost of training and deploying these models themselves.	The method involves curating a text description set for training images using a pre-trained multimodal LLM. Embeddings of these descriptions are then extracted with a text encoder and aligned with image representations learned by the vision models through a distance loss minimization process.	GPT4Image consistently improved performance across various vision models (ResNet, ViT, ConvNeXt) on CIFAR and ImageNet-1K benchmarks. The framework utilizes cross-modality knowledge from LLMs as a supervisory signal to enhance the training of vision models. Short image descriptions focusing on salient objects proved more effective than long descriptions.	The reliance on pre-generated descriptions limits flexibility during training. The effectiveness depends on the quality and relevance of the LLM generated descriptions.	image classification, large language models, multimodal learning, representation learning, supervised learning
2306.00637 Report	Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models	Pablo Pernias, Dominic Rampas, Mats L. Richter, Christopher J. Pal, Marc Aubreville	We introduce W\"urstchen, a novel architecture for text-to-image synthesis that combines competitive performance with unprecedented cost-effectiveness for large-scale text-to-image diffusion models. A key contribution of our work is to develop a latent diffusion technique in which we learn a detailed but extremely compact semantic image representation used to guide the diffusion process. This highly compressed representation of an image provides much more detailed guidance compared to latent representations of language and this significantly reduces the computational requirements to achieve state-of-the-art results. Our approach also improves the quality of text-conditioned image generation based on our user preference study. The training requirements of our approach consists of 24,602 A100-GPU hours - compared to Stable Diffusion 2.1's 200,000 GPU hours. Our approach also requires less training data to achieve these results. Furthermore, our compact latent representations allows us to perform inference over twice as fast, slashing the usual costs and carbon footprint of a state-of-the-art (SOTA) diffusion model significantly, without compromising the end performance. In a broader comparison against SOTA models our approach is substantially more efficient and compares favorably in terms of image quality. We believe that this work motivates more emphasis on the prioritization of both performance and computational accessibility.	This paper introduces "Würstchen", a novel three-stage text-to-image synthesis architecture that achieves competitive performance with significantly reduced computational cost compared to existing large-scale diffusion models.	State-of-the-art text-to-image models, while impressive, are computationally demanding and expensive to train. Würstchen addresses this limitation by achieving high-quality image synthesis with a fraction of the computational resources.	The method employs a three-stage architecture: (1) a VQGAN compresses images into a latent space. (2) a latent diffusion model (LDM) operates on this compressed space, guided by a "Semantic Compressor" that provides highly compressed semantic image representations. (3) A final text-conditional LDM generates images in the compressed latent space, guided by text embeddings.	Würstchen achieves a comparable performance to Stable Diffusion 2.1, while requiring 8x less training compute. Human evaluation and PickScore metrics show that Würstchen consistently outperforms existing models of similar computational cost and even surpasses some larger models in image quality. The architecture allows for fast inference, significantly reducing the cost and carbon footprint associated with large-scale image generation.	While computationally efficient, Würstchen's FID score, although exceeding some larger models, is lower compared to other state-of-the-art models. This is attributed to smoother image features compared to other models. The paper acknowledges the potential for further optimization, such as removing text conditioning from Stage B in future iterations.	text-to-image synthesis, latent diffusion models, vqgan, efficient ai, computational efficiency
2306.00547 Report	AvatarStudio: Text-driven Editing of 3D Dynamic Human Head Avatars	Mohit Mendiratta, Xingang Pan, Mohamed Elgharib, Kartik Teotia, Mallikarjun B R, Ayush Tewari, Vladislav Golyanik, Adam Kortylewski, Christian Theobalt	Capturing and editing full head performances enables the creation of virtual characters with various applications such as extended reality and media production. The past few years witnessed a steep rise in the photorealism of human head avatars. Such avatars can be controlled through different input data modalities, including RGB, audio, depth, IMUs and others. While these data modalities provide effective means of control, they mostly focus on editing the head movements such as the facial expressions, head pose and/or camera viewpoint. In this paper, we propose AvatarStudio, a text-based method for editing the appearance of a dynamic full head avatar. Our approach builds on existing work to capture dynamic performances of human heads using neural radiance field (NeRF) and edits this representation with a text-to-image diffusion model. Specifically, we introduce an optimization strategy for incorporating multiple keyframes representing different camera viewpoints and time stamps of a video performance into a single diffusion model. Using this personalized diffusion model, we edit the dynamic NeRF by introducing view-and-time-aware Score Distillation Sampling (VT-SDS) following a model-based guidance approach. Our method edits the full head in a canonical space, and then propagates these edits to remaining time steps via a pretrained deformation network. We evaluate our method visually and numerically via a user study, and results show that our method outperforms existing approaches. Our experiments validate the design choices of our method and highlight that our edits are genuine, personalized, as well as 3D- and time-consistent.	AvatarStudio is the first text-driven method for editing the appearance of dynamic 3D human head avatars represented as dynamic NeRFs, enabling a wide range of personalized, 3D- and time-consistent edits.	Existing methods for editing digital faces mainly focus on motion (e.g., facial expressions, head pose), while appearance editing is limited to relighting or non-photorealistic edits. Text-driven editing offers a user-friendly way to control and personalize dynamic avatars.	The method fine-tunes a pre-trained text-to-image diffusion model on multiple keyframes from a multi-view video, capturing the identity from various viewpoints and time stamps. It then introduces a view- and time-aware Score Distillation Sampling (VT-SDS) approach to edit the dynamic NeRF based on the target text prompt while preserving identity and coherence.	Generates a diverse range of photorealistic and non-photorealistic text-based edits, including changes to appearance and geometry. Maintains the integrity of the input identity while adhering to the text prompt. Produces 3D-consistent edits viewable from arbitrary camera angles and ensures temporal coherence for smooth video editing.	Requires multi-view data captured in uniform illumination, limiting its application to controlled environments. Computationally expensive, taking about 60 minutes to train on a single A100 GPU.	text-driven editing, neural rendering, 3d dynamic human head avatar, diffusion model, nerf
2306.00450 Report	Exploring Open-Vocabulary Semantic Segmentation without Human Labels	Jun Chen, Deyao Zhu, Guocheng Qian, Bernard Ghanem, Zhicheng Yan, Chenchen Zhu, Fanyi Xiao, Mohamed Elhoseiny, Sean Chang Culatana	Semantic segmentation is a crucial task in computer vision that involves segmenting images into semantically meaningful regions at the pixel level. However, existing approaches often rely on expensive human annotations as supervision for model training, limiting their scalability to large, unlabeled datasets. To address this challenge, we present ZeroSeg, a novel method that leverages the existing pretrained vision-language (VL) model (e.g. CLIP) to train open-vocabulary zero-shot semantic segmentation models. Although acquired extensive knowledge of visual concepts, it is non-trivial to exploit knowledge from these VL models to the task of semantic segmentation, as they are usually trained at an image level. ZeroSeg overcomes this by distilling the visual concepts learned by VL models into a set of segment tokens, each summarizing a localized region of the target image. We evaluate ZeroSeg on multiple popular segmentation benchmarks, including PASCAL VOC 2012, PASCAL Context, and COCO, in a zero-shot manner (i.e., no training or adaption on target segmentation datasets). Our approach achieves state-of-the-art performance when compared to other zero-shot segmentation methods under the same training data, while also performing competitively compared to strongly supervised methods. Finally, we also demonstrated the effectiveness of ZeroSeg on open-vocabulary segmentation, through both human studies and qualitative visualizations.	Introduces ZeroSeg, a model for open-vocabulary zero-shot semantic segmentation that eliminates the need for human annotations by distilling knowledge from pre-trained vision-language models.	Addresses the limitations of traditional supervised methods, which are expensive, time-consuming, and struggle to generalize to new visual concepts. This enables more flexible and efficient semantic segmentation learning.	Utilizes a masked encoder-decoder architecture and a segmentation head that groups pixels into semantically meaningful segments. Employs multi-scale image feature distillation and a segment matching loss to transfer knowledge from a pre-trained CLIP visual encoder without relying on text annotations.	Achieves competitive performance compared to supervised and weakly-supervised methods on PASCAL VOC 2012, PASCAL Context, and COCO datasets despite not using any segmentation labels during training. Outperforms existing zero-shot segmentation methods, even those trained on significantly larger datasets. Demonstrates superior performance in open-vocabulary segmentation tasks, as evidenced by human studies and qualitative visualizations.	Potential biases present in the pre-trained vision-language models may perpetuate in ZeroSeg, requiring mitigation strategies. Limited exploration of performance scaling with even larger datasets due to the inaccessibility of certain datasets like YFCC100M.	semantic segmentation, zero-shot learning, open-vocabulary, vision-language models, knowledge distillation
2306.00354 Report	Addressing Negative Transfer in Diffusion Models	Hyojun Go, JinYoung Kim, Yunsung Lee, Seunghyun Lee, Shinhyeok Oh, Hyeongdon Moon, Seungtaek Choi	Diffusion-based generative models have achieved remarkable success in various domains. It trains a shared model on denoising tasks that encompass different noise levels simultaneously, representing a form of multi-task learning (MTL). However, analyzing and improving diffusion models from an MTL perspective remains under-explored. In particular, MTL can sometimes lead to the well-known phenomenon of negative transfer, which results in the performance degradation of certain tasks due to conflicts between tasks. In this paper, we first aim to analyze diffusion training from an MTL standpoint, presenting two key observations: (O1) the task affinity between denoising tasks diminishes as the gap between noise levels widens, and (O2) negative transfer can arise even in diffusion training. Building upon these observations, we aim to enhance diffusion training by mitigating negative transfer. To achieve this, we propose leveraging existing MTL methods, but the presence of a huge number of denoising tasks makes this computationally expensive to calculate the necessary per-task loss or gradient. To address this challenge, we propose clustering the denoising tasks into small task clusters and applying MTL methods to them. Specifically, based on (O2), we employ interval clustering to enforce temporal proximity among denoising tasks within clusters. We show that interval clustering can be solved using dynamic programming, utilizing signal-to-noise ratio, timestep, and task affinity for clustering objectives. Through this, our approach addresses the issue of negative transfer in diffusion models by allowing for efficient computation of MTL methods. We validate the efficacy of proposed clustering and its integration with MTL methods through various experiments, demonstrating 1) improved generation quality and 2) faster training convergence of diffusion models.	This paper investigates the presence of negative transfer in diffusion model training, where learning denoising tasks at different noise levels can negatively impact each other, and proposes a strategy using interval clustering and multi-task learning methods to mitigate it.	Diffusion models, while successful, are inherently multi-task learners, and understanding and addressing the potential negative transfer between denoising tasks can lead to significant improvements in generation quality and training efficiency.	The paper analyzes task affinity between different denoising tasks and observes negative transfer by comparing models trained on specific timestep intervals to models trained on all tasks. It then proposes to cluster denoising tasks into intervals based on timesteps, SNR, or task affinity, and applies MTL methods (PCGrad, NashMTL, Uncertainty Weighting) to these task clusters to reduce negative transfer.	Incorporating MTL methods with interval clustering significantly improves image generation quality (FID, precision) compared to vanilla diffusion training across different datasets (FFHQ, CelebA-HQ, ImageNet) and architectures (ADM, LDM, DiT). The proposed method achieves faster convergence compared to vanilla training. Uncertainty Weighting (UW) generally achieves better sample quality, while NashMTL shows better distribution coverage, and PCGrad presents a balanced performance.	Negative transfer is not completely resolved, suggesting room for further improvement by enabling the model to learn entire denoising tasks more harmoniously. The study does not explore architectural designs specific to MTL for diffusion models, which could be a promising direction for future work.	diffusion models, multi-task learning, negative transfer, interval clustering, image generation
2306.00241 Report	Balancing Reconstruction and Editing Quality of GAN Inversion for Real Image Editing with StyleGAN Prior Latent Space	Kai Katsumata, Duc Minh Vo, Bei Liu, Hideki Nakayama	The exploration of the latent space in StyleGANs and GAN inversion exemplify impressive real-world image editing, yet the trade-off between reconstruction quality and editing quality remains an open problem. In this study, we revisit StyleGANs' hyperspherical prior $\mathcal{Z}$ and $\mathcal{Z}^+$ and integrate them into seminal GAN inversion methods to improve editing quality. Besides faithful reconstruction, our extensions achieve sophisticated editing quality with the aid of the StyleGAN prior. We project the real images into the proposed space to obtain the inverted codes, by which we then move along $\mathcal{Z}^{+}$, enabling semantic editing without sacrificing image quality. Comprehensive experiments show that $\mathcal{Z}^{+}$ can replace the most commonly-used $\mathcal{W}$, $\mathcal{W}^{+}$, and $\mathcal{S}$ spaces while preserving reconstruction quality, resulting in reduced distortion of edited images.	This paper revisits the use of StyleGAN's hyperspherical prior spaces, $\ZS$ and $\ZPS$, for GAN inversion to enhance editing quality without sacrificing reconstruction quality.	Existing GAN inversion methods struggle to balance high-fidelity reconstruction with the ability to perform semantic image edits without introducing artifacts. This work addresses this trade-off by leveraging the desirable properties of $\ZS$ and $\ZPS$.	The authors integrate $\ZPS$ into established GAN inversion techniques like BDInvert, SAM, and PTI, replacing unbounded latent spaces like $\WPS$ with the bounded $\ZPS$. They introduce the $\FZS$ space, combining $\ZPS$ with a feature space ($\FS$) for improved reconstruction. Optimization retracts latent codes to the hypersphere surface during each iteration.	$\FZS$ achieves reconstruction quality comparable to state-of-the-art methods like $\FWS$ using both qualitative and quantitative metrics (LPIPS, MSE, SSIM). Editing operations in $\FZS$ preserve image quality and identity significantly better than methods relying on $\WPS$, as shown with GANSpace and InterfaceGAN directions. Integrating $\ZPS$ into other GAN inversion methods like PTI and SAM demonstrates consistent improvement in editing quality without hindering reconstruction.	The authors primarily focus on evaluating their approach on face datasets, leaving exploration of other domains for future work. Further investigation into the impact of different editing techniques and their compatibility with $\ZPS$ is warranted.	gan inversion, stylegan, image editing, latent space, hyperspherical prior
2306.00219 Report	Diffusion Brush: A Latent Diffusion Model-based Editing Tool for AI-generated Images	Peyman Gholami, Robert Xiao	Text-to-image generative models have made remarkable advancements in generating high-quality images. However, generated images often contain undesirable artifacts or other errors due to model limitations. Existing techniques to fine-tune generated images are time-consuming (manual editing), produce poorly-integrated results (inpainting), or result in unexpected changes across the entire image (variation selection and prompt fine-tuning). In this work, we present Diffusion Brush, a Latent Diffusion Model-based (LDM) tool to efficiently fine-tune desired regions within an AI-synthesized image. Our method introduces new random noise patterns at targeted regions during the reverse diffusion process, enabling the model to efficiently make changes to the specified regions while preserving the original context for the rest of the image. We evaluate our method's usability and effectiveness through a user study with artists, comparing our technique against other state-of-the-art image inpainting techniques and editing software for fine-tuning AI-generated imagery.	This document provides guidelines for authors submitting papers to the WACV conference, covering style, formatting, and anonymization for blind review.	It ensures a consistent and high-quality standard for submissions to WACV, aiding the review process and final publication.	The paper outlines specific instructions for language, length, formatting (including margins, fonts, and references), figures, equations, and blind review requirements.	Papers must not exceed eight pages excluding references, with no extra page charges. Anonymization for blind review involves avoiding self-identifying language like 'my work' while still citing your past research appropriately. Cross-referencing figures, tables, and equations using specific commands like \cref{} is encouraged.	The guidelines primarily focus on LaTeX users, potentially leaving authors using other systems with less specific guidance. The document could benefit from clearer explanations and examples regarding open challenge result reporting for blind review.	author guidelines, wacv, conference paper, latex, blind review
2306.00180 Report	FlowCam: Training Generalizable 3D Radiance Fields without Camera Poses via Pixel-Aligned Scene Flow	Cameron Smith, Yilun Du, Ayush Tewari, Vincent Sitzmann	Reconstruction of 3D neural fields from posed images has emerged as a promising method for self-supervised representation learning. The key challenge preventing the deployment of these 3D scene learners on large-scale video data is their dependence on precise camera poses from structure-from-motion, which is prohibitively expensive to run at scale. We propose a method that jointly reconstructs camera poses and 3D neural scene representations online and in a single forward pass. We estimate poses by first lifting frame-to-frame optical flow to 3D scene flow via differentiable rendering, preserving locality and shift-equivariance of the image processing backbone. SE(3) camera pose estimation is then performed via a weighted least-squares fit to the scene flow field. This formulation enables us to jointly supervise pose estimation and a generalizable neural scene representation via re-rendering the input video, and thus, train end-to-end and fully self-supervised on real-world video datasets. We demonstrate that our method performs robustly on diverse, real-world video, notably on sequences traditionally challenging to optimization-based pose estimation techniques.	Presents FlowCam, a method for jointly training a feed-forward generalizable 3D neural scene representation and camera trajectory estimation, self-supervised by re-rendering losses on video frames, without ground-truth camera poses or depth maps.	Unlocks orders of magnitude more training data for 3D scene learners by removing dependence on expensive structure-from-motion for camera pose estimation, paving the way for large-scale 3D representation learning.	Leverages single-image neural scene representations and differentiable rendering to lift frame-to-frame optical flow to 3D scene flow. Estimates SE(3) camera poses via a robust, weighted least-squares solver on the scene flow field. Jointly supervises pose estimation and neural scene representation via re-rendering the input video with RGB and flow losses.	Outperforms state-of-the-art unposed methods on novel view synthesis. Demonstrates robust pose estimation on sequences challenging for conventional SLAM approaches (e.g., ORB-SLAM3). Generalizes to out-of-distribution scenes and supports fine-tuning for improved accuracy.	As an odometry method, it suffers from drift and lacks loop closure. Currently does not model scene dynamics.	3d scene representation, camera pose estimation, self-supervised learning, differentiable rendering, neural radiance fields
2305.20091 Report	Humans in 4D: Reconstructing and Tracking Humans with Transformers	Shubham Goel, Georgios Pavlakos, Jathushan Rajasegaran, Angjoo Kanazawa, Jitendra Malik	We present an approach to reconstruct humans and track them over time. At the core of our approach, we propose a fully "transformerized" version of a network for human mesh recovery. This network, HMR 2.0, advances the state of the art and shows the capability to analyze unusual poses that have in the past been difficult to reconstruct from single images. To analyze video, we use 3D reconstructions from HMR 2.0 as input to a tracking system that operates in 3D. This enables us to deal with multiple people and maintain identities through occlusion events. Our complete approach, 4DHumans, achieves state-of-the-art results for tracking people from monocular video. Furthermore, we demonstrate the effectiveness of HMR 2.0 on the downstream task of action recognition, achieving significant improvements over previous pose-based action recognition approaches. Our code and models are available on the project website: https://shubham-goel.github.io/4dhumans/.	This paper presents a novel approach, 4DHumans, for reconstructing and tracking humans in videos using a fully transformer-based architecture, HMR 2.0, for 3D human mesh recovery.	This work pushes the limits of analyzable videos with 3D human reconstruction techniques and achieves state-of-the-art results for human mesh recovery and tracking.	The authors introduce HMR 2.0, a transformer-based network for reconstructing 3D human meshes from single images, and integrate it into a modified PHALP tracking system that operates in 3D. They train their models on a combination of datasets, leveraging pseudo-ground truth annotations for unlabeled data.	HMR 2.0 surpasses previous methods in 2D and 3D pose accuracy metrics, particularly for unusual and challenging poses. 4DHumans achieves state-of-the-art tracking performance on the PoseTrack dataset, showing robustness to occlusions and complex scenes. The accuracy of HMR 2.0's 3D pose estimations translates to superior performance on the downstream task of action recognition on the AVA dataset.	The reliance on the SMPL model limits the system's ability to capture finer details such as hand poses, facial expressions, and variations in age and body shape. Reconstructions are performed in the camera frame, neglecting a common world coordinate frame which is essential for comprehensive action understanding in videos. Future work could address camera motion and multi-person interactions.	human mesh recovery, 3d human pose estimation, tracking, transformers, action recognition
2305.20087 Report	Too Large; Data Reduction for Vision-Language Pre-Training	Alex Jinpeng Wang, Kevin Qinghong Lin, David Junhao Zhang, Stan Weixian Lei, Mike Zheng Shou	This paper examines the problems of severe image-text misalignment and high redundancy in the widely-used large-scale Vision-Language Pre-Training (VLP) datasets. To address these issues, we propose an efficient and straightforward Vision-Language learning algorithm called TL;DR, which aims to compress the existing large VLP data into a small, high-quality set. Our approach consists of two major steps. First, a codebook-based encoder-decoder captioner is developed to select representative samples. Second, a new caption is generated to complement the original captions for selected samples, mitigating the text-image misalignment problem while maintaining uniqueness. As the result, TL;DR enables us to reduce the large dataset into a small set of high-quality data, which can serve as an alternative pre-training dataset. This algorithm significantly speeds up the time-consuming pretraining process. Specifically, TL;DR can compress the mainstream VLP datasets at a high ratio, e.g., reduce well-cleaned CC3M dataset from 2.82M to 0.67M ($\sim$24\%) and noisy YFCC15M from 15M to 2.5M ($\sim$16.7\%). Extensive experiments with three popular VLP models over seven downstream tasks show that VLP model trained on the compressed dataset provided by TL;DR can perform similar or even better results compared with training on the full-scale dataset. The code will be made available at \url{https://github.com/showlab/datacentric.vlp}.	This paper introduces \ModelName, a novel algorithm designed to compress large-scale, noisy Vision-Language Pre-training (VLP) datasets into smaller, high-quality datasets.	Large VLP datasets are computationally expensive to train and often contain significant image-text misalignment and redundancy. \ModelName addresses these issues by creating smaller, more efficient datasets without sacrificing performance.	\ModelName uses a two-stage approach: 1) Training a codebook-based captioner to cluster and select representative image-text pairs, and 2) Refining the captions of selected samples to reduce misalignment.	Training VLP models on \ModelName-compressed datasets (10%-25% of the original size) achieves comparable or better performance than training on the full datasets. The codebook-based clustering effectively groups semantically similar image-text pairs. \ModelName successfully reduces image-text misalignment, as evidenced by improved Image-Text Matching (ITM) scores.	The current implementation relies on manually selecting the compression ratio. Further exploration is needed to achieve even higher compression ratios, potentially leveraging text-to-image generation models.	vision-language pre-training, data reduction, dataset compression, image-text misalignment, codebook learning
2305.20082 Report	Control4D: Efficient 4D Portrait Editing with Text	Ruizhi Shao, Jingxiang Sun, Cheng Peng, Zerong Zheng, Boyao Zhou, Hongwen Zhang, Yebin Liu	We introduce Control4D, an innovative framework for editing dynamic 4D portraits using text instructions. Our method addresses the prevalent challenges in 4D editing, notably the inefficiencies of existing 4D representations and the inconsistent editing effect caused by diffusion-based editors. We first propose GaussianPlanes, a novel 4D representation that makes Gaussian Splatting more structured by applying plane-based decomposition in 3D space and time. This enhances both efficiency and robustness in 4D editing. Furthermore, we propose to leverage a 4D generator to learn a more continuous generation space from inconsistent edited images produced by the diffusion-based editor, which effectively improves the consistency and quality of 4D editing. Comprehensive evaluation demonstrates the superiority of Control4D, including significantly reduced training time, high-quality rendering, and spatial-temporal consistency in 4D portrait editing. The link to our project website is https://control4darxiv.github.io.	Control4D is a novel framework for efficient, high-quality, and temporally consistent editing of dynamic 4D portraits using text instructions.	Existing 4D editing techniques lack interactivity and struggle with spatial-temporal consistency and quality. Control4D addresses these limitations by introducing an efficient 4D representation and a novel editing framework.	The method utilizes GaussianPlanes, a 4D representation built upon Gaussian Splatting with plane-based decomposition for efficiency and robustness. It integrates a 4D generator with a 2D diffusion-based editor to learn a continuous generation space and mitigate inconsistencies in edited images.	Significantly reduced training time compared to previous methods. Achieves high-quality rendering of dynamic portraits with intricate details. Ensures spatiotemporal consistency in editing, resulting in coherent and realistic 4D edits.	Challenges in handling rapid and extensive non-rigid movements due to reliance on flow learning. Limited editing granularity due to ControlNet constraints, preventing precise expression or action edits.	4d portrait editing, text-guided editing, gaussian splatting, generative adversarial networks (gans), diffusion models
2305.20049 Report	A Unified Conditional Framework for Diffusion-based Image Restoration	Yi Zhang, Xiaoyu Shi, Dasong Li, Xiaogang Wang, Jian Wang, Hongsheng Li	Diffusion Probabilistic Models (DPMs) have recently shown remarkable performance in image generation tasks, which are capable of generating highly realistic images. When adopting DPMs for image restoration tasks, the crucial aspect lies in how to integrate the conditional information to guide the DPMs to generate accurate and natural output, which has been largely overlooked in existing works. In this paper, we present a unified conditional framework based on diffusion models for image restoration. We leverage a lightweight UNet to predict initial guidance and the diffusion model to learn the residual of the guidance. By carefully designing the basic module and integration module for the diffusion model block, we integrate the guidance and other auxiliary conditional information into every block of the diffusion model to achieve spatially-adaptive generation conditioning. To handle high-resolution images, we propose a simple yet effective inter-step patch-splitting strategy to produce arbitrary-resolution images without grid artifacts. We evaluate our conditional framework on three challenging tasks: extreme low-light denoising, deblurring, and JPEG restoration, demonstrating its significant improvements in perceptual quality and the generalization to restoration tasks.	This paper presents a unified conditional framework based on diffusion models for image restoration, focusing on effectively integrating conditional information (e.g., degraded image, noise level) into the diffusion process.	Existing image restoration methods using diffusion models often lack effective integration of conditional information, limiting their ability to generate accurate and natural outputs.	The framework utilizes a lightweight UNet to predict an initial guidance image and employs a diffusion model to learn the residual details. It introduces an Adaptive Kernel Guidance Block (AKGB) to adaptively fuse conditional information into each diffusion block for spatially-adaptive generation. An inter-step patch-splitting strategy is proposed for high-resolution image generation without grid artifacts.	The method achieves state-of-the-art perceptual quality on extreme low-light denoising (SID dataset), outperforming existing regression and diffusion-based methods. It demonstrates superior performance on image deblurring (GoPro dataset), surpassing previous deblurring methods in perceptual metrics. The framework generalizes well to JPEG restoration, showing significant improvements over regression-based methods and previous diffusion-based approaches.	The current implementation uses a simple uniform noise schedule for faster sampling, which could be further optimized with advanced sampling techniques. While the method excels in generating realistic textures, it may occasionally produce unnatural details like incorrect characters, requiring further exploration of generation control mechanisms.	image restoration, diffusion models, conditional image generation, adaptive kernel guidance, high-resolution image synthesis
2305.19858 Report	Enhancing image quality prediction with self-supervised visual masking	Uğur Çoğalan, Mojtaba Bemana, Hans-Peter Seidel, Karol Myszkowski	Full-reference image quality metrics (FR-IQMs) aim to measure the visual differences between a pair of reference and distorted images, with the goal of accurately predicting human judgments. However, existing FR-IQMs, including traditional ones like PSNR and SSIM and even perceptual ones such as HDR-VDP, LPIPS, and DISTS, still fall short in capturing the complexities and nuances of human perception. In this work, rather than devising a novel IQM model, we seek to improve upon the perceptual quality of existing FR-IQM methods. We achieve this by considering visual masking, an important characteristic of the human visual system that changes its sensitivity to distortions as a function of local image content. Specifically, for a given FR-IQM metric, we propose to predict a visual masking model that modulates reference and distorted images in a way that penalizes the visual errors based on their visibility. Since the ground truth visual masks are difficult to obtain, we demonstrate how they can be derived in a self-supervised manner solely based on mean opinion scores (MOS) collected from an FR-IQM dataset. Our approach results in enhanced FR-IQM metrics that are more in line with human prediction both visually and quantitatively.	This paper introduces a self-supervised visual masking approach to enhance the perceptual quality prediction of existing full-reference image quality metrics (FR-IQMs).	Existing FR-IQMs, both traditional and learning-based, often fail to accurately capture the complexities of human perception, particularly the phenomenon of visual masking.	A lightweight CNN is trained to predict a visual mask for a given reference and distorted image pair. This mask, learned in a self-supervised manner using MOS data, modulates the input images or features to emphasize perceptually important distortions.	The proposed method significantly improves the correlation of various classic and learning-based FR-IQMs with human judgments on standard benchmark datasets. The generated error maps better align with human perception of distortion visibility compared to the original metrics. The enhanced metrics show promise as loss functions for image restoration tasks, improving perceptual quality in denoising and deblurring.	The effectiveness of the visual masking model is limited to the specific viewing conditions and display setup of the training dataset. Integrating the masking model into complex end-to-end deep learning-based metrics might be challenging.	image quality assessment, visual masking, perceptual metrics, deep learning, self-supervised learning
2305.19599 Report	RealignDiff: Boosting Text-to-Image Diffusion Model with Coarse-to-fine Semantic Re-alignment	Guian Fang, Zutao Jiang, Jianhua Han, Guansong Lu, Hang Xu, Shengcai Liao, Xiaodan Liang	Recent advances in text-to-image diffusion models have achieved remarkable success in generating high-quality, realistic images from textual descriptions. However, these approaches have faced challenges in precisely aligning the generated visual content with the textual concepts described in the prompts. In this paper, we propose a two-stage coarse-to-fine semantic re-alignment method, named RealignDiff, aimed at improving the alignment between text and images in text-to-image diffusion models. In the coarse semantic re-alignment phase, a novel caption reward, leveraging the BLIP-2 model, is proposed to evaluate the semantic discrepancy between the generated image caption and the given text prompt. Subsequently, the fine semantic re-alignment stage employs a local dense caption generation module and a re-weighting attention modulation module to refine the previously generated images from a local semantic view. Experimental results on the MS-COCO benchmark demonstrate that the proposed two-stage coarse-to-fine semantic re-alignment method outperforms other baseline re-alignment techniques by a substantial margin in both visual quality and semantic similarity with the input prompt.	This paper proposes RealignDiff, a two-stage coarse-to-fine semantic re-alignment method for text-to-image diffusion models, aiming to improve the alignment between generated images and input textual descriptions.	Existing text-to-image diffusion models often struggle to precisely align the generated visual content with the concepts described in textual prompts, particularly in capturing object attributes and relationships.	The method consists of: (1) Coarse Semantic Re-alignment: Fine-tuning the diffusion model using a novel caption reward, which leverages the BLIP-2 model to evaluate the semantic similarity between the generated image caption and the input prompt. (2) Fine Semantic Re-alignment: Refining the generated image from a local semantic view using a local dense caption generation module and a re-weighting attention modulation module.	RealignDiff significantly outperforms other baseline re-alignment techniques on the MS-COCO benchmark in terms of visual quality and semantic similarity with the input prompt. The proposed caption reward proves to be more effective than traditional reward functions like CLIP reward, BLIP reward, and ImageReward. Ablation studies demonstrate that both the coarse and fine semantic re-alignment modules contribute to the improved performance.	The fine semantic re-alignment stage may fail if the large language model fails to provide accurate intermediate results, requiring further research to address this limitation. Future work will explore the dynamic learning from multiple reward functions, incorporating both semantic and aesthetic considerations, within the diffusion model.	text-to-image generation, diffusion models, semantic alignment, caption reward, attention modulation
2305.19412 Report	Are Large Kernels Better Teachers than Transformers for ConvNets?	Tianjin Huang, Lu Yin, Zhenyu Zhang, Li Shen, Meng Fang, Mykola Pechenizkiy, Zhangyang Wang, Shiwei Liu	This paper reveals a new appeal of the recently emerged large-kernel Convolutional Neural Networks (ConvNets): as the teacher in Knowledge Distillation (KD) for small-kernel ConvNets. While Transformers have led state-of-the-art (SOTA) performance in various fields with ever-larger models and labeled data, small-kernel ConvNets are considered more suitable for resource-limited applications due to the efficient convolution operation and compact weight sharing. KD is widely used to boost the performance of small-kernel ConvNets. However, previous research shows that it is not quite effective to distill knowledge (e.g., global information) from Transformers to small-kernel ConvNets, presumably due to their disparate architectures. We hereby carry out a first-of-its-kind study unveiling that modern large-kernel ConvNets, a compelling competitor to Vision Transformers, are remarkably more effective teachers for small-kernel ConvNets, due to more similar architectures. Our findings are backed up by extensive experiments on both logit-level and feature-level KD ``out of the box", with no dedicated architectural nor training recipe modifications. Notably, we obtain the \textbf{best-ever pure ConvNet} under 30M parameters with \textbf{83.1\%} top-1 accuracy on ImageNet, outperforming current SOTA methods including ConvNeXt V2 and Swin V2. We also find that beneficial characteristics of large-kernel ConvNets, e.g., larger effective receptive fields, can be seamlessly transferred to students through this large-to-small kernel distillation. Code is available at: \url{https://github.com/VITA-Group/SLaK}.	This paper investigates the effectiveness of using large-kernel Convolutional Neural Networks (ConvNets) as teachers for knowledge distillation into small-kernel ConvNets, finding them superior to Vision Transformers in this role.	Small-kernel ConvNets are preferable for resource-constrained applications, but struggle to match the performance of large-scale Vision Transformers. Knowledge distillation offers a path to improve their performance without increasing model size, but previous work showed limited effectiveness when distilling from Transformers to ConvNets.	The authors conduct systematic experiments on ImageNet, distilling various large-kernel ConvNets (ConvNeXt, SLaK) and Vision Transformers (ViT, Swin, CSWin) into small-kernel ConvNets (ResNet-50, ConvNeXt-T), using both logit-level (KD, NKD) and feature-level (FD) distillation methods. They analyze performance gains, effective receptive fields (ERF), and robustness of the distilled models.	Large-kernel ConvNets consistently outperform Vision Transformers as teachers for small-kernel ConvNets across different distillation methods. Students distilled from larger kernel teachers achieve better performance than those trained on smaller kernels, indicating successful transfer of the benefits of large kernels. Students distilled from large-kernel ConvNets inherit their advantageous properties, exhibiting larger and denser ERF and improved robustness compared to those distilled from Transformers.	The study primarily focuses on ImageNet classification; further investigation is needed for other tasks and datasets. Future work could explore optimal distillation techniques and training recipes tailored for large-to-small kernel knowledge transfer.	knowledge distillation, convolutional neural networks, vision transformers, large kernels, robustness
2305.19327 Report	Cones 2: Customizable Image Synthesis with Multiple Subjects	Zhiheng Liu, Yifei Zhang, Yujun Shen, Kecheng Zheng, Kai Zhu, Ruili Feng, Yu Liu, Deli Zhao, Jingren Zhou, Yang Cao	Synthesizing images with user-specified subjects has received growing attention due to its practical applications. Despite the recent success in single subject customization, existing algorithms suffer from high training cost and low success rate along with increased number of subjects. Towards controllable image synthesis with multiple subjects as the constraints, this work studies how to efficiently represent a particular subject as well as how to appropriately compose different subjects. We find that the text embedding regarding the subject token already serves as a simple yet effective representation that supports arbitrary combinations without any model tuning. Through learning a residual on top of the base embedding, we manage to robustly shift the raw subject to the customized subject given various text conditions. We then propose to employ layout, a very abstract and easy-to-obtain prior, as the spatial guidance for subject arrangement. By rectifying the activations in the cross-attention map, the layout appoints and separates the location of different subjects in the image, significantly alleviating the interference across them. Both qualitative and quantitative experimental results demonstrate our superiority over state-of-the-art alternatives under a variety of settings for multi-subject customization.	This paper proposes \method, a novel approach for multi-subject customization using a pre-trained text-to-image diffusion model that utilizes a simple yet effective representation to register a subject and enables the arbitrary composition of various subjects without requiring any model retraining.	Existing algorithms for user-specified subject image synthesis, despite success in single subject customization, suffer from high training cost and low success rate when multiple subjects are introduced. This work addresses the need for controllable image synthesis with multiple subjects as constraints.	The approach decomposes the task into two components: 1) efficiently representing a subject, achieved by fine-tuning the text encoder with subject-specific images and deriving residual token embeddings, and 2) effectively combining different subjects, addressed by a layout guidance method that controls the generation process by rectifying activations in cross-attention maps based on a user-defined layout.	The method demonstrates superior performance over existing baselines in multi-subject customization, particularly with three or more subjects, as evidenced by quantitative metrics and user studies. It effectively mitigates attribute confusion among subjects with high semantic similarity, a challenge faced by other methods. The approach allows for the customization of image synthesis with a relatively large number of subjects (e.g., six subjects).	The approach may not consistently generate satisfactory results when combining more than six subjects. The user-provided layout needs to be roughly consistent with the textual description to achieve desired generation results.	image synthesis, text-to-image generation, multi-subject customization, diffusion models, layout guidance
2305.19270 Report	Learning without Forgetting for Vision-Language Models	Da-Wei Zhou, Yuanhan Zhang, Jingyi Ning, Han-Jia Ye, De-Chuan Zhan, Ziwei Liu	Class-Incremental Learning (CIL) or continual learning is a desired capability in the real world, which requires a learning system to adapt to new tasks without forgetting former ones. While traditional CIL methods focus on visual information to grasp core features, recent advances in Vision-Language Models (VLM) have shown promising capabilities in learning generalizable representations with the aid of textual information. However, when continually trained with new classes, VLMs often suffer from catastrophic forgetting of former knowledge. Applying VLMs to CIL poses two major challenges: 1) how to adapt the model without forgetting; and 2) how to make full use of the multi-modal information. To this end, we propose PROjectiOn Fusion (PROOF) that enables VLMs to learn without forgetting. To handle the first challenge, we propose training task-specific projections based on the frozen image/text encoders. When facing new tasks, new projections are expanded and former projections are fixed, alleviating the forgetting of old concepts. For the second challenge, we propose the fusion module to better utilize the cross-modality information. By jointly adjusting visual and textual features, the model can capture semantic information with stronger representation ability. Extensive experiments on nine benchmark datasets validate PROOF achieves state-of-the-art performance.	This paper presents PROjectiOn Fusion (PROOF), a novel approach to address catastrophic forgetting in Vision-Language Models (VLMs) for Class-Incremental Learning (CIL).	The paper addresses the limitations of existing CIL methods that either rely solely on visual information or suffer from catastrophic forgetting when adapting VLMs incrementally. VLMs offer the potential for learning generalizable representations by leveraging textual information, making them suitable for CIL.	PROOF utilizes a two-fold strategy: 1) Expandable Feature Projection: Freezing pre-trained image/text backbones and appending task-specific linear projections to capture new concepts without overwriting old ones. 2) Contextualizing Projections with Projection Fusion: Employing self-attention to fuse and adapt query instance embeddings with visual and textual context, including prototypes and learnable prompts, for robust classification.	PROOF achieves state-of-the-art performance on nine benchmark CIL datasets, consistently outperforming existing methods. The ablation study validates the contribution of both expandable projections and cross-modal fusion to the model's performance. A variation of PROOF is proposed to address the zero-shot performance degradation in CIL, striking a balance between adapting to new tasks and preserving generalization ability.	The current implementation of PROOF relies on exemplars for rehearsal, which may raise storage and privacy concerns. Future work includes extending PROOF to exemplar-free scenarios and exploring its application in other VLMs and vision-language tasks.	class-incremental learning, vision-language models, catastrophic forgetting, projection fusion, cross-modal learning
2305.19193 Report	Video ControlNet: Towards Temporally Consistent Synthetic-to-Real Video Translation Using Conditional Image Diffusion Models	Ernie Chu, Shuo-Yen Lin, Jun-Cheng Chen	In this study, we present an efficient and effective approach for achieving temporally consistent synthetic-to-real video translation in videos of varying lengths. Our method leverages off-the-shelf conditional image diffusion models, allowing us to perform multiple synthetic-to-real image generations in parallel. By utilizing the available optical flow information from the synthetic videos, our approach seamlessly enforces temporal consistency among corresponding pixels across frames. This is achieved through joint noise optimization, effectively minimizing spatial and temporal discrepancies. To the best of our knowledge, our proposed method is the first to accomplish diverse and temporally consistent synthetic-to-real video translation using conditional image diffusion models. Furthermore, our approach does not require any training or fine-tuning of the diffusion models. Extensive experiments conducted on various benchmarks for synthetic-to-real video translation demonstrate the effectiveness of our approach, both quantitatively and qualitatively. Finally, we show that our method outperforms other baseline methods in terms of both temporal consistency and visual quality.	This paper presents Video ControlNet, a novel approach that leverages pre-trained conditional image diffusion models, like ControlNet, to achieve temporally consistent synthetic-to-real video translation.	Existing image-to-image translation methods often produce temporally inconsistent videos with flickering artifacts. Video ControlNet addresses this issue by enforcing temporal consistency through a novel optimization process.	The method uses optical flow information from synthetic videos to guide a joint noise optimization process, minimizing discrepancies between corresponding pixels across frames. This ensures both spatial and temporal consistency in the generated videos.	Video ControlNet significantly reduces temporal inconsistency compared to vanilla ControlNet, as evidenced by lower average endpoint errors in optical flow estimation. It demonstrates improved instance-level temporal consistency, resulting in fewer ID switches and fragmentation in multi-object tracking. The paper introduces effective acceleration techniques that significantly speed up the optimization process without compromising temporal consistency.	The optimization process might lead to slightly blurry videos due to the emphasis on temporal consistency. Further exploration of different interpolation techniques for in-between frame generation could enhance overall quality and efficiency.	video generation, diffusion models, temporal consistency, synthetic-to-real, controlnet
2305.19129 Report	Key-Value Transformer	Ali Borji	Transformers have emerged as the prevailing standard solution for various AI tasks, including computer vision and natural language processing. The widely adopted Query, Key, and Value formulation (QKV) has played a significant role in this. Nevertheless, no research has examined the essentiality of these three components for transformer performance. Therefore, we conducted an evaluation of the key-value formulation (KV), which generates symmetric attention maps, along with an asymmetric version that incorporates a 2D positional encoding into the attention matrix. Remarkably, this transformer requires fewer parameters and computation than the original one. Through experiments encompassing three task types -- synthetics (such as reversing or sorting a list), vision (mnist or cifar classification), and NLP (character generation and translation) -- we discovered that the KV transformer occasionally outperforms the QKV transformer. However, it also exhibits instances of underperformance compared to QKV, making it challenging to draw a definitive conclusion. Nonetheless, we consider the reported results to be encouraging and anticipate that they may pave the way for more efficient transformers in the future.	This paper explores the necessity of the Query, Key, and Value (QKV) formulation in transformers, proposing two alternative models: Key-Value (KV) and KV with 2D positional encoding (KV+Pos).	Investigating the essentiality of QKV can lead to more efficient transformer architectures with reduced computational complexity and parameters.	The authors empirically evaluate KV and KV+Pos against QKV on 13 tasks spanning synthetics (list manipulation), vision (classification, anomaly detection), and NLP (character/number generation, translation).	KV and KV+Pos demonstrate competitive performance, occasionally outperforming QKV, particularly in synthetic and vision tasks. KV+Pos often surpasses KV, suggesting the importance of asymmetry in attention for certain tasks. The effectiveness of KV attention varies across tasks, indicating a need for further investigation into the role of symmetric attention.	The study primarily relies on empirical evaluation without delving into the theoretical underpinnings of why KV might succeed or fail. Future work could explore the impact of KV attention on larger, more complex tasks and datasets.	transformers, attention mechanism, key-value attention, symmetric attention, model efficiency
2305.19094 Report	Diffusion Model for Dense Matching	Jisu Nam, Gyuseong Lee, Sunwoo Kim, Hyeonsu Kim, Hyoungwon Cho, Seyeon Kim, Seungryong Kim	The objective for establishing dense correspondence between paired images consists of two terms: a data term and a prior term. While conventional techniques focused on defining hand-designed prior terms, which are difficult to formulate, recent approaches have focused on learning the data term with deep neural networks without explicitly modeling the prior, assuming that the model itself has the capacity to learn an optimal prior from a large-scale dataset. The performance improvement was obvious, however, they often fail to address inherent ambiguities of matching, such as textureless regions, repetitive patterns, and large displacements. To address this, we propose DiffMatch, a novel conditional diffusion-based framework designed to explicitly model both the data and prior terms. Unlike previous approaches, this is accomplished by leveraging a conditional denoising diffusion model. DiffMatch consists of two main components: conditional denoising diffusion module and cost injection module. We stabilize the training process and reduce memory usage with a stage-wise training strategy. Furthermore, to boost performance, we introduce an inference technique that finds a better path to the accurate matching field. Our experimental results demonstrate significant performance improvements of our method over existing approaches, and the ablation studies validate our design choices along with the effectiveness of each component. Project page is available at https://ku-cvlab.github.io/DiffMatch/.	This paper introduces DiffMatch, a novel diffusion-based framework for dense correspondence that explicitly learns both data and prior terms of matching field distribution.	Existing methods for dense correspondence often struggle with inherent ambiguities like textureless regions and repetitive patterns, as they mainly focus on the data term without explicitly modeling the matching prior.	The proposed method leverages a conditional denoising diffusion model conditioned on initial correspondence and local matching cost. Additionally, a cascaded pipeline with a super-resolution diffusion model is used for upsampling the matching field.	DiffMatch achieves state-of-the-art performance on standard benchmarks like HPatches and ETH3D. The method shows robustness to image corruptions, outperforming previous approaches on ImageNet-C corrupted benchmarks. Analysis demonstrates the efficacy of the generative prior in capturing the matching field manifold and handling challenging matching scenarios.	The performance on ETH3D with small displacements is slightly lower, potentially due to lower input resolution compared to some prior works. Future work includes exploring higher resolution, advanced feature extractors beyond VGG-16, and incorporating techniques like zoom-in and patch-match for detailed matching.	dense correspondence, diffusion models, generative prior, matching field, image corruptions
2305.19066 Report	Nested Diffusion Processes for Anytime Image Generation	Noam Elata, Bahjat Kawar, Tomer Michaeli, Michael Elad	Diffusion models are the current state-of-the-art in image generation, synthesizing high-quality images by breaking down the generation process into many fine-grained denoising steps. Despite their good performance, diffusion models are computationally expensive, requiring many neural function evaluations (NFEs). In this work, we propose an anytime diffusion-based method that can generate viable images when stopped at arbitrary times before completion. Using existing pretrained diffusion models, we show that the generation scheme can be recomposed as two nested diffusion processes, enabling fast iterative refinement of a generated image. In experiments on ImageNet and Stable Diffusion-based text-to-image generation, we show, both qualitatively and quantitatively, that our method's intermediate generation quality greatly exceeds that of the original diffusion model, while the final generation result remains comparable. We illustrate the applicability of Nested Diffusion in several settings, including for solving inverse problems, and for rapid text-based content creation by allowing user intervention throughout the sampling process.	This paper introduces Nested Diffusion, an anytime sampling algorithm for pre-trained diffusion models, enabling the generation of plausible images even with early termination.	Diffusion models (DMs) excel in image generation but are computationally expensive. Existing methods struggle to produce high-quality intermediate images during sampling, hindering user intervention and real-time applications.	Nested Diffusion embeds an inner diffusion process within each step of an outer diffusion process. The inner diffusion generates plausible images iteratively, providing progressively refined intermediate outputs.	Nested Diffusion generates superior intermediate images compared to vanilla DMs, as demonstrated by FID scores on ImageNet. Text-to-image generation using Stable Diffusion shows that Nested Diffusion provides semantically meaningful intermediate outputs, unlike vanilla Stable Diffusion. Nested Diffusion successfully generalizes to inverse problems, enabling anytime solutions for tasks like inpainting, super-resolution, and colorization.	Nested Diffusion requires careful tuning of the ratio between outer and inner diffusion steps. Further exploration of dynamic allocation for the number of inner steps per outer step is left for future work.	diffusion models, anytime algorithms, image generation, inverse problems, human-in-the-loop learning
2305.19012 Report	StyleAvatar3D: Leveraging Image-Text Diffusion Models for High-Fidelity 3D Avatar Generation	Chi Zhang, Yiwen Chen, Yijun Fu, Zhenglin Zhou, Gang YU, Billzb Wang, Bin Fu, Tao Chen, Guosheng Lin, Chunhua Shen	The recent advancements in image-text diffusion models have stimulated research interest in large-scale 3D generative models. Nevertheless, the limited availability of diverse 3D resources presents significant challenges to learning. In this paper, we present a novel method for generating high-quality, stylized 3D avatars that utilizes pre-trained image-text diffusion models for data generation and a Generative Adversarial Network (GAN)-based 3D generation network for training. Our method leverages the comprehensive priors of appearance and geometry offered by image-text diffusion models to generate multi-view images of avatars in various styles. During data generation, we employ poses extracted from existing 3D models to guide the generation of multi-view images. To address the misalignment between poses and images in data, we investigate view-specific prompts and develop a coarse-to-fine discriminator for GAN training. We also delve into attribute-related prompts to increase the diversity of the generated avatars. Additionally, we develop a latent diffusion model within the style space of StyleGAN to enable the generation of avatars based on image inputs. Our approach demonstrates superior performance over current state-of-the-art methods in terms of visual quality and diversity of the produced avatars.	This paper proposes a novel framework for generating high-fidelity 3D avatars by leveraging pre-trained text-to-image diffusion models for data generation and training a 3D GAN.	Existing 3D generative models are limited by the scarcity and lack of diversity in 3D training data. This work leverages the rich priors of image-text diffusion models to address this challenge.	The proposed method uses ControlNet with StableDiffusion for generating multi-view stylized avatar images guided by poses and text prompts. A coarse-to-fine discriminator is introduced to handle image-pose misalignment during 3D GAN training. Finally, a latent diffusion model in the StyleGAN latent space enables image-conditioned 3D avatar generation.	The coarse-to-fine discriminator significantly outperforms existing methods, achieving a FID of 5.6 compared to 7.8 for EG3D and 20.9 for PoF3D. The framework successfully generates diverse and high-quality 3D avatars with various styles defined by text prompts or example images. The latent diffusion model effectively captures facial features and allows for conditional 3D avatar generation even with large pose angles or out-of-domain input images.	The reliance on pre-trained pose estimators for guidance can introduce inaccuracies in synthesized images, especially for complex styles. The current implementation focuses on avatar generation and may require further exploration for general 3D object generation.	3d avatar generation, text-to-3d, image-to-3d, diffusion models, generative adversarial networks
2305.18980 Report	Multi-modal Queried Object Detection in the Wild	Yifan Xu, Mengdan Zhang, Chaoyou Fu, Peixian Chen, Xiaoshan Yang, Ke Li, Changsheng Xu	We introduce MQ-Det, an efficient architecture and pre-training strategy design to utilize both textual description with open-set generalization and visual exemplars with rich description granularity as category queries, namely, Multi-modal Queried object Detection, for real-world detection with both open-vocabulary categories and various granularity. MQ-Det incorporates vision queries into existing well-established language-queried-only detectors. A plug-and-play gated class-scalable perceiver module upon the frozen detector is proposed to augment category text with class-wise visual information. To address the learning inertia problem brought by the frozen detector, a vision conditioned masked language prediction strategy is proposed. MQ-Det's simple yet effective architecture and training strategy design is compatible with most language-queried object detectors, thus yielding versatile applications. Experimental results demonstrate that multi-modal queries largely boost open-world detection. For instance, MQ-Det significantly improves the state-of-the-art open-set detector GLIP by +7.8% AP on the LVIS benchmark via multi-modal queries without any downstream finetuning, and averagely +6.3% AP on 13 few-shot downstream tasks, with merely additional 3% modulating time required by GLIP. Code is available at https://github.com/YifanXu74/MQ-Det.	This paper introduces MQ-Det, a novel approach for object detection that utilizes both textual descriptions and visual exemplars as category queries.	Existing object detectors often struggle with insufficient description granularity when using only textual queries. MQ-Det addresses this limitation by incorporating visual information, leading to more accurate and versatile detection, especially for fine-grained categories.	MQ-Det proposes a plug-and-play Gated Class-scalable Perceiver (GCP) module that augments textual category queries with class-wise visual information extracted from exemplars. It employs a vision-conditioned masked language prediction strategy to overcome the learning inertia caused by the frozen pre-trained detector.	MQ-Det significantly boosts open-world detection performance, achieving state-of-the-art results on LVIS and ODinW benchmarks. It demonstrates strong finetuning-free transferability, enabling detection of customized objects without any finetuning. MQ-Det requires significantly less training time than previous leading detectors while exhibiting strong few-shot learning capabilities.	The contribution of multi-modal queries diminishes with sufficient training data per category. The application of MQ-Det to other dense prediction tasks like segmentation needs further exploration.	object detection, multi-modal learning, open-vocabulary detection, few-shot learning, vision-language models
2305.18832 Report	ReTR: Modeling Rendering Via Transformer for Generalizable Neural Surface Reconstruction	Yixun Liang, Hao He, Ying-cong Chen	Generalizable neural surface reconstruction techniques have attracted great attention in recent years. However, they encounter limitations of low confidence depth distribution and inaccurate surface reasoning due to the oversimplified volume rendering process employed. In this paper, we present Reconstruction TRansformer (ReTR), a novel framework that leverages the transformer architecture to redesign the rendering process, enabling complex render interaction modeling. It introduces a learnable $\textit{meta-ray token}$ and utilizes the cross-attention mechanism to simulate the interaction of rendering process with sampled points and render the observed color. Meanwhile, by operating within a high-dimensional feature space rather than the color space, ReTR mitigates sensitivity to projected colors in source views. Such improvements result in accurate surface assessment with high confidence. We demonstrate the effectiveness of our approach on various datasets, showcasing how our method outperforms the current state-of-the-art approaches in terms of reconstruction quality and generalization ability. $\textit{Our code is available at }$ https://github.com/YixunLiang/ReTR.	This paper proposes Reconstruction Transformer (ReTR), a novel framework for generalizable neural surface reconstruction that leverages the transformer architecture to redesign the rendering process for improved surface modeling.	Existing generalizable neural surface reconstruction techniques suffer from limitations of low confidence depth distribution and inaccurate surface reasoning due to the oversimplified volume rendering process.	ReTR introduces a learnable meta-ray token and utilizes the cross-attention mechanism to simulate the interaction of the rendering process with sampled points. It operates within a high-dimensional feature space rather than the color space and introduces a unidirectional transformer and continuous positional encoding to simulate photon-medium interaction.	ReTR outperforms state-of-the-art approaches in terms of reconstruction quality and generalization ability on various datasets including DTU, BlendedMVS, ETH3D, and Tanks & Temples. The method achieves more accurate surface assessment with higher confidence, resulting in sharper depth distribution and reduced noise. ReTR demonstrates robustness to different sampling strategies and can provide reliable depth estimations even with fewer samples.	The method has limitations in terms of efficiency, requiring around 30 seconds to render a depth map and image with a resolution of 600x800. While learning-based rendering enhances capabilities, it introduces additional training parameters compared to traditional volume rendering, increasing training time.	neural surface reconstruction, transformer, volume rendering, computer vision, 3d reconstruction
2305.18766 Report	HiFA: High-fidelity Text-to-3D Generation with Advanced Diffusion Guidance	Junzhe Zhu, Peiye Zhuang, Sanmi Koyejo	The advancements in automatic text-to-3D generation have been remarkable. Most existing methods use pre-trained text-to-image diffusion models to optimize 3D representations like Neural Radiance Fields (NeRFs) via latent-space denoising score matching. Yet, these methods often result in artifacts and inconsistencies across different views due to their suboptimal optimization approaches and limited understanding of 3D geometry. Moreover, the inherent constraints of NeRFs in rendering crisp geometry and stable textures usually lead to a two-stage optimization to attain high-resolution details. This work proposes holistic sampling and smoothing approaches to achieve high-quality text-to-3D generation, all in a single-stage optimization. We compute denoising scores in the text-to-image diffusion model's latent and image spaces. Instead of randomly sampling timesteps (also referred to as noise levels in denoising score matching), we introduce a novel timestep annealing approach that progressively reduces the sampled timestep throughout optimization. To generate high-quality renderings in a single-stage optimization, we propose regularization for the variance of z-coordinates along NeRF rays. To address texture flickering issues in NeRFs, we introduce a kernel smoothing technique that refines importance sampling weights coarse-to-fine, ensuring accurate and thorough sampling in high-density regions. Extensive experiments demonstrate the superiority of our method over previous approaches, enabling the generation of highly detailed and view-consistent 3D assets through a single-stage training process.	This paper introduces a novel approach for generating high-quality, view-consistent 3D assets from text prompts using a single-stage optimization process, leveraging pre-trained text-to-image diffusion models.	Existing text-to-3D generation methods often suffer from artifacts, inconsistencies across views, and require multi-stage optimization for high-resolution details. This work aims to overcome these limitations by improving both the optimization process and the 3D representation.	The method combines score distillation from the latent and image spaces of a pre-trained Stable Diffusion model with a novel timestep annealing strategy for improved optimization. Additionally, it introduces a variance regularization loss for sharper geometry and a kernel smoothing technique for coarse-to-fine importance sampling to mitigate flickering artifacts in NeRFs.	The proposed method generates 3D assets with superior photorealism, detailed textures, and more natural lighting compared to existing methods. A novel timestep annealing approach effectively addresses divergence issues and enhances the guidance provided by the text-to-image diffusion prior, leading to improved generation quality. The introduction of z-variance regularization and kernel smoothing techniques significantly enhances the quality of NeRF representations, ensuring both sharp geometry and view-consistent appearance.	The current implementation relies on low-resolution guidance ($64\times64$) from the Deep Floyd IF model; future work will explore utilizing the full model for high-resolution guidance. The method currently focuses on generating 3D assets from text prompts; future research will explore extending it to handle other input modalities, such as images or sketches.	text-to-3d generation, neural radiance fields (nerfs), diffusion models, score distillation sampling (sds), timestep annealing
2305.18729 Report	Real-World Image Variation by Aligning Diffusion Inversion Chain	Yuechen Zhang, Jinbo Xing, Eric Lo, Jiaya Jia	Recent diffusion model advancements have enabled high-fidelity images to be generated using text prompts. However, a domain gap exists between generated images and real-world images, which poses a challenge in generating high-quality variations of real-world images. Our investigation uncovers that this domain gap originates from a latents' distribution gap in different diffusion processes. To address this issue, we propose a novel inference pipeline called Real-world Image Variation by ALignment (RIVAL) that utilizes diffusion models to generate image variations from a single image exemplar. Our pipeline enhances the generation quality of image variations by aligning the image generation process to the source image's inversion chain. Specifically, we demonstrate that step-wise latent distribution alignment is essential for generating high-quality variations. To attain this, we design a cross-image self-attention injection for feature interaction and a step-wise distribution normalization to align the latent features. Incorporating these alignment processes into a diffusion model allows RIVAL to generate high-quality image variations without further parameter optimization. Our experimental results demonstrate that our proposed approach outperforms existing methods concerning semantic similarity and perceptual quality. This generalized inference pipeline can be easily applied to other diffusion-based generation tasks, such as image-conditioned text-to-image generation and stylization.	This paper proposes RIVAL, a training-free inference pipeline for generating high-quality image variations from a single real-world image exemplar using diffusion models.	Bridging the domain gap between generated and real-world images for high-quality image variation generation with diffusion models is crucial but challenging.	RIVAL aligns the image generation process to the source image's inversion chain using two key components: (i) cross-image self-attention injection for feature interaction and (ii) step-wise latent normalization for latent distribution alignment.	RIVAL generates high-quality image variations maintaining semantic and style consistency with the exemplar image. RIVAL outperforms existing methods in terms of semantic similarity and perceptual quality. RIVAL can be applied to other image generation tasks like text-driven generation with image conditions and inpainting.	RIVAL relies on text prompts, potentially introducing semantic biases. Generating complex scenes with RIVAL can be challenging due to limitations of the base diffusion model. Future work could explore refining diffusion models and novel input modalities beyond text prompts.	image variation generation, diffusion models, latent space alignment, cross-image attention, real-world image editing
2305.18726 Report	Diffusion-Stego: Training-free Diffusion Generative Steganography via Message Projection	Daegyu Kim, Chaehun Shin, Jooyoung Choi, Dahuin Jung, Sungroh Yoon	Generative steganography is the process of hiding secret messages in generated images instead of cover images. Existing studies on generative steganography use GAN or Flow models to obtain high hiding message capacity and anti-detection ability over cover images. However, they create relatively unrealistic stego images because of the inherent limitations of generative models. We propose Diffusion-Stego, a generative steganography approach based on diffusion models which outperform other generative models in image generation. Diffusion-Stego projects secret messages into latent noise of diffusion models and generates stego images with an iterative denoising process. Since the naive hiding of secret messages into noise boosts visual degradation and decreases extracted message accuracy, we introduce message projection, which hides messages into noise space while addressing these issues. We suggest three options for message projection to adjust the trade-off between extracted message accuracy, anti-detection ability, and image quality. Diffusion-Stego is a training-free approach, so we can apply it to pre-trained diffusion models which generate high-quality images, or even large-scale text-to-image models, such as Stable diffusion. Diffusion-Stego achieved a high capacity of messages (3.0 bpp of binary messages with 98% accuracy, and 6.0 bpp with 90% accuracy) as well as high quality (with a FID score of 2.77 for 1.0 bpp on the FFHQ 64$\times$64 dataset) that makes it challenging to distinguish from real images in the PNG format.	This paper presents \our{}, a novel generative steganography approach based on diffusion models and deterministic samplers for hiding messages within generated images, achieving high message capacity and quality.	Generative steganography enhances traditional steganography by hiding messages within generated images instead of cover images, making it more resistant to steganalysis.	\our{} leverages the invertible property of deterministic samplers in diffusion models to embed secret messages into the noise of the generative process. The authors introduce message projection techniques to address challenges like image collapse and extraction errors.	Achieves high message capacity, hiding up to 3.0 bpp of binary messages with 98% accuracy, and 6.0 bpp with 90% accuracy. Generates high-quality stego images, achieving a FID score of 2.77 for 1.0 bpp on the FFHQ 64×64 dataset, making it difficult to distinguish from real PNG images. Demonstrates applicability to large-scale text-to-image models like Stable diffusion, allowing for message hiding based on text prompts.	Trade-off exists between image quality, anti-detection ability, and extracted message accuracy, requiring further optimization. Reliance on pre-trained diffusion models raises concerns about potential misuse for malicious purposes, necessitating research on safeguards and steganalysis techniques.	generative steganography, diffusion models, deterministic sampling, message projection, image steganalysis
2305.18676 Report	LayerDiffusion: Layered Controlled Image Editing with Diffusion Models	Pengzhi Li, QInxuan Huang, Yikang Ding, Zhiheng Li	Text-guided image editing has recently experienced rapid development. However, simultaneously performing multiple editing actions on a single image, such as background replacement and specific subject attribute changes, while maintaining consistency between the subject and the background remains challenging. In this paper, we propose LayerDiffusion, a semantic-based layered controlled image editing method. Our method enables non-rigid editing and attribute modification of specific subjects while preserving their unique characteristics and seamlessly integrating them into new backgrounds. We leverage a large-scale text-to-image model and employ a layered controlled optimization strategy combined with layered diffusion training. During the diffusion process, an iterative guidance strategy is used to generate a final image that aligns with the textual description. Experimental results demonstrate the effectiveness of our method in generating highly coherent images that closely align with the given textual description. The edited images maintain a high similarity to the features of the input image and surpass the performance of current leading image editing methods. LayerDiffusion opens up new possibilities for controllable image editing.	This paper introduces LayerDiffusion, a novel semantic-based layered controlled image editing method that enables simultaneous editing of both the background and specific subjects within an image using a single input image.	Current text-guided image editing methods struggle to maintain consistency between edited subjects and backgrounds, especially when performing multiple editing actions simultaneously. This new method aims to address these limitations and enhance controllable image editing.	The method leverages a large-scale text-to-image model and employs a layered controlled optimization strategy to refine text embeddings. It then uses a layered diffusion training strategy to fine-tune the model and an iterative guidance strategy during inference to generate images consistent with the textual description.	LayerDiffusion enables non-rigid editing and attribute modification of specific subjects while preserving their unique characteristics and seamlessly integrating them into new backgrounds. The method generates images with highly similar features to the input images, surpassing the performance of current leading image editing methods. User studies confirm that LayerDiffusion's output aligns more closely with human perception compared to other methods.	The method faces challenges in dealing with fine-grained tasks, such as preserving intricate textures or facial features, due to potential overfitting during model fine-tuning. Significant disparities in camera angles between the input reference image and the desired edited image can lead to visually inconsistent scenes.	image editing, text-guided synthesis, diffusion models, layered control, semantic editing
2305.18670 Report	SAVE: Spectral-Shift-Aware Adaptation of Image Diffusion Models for Text-driven Video Editing	Nazmul Karim, Umar Khalid, Mohsen Joneidi, Chen Chen, Nazanin Rahnavard	Text-to-Image (T2I) diffusion models have achieved remarkable success in synthesizing high-quality images conditioned on text prompts. Recent methods have tried to replicate the success by either training text-to-video (T2V) models on a very large number of text-video pairs or adapting T2I models on text-video pairs independently. Although the latter is computationally less expensive, it still takes a significant amount of time for per-video adaption. To address this issue, we propose SAVE, a novel spectral-shift-aware adaptation framework, in which we fine-tune the spectral shift of the parameter space instead of the parameters themselves. Specifically, we take the spectral decomposition of the pre-trained T2I weights and only update the singular values while freezing the corresponding singular vectors. In addition, we introduce a spectral shift regularizer aimed at placing tighter constraints on larger singular values compared to smaller ones. This form of regularization enables the model to grasp finer details within the video that align with the provided textual descriptions. We also offer theoretical justification for our proposed regularization technique. Since we are only dealing with spectral shifts, the proposed method reduces the adaptation time significantly (approx. 10 times) and has fewer resource constraints for training. Such attributes posit SAVE to be more suitable for real-world applications, e.g. editing undesirable content during video streaming. We validate the effectiveness of SAVE with an extensive experimental evaluation under different settings, e.g. style transfer, object replacement, privacy preservation, etc.	Proposes SAVE, a novel spectral-shift-aware adaptation framework for text-guided video editing that fine-tunes the spectral shift of a pre-trained T2I diffusion model instead of its parameters for efficient adaptation.	Existing text-to-video generation methods are computationally expensive, lack temporal awareness, and require large-scale datasets, while this method leverages existing T2I models for efficiency and addresses temporal modeling for improved video editing.	Leverages a pre-trained T2I model and fine-tunes its spectral shifts by updating singular values of weight matrices while freezing singular vectors, incorporating a spectral shift regularizer to prioritize learning finer video details, and exploring different spatiotemporal attention mechanisms for temporal coherence.	Significantly reduces adaptation time (approximately 10x faster) compared to traditional fine-tuning methods. Achieves state-of-the-art performance in text-guided video editing tasks, including style transfer, object replacement, and local attribute editing, as demonstrated through quantitative and qualitative evaluations. Shows promising results in zero-shot text-to-video generation by incorporating pre-trained T2I adapters for motion modeling and frame attention for temporal consistency.	Struggles with editing long video sequences with irregular actions, indicating potential for further exploration of temporal modeling techniques. Relies on pre-trained T2I models, which might limit its capacity to learn novel concepts beyond the knowledge captured in the pre-trained models, suggesting investigation into incorporating external knowledge sources or few-shot learning strategies.	video editing, diffusion models, text-to-video generation, spectral shift, temporal modeling
2305.18439 Report	Alteration-free and Model-agnostic Origin Attribution of Generated Images	Zhenting Wang, Chen Chen, Yi Zeng, Lingjuan Lyu, Shiqing Ma	Recently, there has been a growing attention in image generation models. However, concerns have emerged regarding potential misuse and intellectual property (IP) infringement associated with these models. Therefore, it is necessary to analyze the origin of images by inferring if a specific image was generated by a particular model, i.e., origin attribution. Existing methods are limited in their applicability to specific types of generative models and require additional steps during training or generation. This restricts their use with pre-trained models that lack these specific operations and may compromise the quality of image generation. To overcome this problem, we first develop an alteration-free and model-agnostic origin attribution method via input reverse-engineering on image generation models, i.e., inverting the input of a particular model for a specific image. Given a particular model, we first analyze the differences in the hardness of reverse-engineering tasks for the generated images of the given model and other images. Based on our analysis, we propose a method that utilizes the reconstruction loss of reverse-engineering to infer the origin. Our proposed method effectively distinguishes between generated images from a specific generative model and other images, including those generated by different models and real images.	This paper introduces a novel, alteration-free, and model-agnostic method for attributing the origin of AI-generated images, determining if a specific image was generated by a particular model.	With the increasing concerns about misuse and intellectual property infringement related to AI-generated images, verifying the origin of these images is crucial for copyright protection, tracing malicious content, and ensuring fairness.	The proposed method leverages the concept of input reverse-engineering on image generation models. It analyzes the reconstruction loss during reverse-engineering, comparing the loss for the examined image to the distribution of losses observed in images genuinely generated by the model in question. To mitigate the influence of inherent image complexities, the method calibrates the reconstruction loss using a reference model trained on a different dataset.	The method effectively distinguishes between images generated by a specific model and real images, achieving an average accuracy of 94.2%. It successfully differentiates between images generated by a particular model and those generated by other models, regardless of architectural differences or variations in training datasets, with an average accuracy exceeding 95%. The method demonstrates robustness against adaptive attacks, such as image editing, maintaining an accuracy above 90% even when malicious modifications are applied.	The computational cost of the method, primarily due to the reverse-engineering process, is acknowledged as a limitation compared to watermarking or classifier-based approaches. Future work aims to explore techniques for accelerating this process. The current focus of the method is on image generation models. Expanding its applicability to other domains, such as video, language, and graph generation models, is identified as a direction for future research.	origin attribution, ai-generated images, reverse-engineering, image generation models, intellectual property protection
2305.18295 Report	RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths	Zeyue Xue, Guanglu Song, Qiushan Guo, Boxiao Liu, Zhuofan Zong, Yu Liu, Ping Luo	Text-to-image generation has recently witnessed remarkable achievements. We introduce a text-conditional image diffusion model, termed RAPHAEL, to generate highly artistic images, which accurately portray the text prompts, encompassing multiple nouns, adjectives, and verbs. This is achieved by stacking tens of mixture-of-experts (MoEs) layers, i.e., space-MoE and time-MoE layers, enabling billions of diffusion paths (routes) from the network input to the output. Each path intuitively functions as a "painter" for depicting a particular textual concept onto a specified image region at a diffusion timestep. Comprehensive experiments reveal that RAPHAEL outperforms recent cutting-edge models, such as Stable Diffusion, ERNIE-ViLG 2.0, DeepFloyd, and DALL-E 2, in terms of both image quality and aesthetic appeal. Firstly, RAPHAEL exhibits superior performance in switching images across diverse styles, such as Japanese comics, realism, cyberpunk, and ink illustration. Secondly, a single model with three billion parameters, trained on 1,000 A100 GPUs for two months, achieves a state-of-the-art zero-shot FID score of 6.61 on the COCO dataset. Furthermore, RAPHAEL significantly surpasses its counterparts in human evaluation on the ViLG-300 benchmark. We believe that RAPHAEL holds the potential to propel the frontiers of image generation research in both academia and industry, paving the way for future breakthroughs in this rapidly evolving field. More details can be found on a webpage: https://raphael-painter.github.io/.	RAPHAEL is a novel text-conditional image diffusion model that leverages a large-scale mixture of diffusion paths to generate highly artistic and text-aligned images.	Existing text-to-image models often fail to accurately preserve all textual concepts within generated images due to limitations in the cross-attention mechanism for text-image integration.	RAPHAEL employs a U-Net architecture with stacked space-MoE and time-MoE layers to enable billions of diffusion paths, each acting as a 'painter' for specific concepts and image regions. It also incorporates edge-supervised learning to enhance image quality and aesthetics.	RAPHAEL exhibits superior performance in generating images across diverse artistic styles, surpassing models like Stable Diffusion and DALL-E 2. It achieves state-of-the-art zero-shot FID-30k score of 6.61 on the COCO dataset, demonstrating high image quality and diversity. RAPHAEL significantly outperforms competitors in human evaluations on the ViLG-300 benchmark for both image-text alignment and aesthetic quality.	Potential misuse for creating misleading or false information, requiring prompt filtering and ethical considerations. Computational complexity increases with the number of experts, necessitating a trade-off between image fidelity and inference speed.	text-to-image generation, diffusion models, mixture-of-experts (moe), edge-supervised learning, artistic image synthesis
2305.18292 Report	Mix-of-Show: Decentralized Low-Rank Adaptation for Multi-Concept Customization of Diffusion Models	Yuchao Gu, Xintao Wang, Jay Zhangjie Wu, Yujun Shi, Yunpeng Chen, Zihan Fan, Wuyou Xiao, Rui Zhao, Shuning Chang, Weijia Wu, Yixiao Ge, Ying Shan, Mike Zheng Shou	Public large-scale text-to-image diffusion models, such as Stable Diffusion, have gained significant attention from the community. These models can be easily customized for new concepts using low-rank adaptations (LoRAs). However, the utilization of multiple concept LoRAs to jointly support multiple customized concepts presents a challenge. We refer to this scenario as decentralized multi-concept customization, which involves single-client concept tuning and center-node concept fusion. In this paper, we propose a new framework called Mix-of-Show that addresses the challenges of decentralized multi-concept customization, including concept conflicts resulting from existing single-client LoRA tuning and identity loss during model fusion. Mix-of-Show adopts an embedding-decomposed LoRA (ED-LoRA) for single-client tuning and gradient fusion for the center node to preserve the in-domain essence of single concepts and support theoretically limitless concept fusion. Additionally, we introduce regionally controllable sampling, which extends spatially controllable sampling (e.g., ControlNet and T2I-Adaptor) to address attribute binding and missing object problems in multi-concept sampling. Extensive experiments demonstrate that Mix-of-Show is capable of composing multiple customized concepts with high fidelity, including characters, objects, and scenes.	Mix-of-Show, a novel framework for decentralized multi-concept customization in text-to-image diffusion models, enabling the merging of multiple independently fine-tuned concept models while preserving individual concept identity and fidelity.	Existing methods struggle to combine multiple customized concepts effectively due to concept conflicts and identity loss during model fusion, limiting the potential of large-scale text-to-image models for complex compositions.	Mix-of-Show utilizes ED-LoRA for single-client concept tuning, which enhances embedding expressiveness to preserve concept essence, and employs gradient fusion at the center node to align inference behavior of individual concepts, minimizing identity loss.	ED-LoRA effectively captures concept identity while mitigating concept conflicts observed in vanilla LoRA. Gradient fusion outperforms weight fusion in preserving individual concept fidelity after model merging. Regionally controllable sampling, introduced for multi-concept generation, addresses attribute binding issues and enables complex compositions with accurate attribute assignment.	Regionally controllable sampling may exhibit attribute leakage between regions. Center-node fusion using gradient descent can be time-consuming, especially for large spatial features in Unet layers.	text-to-image generation, diffusion models, concept customization, decentralized learning, low-rank adaptation
2305.18286 Report	Photoswap: Personalized Subject Swapping in Images	Jing Gu, Yilin Wang, Nanxuan Zhao, Tsu-Jui Fu, Wei Xiong, Qing Liu, Zhifei Zhang, He Zhang, Jianming Zhang, HyunJoon Jung, Xin Eric Wang	In an era where images and visual content dominate our digital landscape, the ability to manipulate and personalize these images has become a necessity. Envision seamlessly substituting a tabby cat lounging on a sunlit window sill in a photograph with your own playful puppy, all while preserving the original charm and composition of the image. We present Photoswap, a novel approach that enables this immersive image editing experience through personalized subject swapping in existing images. Photoswap first learns the visual concept of the subject from reference images and then swaps it into the target image using pre-trained diffusion models in a training-free manner. We establish that a well-conceptualized visual subject can be seamlessly transferred to any image with appropriate self-attention and cross-attention manipulation, maintaining the pose of the swapped subject and the overall coherence of the image. Comprehensive experiments underscore the efficacy and controllability of Photoswap in personalized subject swapping. Furthermore, Photoswap significantly outperforms baseline methods in human ratings across subject swapping, background preservation, and overall quality, revealing its vast application potential, from entertainment to professional editing.	Presents Photoswap, a novel, training-free method for personalized subject swapping in images using pre-trained diffusion models. It allows users to replace subjects in an image with a user-specified subject, while maintaining the original pose and composition.	Personalized subject swapping has broad applications in entertainment, advertising, and professional editing. Existing methods lack the capability to seamlessly integrate new subjects into existing images while preserving their pose and the image composition.	Photoswap first learns the visual concept of the target subject from reference images using techniques like DreamBooth. Then, it leverages a training-free attention swapping mechanism that manipulates the self-attention and cross-attention maps and outputs during the target image generation process.	Photoswap demonstrates impressive capabilities in swapping subjects in various images while preserving the original composition and subject pose. It significantly outperforms baseline methods in human evaluations for subject identity preservation, background preservation, and overall quality. The method provides control over the subject's appearance by adjusting the attention swapping steps.	The model sometimes struggles with accurately reproducing hands and complex background information. Future work aims to address limitations and enhance performance for intricate hand gestures or complex abstract information.	image editing, subject swapping, diffusion models, attention mechanisms, personalized image manipulation
2305.18264 Report	Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising	Fu-Yun Wang, Wenshuo Chen, Guanglu Song, Han-Jia Ye, Yu Liu, Hongsheng Li	Leveraging large-scale image-text datasets and advancements in diffusion models, text-driven generative models have made remarkable strides in the field of image generation and editing. This study explores the potential of extending the text-driven ability to the generation and editing of multi-text conditioned long videos. Current methodologies for video generation and editing, while innovative, are often confined to extremely short videos (typically less than 24 frames) and are limited to a single text condition. These constraints significantly limit their applications given that real-world videos usually consist of multiple segments, each bearing different semantic information. To address this challenge, we introduce a novel paradigm dubbed as Gen-L-Video, capable of extending off-the-shelf short video diffusion models for generating and editing videos comprising hundreds of frames with diverse semantic segments without introducing additional training, all while preserving content consistency. We have implemented three mainstream text-driven video generation and editing methodologies and extended them to accommodate longer videos imbued with a variety of semantic segments with our proposed paradigm. Our experimental outcomes reveal that our approach significantly broadens the generative and editing capabilities of video diffusion models, offering new possibilities for future research and applications. The code is available at https://github.com/G-U-N/Gen-L-Video.	Introduces Gen-L-Video, a novel paradigm that extends off-the-shelf short video diffusion models to generate and edit long videos with multiple semantic segments, all without additional training.	Addresses limitations of current text-driven video generation and editing methods that struggle with long durations (typically under 24 frames) and single text conditions, hindering their real-world applicability.	Treats long videos as overlapping short clips, jointly denoising them with existing models while ensuring consistency and coherence via a weighted merging process. Integrates with pretrained, tuning-free, and one-shot-tuning paradigms, and incorporates techniques like bidirectional cross-frame attention.	Significantly enhances frame consistency and textual alignment compared to isolated denoising. Successfully generates and edits videos with hundreds of frames and diverse semantic segments, as demonstrated qualitatively and quantitatively. Demonstrates versatility by integrating with personalized diffusion models, layout control mechanisms, and open-set detection/segmentation for arbitrary object editing.	The framework’s potential to integrate different video diffusion models with varying lengths remains unexplored. Further research can explore using diverse short video diffusion models concurrently.	video generation, video editing, diffusion models, long video synthesis, multi-text conditioned generation
2305.18203 Report	Concept Decomposition for Visual Exploration and Inspiration	Yael Vinker, Andrey Voynov, Daniel Cohen-Or, Ariel Shamir	A creative idea is often born from transforming, combining, and modifying ideas from existing visual examples capturing various concepts. However, one cannot simply copy the concept as a whole, and inspiration is achieved by examining certain aspects of the concept. Hence, it is often necessary to separate a concept into different aspects to provide new perspectives. In this paper, we propose a method to decompose a visual concept, represented as a set of images, into different visual aspects encoded in a hierarchical tree structure. We utilize large vision-language models and their rich latent space for concept decomposition and generation. Each node in the tree represents a sub-concept using a learned vector embedding injected into the latent space of a pretrained text-to-image model. We use a set of regularizations to guide the optimization of the embedding vectors encoded in the nodes to follow the hierarchical structure of the tree. Our method allows to explore and discover new concepts derived from the original one. The tree provides the possibility of endless visual sampling at each node, allowing the user to explore the hidden sub-concepts of the object of interest. The learned aspects in each node can be combined within and across trees to create new visual ideas, and can be used in natural language sentences to apply such aspects to new designs.	This paper introduces a method for decomposing visual concepts into distinct aspects, creating a hierarchical tree structure for exploration and inspiration.	This approach supports creative design by enabling the exploration of nuanced aspects within a concept and facilitating the generation of novel ideas through combination.	Leveraging large vision-language models, the method constructs a binary tree where each node represents a learned vector embedding of a sub-concept. This learning process is guided by a binary reconstruction loss and a coherency constraint ensuring meaningful and distinct aspect representation.	The method successfully decomposes complex visual concepts into coherent sub-concepts, as demonstrated through qualitative examples and a perceptual study. The generated tree structure facilitates the exploration and combination of aspects, both within a concept (intra-tree) and across different concepts (inter-tree), to foster new design ideas. The learned aspects can be effectively integrated into natural language sentences, enabling aspect-based image generation using pre-trained text-to-image models.	The method may struggle with specific image conditions (e.g., background leakage, dominant sub-concepts) impacting decomposition quality. Generating deeper trees with multiple levels remains challenging due to potential drift towards out-of-distribution embeddings, requiring further investigation.	concept decomposition, visual exploration, design inspiration, text-to-image generation, vision-language models
2305.18009 Report	Multi-Modal Face Stylization with a Generative Prior	Mengtian Li, Yi Dong, Minxuan Lin, Haibin Huang, Pengfei Wan, Chongyang Ma	In this work, we introduce a new approach for face stylization. Despite existing methods achieving impressive results in this task, there is still room for improvement in generating high-quality artistic faces with diverse styles and accurate facial reconstruction. Our proposed framework, MMFS, supports multi-modal face stylization by leveraging the strengths of StyleGAN and integrates it into an encoder-decoder architecture. Specifically, we use the mid-resolution and high-resolution layers of StyleGAN as the decoder to generate high-quality faces, while aligning its low-resolution layer with the encoder to extract and preserve input facial details. We also introduce a two-stage training strategy, where we train the encoder in the first stage to align the feature maps with StyleGAN and enable a faithful reconstruction of input faces. In the second stage, the entire network is fine-tuned with artistic data for stylized face generation. To enable the fine-tuned model to be applied in zero-shot and one-shot stylization tasks, we train an additional mapping network from the large-scale Contrastive-Language-Image-Pre-training (CLIP) space to a latent $w+$ space of fine-tuned StyleGAN. Qualitative and quantitative experiments show that our framework achieves superior performance in both one-shot and zero-shot face stylization tasks, outperforming state-of-the-art methods by a large margin.	This paper introduces MMFS, a novel framework for multi-modal face stylization that leverages StyleGAN2 within an encoder-decoder architecture for high-quality stylized face generation.	Existing face stylization methods struggle to balance high-quality artistic generation with diverse style support, accurate facial reconstruction, and flexible control mechanisms (one-shot, zero-shot).	The framework uses a two-stage training strategy. Stage I aligns a convolution-based encoder with StyleGAN2 for accurate reconstruction. Stage II fine-tunes the entire network on artistic data for stylization. A mapping network from CLIP feature space to StyleGAN2's latent space enables guided stylization.	MMFS achieves state-of-the-art performance on random stylization, outperforming baselines in quality, diversity, and identity preservation. The method demonstrates superior visual quality in both one-shot and zero-shot settings, effectively transferring styles from reference images or text prompts while preserving facial details. Ablation studies validate the effectiveness of the two-stage training, projection loss, fine-tuning step, and CLIP feature integration.	The current implementation has limitations in handling significant geometric deformations (e.g., caricatures). The generated images are limited to the cropped region of FFHQ and struggle with out-of-distribution inputs with large pose variations or occlusions.	face stylization, generative adversarial networks, stylegan2, clip, image-to-image translation
2305.17929 Report	Factored-NeuS: Reconstructing Surfaces, Illumination, and Materials of Possibly Glossy Objects	Yue Fan, Ivan Skorokhodov, Oleg Voynov, Savva Ignatyev, Evgeny Burnaev, Peter Wonka, Yiqun Wang	We develop a method that recovers the surface, materials, and illumination of a scene from its posed multi-view images. In contrast to prior work, it does not require any additional data and can handle glossy objects or bright lighting. It is a progressive inverse rendering approach, which consists of three stages. First, we reconstruct the scene radiance and signed distance function (SDF) with our novel regularization strategy for specular reflections. Our approach considers both the diffuse and specular colors, which allows for handling complex view-dependent lighting effects for surface reconstruction. Second, we distill light visibility and indirect illumination from the learned SDF and radiance field using learnable mapping functions. Third, we design a method for estimating the ratio of incoming direct light represented via Spherical Gaussians reflected in a specular manner and then reconstruct the materials and direct illumination of the scene. Experimental results demonstrate that the proposed method outperforms the current state-of-the-art in recovering surfaces, materials, and lighting without relying on any additional data.	This paper presents Factored-NeuS, a novel method for reconstructing surfaces, materials, and illumination from posed multi-view images, even for scenes with glossy objects and complex lighting.	Existing methods struggle to accurately reconstruct glossy surfaces and disentangle specular reflections from diffuse color, leading to inaccurate geometry and material estimations. This work addresses these limitations, particularly for real-world data.	The method employs a three-stage progressive inverse rendering approach: (1) Joint reconstruction of surface SDF and radiance with diffuse and specular color decomposition. (2) Learning direct lighting visibility and indirect illumination from the SDF and radiance. (3) Recovering BRDF and direct illumination using a novel specular albedo network and continuous light visibility.	Outperforms state-of-the-art methods in surface reconstruction quality, particularly for glossy objects, as demonstrated on DTU, SK3D, and Shiny datasets. Achieves superior material and lighting decomposition compared to existing techniques, evidenced by improved PSNR metrics and visual fidelity on the IndiSG dataset. Effectiveness of the proposed components, including specular albedo network and continuous light visibility, is validated through ablation studies, showing improvements in both quantitative metrics and qualitative results.	Challenges remain in reconstructing fine geometric details and materials for objects with complex structures. Future work includes extending the method to dynamic scenes and incorporating additional data modalities.	inverse rendering, surface reconstruction, material reconstruction, illumination reconstruction, glossy surfaces
2305.17916 Report	Volume Feature Rendering for Fast Neural Radiance Field Reconstruction	Kang Han, Wei Xiang, Lu Yu	Neural radiance fields (NeRFs) are able to synthesize realistic novel views from multi-view images captured from distinct positions and perspectives. In NeRF's rendering pipeline, neural networks are used to represent a scene independently or transform queried learnable feature vector of a point to the expected color or density. With the aid of geometry guides either in occupancy grids or proposal networks, the number of neural network evaluations can be reduced from hundreds to dozens in the standard volume rendering framework. Instead of rendering yielded color after neural network evaluation, we propose to render the queried feature vectors of a ray first and then transform the rendered feature vector to the final pixel color by a neural network. This fundamental change to the standard volume rendering framework requires only one single neural network evaluation to render a pixel, which substantially lowers the high computational complexity of the rendering framework attributed to a large number of neural network evaluations. Consequently, we can use a comparably larger neural network to achieve a better rendering quality while maintaining the same training and rendering time costs. Our model achieves the state-of-the-art rendering quality on both synthetic and real-world datasets while requiring a training time of several minutes.	This paper proposes Volume Feature Rendering (VFR), a novel method that achieves state-of-the-art view synthesis quality with significantly reduced training time compared to standard volume rendering techniques.	Existing neural rendering methods, while capable of high-fidelity view synthesis, suffer from high computational complexity due to numerous neural network evaluations per pixel. This limits the use of larger networks for better quality and increases training time. VFR addresses this limitation by enabling the use of larger networks without sacrificing training speed.	Instead of rendering colors directly, VFR renders queried feature vectors of sample points along a ray. These vectors are then integrated and transformed into the final pixel color using a single neural network evaluation. This reduces computational complexity and allows for larger, more expressive networks.	VFR achieves state-of-the-art rendering quality on both synthetic (NeRF synthetic dataset) and real-world (360 dataset) benchmarks. The method significantly reduces training time compared to existing fast methods, achieving comparable quality in just minutes. Ablation studies demonstrate the contribution of individual components like GELU activation and SH feature encoding to the improved performance.	VFR currently requires per-scene optimization, limiting its applicability in real-time 3D video applications. While offering high quality, VFR's rendering speed needs further improvement to compete with real-time rendering methods like BakedSDF and MobileNeRF, potentially by reducing the number of feature queries.	neural rendering, view synthesis, neural radiance fields (nerf), volume rendering, feature integration
2305.17624 Report	SimpSON: Simplifying Photo Cleanup with Single-Click Distracting Object Segmentation Network	Chuong Huynh, Yuqian Zhou, Zhe Lin, Connelly Barnes, Eli Shechtman, Sohrab Amirghodsi, Abhinav Shrivastava	In photo editing, it is common practice to remove visual distractions to improve the overall image quality and highlight the primary subject. However, manually selecting and removing these small and dense distracting regions can be a laborious and time-consuming task. In this paper, we propose an interactive distractor selection method that is optimized to achieve the task with just a single click. Our method surpasses the precision and recall achieved by the traditional method of running panoptic segmentation and then selecting the segments containing the clicks. We also showcase how a transformer-based module can be used to identify more distracting regions similar to the user's click position. Our experiments demonstrate that the model can effectively and accurately segment unknown distracting objects interactively and in groups. By significantly simplifying the photo cleaning and retouching process, our proposed model provides inspiration for exploring rare object segmentation and group selection with a single click.	This paper introduces SimpSON, an interactive single-click distractor segmentation network for simplifying photo cleanup.	Removing small and dense distracting objects from photos is a common yet time-consuming task in photo editing. SimpSON allows users to select and remove these distractions with a single click, potentially reducing editing time from hours to minutes.	The method utilizes a three-stage pipeline: 1) a single-click Distractor Segmentation Network (1C-DSN) segments objects based on a single click, 2) a Click Proposal Network (CPN) identifies similar distractors and proposes their click positions, and 3) a Proposal Verification Module (PVM) verifies the similarity of proposed clicks to reduce false positives. This process can be run iteratively for more thorough selection.	The 1C-DSN outperforms existing interactive segmentation methods in segmenting small and medium objects with a single click. The CPN effectively identifies similar distractors within an image, enabling group selection. The iterative selection process with PVM significantly improves group selection accuracy.	The group selection pipeline relies on synthetic data due to the lack of labeled datasets with repeated distractors. Further exploration of image harmonization techniques could improve the realism of synthetic data and potentially enhance performance.	interactive segmentation, distractor removal, photo retouching, single-click segmentation, group selection
2305.17431 Report	Towards Consistent Video Editing with Text-to-Image Diffusion Models	Zicheng Zhang, Bonan Li, Xuecheng Nie, Congying Han, Tiande Guo, Luoqi Liu	Existing works have advanced Text-to-Image (TTI) diffusion models for video editing in a one-shot learning manner. Despite their low requirements of data and computation, these methods might produce results of unsatisfied consistency with text prompt as well as temporal sequence, limiting their applications in the real world. In this paper, we propose to address the above issues with a novel EI$^2$ model towards \textbf{E}nhancing v\textbf{I}deo \textbf{E}diting cons\textbf{I}stency of TTI-based frameworks. Specifically, we analyze and find that the inconsistent problem is caused by newly added modules into TTI models for learning temporal information. These modules lead to covariate shift in the feature space, which harms the editing capability. Thus, we design EI$^2$ to tackle the above drawbacks with two classical modules: Shift-restricted Temporal Attention Module (STAM) and Fine-coarse Frame Attention Module (FFAM). First, through theoretical analysis, we demonstrate that covariate shift is highly related to Layer Normalization, thus STAM employs a \textit{Instance Centering} layer replacing it to preserve the distribution of temporal features. In addition, {STAM} employs an attention layer with normalized mapping to transform temporal features while constraining the variance shift. As the second part, we incorporate {STAM} with a novel {FFAM}, which efficiently leverages fine-coarse spatial information of overall frames to further enhance temporal consistency. Extensive experiments demonstrate the superiority of the proposed EI$^2$ model for text-driven video editing.	This paper presents EI$^2$, a novel approach that enhances the consistency of text-driven video editing using pre-trained Text-to-Image (TTI) diffusion models.	Existing methods for adapting TTI models to video editing often suffer from temporal inconsistencies (e.g., flickering) and semantic disparity (inconsistency between edits and text prompts), limiting their real-world applicability.	EI$^2$ introduces two novel modules: (1) Shift-restricted Temporal Attention Module (STAM), theoretically grounded to address semantic disparity by mitigating covariate shift in feature space caused by temporal attention. (2) Fine-coarse Frame Attention Module (FFAM), which enhances temporal consistency by efficiently incorporating global spatial-temporal information.	EI$^2$ effectively addresses semantic disparity, leading to edits that better align with text prompts. EI$^2$ enhances temporal consistency, producing smoother and more coherent video edits. Extensive experiments demonstrate EI$^2$'s superiority over state-of-the-art methods in terms of visual quality, user preference, and resource consumption.	The theoretical analysis relies on a Gaussian assumption for feature distributions, which may not hold perfectly in practice. While demonstrating strong editing capabilities, EI$^2$ may still exhibit temporal inconsistencies in certain challenging scenarios (e.g., object replacement with dissimilar attributes).	video editing, diffusion models, text-to-image synthesis, temporal consistency, semantic alignment
2305.17423 Report	Accelerating Text-to-Image Editing via Cache-Enabled Sparse Diffusion Inference	Zihao Yu, Haoyang Li, Fangcheng Fu, Xupeng Miao, Bin Cui	Due to the recent success of diffusion models, text-to-image generation is becoming increasingly popular and achieves a wide range of applications. Among them, text-to-image editing, or continuous text-to-image generation, attracts lots of attention and can potentially improve the quality of generated images. It's common to see that users may want to slightly edit the generated image by making minor modifications to their input textual descriptions for several rounds of diffusion inference. However, such an image editing process suffers from the low inference efficiency of many existing diffusion models even using GPU accelerators. To solve this problem, we introduce Fast Image Semantically Edit (FISEdit), a cached-enabled sparse diffusion model inference engine for efficient text-to-image editing. The key intuition behind our approach is to utilize the semantic mapping between the minor modifications on the input text and the affected regions on the output image. For each text editing step, FISEdit can automatically identify the affected image regions and utilize the cached unchanged regions' feature map to accelerate the inference process. Extensive empirical results show that FISEdit can be $3.4\times$ and $4.4\times$ faster than existing methods on NVIDIA TITAN RTX and A100 GPUs respectively, and even generates more satisfactory images.	This paper introduces Fast Image Semantically Edit (FISEdit), a cached-enabled sparse diffusion model inference engine designed for efficient text-to-image editing.	The method addresses the inefficiency of existing text-to-image editing techniques that regenerate the entire image even when only minor modifications are desired.	FISEdit leverages the semantic mapping between textual changes and affected image regions to enable sparse computation. It involves a mask generation algorithm to identify affected areas, sparse computation techniques for efficient feature map updates, and a cache-based editing pipeline for managing intermediate data.	FISEdit achieves up to 4.9x reduction in computational cost compared to baselines. It offers a speedup of up to 4.4x on NVIDIA TITAN RTX and 3.4x on NVIDIA A100 GPUs. The method generates high-quality edited images comparable to existing approaches while being significantly faster.	FISEdit's performance degrades when editing low-resolution images due to limited sparsity. Future work includes extending the caching mechanism to support real-world text-to-image services for improved throughput.	text-to-image editing, diffusion models, sparse computation, cache-enabled inference, semantic image editing
2305.17235 Report	COMCAT: Towards Efficient Compression and Customization of Attention-Based Vision Models	Jinqi Xiao, Miao Yin, Yu Gong, Xiao Zang, Jian Ren, Bo Yuan	Attention-based vision models, such as Vision Transformer (ViT) and its variants, have shown promising performance in various computer vision tasks. However, these emerging architectures suffer from large model sizes and high computational costs, calling for efficient model compression solutions. To date, pruning ViTs has been well studied, while other compression strategies that have been widely applied in CNN compression, e.g., model factorization, is little explored in the context of ViT compression. This paper explores an efficient method for compressing vision transformers to enrich the toolset for obtaining compact attention-based vision models. Based on the new insight on the multi-head attention layer, we develop a highly efficient ViT compression solution, which outperforms the state-of-the-art pruning methods. For compressing DeiT-small and DeiT-base models on ImageNet, our proposed approach can achieve 0.45% and 0.76% higher top-1 accuracy even with fewer parameters. Our finding can also be applied to improve the customization efficiency of text-to-image diffusion models, with much faster training (up to $2.6\times$ speedup) and lower extra storage cost (up to $1927.5\times$ reduction) than the existing works.	This paper presents ComCAT, a novel model compression technique for attention-based vision models like Vision Transformer (ViT) by effectively exploring the inherent low-rankness within the multi-head attention mechanism.	Large model sizes and high computational costs of attention-based models necessitate efficient compression techniques, and exploring low-rankness offers an alternative to existing pruning methods.	The authors analyze singular value distributions in ViT layers and propose a head-level low-rank approximation strategy. They further introduce an automatic rank selection method, leveraging differentiable neural architecture search (NAS) to find optimal rank combinations for compression.	ComCAT outperforms state-of-the-art pruning methods, achieving 0.45% and 0.76% higher top-1 accuracy with fewer parameters for DeiT-small and DeiT-base on ImageNet. ComCAT demonstrates significant practical speedups on various hardware platforms, including GPUs, mobile processors, ASIC accelerators, and FPGAs. Applied to text-to-image diffusion model customization, ComCAT improves efficiency with faster training (up to 2.6x speedup) and lower storage costs (up to 1927.5x reduction).	Exploration of alternative low-rank decomposition methods beyond SVD for specific layers or tasks could be beneficial. Further investigation into the trade-off between compression ratio, accuracy, and hardware efficiency is crucial for practical deployment.	model compression, vision transformer, low-rank approximation, text-to-image diffusion, neural architecture search
2305.17223 Report	Do We Really Need a Large Number of Visual Prompts?	Youngeun Kim, Yuhang Li, Abhishek Moitra, Ruokai Yin, Priyadarshini Panda	Due to increasing interest in adapting models on resource-constrained edges, parameter-efficient transfer learning has been widely explored. Among various methods, Visual Prompt Tuning (VPT), prepending learnable prompts to input space, shows competitive fine-tuning performance compared to training of full network parameters. However, VPT increases the number of input tokens, resulting in additional computational overhead. In this paper, we analyze the impact of the number of prompts on fine-tuning performance and self-attention operation in a vision transformer architecture. Through theoretical and empirical analysis we show that adding more prompts does not lead to linear performance improvement. Further, we propose a Prompt Condensation (PC) technique that aims to prevent performance degradation from using a small number of prompts. We validate our methods on FGVC and VTAB-1k tasks and show that our approach reduces the number of prompts by ~70% while maintaining accuracy.	This paper analyzes the impact of the number of visual prompts on the performance of Visual Prompt Tuning (VPT) and proposes a Prompt Condensation (PC) technique to reduce the number of prompts while maintaining accuracy.	VPT, while memory-efficient, can lead to increased computational cost due to the use of additional prompts. This paper investigates this trade-off to improve the efficiency of VPT.	The paper provides empirical analysis on the correlation between the number of prompts and accuracy. It mathematically analyzes the impact of prompts on self-attention operation. It proposes Prompt Condensation (PC) which involves calculating the importance score of prompts and selecting the most important ones for fine-tuning.	Reducing the number of prompts by 50% does not lead to a significant drop in accuracy. The self-attention matrix remains low-rank even with the addition of prompts. Proposed Prompt Condensation (PC) can reduce the number of prompts by ~70% while maintaining accuracy.	The analysis primarily focuses on VPT-Deep and might not be directly applicable to other VPT variants. Further investigation into more efficient and accurate prompt scoring methods is needed.	visual prompt tuning, parameter-efficient transfer learning, prompt condensation, vision transformers, self-attention
2305.16965 Report	Accelerating Diffusion Models for Inverse Problems through Shortcut Sampling	Gongye Liu, Haoze Sun, Jiayi Li, Fei Yin, Yujiu Yang	Diffusion models have recently demonstrated an impressive ability to address inverse problems in an unsupervised manner. While existing methods primarily focus on modifying the posterior sampling process, the potential of the forward process remains largely unexplored. In this work, we propose Shortcut Sampling for Diffusion(SSD), a novel approach for solving inverse problems in a zero-shot manner. Instead of initiating from random noise, the core concept of SSD is to find a specific transitional state that bridges the measurement image y and the restored image x. By utilizing the shortcut path of "input - transitional state - output", SSD can achieve precise restoration with fewer steps. To derive the transitional state during the forward process, we introduce Distortion Adaptive Inversion. Moreover, we apply back projection as additional consistency constraints during the generation process. Experimentally, we demonstrate SSD's effectiveness on multiple representative IR tasks. Our method achieves competitive results with only 30 NFEs compared to state-of-the-art zero-shot methods(100 NFEs) and outperforms them with 100 NFEs in certain tasks. Code is available at https://github.com/GongyeLiu/SSD	This paper proposes Shortcut Sampling for Diffusion (SSD), a novel approach for solving inverse problems in a zero-shot manner by finding a specific transitional state that bridges the input and restored images, enabling faster and more accurate restoration with fewer steps.	Existing diffusion-based inverse problem solvers are slow due to their reliance on lengthy sampling processes starting from random noise, neglecting the potential of modifying the forward process.	SSD uses Distortion Adaptive Inversion (DA Inversion) to find the transitional state by adding controllable random perturbations during the inversion process. It then applies back projection during the generation process to ensure faithfulness to the input image.	SSD achieves competitive results with only 30 NFEs compared to state-of-the-art zero-shot methods using 100 NFEs. SSD outperforms existing methods in certain IR tasks when using 100 NFEs. SSD demonstrates strong performance on various inverse problems, including super-resolution, colorization, inpainting, and deblurring, on both CelebA and ImageNet datasets.	The reliance on back projection may limit the performance when the degradation operator estimation is inaccurate. The paper mainly focuses on simple degradation operators; handling more complex real-world degradation remains unexplored.	diffusion models, inverse problems, zero-shot learning, image restoration, shortcut sampling
2305.16936 Report	CRoSS: Diffusion Model Makes Controllable, Robust and Secure Image Steganography	Jiwen Yu, Xuanyu Zhang, Youmin Xu, Jian Zhang	Current image steganography techniques are mainly focused on cover-based methods, which commonly have the risk of leaking secret images and poor robustness against degraded container images. Inspired by recent developments in diffusion models, we discovered that two properties of diffusion models, the ability to achieve translation between two images without training, and robustness to noisy data, can be used to improve security and natural robustness in image steganography tasks. For the choice of diffusion model, we selected Stable Diffusion, a type of conditional diffusion model, and fully utilized the latest tools from open-source communities, such as LoRAs and ControlNets, to improve the controllability and diversity of container images. In summary, we propose a novel image steganography framework, named Controllable, Robust and Secure Image Steganography (CRoSS), which has significant advantages in controllability, robustness, and security compared to cover-based image steganography methods. These benefits are obtained without additional training. To our knowledge, this is the first work to introduce diffusion models to the field of image steganography. In the experimental section, we conducted detailed experiments to demonstrate the advantages of our proposed CRoSS framework in controllability, robustness, and security.	Proposes CRoSS, a novel coverless image steganography framework leveraging diffusion models for enhanced security, controllability, and robustness.	Addresses limitations of existing steganography methods, which often leak secret image information, lack robustness to degradation, and offer limited control over container image content.	Utilizes DDIM Inversion with conditional diffusion models (Stable Diffusion) to enable invertible image translation between secret and container images. Different conditions (prompts, LoRAs, ControlNets) act as keys for hiding and revealing.	CRoSS demonstrates higher security against steganalysis attacks and visual suspicion compared to existing methods. Offers flexible control over container image content while maintaining high visual quality. Exhibits strong robustness to various image degradations, including real-world scenarios like transmission via messaging apps and phone captures.	While subjectively acceptable, the pixel-wise fidelity of revealed images is lower than cover-based methods. Current implementation focuses on single-subject modifications within the secret image, limiting its ability to hide global image content.	image steganography, diffusion models, ddim inversion, coverless steganography, stable diffusion
2305.16835 Report	OpenVIS: Open-vocabulary Video Instance Segmentation	Pinxue Guo, Tony Huang, Peiyang He, Xuefeng Liu, Tianjun Xiao, Zhaoyu Chen, Wenqiang Zhang	Open-vocabulary Video Instance Segmentation (OpenVIS) can simultaneously detect, segment, and track arbitrary object categories in a video, without being constrained to categories seen during training. In this work, we propose an OpenVIS framework called InstFormer that achieves powerful open vocabulary capability through lightweight fine-tuning on a limited-category labeled dataset. Specifically, InstFormer comes in three steps a) Open-world Mask Proposal: we utilize a query-based transformer, which is encouraged to propose all potential object instances, to obtain class-agnostic instance masks; b) Open-vocabulary Instance Representation and Classification: we propose InstCLIP, adapted from pre-trained CLIP with Instance Guidance Attention. InstCLIP generates the instance token capable of representing each open-vocabulary instance. These instance tokens not only enable open-vocabulary classification for multiple instances with a single CLIP forward pass but have also been proven effective for subsequent open-vocabulary instance tracking. c) Rollout Association: we introduce a class-agnostic rollout tracker to predict rollout tokens from the tracking tokens of previous frames to enable open-vocabulary instance association across frames in the video. The experimental results demonstrate the proposed InstFormer achieve state-of-the-art capabilities on a comprehensive OpenVIS evaluation benchmark, while also achieves competitive performance in fully supervised VIS task.	This paper presents InstFormer, a novel open-vocabulary video instance segmentation (OpenVIS) framework that segments, detects, and tracks arbitrary object categories in a video without being limited to categories seen during training.	Current video instance segmentation models are limited to identifying objects from categories present in their training data, hindering their ability to understand target videos comprehensively. This necessitates retraining with new data for novel categories, which is time-consuming and resource-intensive. OpenVIS addresses this limitation by enabling the identification of objects from arbitrary categories, even unseen during training.	InstFormer utilizes a three-step approach: 1) Open-world Mask Proposal using a query-based transformer to generate class-agnostic instance masks. 2) Open-vocabulary Instance Representation and Classification through InstCLIP, an enhanced version of CLIP with Instance Guidance Attention, to embed each instance with an instance token for classification and tracking. 3) Rollout Association leveraging a class-independent rollout tracker with temporal contrastive learning to associate instances across frames.	InstFormer achieves state-of-the-art OpenVIS performance, surpassing existing open-vocabulary methods even with seeing fewer categories during training. The proposed InstCLIP effectively maintains the zero-shot capability of the pre-trained CLIP model, demonstrating strong performance in zero-shot instance classification. InstFormer demonstrates competitive performance in fully supervised VIS tasks, highlighting its ability to excel in both open-set and closed-set scenarios.	The reliance on pre-trained VLMs like CLIP introduces a dependence on the capabilities and limitations of these models. Future work could explore improving the rollout tracker by incorporating more advanced temporal modeling techniques for enhanced instance association.	open-vocabulary, video instance segmentation, openvis, instance guidance clip, contrastive learning
2305.16807 Report	Negative-prompt Inversion: Fast Image Inversion for Editing with Text-guided Diffusion Models	Daiki Miyake, Akihiro Iohara, Yu Saito, Toshiyuki Tanaka	In image editing employing diffusion models, it is crucial to preserve the reconstruction quality of the original image while changing its style. Although existing methods ensure reconstruction quality through optimization, a drawback of these is the significant amount of time required for optimization. In this paper, we propose negative-prompt inversion, a method capable of achieving equivalent reconstruction solely through forward propagation without optimization, thereby enabling much faster editing processes. We experimentally demonstrate that the reconstruction quality of our method is comparable to that of existing methods, allowing for inversion at a resolution of 512 pixels and with 50 sampling steps within approximately 5 seconds, which is more than 30 times faster than null-text inversion. Reduction of the computation time by the proposed method further allows us to use a larger number of sampling steps in diffusion models to improve the reconstruction quality with a moderate increase in computation time.	The paper proposes "negative-prompt inversion," a novel method for fast reconstruction of real images with diffusion models without requiring optimization.	Existing image editing methods using diffusion models rely on optimization for reconstruction quality, leading to high computational costs and slow processing. This new method significantly accelerates the editing process.	The method leverages the observation that the optimal null-text embedding in null-text inversion can be approximated by the embedding of the input text prompt. It replaces the iterative optimization process of null-text inversion with a single forward pass using the prompt embedding.	Negative-prompt inversion achieves reconstruction quality comparable to null-text inversion but is more than 30 times faster. The method enables real-time image editing when combined with existing editing techniques like prompt-to-prompt. Increasing the number of sampling steps in negative-prompt inversion further improves reconstruction quality while remaining computationally faster than optimization-based methods.	The average reconstruction quality of negative-prompt inversion, while visually similar, does not fully match the quality of null-text inversion. The method may exhibit failures, particularly in reconstructing human figures, potentially due to limitations in the employed AutoEncoder.	diffusion models, image editing, image reconstruction, negative-prompt inversion, real-time editing
2305.16804 Report	Towards Open-World Segmentation of Parts	Tai-Yu Pan, Qing Liu, Wei-Lun Chao, Brian Price	Segmenting object parts such as cup handles and animal bodies is important in many real-world applications but requires more annotation effort. The largest dataset nowadays contains merely two hundred object categories, implying the difficulty to scale up part segmentation to an unconstrained setting. To address this, we propose to explore a seemingly simplified but empirically useful and scalable task, class-agnostic part segmentation. In this problem, we disregard the part class labels in training and instead treat all of them as a single part class. We argue and demonstrate that models trained without part classes can better localize parts and segment them on objects unseen in training. We then present two further improvements. First, we propose to make the model object-aware, leveraging the fact that parts are "compositions", whose extents are bounded by the corresponding objects and whose appearances are by nature not independent but bundled. Second, we introduce a novel approach to improve part segmentation on unseen objects, inspired by an interesting finding -- for unseen objects, the pixel-wise features extracted by the model often reveal high-quality part segments. To this end, we propose a novel self-supervised procedure that iterates between pixel clustering and supervised contrastive learning that pulls pixels closer or pushes them away. Via extensive experiments on PartImageNet and Pascal-Part, we show notable and consistent gains by our approach, essentially a critical step towards open-world part segmentation.	This paper presents Open Part Segmenter (OPS), a novel approach for open-world part instance segmentation, enabling the segmentation of parts for objects unseen during training.	Existing part segmentation methods struggle in open-world settings due to the limited coverage of part classes in training data. OPS aims to address this limitation by enabling part segmentation for unseen objects.	OPS leverages class-agnostic training, object-aware learning (using object masks), and self-supervised fine-tuning with unlabeled data. This approach removes the reliance on specific part class labels and allows the model to learn more general part representations.	Class-agnostic training proves effective for open-world part segmentation, outperforming class-aware training. Object-aware learning significantly improves part segmentation, especially for unseen objects, by leveraging the object-part relationship. Self-supervised fine-tuning with unlabeled data further enhances the model's generalizability to unseen parts and objects.	The evaluation metric for unlabeled part discovery requires further exploration. While multiple rounds of self-training show promise, further investigation is needed to optimize pseudo-label generation.	part segmentation, open-world learning, class-agnostic training, object-aware learning, self-supervised learning
2305.16759 Report	StyleHumanCLIP: Text-guided Garment Manipulation for StyleGAN-Human	Takato Yoshikawa, Yuki Endo, Yoshihiro Kanamori	This paper tackles text-guided control of StyleGAN for editing garments in full-body human images. Existing StyleGAN-based methods suffer from handling the rich diversity of garments and body shapes and poses. We propose a framework for text-guided full-body human image synthesis via an attention-based latent code mapper, which enables more disentangled control of StyleGAN than existing mappers. Our latent code mapper adopts an attention mechanism that adaptively manipulates individual latent codes on different StyleGAN layers under text guidance. In addition, we introduce feature-space masking at inference time to avoid unwanted changes caused by text inputs. Our quantitative and qualitative evaluations reveal that our method can control generated images more faithfully to given texts than existing methods.	This paper introduces a novel framework that leverages text guidance for manipulating garments in full-body human images generated by StyleGAN-Human.	Existing methods for text-guided StyleGAN image editing struggle with the diversity of garments and body shapes/poses present in full-body human images, often neglecting garment details or altering the person's identity.	The proposed framework utilizes an attention-based latent code mapper to effectively capture the correspondence between text descriptions and individual latent codes controlling different StyleGAN layers. It also employs feature-space masking at inference to prevent unwanted changes in image areas unrelated to the text input.	The proposed method demonstrates superior performance in accurately reflecting text semantics in edited images compared to existing StyleGAN-based methods (StyleCLIP and HairCLIP) and diffusion model-based approaches (SD Inpainting and DiffEdit). Quantitative evaluations reveal that the proposed method achieves higher CLIP accuracy and better preserves background regions (lower BG LPIPS) compared to existing methods. Subjective user studies confirm the effectiveness of the proposed method, indicating higher scores for text alignment and competitive scores for image realism.	Currently, separate mapper networks are trained for the upper and lower body, requiring users to manually select the appropriate network based on the target text. The method faces limitations in handling full-body garments like dresses and is sensitive to the accuracy of the human parsing model used for mask generation.	image editing, stylegan, text-guided image manipulation, virtual try-on, full-body human image synthesis
2305.16681 Report	CAILA: Concept-Aware Intra-Layer Adapters for Compositional Zero-Shot Learning	Zhaoheng Zheng, Haidong Zhu, Ram Nevatia	In this paper, we study the problem of Compositional Zero-Shot Learning (CZSL), which is to recognize novel attribute-object combinations with pre-existing concepts. Recent researchers focus on applying large-scale Vision-Language Pre-trained (VLP) models like CLIP with strong generalization ability. However, these methods treat the pre-trained model as a black box and focus on pre- and post-CLIP operations, which do not inherently mine the semantic concept between the layers inside CLIP. We propose to dive deep into the architecture and insert adapters, a parameter-efficient technique proven to be effective among large language models, into each CLIP encoder layer. We further equip adapters with concept awareness so that concept-specific features of "object", "attribute", and "composition" can be extracted. We assess our method on four popular CZSL datasets, MIT-States, C-GQA, UT-Zappos, and VAW-CZSL, which shows state-of-the-art performance compared to existing methods on all of them.	This paper proposes CAILA (Concept-Aware Intra-Layer Adapters), a novel method that enhances Compositional Zero-Shot Learning (CZSL) by integrating concept-aware adapters into each layer of pre-trained vision-language models like CLIP.	Existing CZSL methods often treat large-scale VLP models as black boxes, failing to fully exploit the semantic knowledge embedded within their layers. This work aims to overcome this limitation by directly modifying the model architecture for better knowledge transfer and generalization.	The proposed CAILA method inserts concept-specific adapters into each CLIP encoder layer to extract features related to attributes, objects, and compositions. It then employs a Mixture-of-Adapters (MoA) mechanism to fuse these features and enhance knowledge aggregation. Additionally, a Primitive Concept Shift strategy is introduced to generate augmented training data by combining primitive features.	CAILA achieves state-of-the-art performance on four popular CZSL benchmarks: MIT-States, C-GQA, UT-Zappos, and VAW-CZSL, under both closed and open world settings. The method exhibits significant improvements, especially on C-GQA, where it surpasses baselines by a large margin in the challenging open world scenario. Ablation studies validate the effectiveness of individual components, including adapters, MoA, concept shift, and choice of mixture functions.	CAILA's performance can degrade in open world settings when the number of possible compositions significantly increases, highlighting the need for more robust methods to handle large search spaces. Future work could explore alternative adapter architectures and MoA strategies to further enhance knowledge transfer and generalization in CZSL.	compositional zero-shot learning, vision-language pre-training, clip, adapters, concept-aware learning
2305.16411 Report	ZeroAvatar: Zero-shot 3D Avatar Generation from a Single Image	Zhenzhen Weng, Zeyu Wang, Serena Yeung	Recent advancements in text-to-image generation have enabled significant progress in zero-shot 3D shape generation. This is achieved by score distillation, a methodology that uses pre-trained text-to-image diffusion models to optimize the parameters of a 3D neural presentation, e.g. Neural Radiance Field (NeRF). While showing promising results, existing methods are often not able to preserve the geometry of complex shapes, such as human bodies. To address this challenge, we present ZeroAvatar, a method that introduces the explicit 3D human body prior to the optimization process. Specifically, we first estimate and refine the parameters of a parametric human body from a single image. Then during optimization, we use the posed parametric body as additional geometry constraint to regularize the diffusion model as well as the underlying density field. Lastly, we propose a UV-guided texture regularization term to further guide the completion of texture on invisible body parts. We show that ZeroAvatar significantly enhances the robustness and 3D consistency of optimization-based image-to-3D avatar generation, outperforming existing zero-shot image-to-3D methods.	Proposes ZeroAvatar, a zero-shot 3D human avatar generation method from a single image using a pre-trained text-to-image diffusion model as a prior. It leverages a parametric human body model (SMPL) for initialization and depth-guided optimization, enhancing geometry preservation, and incorporates UV-guided texture completion for improved appearance, surpassing existing zero-shot methods.	Extracting accurate 3D information from single images is crucial for content creation, AR/VR, robotics, and scene understanding, but existing methods struggle with preserving complex human geometry.	1. Initialization: Estimate body pose and shape from the input image using SMPL, refining it against image features for accurate alignment. 2. Depth-guided Optimization: Optimize NeRF parameters using a depth-conditioned score distillation loss derived from a pre-trained text-to-image diffusion model, guided by SMPL depth. 3. UV-guided Texture Completion: Regularize the appearance of invisible body parts using a UV-guided texture prior, leveraging texture symmetry.	ZeroAvatar significantly improves geometry and appearance fidelity of generated avatars, outperforming existing zero-shot 3D generation methods. It effectively preserves human structure, achieving higher detection scores on novel views compared to baselines. The method demonstrates strong generalization ability, handling both real-world humans and virtual avatars.	Limitation: Relies on SMPL, limiting accuracy for body proportions deviating significantly from the average human shape. Future work: Enhance the generalizability of the human body prior. Limitation: Extracted meshes from the density field can be coarse. Future work: Integrate techniques for geometry and texture refinement.	3d avatar generation, zero-shot learning, diffusion models, score distillation sampling, human body prior
2305.16310 Report	Securing Deep Generative Models with Universal Adversarial Signature	Yu Zeng, Mo Zhou, Yuan Xue, Vishal M. Patel	Recent advances in deep generative models have led to the development of methods capable of synthesizing high-quality, realistic images. These models pose threats to society due to their potential misuse. Prior research attempted to mitigate these threats by detecting generated images, but the varying traces left by different generative models make it challenging to create a universal detector capable of generalizing to new, unseen generative models. In this paper, we propose to inject a universal adversarial signature into an arbitrary pre-trained generative model, in order to make its generated contents more detectable and traceable. First, the imperceptible optimal signature for each image can be found by a signature injector through adversarial training. Subsequently, the signature can be incorporated into an arbitrary generator by fine-tuning it with the images processed by the signature injector. In this way, the detector corresponding to the signature can be reused for any fine-tuned generator for tracking the generator identity. The proposed method is validated on the FFHQ and ImageNet datasets with various state-of-the-art generative models, consistently showing a promising detection rate. Code will be made publicly available at \url{https://github.com/zengxianyu/genwm}.	This work proposes a method for securing deep generative models by embedding imperceptible signatures into generated images. These signatures are designed to be robust to various image manipulations, enabling the source of generated images to be tracked.	The proliferation of high-quality image generation models raises concerns about potential misuse, including the spread of misinformation. This method aims to address this by providing a way to verify the origin of generated images.	The method involves fine-tuning a pre-trained generative model with an additional signature injector network. The injector embeds the signature while minimizing the perceptual difference between the original and signed images. The presence of the signature is then verified by a separate classifier network.	The embedded signatures are nearly imperceptible, with minimal impact on visual quality. The signatures are robust to various image manipulations, including compression, resizing, and noise addition. The method achieves high classification accuracy in distinguishing between signed and unsigned images.	The current method requires fine-tuning a pre-trained generative model to embed the signature, limiting its practicality. Future work could explore training-free frameworks for securing deep generative models.	deep generative models, image security, watermarking, source tracking, misinformation mitigation
2305.16295 Report	HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning	Chia-Wen Kuo, Zsolt Kira	A great deal of progress has been made in image captioning, driven by research into how to encode the image using pre-trained models. This includes visual encodings (e.g. image grid features or detected objects) and more recently textual encodings (e.g. image tags or text descriptions of image regions). As more advanced encodings are available and incorporated, it is natural to ask: how to efficiently and effectively leverage the heterogeneous set of encodings? In this paper, we propose to regard the encodings as augmented views of the input image. The image captioning model encodes each view independently with a shared encoder efficiently, and a contrastive loss is incorporated across the encoded views in a novel way to improve their representation quality and the model's data efficiency. Our proposed hierarchical decoder then adaptively weighs the encoded views according to their effectiveness for caption generation by first aggregating within each view at the token level, and then across views at the view level. We demonstrate significant performance improvements of +5.6% CIDEr on MS-COCO and +12.9% CIDEr on Flickr30k compared to state of the arts, and conduct rigorous analyses to demonstrate the importance of each part of our design.	This paper proposes HAAV, a hierarchical aggregation method for augmented views in image captioning, enabling efficient and effective utilization of heterogeneous image encodings.	Existing methods for leveraging heterogeneous image encodings in captioning are either computationally expensive (concatenation) or parameter inefficient (separate models per view). HAAV offers a solution that is both efficient and effective.	HAAV treats heterogeneous views as image augmentations, encoding them independently with a shared transformer encoder. A contrastive loss improves representation learning. A hierarchical decoder then combines information within and across views, adaptively weighting their contributions for each generated word.	HAAV achieves state-of-the-art performance on MS-COCO (+5.6% CIDEr) and Flickr30K (+12.9% CIDEr) without relying on large-scale pre-training. The method demonstrates superior computation, parameter, and label efficiency compared to alternative approaches. Analysis of attention weights confirms the hierarchical decoder's ability to adaptively leverage different views based on their relevance to the generated caption.	The study primarily focuses on the trained-from-scratch setting, with potential benefits from large-scale pre-training yet to be explored. Future work could investigate the impact of incorporating more diverse augmented views beyond the ones considered.	image captioning, multi-view learning, data augmentation, contrastive learning, hierarchical attention
2305.16289 Report	Diversify Your Vision Datasets with Automatic Diffusion-Based Augmentation	Lisa Dunlap, Alyssa Umino, Han Zhang, Jiezhi Yang, Joseph E. Gonzalez, Trevor Darrell	Many fine-grained classification tasks, like rare animal identification, have limited training data and consequently classifiers trained on these datasets often fail to generalize to variations in the domain like changes in weather or location. As such, we explore how natural language descriptions of the domains seen in training data can be used with large vision models trained on diverse pretraining datasets to generate useful variations of the training data. We introduce ALIA (Automated Language-guided Image Augmentation), a method which utilizes large vision and language models to automatically generate natural language descriptions of a dataset's domains and augment the training data via language-guided image editing. To maintain data integrity, a model trained on the original dataset filters out minimal image edits and those which corrupt class-relevant information. The resulting dataset is visually consistent with the original training data and offers significantly enhanced diversity. We show that ALIA is able to surpasses traditional data augmentation and text-to-image generated data on fine-grained classification tasks, including cases of domain generalization and contextual bias. Code is available at https://github.com/lisadunlap/ALIA.	This paper introduces ALIA (Automated Language-guided Image Augmentation), a method using vision and language models to automatically generate natural language descriptions of domains within image datasets and leverage them for language-guided image editing to augment the data.	ALIA addresses the challenge of limited training data in fine-grained classification tasks, particularly domain shifts, by creating visually consistent, diverse augmentations grounded in the original data.	ALIA generates domain descriptions from image captions summarized by an LLM. Then, it uses these descriptions for text-guided image editing, applying filtering techniques to ensure data quality and preserve task-relevant information.	ALIA outperforms traditional data augmentation and text-to-image generation methods, even exceeding real data performance on the iWildCam dataset. The study shows that ALIA-generated domain descriptions are more effective than user-provided prompts, highlighting the method's ability to capture key domain-specific features. The choice of image editing method significantly impacts ALIA's performance, with Img2Img being more suitable for certain datasets like iWildCam and InstructPix2Pix for others.	ALIA's performance depends on the quality of the captioning model, LLM, and image editing method, which can limit its effectiveness. Determining the optimal amount of augmented data to include in training is an open question for future research.	data augmentation, language-guided image editing, fine-grained classification, domain generalization, contextual bias
2305.16233 Report	Interactive Segment Anything NeRF with Feature Imitation	Xiaokang Chen, Jiaxiang Tang, Diwen Wan, Jingbo Wang, Gang Zeng	This paper investigates the potential of enhancing Neural Radiance Fields (NeRF) with semantics to expand their applications. Although NeRF has been proven useful in real-world applications like VR and digital creation, the lack of semantics hinders interaction with objects in complex scenes. We propose to imitate the backbone feature of off-the-shelf perception models to achieve zero-shot semantic segmentation with NeRF. Our framework reformulates the segmentation process by directly rendering semantic features and only applying the decoder from perception models. This eliminates the need for expensive backbones and benefits 3D consistency. Furthermore, we can project the learned semantics onto extracted mesh surfaces for real-time interaction. With the state-of-the-art Segment Anything Model (SAM), our framework accelerates segmentation by 16 times with comparable mask quality. The experimental results demonstrate the efficacy and computational advantages of our approach. Project page: \url{https://me.kiui.moe/san/}.	Presents a novel feature imitation method to enable real-time interactive 3D segmentation in Neural Radiance Fields (NeRF) by leveraging pre-trained 2D perception models.	NeRF lacks explicit semantic information, limiting its interactive applications. This work aims to bridge this gap and enhance NeRF with semantic understanding for real-time user interaction in 3D scenes.	Imitates the backbone features of off-the-shelf 2D perception models (e.g., SAM, X-Decoder) to directly render semantic features within the NeRF framework. Employs camera augmentation and caching mechanisms to improve training efficiency and feature imitation quality.	Achieves real-time 3D click-based segmentation (24.39 FPS) with SAM, a 16x speedup compared to directly applying SAM on rendered images. Demonstrates comparable segmentation quality to the original 2D models on various challenging scenes. Enables mesh segmentation by projecting 2D masks onto 3D surfaces, facilitating downstream applications like texture editing and model composition.	Performance relies on the capabilities of the underlying perception models, which can sometimes lead to imperfect segmentation masks. Future work includes exploring more powerful perception models and extending the method to support more complex 3D interactions beyond segmentation.	nerf, interactive segmentation, 3d semantic understanding, feature imitation, real-time
2305.16133 Report	OVO: Open-Vocabulary Occupancy	Zhiyu Tan, Zichao Dong, Cheng Zhang, Weikun Zhang, Hang Ji, Hao Li	Semantic occupancy prediction aims to infer dense geometry and semantics of surroundings for an autonomous agent to operate safely in the 3D environment. Existing occupancy prediction methods are almost entirely trained on human-annotated volumetric data. Although of high quality, the generation of such 3D annotations is laborious and costly, restricting them to a few specific object categories in the training dataset. To address this limitation, this paper proposes Open Vocabulary Occupancy (OVO), a novel approach that allows semantic occupancy prediction of arbitrary classes but without the need for 3D annotations during training. Keys to our approach are (1) knowledge distillation from a pre-trained 2D open-vocabulary segmentation model to the 3D occupancy network, and (2) pixel-voxel filtering for high-quality training data generation. The resulting framework is simple, compact, and compatible with most state-of-the-art semantic occupancy prediction models. On NYUv2 and SemanticKITTI datasets, OVO achieves competitive performance compared to supervised semantic occupancy prediction approaches. Furthermore, we conduct extensive analyses and ablation studies to offer insights into the design of the proposed framework. Our code is publicly available at https://github.com/dzcgaara/OVO.	This paper proposes Open Vocabulary Occupancy (OVO), a novel approach for semantic occupancy prediction that allows inference of arbitrary classes without requiring 3D annotations during training.	Existing methods for semantic occupancy prediction rely heavily on laborious and costly 3D annotations, limiting their scalability and applicability to a restricted set of object categories.	OVO leverages knowledge distillation from a pre-trained 2D open-vocabulary segmentation model to a 3D occupancy network and employs pixel-voxel filtering for high-quality training data generation.	OVO achieves competitive performance compared to supervised semantic occupancy prediction approaches on NYUv2 and SemanticKITTI datasets. The effectiveness of the proposed feature alignment and voxel filtering techniques is demonstrated through ablation studies. OVO introduces a minor computational overhead compared to the baseline occupancy network.	OVO's reliance on voxel-wise prediction without instance-level optimization can lead to inconsistencies within a single object. Future work will explore voxel grouping techniques to enhance prediction consistency at the instance level.	semantic occupancy prediction, open vocabulary learning, knowledge distillation, 3d scene understanding, zero-shot learning
2305.15779 Report	Custom-Edit: Text-Guided Image Editing with Customized Diffusion Models	Jooyoung Choi, Yunjey Choi, Yunji Kim, Junho Kim, Sungroh Yoon	Text-to-image diffusion models can generate diverse, high-fidelity images based on user-provided text prompts. Recent research has extended these models to support text-guided image editing. While text guidance is an intuitive editing interface for users, it often fails to ensure the precise concept conveyed by users. To address this issue, we propose Custom-Edit, in which we (i) customize a diffusion model with a few reference images and then (ii) perform text-guided editing. Our key discovery is that customizing only language-relevant parameters with augmented prompts improves reference similarity significantly while maintaining source similarity. Moreover, we provide our recipe for each customization and editing process. We compare popular customization methods and validate our findings on two editing methods using various datasets.	The paper introduces Custom-Edit, a two-step approach for precise text-guided image editing using customized diffusion models.	Existing text-to-image models struggle to capture unique user concepts or appearances not encountered during training, making precise editing with textual prompts challenging.	1. Customization: Fine-tune language-relevant parameters (cross-attention keys/values, rare token) of a pre-trained diffusion model on a few reference images with augmented prompts. 2. Editing: Utilize text-guided editing methods like Prompt-to-Prompt (P2P) or SDEdit on the customized model to edit images based on user prompts.	Customizing language-relevant parameters with augmented prompts significantly improves reference similarity while maintaining source similarity. Custom-Edit effectively transfers fine-grained appearance details from references to source images while preserving the overall structure. The paper provides insights into the source-reference trade-off in diffusion-based editing, showing how adjusting strengths in P2P and SDEdit can control this balance.	Custom-Edit sometimes struggles with editing complex backgrounds or may modify undesired regions due to limitations in attention map accuracy and text input controllability. Future work could explore using larger text encoders, incorporating grounding inputs, or leveraging models with enhanced controllability to address these limitations.	image editing, diffusion models, text-to-image, customization, prompt-to-prompt
2305.15712 Report	Knowledge Diffusion for Distillation	Tao Huang, Yuan Zhang, Mingkai Zheng, Shan You, Fei Wang, Chen Qian, Chang Xu	The representation gap between teacher and student is an emerging topic in knowledge distillation (KD). To reduce the gap and improve the performance, current methods often resort to complicated training schemes, loss functions, and feature alignments, which are task-specific and feature-specific. In this paper, we state that the essence of these methods is to discard the noisy information and distill the valuable information in the feature, and propose a novel KD method dubbed DiffKD, to explicitly denoise and match features using diffusion models. Our approach is based on the observation that student features typically contain more noises than teacher features due to the smaller capacity of student model. To address this, we propose to denoise student features using a diffusion model trained by teacher features. This allows us to perform better distillation between the refined clean feature and teacher feature. Additionally, we introduce a light-weight diffusion model with a linear autoencoder to reduce the computation cost and an adaptive noise matching module to improve the denoising performance. Extensive experiments demonstrate that DiffKD is effective across various types of features and achieves state-of-the-art performance consistently on image classification, object detection, and semantic segmentation tasks. Code is available at https://github.com/hunto/DiffKD.	Presents DiffKD, a novel knowledge distillation (KD) method that utilizes diffusion models to explicitly denoise student features, thereby reducing the representation gap between teacher and student models.	Addresses the challenge of representation gap in KD, particularly when distilling knowledge from stronger, more complex teacher models to smaller student models.	Trains a diffusion model on teacher features to learn a denoising process. Employs this model to denoise student features, subsequently used for distillation. Introduces a lightweight diffusion model with a linear autoencoder for efficiency and an adaptive noise matching module for optimal denoising.	DiffKD consistently outperforms state-of-the-art KD methods across various benchmarks including image classification, object detection, and semantic segmentation. Demonstrates significant performance gains, particularly when distilling from stronger teacher models, highlighting its effectiveness in bridging the representation gap. Shows the generic applicability of DiffKD across various tasks and feature types, including intermediate features and classification outputs.	Current implementation relies on simple convolutional diffusion models and traditional loss functions, exploring more advanced diffusion techniques and loss functions could yield further improvements. Computational cost, although comparable to other feature-based KD methods, is higher than simple logits distillation methods, presenting an area for future optimization	knowledge-distillation, diffusion-models, representation-learning, model-compression, computer-vision
2305.15542 Report	TOAST: Transfer Learning via Attention Steering	Baifeng Shi, Siyu Gai, Trevor Darrell, Xin Wang	Transfer learning involves adapting a pre-trained model to novel downstream tasks. However, we observe that current transfer learning methods often fail to focus on task-relevant features. In this work, we explore refocusing model attention for transfer learning. We introduce Top-Down Attention Steering (TOAST), a novel transfer learning algorithm that keeps the pre-trained backbone frozen, selects task-relevant features in the output, and feeds those features back to the model to steer the attention to the task-specific features. By refocusing the attention only, TOAST achieves state-of-the-art results on a number of transfer learning benchmarks, while having a small number of tunable parameters. Compared to fully fine-tuning, LoRA, and prompt tuning, TOAST substantially improves performance across a range of fine-grained visual classification datasets (e.g., 81.1% -> 86.2% on FGVC). TOAST also outperforms the fully fine-tuned Alpaca and Vicuna models on instruction-following language generation. Code is available at https://github.com/bfshi/TOAST.	This paper introduces TOAST (Top-Down Attention Steering), a transfer learning algorithm that enhances downstream task performance by refocusing the pre-trained model's attention onto task-relevant features.	Existing transfer learning techniques often struggle to concentrate on task-specific features, limiting their effectiveness.	TOAST freezes the pre-trained backbone and incorporates a top-down attention module. This module identifies task-relevant features in the output, feeds them back to guide attention during a second feedforward pass, effectively highlighting essential features.	TOAST achieves state-of-the-art results on various benchmarks, including FGVC for fine-grained classification and VTAB-1k for broader image understanding. It outperforms methods like fine-tuning, LoRA, and VPT, demonstrating the significance of attention refocusing. TOAST also excels in instruction-following language generation, surpassing fine-tuned Alpaca and Vicuna models by providing more detailed and relevant responses.	TOAST incurs higher computational cost due to the second feedforward pass. While adaptable to diverse architectures and tasks, its performance on dense prediction tasks like semantic segmentation lags behind full fine-tuning.	transfer learning, top-down attention, attention refocusing, parameter-efficient fine-tuning, computer vision, natural language processing
2305.15399 Report	Sin3DM: Learning a Diffusion Model from a Single 3D Textured Shape	Rundi Wu, Ruoshi Liu, Carl Vondrick, Changxi Zheng	Synthesizing novel 3D models that resemble the input example has long been pursued by graphics artists and machine learning researchers. In this paper, we present Sin3DM, a diffusion model that learns the internal patch distribution from a single 3D textured shape and generates high-quality variations with fine geometry and texture details. Training a diffusion model directly in 3D would induce large memory and computational cost. Therefore, we first compress the input into a lower-dimensional latent space and then train a diffusion model on it. Specifically, we encode the input 3D textured shape into triplane feature maps that represent the signed distance and texture fields of the input. The denoising network of our diffusion model has a limited receptive field to avoid overfitting, and uses triplane-aware 2D convolution blocks to improve the result quality. Aside from randomly generating new samples, our model also facilitates applications such as retargeting, outpainting and local editing. Through extensive qualitative and quantitative evaluation, we show that our method outperforms prior methods in generation quality of 3D shapes.	Sin3DM: a diffusion model that learns the internal patch distribution from a single 3D textured shape and generates high-quality variations with fine geometry and texture details.	Collecting large, diverse 3D datasets is challenging, limiting the applicability of data-driven 3D generation methods. Sin3DM addresses this by enabling high-quality 3D shape generation from a single example.	The input 3D shape is compressed into a triplane feature representation using an autoencoder. A diffusion model with a limited receptive field and triplane-aware convolutions is trained on this latent space to learn local patch distributions.	Sin3DM generates high-fidelity 3D shapes with diverse local variations while preserving global structure. Quantitative evaluation shows Sin3DM outperforms prior single-instance 3D generation methods in terms of geometry and texture quality. The method supports controlled generation, including retargeting, outpainting, and patch duplication.	The generated variations primarily occur along the three axis directions due to the triplane representation. Exploring the trade-off between generation quality and diversity is limited to adjusting the receptive field size.	3d shape generation, diffusion models, single instance learning, triplane representation, controlled generation
2305.15328 Report	Visual Programming for Text-to-Image Generation and Evaluation	Jaemin Cho, Abhay Zala, Mohit Bansal	As large language models have demonstrated impressive performance in many domains, recent works have adopted language models (LMs) as controllers of visual modules for vision-and-language tasks. While existing work focuses on equipping LMs with visual understanding, we propose two novel interpretable/explainable visual programming frameworks for text-to-image (T2I) generation and evaluation. First, we introduce VPGen, an interpretable step-by-step T2I generation framework that decomposes T2I generation into three steps: object/count generation, layout generation, and image generation. We employ an LM to handle the first two steps (object/count generation and layout generation), by finetuning it on text-layout pairs. Our step-by-step T2I generation framework provides stronger spatial control than end-to-end models, the dominant approach for this task. Furthermore, we leverage the world knowledge of pretrained LMs, overcoming the limitation of previous layout-guided T2I works that can only handle predefined object classes. We demonstrate that our VPGen has improved control in counts/spatial relations/scales of objects than state-of-the-art T2I generation models. Second, we introduce VPEval, an interpretable and explainable evaluation framework for T2I generation based on visual programming. Unlike previous T2I evaluations with a single scoring model that is accurate in some skills but unreliable in others, VPEval produces evaluation programs that invoke a set of visual modules that are experts in different skills, and also provides visual+textual explanations of the evaluation results. Our analysis shows that VPEval provides a more human-correlated evaluation for skill-specific and open-ended prompts than widely used single model-based evaluation. We hope that our work encourages future progress on interpretable/explainable generation and evaluation for T2I models.	This paper introduces two novel visual programming frameworks for text-to-image (T2I) generation and evaluation: Text2Box and VPEval.	Existing T2I generation lacks interpretable spatial control, and current evaluation methods rely on single models, lacking interpretability and struggling to accurately assess all skills.	Text2Box decomposes T2I generation into interpretable steps (object/count generation, layout generation, image generation) leveraging a fine-tuned large language model (LLM) and layout-to-image models. VPEval employs evaluation programs invoking diverse visual modules specialized for different skills, providing visual and textual explanations.	Text2Box demonstrates improved adherence to text prompts regarding object counts, spatial relationships, and object scales compared to baseline T2I models. VPEval exhibits stronger alignment with human evaluation than existing single model-based T2I evaluation methods for both skill-specific and open-ended prompts. Analysis reveals that while count, spatial, scale, and text rendering skills pose challenges for T2I models, Text2Box excels in the first three due to its strong layout control.	The reliance on English-heavy datasets and natural image training data may limit the generalizability of the LLMs and generation/evaluation modules to other languages or image domains. Generating evaluation programs with LLMs can be expensive; however, the authors plan to release pre-generated programs and a locally runnable LM for this purpose.	text-to-image generation, visual programming, interpretable ai, explainable ai, image generation evaluation
2305.15316 Report	Training on Thin Air: Improve Image Classification with Generated Data	Yongchao Zhou, Hshmat Sahak, Jimmy Ba	Acquiring high-quality data for training discriminative models is a crucial yet challenging aspect of building effective predictive systems. In this paper, we present Diffusion Inversion, a simple yet effective method that leverages the pre-trained generative model, Stable Diffusion, to generate diverse, high-quality training data for image classification. Our approach captures the original data distribution and ensures data coverage by inverting images to the latent space of Stable Diffusion, and generates diverse novel training images by conditioning the generative model on noisy versions of these vectors. We identify three key components that allow our generated images to successfully supplant the original dataset, leading to a 2-3x enhancement in sample complexity and a 6.5x decrease in sampling time. Moreover, our approach consistently outperforms generic prompt-based steering methods and KNN retrieval baseline across a wide range of datasets. Additionally, we demonstrate the compatibility of our approach with widely-used data augmentation techniques, as well as the reliability of the generated data in supporting various neural architectures and enhancing few-shot learning.	This paper presents Diffusion Inversion, a novel method leveraging pre-trained generative models (specifically Stable Diffusion) to produce diverse, high-quality training data for image classification, thereby enhancing sample complexity and reducing sampling time.	Acquiring high-quality training data is crucial for effective predictive systems but can be complex, costly, and time-consuming. This method addresses the limitations of traditional data collection and existing synthetic data generation approaches.	The two-stage method first maps each training image to the latent space of Stable Diffusion, creating embedding vectors. Subsequently, it generates novel training images by conditioning the model on perturbed versions of these vectors.	Diffusion Inversion achieves 2-3x improvement in sample complexity and a 6.5x reduction in sampling time compared to training on original data. The method surpasses generic prompt-based steering methods and KNN retrieval baselines by effectively addressing data distribution shifts and ensuring data coverage. The generated data is compatible with various neural architectures, improves few-shot learning performance, and complements traditional data augmentation techniques.	Scaling the method to large datasets like ImageNet is challenging due to storage requirements and sampling efficiency of current diffusion models. Potential for bias in generated data inherited from the generative model necessitates further research on bias mitigation strategies.	data augmentation, synthetic data generation, diffusion models, image classification, stable diffusion
2305.15194 Report	DiffBlender: Scalable and Composable Multimodal Text-to-Image Diffusion Models	Sungnyun Kim, Junsoo Lee, Kibeom Hong, Daesik Kim, Namhyuk Ahn	In this study, we aim to extend the capabilities of diffusion-based text-to-image (T2I) generation models by incorporating diverse modalities beyond textual description, such as sketch, box, color palette, and style embedding, within a single model. We thus design a multimodal T2I diffusion model, coined as DiffBlender, by separating the channels of conditions into three types, i.e., image forms, spatial tokens, and non-spatial tokens. The unique architecture of DiffBlender facilitates adding new input modalities, pioneering a scalable framework for conditional image generation. Notably, we achieve this without altering the parameters of the existing generative model, Stable Diffusion, only with updating partial components. Our study establishes new benchmarks in multimodal generation through quantitative and qualitative comparisons with existing conditional generation methods. We demonstrate that DiffBlender faithfully blends all the provided information and showcase its various applications in the detailed image synthesis.	\textsc{DiffBlender} is a novel multimodal text-to-image diffusion model that effectively incorporates diverse conditioning modalities, such as sketch, box, color palette, and style embedding, within a single model.	Existing text-to-image generation models struggle to incorporate diverse modalities beyond textual descriptions. This limits the user's ability to provide fine-grained details and control over the generated image.	\textsc{DiffBlender} categorizes input modalities into three types: image forms, spatial tokens, and non-spatial tokens. Each type is handled by a specific conditioning module attached to the Stable Diffusion backbone. This modular design allows for the seamless integration and extension of new modalities.	\textsc{DiffBlender} achieves state-of-the-art performance in multi-conditional image generation, as evidenced by high scores in quantitative metrics (YOLO, SSIM, Depth) and qualitative comparisons. The model allows for mode-specific guidance, providing fine-grained control over the influence of each modality on the generated image. The modular design of \textsc{DiffBlender} enables easy extension to new modalities with minimal computational cost.	The model may struggle to generate coherent images when provided with conflicting conditions. As \textsc{DiffBlender} is built upon Stable Diffusion, it inherits its limitations, such as difficulty in representing intricate details like human hands.	text-to-image generation, diffusion models, multimodal conditioning, stable diffusion, mode-specific guidance
2305.15094 Report	InNeRF360: Text-Guided 3D-Consistent Object Inpainting on 360-degree Neural Radiance Fields	Dongqing Wang, Tong Zhang, Alaa Abboud, Sabine Süsstrunk	We propose InNeRF360, an automatic system that accurately removes text-specified objects from 360-degree Neural Radiance Fields (NeRF). The challenge is to effectively remove objects while inpainting perceptually consistent content for the missing regions, which is particularly demanding for existing NeRF models due to their implicit volumetric representation. Moreover, unbounded scenes are more prone to floater artifacts in the inpainted region than frontal-facing scenes, as the change of object appearance and background across views is more sensitive to inaccurate segmentations and inconsistent inpainting. With a trained NeRF and a text description, our method efficiently removes specified objects and inpaints visually consistent content without artifacts. We apply depth-space warping to enforce consistency across multiview text-encoded segmentations, and then refine the inpainted NeRF model using perceptual priors and 3D diffusion-based geometric priors to ensure visual plausibility. Through extensive experiments in segmentation and inpainting on 360-degree and frontal-facing NeRFs, we show that our approach is effective and enhances NeRF's editability. Project page: https://ivrl.github.io/InNeRF360.	\MethodName{} is the first system for text-guided object removal and inpainting in 360\degree Neural Radiance Fields (NeRF), enabling object-level editing with perceptually consistent results.	Existing methods struggle with 360\degree scenes due to limitations in multi-view consistency for both segmentation and inpainting, especially under object occlusion and geometry deformation across viewpoints. \MethodName{} addresses this by combining accurate segmentation with 3D-aware inpainting.	1. Multiview Consistent Segmentation: Leverages Segment Anything Model (SAM) with depth-warped prompts for accurate object masks across views. 2. Inpainting 360\degree NeRF: Uses 2D inpainted images to initialize a new NeRF, then refines it with a 3D diffusion-based geometric prior to eliminate floaters and a perceptual prior for consistent texture.	Achieves accurate and consistent object segmentation in 360\degree scenes, even for challenging cases like transparent objects. Generates high-quality inpainted NeRFs without floaters, seamlessly blending the modifications into the original scene. Quantitative evaluation shows superior performance over per-frame inpainting and baseline methods in terms of visual consistency and inpainting quality.	Performance depends on the accuracy of the initial 2D object detection, which can be limited for ambiguous text instructions. Future work includes exploring more sophisticated text-guided 3D editing and addressing limitations of current vision-language models.	neural radiance fields, nerf inpainting, 3d scene editing, text-guided image editing, multiview segmentation
2305.14849 Report	DuDGAN: Improving Class-Conditional GANs via Dual-Diffusion	Taesun Yeom, Minhyeok Lee	Class-conditional image generation using generative adversarial networks (GANs) has been investigated through various techniques; however, it continues to face challenges such as mode collapse, training instability, and low-quality output in cases of datasets with high intra-class variation. Furthermore, most GANs often converge in larger iterations, resulting in poor iteration efficacy in training procedures. While Diffusion-GAN has shown potential in generating realistic samples, it has a critical limitation in generating class-conditional samples. To overcome these limitations, we propose a novel approach for class-conditional image generation using GANs called DuDGAN, which incorporates a dual diffusion-based noise injection process. Our method consists of three unique networks: a discriminator, a generator, and a classifier. During the training process, Gaussian-mixture noises are injected into the two noise-aware networks, the discriminator and the classifier, in distinct ways. This noisy data helps to prevent overfitting by gradually introducing more challenging tasks, leading to improved model performance. As a result, our method outperforms state-of-the-art conditional GAN models for image generation in terms of performance. We evaluated our method using the AFHQ, Food-101, and CIFAR-10 datasets and observed superior results across metrics such as FID, KID, Precision, and Recall score compared with comparison models, highlighting the effectiveness of our approach.	DuDGAN, a novel approach for class-conditional image generation using GANs, incorporates a dual diffusion-based noise injection process to improve quality and iteration efficiency.	Conditional image generation with GANs often suffers from issues like mode collapse, training instability, and low-quality output, particularly with limited data and high intra-class variation. Existing methods often require extensive training iterations, making them inefficient.	DuDGAN utilizes three networks: a generator, a discriminator, and a classifier. Gaussian-mixture noises are injected into the discriminator and classifier during training. The classifier, trained only on real images, provides high-dimensional class information and class logits, aiding the generator in producing diverse and high-fidelity images.	DuDGAN outperforms state-of-the-art conditional GAN models in terms of FID and KID on AFHQ and CIFAR-10 datasets, indicating superior generation quality. It achieves faster convergence within a smaller number of iterations compared to other models. The generated images demonstrate high visual quality with fine details, accurate colors, and clear textures.	The model's performance on the Food-101 dataset, while improved, suggests a need for further exploration in handling highly diverse datasets. Future work could involve investigating the impact of varying noise schedules and exploring alternative augmentation techniques.	generative adversarial networks, image generation, conditional image synthesis, diffusion models, noise injection
2305.14840 Report	Predicting Token Impact Towards Efficient Vision Transformer	Hong Wang, Su Yang, Xiaoke Huang, Weishan Zhang	Token filtering to reduce irrelevant tokens prior to self-attention is a straightforward way to enable efficient vision Transformer. This is the first work to view token filtering from a feature selection perspective, where we weigh the importance of a token according to how much it can change the loss once masked. If the loss changes greatly after masking a token of interest, it means that such a token has a significant impact on the final decision and is thus relevant. Otherwise, the token is less important for the final decision, so it can be filtered out. After applying the token filtering module generalized from the whole training data, the token number fed to the self-attention module can be obviously reduced in the inference phase, leading to much fewer computations in all the subsequent self-attention layers. The token filter can be realized using a very simple network, where we utilize multi-layer perceptron. Except for the uniqueness of performing token filtering only once from the very beginning prior to self-attention, the other core feature making our method different from the other token filters lies in the predictability of token impact from a feature selection point of view. The experiments show that the proposed method provides an efficient way to approach a light weighted model after optimized with a backbone by means of fine tune, which is easy to be deployed in comparison with the existing methods based on training from scratch.	This paper proposes DL-ViT, an efficient vision Transformer that predicts token impact from a feature selection perspective to filter irrelevant tokens before self-attention, resulting in a lighter model without significant accuracy loss.	Vision Transformers, while powerful, suffer from heavy computational loads, hindering their application in edge computing. Existing token filtering methods are often heuristic-based, lack explainability, and require gradual token reduction throughout the model, making them less efficient.	The method involves two phases: (1) It uses a novel metric called 'delta loss' (DL) to measure a token's impact on the classification loss when masked. Tokens with large DL values are labeled as important. This data is used to train an MLP-based binary classifier for token filtering. (2) The trained token filter is applied before the Transformer backbone, and the entire pipeline is fine-tuned end-to-end.	DL-ViT achieves state-of-the-art performance in terms of both efficiency and accuracy compared to existing lightweight ViT models. The method leads to a significant reduction (up to 46%) in FLOPs compared to the DeiT backbone while maintaining comparable accuracy. The study demonstrates that incorporating global image features into the token selection module enhances performance.	The method relies on a single hyperparameter (ρ) to control the significance of token importance during the labeling process, requiring careful tuning. Future work will explore token relevance at middle layers to further enhance efficiency.	vision transformer, token filtering, efficient deep learning, feature selection, delta loss
2305.14831 Report	OD-NeRF: Efficient Training of On-the-Fly Dynamic Neural Radiance Fields	Zhiwen Yan, Chen Li, Gim Hee Lee	Dynamic neural radiance fields (dynamic NeRFs) have demonstrated impressive results in novel view synthesis on 3D dynamic scenes. However, they often require complete video sequences for training followed by novel view synthesis, which is similar to playing back the recording of a dynamic 3D scene. In contrast, we propose OD-NeRF to efficiently train and render dynamic NeRFs on-the-fly which instead is capable of streaming the dynamic scene. When training on-the-fly, the training frames become available sequentially and the model is trained and rendered frame-by-frame. The key challenge of efficient on-the-fly training is how to utilize the radiance field estimated from the previous frames effectively. To tackle this challenge, we propose: 1) a NeRF model conditioned on the multi-view projected colors to implicitly track correspondence between the current and previous frames, and 2) a transition and update algorithm that leverages the occupancy grid from the last frame to sample efficiently at the current frame. Our algorithm can achieve an interactive speed of 6FPS training and rendering on synthetic dynamic scenes on-the-fly, and a significant speed-up compared to the state-of-the-art on real-world dynamic scenes.	This paper introduces OD-NeRF, a new method for efficiently training and rendering dynamic neural radiance fields (NeRFs) on-the-fly, enabling real-time streaming of dynamic 3D scenes.	Existing dynamic NeRFs typically require complete video sequences for training, limiting their use in real-time applications like streaming. On-the-fly training allows for the reconstruction and rendering of dynamic scenes as they happen.	The authors propose two key techniques: 1) a NeRF model conditioned on multi-view projected colors to implicitly track point correspondence across frames and 2) a transition and update algorithm for the occupancy grid, leveraging information from previous frames for efficient sampling.	OD-NeRF achieves an interactive speed of 6 FPS for on-the-fly training and rendering on synthetic dynamic scenes. The method demonstrates significant speed-up compared to state-of-the-art techniques on real-world dynamic scenes. OD-NeRF maintains comparable rendering quality to existing methods while achieving faster on-the-fly training.	The implicit correspondence of the projected color-guided NeRF relies on the relative invariance of projected colors, which can be affected by specular surfaces and occlusions. Future work could explore techniques to filter outlier projected colors or explicitly detect occlusions to improve the robustness of the method.	neural radiance fields, dynamic scene reconstruction, on-the-fly training, novel view synthesis, 3d streaming
2305.14777 Report	Generative Modeling through the Semi-dual Formulation of Unbalanced Optimal Transport	Jaemoo Choi, Jaewoong Choi, Myungjoo Kang	Optimal Transport (OT) problem investigates a transport map that bridges two distributions while minimizing a given cost function. In this regard, OT between tractable prior distribution and data has been utilized for generative modeling tasks. However, OT-based methods are susceptible to outliers and face optimization challenges during training. In this paper, we propose a novel generative model based on the semi-dual formulation of Unbalanced Optimal Transport (UOT). Unlike OT, UOT relaxes the hard constraint on distribution matching. This approach provides better robustness against outliers, stability during training, and faster convergence. We validate these properties empirically through experiments. Moreover, we study the theoretical upper-bound of divergence between distributions in UOT. Our model outperforms existing OT-based generative models, achieving FID scores of 2.97 on CIFAR-10 and 6.36 on CelebA-HQ-256. The code is available at \url{https://github.com/Jae-Moo/UOTM}.	This paper proposes UOTM, a novel generative model based on the semi-dual formulation of Unbalanced Optimal Transport (UOT) that relaxes the hard constraint on distribution matching in OT.	OT-based generative models, while effective, suffer from sensitivity to outliers and optimization challenges. UOT offers a solution by enabling outlier robustness and stable training.	The authors leverage the semi-dual formulation of UOT to derive a new objective function. They then parameterize the potential and transport map using neural networks and optimize them through an adversarial training procedure similar to GANs.	UOTM exhibits strong robustness against outliers, outperforming OT-based methods on datasets with injected outliers. Despite the relaxed constraints, UOTM achieves superior target distribution matching compared to OT-based counterparts. UOTM demonstrates faster and more stable convergence, requiring significantly fewer training epochs to reach comparable performance.	The hyperparameter tau, controlling the trade-off between cost and marginal matching, requires careful tuning for optimal performance. Further investigation into the theoretical properties of UOTM, particularly regarding the role of the auxiliary variable and regularization, is necessary.	generative models, optimal transport, unbalanced optimal transport, outlier robustness, stable training
2305.14742 Report	ChatFace: Chat-Guided Real Face Editing via Diffusion Latent Space Manipulation	Dongxu Yue, Qin Guo, Munan Ning, Jiaxi Cui, Yuesheng Zhu, Li Yuan	Editing real facial images is a crucial task in computer vision with significant demand in various real-world applications. While GAN-based methods have showed potential in manipulating images especially when combined with CLIP, these methods are limited in their ability to reconstruct real images due to challenging GAN inversion capability. Despite the successful image reconstruction achieved by diffusion-based methods, there are still challenges in effectively manipulating fine-gained facial attributes with textual instructions.To address these issues and facilitate convenient manipulation of real facial images, we propose a novel approach that conduct text-driven image editing in the semantic latent space of diffusion model. By aligning the temporal feature of the diffusion model with the semantic condition at generative process, we introduce a stable manipulation strategy, which perform precise zero-shot manipulation effectively. Furthermore, we develop an interactive system named ChatFace, which combines the zero-shot reasoning ability of large language models to perform efficient manipulations in diffusion semantic latent space. This system enables users to perform complex multi-attribute manipulations through dialogue, opening up new possibilities for interactive image editing. Extensive experiments confirmed that our approach outperforms previous methods and enables precise editing of real facial images, making it a promising candidate for real-world applications. Project page: https://dongxuyue.github.io/chatface/	ChatFace, an interactive system for high-quality real facial image editing using text instructions in the semantic latent space of a diffusion model.	Existing GAN-based methods struggle with real image reconstruction, while diffusion models face challenges in fine-grained facial attribute manipulation with text.	An LLM parses user requests and controls editing attributes in the diffusion model's semantic latent space. A mapping network infers manipulation directions, and a Stable Manipulation Strategy (SMS) ensures precise zero-shot editing.	Outperforms SOTA methods in quantitative metrics (directional CLIP similarity, segmentation-consistency, face identity similarity). Human evaluation confirms superior performance in semantic relevance, visual realism, and identity consistency. Enables fine-grained control over various facial attributes, including multi-attribute editing.	Limited to the domain of the pre-trained diffusion autoencoder. Generalization to visually diverse datasets requires further investigation.	image editing, diffusion models, large language models, semantic manipulation, interactive system
2305.14720 Report	BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing	Dongxu Li, Junnan Li, Steven C. H. Hoi	Subject-driven text-to-image generation models create novel renditions of an input subject based on text prompts. Existing models suffer from lengthy fine-tuning and difficulties preserving the subject fidelity. To overcome these limitations, we introduce BLIP-Diffusion, a new subject-driven image generation model that supports multimodal control which consumes inputs of subject images and text prompts. Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation. We first pre-train the multimodal encoder following BLIP-2 to produce visual representation aligned with the text. Then we design a subject representation learning task which enables a diffusion model to leverage such visual representation and generates new subject renditions. Compared with previous methods such as DreamBooth, our model enables zero-shot subject-driven generation, and efficient fine-tuning for customized subject with up to 20x speedup. We also demonstrate that BLIP-Diffusion can be flexibly combined with existing techniques such as ControlNet and prompt-to-prompt to enable novel subject-driven generation and editing applications. Code and models will be released at https://github.com/salesforce/LAVIS/tree/main/projects/blip-diffusion. Project page at https://dxli94.github.io/BLIP-Diffusion-website/.	This paper introduces BLIP-Diffusion, a novel subject-driven text-to-image generation model that leverages pre-trained generic subject representation for efficient and high-fidelity image synthesis.	Existing subject-driven generation models suffer from lengthy fine-tuning processes and difficulties in preserving subject fidelity. BLIP-Diffusion addresses these limitations by introducing pre-trained subject representation, enabling zero-shot generation or efficient fine-tuning with significant speedups.	BLIP-Diffusion employs a two-stage pre-training strategy: (1) Multimodal representation learning with BLIP-2 to produce text-aligned visual features. (2) Subject representation learning using a novel prompted context generation task, where the model learns to generate subject renditions based on synthesized images with random backgrounds.	BLIP-Diffusion achieves promising zero-shot subject-driven generation results. It enables efficient fine-tuning for customized subjects with up to 20x speedup compared to previous methods like DreamBooth. The model can be seamlessly integrated with existing techniques like ControlNet and prompt-to-prompt for enhanced control and editing capabilities.	BLIP-Diffusion can still exhibit failures common to subject-driven generation models, such as inaccurate context synthesis and overfitting to the training set. The model may inherit limitations from the underlying diffusion model, impacting its ability to fully comprehend complex text prompts and compositional relationships.	text-to-image generation, subject-driven generation, diffusion models, multimodal learning, blip-2
2305.14677 Report	Optimal Linear Subspace Search: Learning to Construct Fast and High-Quality Schedulers for Diffusion Models	Zhongjie Duan, Chengyu Wang, Cen Chen, Jun Huang, Weining Qian	In recent years, diffusion models have become the most popular and powerful methods in the field of image synthesis, even rivaling human artists in artistic creativity. However, the key issue currently limiting the application of diffusion models is its extremely slow generation process. Although several methods were proposed to speed up the generation process, there still exists a trade-off between efficiency and quality. In this paper, we first provide a detailed theoretical and empirical analysis of the generation process of the diffusion models based on schedulers. We transform the designing problem of schedulers into the determination of several parameters, and further transform the accelerated generation process into an expansion process of the linear subspace. Based on these analyses, we consequently propose a novel method called Optimal Linear Subspace Search (OLSS), which accelerates the generation process by searching for the optimal approximation process of the complete generation process in the linear subspaces spanned by latent variables. OLSS is able to generate high-quality images with a very small number of steps. To demonstrate the effectiveness of our method, we conduct extensive comparative experiments on open-source diffusion models. Experimental results show that with a given number of steps, OLSS can significantly improve the quality of generated images. Using an NVIDIA A100 GPU, we make it possible to generate a high-quality image by Stable Diffusion within only one second without other optimization techniques.	This paper proposes OLSS (Optimal Linear Subspace Search), a novel diffusion scheduler that accelerates image generation by searching for the optimal approximation of the complete generation process within linear subspaces spanned by latent variables.	Diffusion models, despite their prowess in image synthesis, suffer from slow generation speed. OLSS addresses this limitation by significantly reducing the number of inference steps while maintaining high image quality.	The paper analyzes the diffusion model generation process, modeling it as a linear subspace expansion. OLSS replaces iterative formula coefficients with trainable parameters, solved using least squares methods, to control subspace expansion. A path optimization algorithm further enhances performance by tuning sampling steps.	OLSS achieves superior image quality compared to state-of-the-art schedulers with the same number of steps. The path optimization algorithm in OLSS further improves performance compared to uniform step selection. OLSS demonstrates effectiveness in both open-domain and close-domain image synthesis tasks.	The current path optimization algorithm in OLSS could be further improved for even better efficiency. Exploration of improving generative quality based on modifications in the latent space is a potential future direction.	diffusion models, image synthesis, computational efficiency, schedulers, path optimization
2305.14674 Report	T1: Scaling Diffusion Probabilistic Fields to High-Resolution on Unified Visual Modalities	Kangfu Mei, Mo Zhou, Vishal M. Patel	Diffusion Probabilistic Field (DPF) models the distribution of continuous functions defined over metric spaces. While DPF shows great potential for unifying data generation of various modalities including images, videos, and 3D geometry, it does not scale to a higher data resolution. This can be attributed to the ``scaling property'', where it is difficult for the model to capture local structures through uniform sampling. To this end, we propose a new model comprising of a view-wise sampling algorithm to focus on local structure learning, and incorporating additional guidance, e.g., text description, to complement the global geometry. The model can be scaled to generate high-resolution data while unifying multiple modalities. Experimental results on data generation in various modalities demonstrate the effectiveness of our model, as well as its potential as a foundation framework for scalable modality-unified visual content generation.	This paper proposes T1, a new diffusion-based field model for scalable, modality-unified visual content generation. T1 leverages a novel view-wise sampling algorithm and incorporates text descriptions as inductive biases to preserve both local structure and global geometry of the data.	Existing diffusion-based field models struggle to scale to high-resolution data due to limitations in capturing local structures through uniform sampling and lack of global geometry guidance.	T1 uses a view-wise sampling algorithm that extracts local, high-resolution coordinate-signal pairs. It also incorporates text descriptions as inductive bias to guide the generation process and preserve global geometry.	T1 outperforms previous domain-agnostic methods and achieves competitive results against domain-specific approaches on image, video, and 3D viewpoint generation tasks. T1 is able to generate high-resolution videos under affordable computational resources. Ablation studies validate the contribution of the proposed sampling algorithm and text conditioning.	The scaling property is only resolved for spatial dimensions, and generating extremely long videos with complex dynamics remains challenging. The method is only applicable to visual modalities interpretable by views.	diffusion models, generative models, field models, text-to-video generation, novel view synthesis
2305.14345 Report	NCHO: Unsupervised Learning for Neural 3D Composition of Humans and Objects	Taeksoo Kim, Shunsuke Saito, Hanbyul Joo	Deep generative models have been recently extended to synthesizing 3D digital humans. However, previous approaches treat clothed humans as a single chunk of geometry without considering the compositionality of clothing and accessories. As a result, individual items cannot be naturally composed into novel identities, leading to limited expressiveness and controllability of generative 3D avatars. While several methods attempt to address this by leveraging synthetic data, the interaction between humans and objects is not authentic due to the domain gap, and manual asset creation is difficult to scale for a wide variety of objects. In this work, we present a novel framework for learning a compositional generative model of humans and objects (backpacks, coats, scarves, and more) from real-world 3D scans. Our compositional model is interaction-aware, meaning the spatial relationship between humans and objects, and the mutual shape change by physical contact is fully incorporated. The key challenge is that, since humans and objects are in contact, their 3D scans are merged into a single piece. To decompose them without manual annotations, we propose to leverage two sets of 3D scans of a single person with and without objects. Our approach learns to decompose objects and naturally compose them back into a generative human model in an unsupervised manner. Despite our simple setup requiring only the capture of a single subject with objects, our experiments demonstrate the strong generalization of our model by enabling the natural composition of objects to diverse identities in various poses and the composition of multiple objects, which is unseen in training data. https://taeksuu.github.io/ncho/	This paper presents NCHO, a novel framework for learning a compositional generative model of humans and objects from real-world 3D scans, enabling separate control over human identity and attached objects like backpacks and coats.	Existing 3D human generative models often treat clothing and accessories as entangled geometry, limiting controllability and expressiveness for tasks like virtual try-on or avatar creation.	The method leverages paired 3D scans of a source person with and without objects to decompose object geometry. It trains separate human and object modules, combining them with a neural composition module for realistic interactions.	NCHO demonstrates superior generation quality and disentanglement compared to baselines, as evidenced by FID scores and user studies. The model generalizes to unseen identities and object instances, enabling diverse and controllable avatar generation. It allows object removal from 3D scans and composition of multiple objects, showcasing capabilities beyond training data.	Decomposing thin clothing layers remains challenging due to 3D scan limitations. Future work includes extending the approach to handle RGB images as input.	3d human modeling, generative models, compositional modeling, unsupervised learning, 3d object decomposition
2305.14334 Report	Diffusion Hyperfeatures: Searching Through Time and Space for Semantic Correspondence	Grace Luo, Lisa Dunlap, Dong Huk Park, Aleksander Holynski, Trevor Darrell	Diffusion models have been shown to be capable of generating high-quality images, suggesting that they could contain meaningful internal representations. Unfortunately, the feature maps that encode a diffusion model's internal information are spread not only over layers of the network, but also over diffusion timesteps, making it challenging to extract useful descriptors. We propose Diffusion Hyperfeatures, a framework for consolidating multi-scale and multi-timestep feature maps into per-pixel feature descriptors that can be used for downstream tasks. These descriptors can be extracted for both synthetic and real images using the generation and inversion processes. We evaluate the utility of our Diffusion Hyperfeatures on the task of semantic keypoint correspondence: our method achieves superior performance on the SPair-71k real image benchmark. We also demonstrate that our method is flexible and transferable: our feature aggregation network trained on the inversion features of real image pairs can be used on the generation features of synthetic image pairs with unseen objects and compositions. Our code is available at https://diffusion-hyperfeatures.github.io.	This paper introduces Diffusion Hyperfeatures, a method to extract per-pixel feature descriptors from diffusion models by consolidating multi-scale and multi-timestep feature maps.	Diffusion models have shown potential for internal representations but extracting useful features is challenging due to their spread across layers and timesteps. This work offers a way to leverage these representations for downstream tasks.	The method uses an aggregation network to combine intermediate feature maps from the diffusion process, learning mixing weights to identify the most meaningful features for a specific task (e.g., semantic correspondence).	Diffusion Hyperfeatures outperform DINOv2 and CATS++ on semantic keypoint correspondence for real images (SPair-71k, CUB). The aggregation network successfully transfers to unseen synthetic images, enabling the creation of datasets with pseudo-ground truth semantic correspondences. Analysis of mixing weights shows that different model variants (SDv1-5 vs. SDv2-1) require different feature map prioritization for optimal performance.	The current method is limited by memory constraints when aggregating many timesteps. Future work could explore more efficient architectures or incorporate attention mechanisms to reduce memory footprint.	diffusion models, feature representation, semantic correspondence, keypoint matching, feature aggregation
2305.14330 Report	DirecT2V: Large Language Models are Frame-Level Directors for Zero-Shot Text-to-Video Generation	Susung Hong, Junyoung Seo, Heeseong Shin, Sunghwan Hong, Seungryong Kim	In the paradigm of AI-generated content (AIGC), there has been increasing attention to transferring knowledge from pre-trained text-to-image (T2I) models to text-to-video (T2V) generation. Despite their effectiveness, these frameworks face challenges in maintaining consistent narratives and handling shifts in scene composition or object placement from a single abstract user prompt. Exploring the ability of large language models (LLMs) to generate time-dependent, frame-by-frame prompts, this paper introduces a new framework, dubbed DirecT2V. DirecT2V leverages instruction-tuned LLMs as directors, enabling the inclusion of time-varying content and facilitating consistent video generation. To maintain temporal consistency and prevent mapping the value to a different object, we equip a diffusion model with a novel value mapping method and dual-softmax filtering, which do not require any additional training. The experimental results validate the effectiveness of our framework in producing visually coherent and storyful videos from abstract user prompts, successfully addressing the challenges of zero-shot video generation.	Introduces DirecT2V, a novel framework for zero-shot text-to-video generation using large language models (LLMs) as frame-level directors to enhance narrative consistency and handle time-varying content in videos.	Existing zero-shot text-to-video generation methods struggle to maintain narrative consistency and handle complex actions or scene changes over time due to relying on a single user prompt for all frames.	Leverages instruction-tuned LLMs (e.g., GPT-4) to generate frame-by-frame descriptions from a single abstract user prompt. Employs novel techniques like rotational value mapping and dual softmax filtering within the text-to-image diffusion model for improved temporal coherence and flexibility.	DirecT2V successfully generates videos with consistent narratives and time-varying content, outperforming existing zero-shot methods. Rotational value mapping in DirecT2V enables diverse context integration across frames while maintaining temporal consistency. Dual softmax filtering effectively reduces inaccurate matching during value mapping, leading to more coherent video generation.	The performance of DirecT2V is dependent on the capabilities and limitations of the chosen LLM for frame-level directing. DirecT2V relies on pre-trained text-to-image diffusion models, which may inherit limitations in accurate object counting and positioning.	text-to-video generation, large language models, zero-shot learning, diffusion models, temporal consistency
2305.14312 Report	Text-guided 3D Human Generation from 2D Collections	Tsu-Jui Fu, Wenhan Xiong, Yixin Nie, Jingyu Liu, Barlas Oğuz, William Yang Wang	3D human modeling has been widely used for engaging interaction in gaming, film, and animation. The customization of these characters is crucial for creativity and scalability, which highlights the importance of controllability. In this work, we introduce Text-guided 3D Human Generation (\texttt{T3H}), where a model is to generate a 3D human, guided by the fashion description. There are two goals: 1) the 3D human should render articulately, and 2) its outfit is controlled by the given text. To address this \texttt{T3H} task, we propose Compositional Cross-modal Human (CCH). CCH adopts cross-modal attention to fuse compositional human rendering with the extracted fashion semantics. Each human body part perceives relevant textual guidance as its visual patterns. We incorporate the human prior and semantic discrimination to enhance 3D geometry transformation and fine-grained consistency, enabling it to learn from 2D collections for data efficiency. We conduct evaluations on DeepFashion and SHHQ with diverse fashion attributes covering the shape, fabric, and color of upper and lower clothing. Extensive experiments demonstrate that CCH achieves superior results for \texttt{T3H} with high efficiency.	This paper introduces Text-guided 3D Human Generation (T3H), aiming to generate controllable 3D human models with customized outfits from fashion descriptions.	This work addresses the limitation of previous 3D human modeling approaches that rely on multi-view videos or lack language controllability. It enables efficient and customizable generation of 3D humans for various applications like gaming and animation.	The authors propose Compositional Cross-modal Human (CCH), which leverages cross-modal attention to fuse compositional human rendering with extracted fashion semantics. It incorporates the human prior (SMPL) for robust geometry transformation and semantic discrimination for fine-grained consistency with descriptions.	CCH achieves superior results for T3H with high efficiency compared to baselines like Latent-NeRF, TEXTure, and CLIP-O. CCH exhibits comprehensive superiority across metrics like FID, Depth, PCK, CLIP-S, and FA, indicating its effectiveness in generating realistic and textually aligned 3D humans. The ablation study shows the importance of textual guidance, cross-modal attention, and semantic discrimination for effective T3H.	The reliance on SMPL parameters can cause quality degradation if the estimation is inaccurate. Datasets used for training have limited viewing angles, leading to artifacts in 3D consistency. Future work can explore diverse datasets and improve handling of challenging poses.	3d human generation, text-guided synthesis, cross-modal attention, neural rendering, compositional modeling
2305.14207 Report	SAD: Segment Any RGBD	Jun Cen, Yizheng Wu, Kewei Wang, Xingyi Li, Jingkang Yang, Yixuan Pei, Lingdong Kong, Ziwei Liu, Qifeng Chen	The Segment Anything Model (SAM) has demonstrated its effectiveness in segmenting any part of 2D RGB images. However, SAM exhibits a stronger emphasis on texture information while paying less attention to geometry information when segmenting RGB images. To address this limitation, we propose the Segment Any RGBD (SAD) model, which is specifically designed to extract geometry information directly from images. Inspired by the natural ability of humans to identify objects through the visualization of depth maps, SAD utilizes SAM to segment the rendered depth map, thus providing cues with enhanced geometry information and mitigating the issue of over-segmentation. We further include the open-vocabulary semantic segmentation in our framework, so that the 3D panoptic segmentation is fulfilled. The project is available on https://github.com/Jun-CEN/SegmentAnyRGBD.	This paper introduces Segment Any RGBD (SAD), a novel model that leverages Segment Anything Model (SAM) and Open-Vocabulary Semantic Segmentation (OVSeg) to perform semantic segmentation by incorporating geometric information from depth maps.	This work addresses the limitations of SAM, which primarily relies on texture information and often leads to over-segmentation, by incorporating depth information to improve segmentation accuracy.	SAD renders depth maps to RGB space and uses them as input for SAM, generating initial masks. These masks are then refined using coarse semantic masks from OVSeg. Finally, a clustering process groups adjacent segments of the same class.	SAD effectively reduces over-segmentation compared to using RGB images directly with SAM. The incorporation of depth information leads to more accurate segmentation results, particularly in distinguishing objects with similar textures. SAD demonstrates its effectiveness in generating geometrically sound semantic segmentation results on both Sailvos3D and ScanNet datasets.	The model may struggle to distinguish between objects in close proximity when they lack distinct geometric features in the depth map. Future work could focus on improving the model's ability to handle challenging scenarios, such as scenes with significant occlusions or varying lighting conditions.	semantic segmentation, depth maps, segment anything model (sam), open-vocabulary segmentation, 3d vision
2305.14022 Report	Realistic Noise Synthesis with Diffusion Models	Qi Wu, Mingyan Han, Ting Jiang, Haoqiang Fan, Bing Zeng, Shuaicheng Liu	Deep image denoising models often rely on large amount of training data for the high quality performance. However, it is challenging to obtain sufficient amount of data under real-world scenarios for the supervised training. As such, synthesizing realistic noise becomes an important solution. However, existing techniques have limitations in modeling complex noise distributions, resulting in residual noise and edge artifacts in denoising methods relying on synthetic data. To overcome these challenges, we propose a novel method that synthesizes realistic noise using diffusion models, namely Realistic Noise Synthesize Diffusor (RNSD). In particular, the proposed time-aware controlling module can simulate various environmental conditions under given camera settings. RNSD can incorporate guided multiscale content, such that more realistic noise with spatial correlations can be generated at multiple frequencies. In addition, we construct an inversion mechanism to predict the unknown camera setting, which enables the extension of RNSD to datasets without setting information. Extensive experiments demonstrate that our RNSD method significantly outperforms the existing methods not only in the synthesized noise under multiple realism metrics, but also in the single image denoising performances.	This paper introduces RNSD, a novel diffusion model-based approach for synthesizing realistic noise in images, which significantly outperforms existing methods in terms of realism and improves the performance of denoising models.	Collecting real-world noisy/clean image pairs for training denoising models is challenging. Existing synthetic noise generation techniques often fail to capture the complexity of real noise, leading to suboptimal denoising results.	RNSD leverages a time-aware camera setting module (CamSampler) to simulate diverse noise distributions based on camera parameters. It also employs a multi-scale content guided UNet (MCG-UNet) to generate spatially correlated noise. Additionally, a camera setting prediction module (CamPredictor) enables noise synthesis on datasets without camera setting information.	RNSD achieves state-of-the-art results on noise realism benchmarks, surpassing existing methods in metrics like PGAP and AKLD. Denoising models trained with RNSD's synthetic noise demonstrate significant performance improvements, achieving up to 0.6dB PSNR gain. Ablation studies confirm the efficacy of individual RNSD components, including CamSampler, MCG-UNet, and CamPredictor.	The computational cost of diffusion models for noise synthesis is higher than some simpler methods. Further exploration of incorporating more complex camera ISP pipelines could further enhance realism.	image denoising, noise synthesis, diffusion models, camera settings, data augmentation
2305.13921 Report	Compositional Text-to-Image Synthesis with Attention Map Control of Diffusion Models	Ruichen Wang, Zekang Chen, Chen Chen, Jian Ma, Haonan Lu, Xiaodong Lin	Recent text-to-image (T2I) diffusion models show outstanding performance in generating high-quality images conditioned on textual prompts. However, they fail to semantically align the generated images with the prompts due to their limited compositional capabilities, leading to attribute leakage, entity leakage, and missing entities. In this paper, we propose a novel attention mask control strategy based on predicted object boxes to address these issues. In particular, we first train a BoxNet to predict a box for each entity that possesses the attribute specified in the prompt. Then, depending on the predicted boxes, a unique mask control is applied to the cross- and self-attention maps. Our approach produces a more semantically accurate synthesis by constraining the attention regions of each token in the prompt to the image. In addition, the proposed method is straightforward and effective and can be readily integrated into existing cross-attention-based T2I generators. We compare our approach to competing methods and demonstrate that it can faithfully convey the semantics of the original text to the generated content and achieve high availability as a ready-to-use plugin. Please refer to https://github.com/OPPOMente-Lab/attention-mask-control.	This paper introduces a novel attention mask control strategy for text-to-image synthesis using diffusion models. The method leverages a BoxNet, trained to predict object boxes for entities within text prompts, to guide cross- and self-attention maps during image generation. This constraint ensures semantic accuracy by aligning textual elements with corresponding image regions.	Existing text-to-image diffusion models struggle with accurately representing complex textual descriptions involving multiple entities and attributes, often leading to issues like attribute leakage, entity leakage, and missing entities. This work addresses these problems to improve the fidelity and faithfulness of generated images.	The proposed approach employs a two-stage process. First, a BoxNet is trained on the COCO dataset to predict object boxes for entities at each timestep of the diffusion process. Second, during image generation, unique masks derived from these predicted boxes control the cross- and self-attention maps, ensuring entities and attributes are rendered within their designated image regions.	The method significantly improves the semantic alignment between generated images and text prompts, effectively addressing attribute leakage, entity leakage, and missing entities. Quantitative analysis using metrics like DINO similarity scores and subjective fidelity scores demonstrate the effectiveness of the approach in generating more accurate and faithful images. The proposed strategy is flexible and can be easily integrated into existing diffusion-based image generators as a plugin to enhance their compositional generation capabilities.	While the method demonstrates promising results, there is potential for a slight decrease in image quality in some cases. Future work could focus on mitigating potential quality degradation and exploring more sophisticated text parsing techniques to further enhance the model’s understanding of complex prompts.	text-to-image synthesis, diffusion models, compositional generation, attention mask control, object detection
2305.13840 Report	Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models	Weifeng Chen, Yatai Ji, Jie Wu, Hefeng Wu, Pan Xie, Jiashi Li, Xin Xia, Xuefeng Xiao, Liang Lin	Recent advancements in diffusion models have unlocked unprecedented abilities in visual creation. However, current text-to-video generation models struggle with the trade-off among movement range, action coherence and object consistency. To mitigate this issue, we present a controllable text-to-video (T2V) diffusion model, called Control-A-Video, capable of maintaining consistency while customizable video synthesis. Based on a pre-trained conditional text-to-image (T2I) diffusion model, our model aims to generate videos conditioned on a sequence of control signals, such as edge or depth maps. For the purpose of improving object consistency, Control-A-Video integrates motion priors and content priors into video generation. We propose two motion-adaptive noise initialization strategies, which are based on pixel residual and optical flow, to introduce motion priors from input videos, producing more coherent videos. Moreover, a first-frame conditioned controller is proposed to generate videos from content priors of the first frame, which facilitates the semantic alignment with text and allows longer video generation in an auto-regressive manner. With the proposed architecture and strategies, our model achieves resource-efficient convergence and generate consistent and coherent videos with fine-grained control. Extensive experiments demonstrate its success in various video generative tasks such as video editing and video style transfer, outperforming previous methods in terms of consistency and quality.	This paper presents Control-A-Video, a controllable text-to-video (T2V) diffusion model that enhances object consistency and coherence in customizable video synthesis using control signals like edge or depth maps.	Current T2V models struggle with maintaining consistency (object appearance across frames) and coherence (smooth action transitions) when generating videos with a large range of motion. Control-A-Video addresses this trade-off.	The model integrates motion and content priors. It leverages motion-adaptive noise initialization (pixel residual and optical flow based) and a first-frame conditioned controller (generating videos based on the first frame's content).	Control-A-Video generates consistent and coherent videos with fine-grained control from text prompts and control maps (depth, edge). Motion-adaptive noise initialization improves consistency by preserving latent space similarity between frames, reducing flickering. First-frame conditioning enhances text alignment and allows auto-regressive generation of longer videos.	The model currently relies on a T2I model, inheriting its limitations. Future work includes exploring the stability and controllability of video generation models.	text-to-video generation, diffusion models, controllable video synthesis, motion priors, content priors
2305.13777 Report	VisorGPT: Learning Visual Prior via Generative Pre-Training	Jinheng Xie, Kai Ye, Yudong Li, Yuexiang Li, Kevin Qinghong Lin, Yefeng Zheng, Linlin Shen, Mike Zheng Shou	Various stuff and things in visual data possess specific traits, which can be learned by deep neural networks and are implicitly represented as the visual prior, e.g., object location and shape, in the model. Such prior potentially impacts many vision tasks. For example, in conditional image synthesis, spatial conditions failing to adhere to the prior can result in visually inaccurate synthetic results. This work aims to explicitly learn the visual prior and enable the customization of sampling. Inspired by advances in language modeling, we propose to learn Visual prior via Generative Pre-Training, dubbed VisorGPT. By discretizing visual locations of objects, e.g., bounding boxes, human pose, and instance masks, into sequences, VisorGPT can model visual prior through likelihood maximization. Besides, prompt engineering is investigated to unify various visual locations and enable customized sampling of sequential outputs from the learned prior. Experimental results demonstrate that VisorGPT can effectively model the visual prior, which can be employed for many vision tasks, such as customizing accurate human pose for conditional image synthesis models like ControlNet. Code will be released at https://github.com/Sierkinhane/VisorGPT.	This paper presents VisorGPT, a novel approach to explicitly learning the probabilistic prior of visual data, such as object location and shape, using generative pre-training.	Explicitly learning visual prior is important for various vision tasks as it captures common sense knowledge about the visual world, leading to more realistic and accurate results in applications like image synthesis.	VisorGPT discretizes visual annotations (e.g., bounding boxes, human poses) into sequences and leverages a GPT-style transformer to learn the probabilistic prior by maximizing the likelihood of training sequences.	VisorGPT effectively models the visual prior, demonstrated by its ability to generate realistic and customized spatial conditions for image synthesis models like ControlNet and GLIGEN. The learned prior aligns well with real-world data, evidenced by the close similarity in location, shape, and relation priors between generated sequences and real-world datasets. VisorGPT enables customizable sampling of visual data by leveraging prompt engineering, allowing for control over factors like object size, number of instances, and categories.	VisorGPT is currently limited to closed-set inference due to the limited number of labeled classes in training data. The maximum token length in the model restricts the number of instances that can be included in each sequence, posing challenges for complex scenes.	visual prior, generative pre-training, conditional image synthesis, prompt engineering, language modeling
2305.13738 Report	i-Code Studio: A Configurable and Composable Framework for Integrative AI	Yuwei Fang, Mahmoud Khademi, Chenguang Zhu, Ziyi Yang, Reid Pryzant, Yichong Xu, Yao Qian, Takuya Yoshioka, Lu Yuan, Michael Zeng, Xuedong Huang	Artificial General Intelligence (AGI) requires comprehensive understanding and generation capabilities for a variety of tasks spanning different modalities and functionalities. Integrative AI is one important direction to approach AGI, through combining multiple models to tackle complex multimodal tasks. However, there is a lack of a flexible and composable platform to facilitate efficient and effective model composition and coordination. In this paper, we propose the i-Code Studio, a configurable and composable framework for Integrative AI. The i-Code Studio orchestrates multiple pre-trained models in a finetuning-free fashion to conduct complex multimodal tasks. Instead of simple model composition, the i-Code Studio provides an integrative, flexible, and composable setting for developers to quickly and easily compose cutting-edge services and technologies tailored to their specific requirements. The i-Code Studio achieves impressive results on a variety of zero-shot multimodal tasks, such as video-to-text retrieval, speech-to-speech translation, and visual question answering. We also demonstrate how to quickly build a multimodal agent based on the i-Code Studio that can communicate and personalize for users.	The paper proposes i-Code Studio, a configurable and composable framework for Integrative AI that orchestrates multiple pre-trained models to conduct complex multimodal tasks without finetuning.	Integrative AI is an important direction towards AGI, and the proposed framework addresses the lack of flexible and composable platforms for efficient model composition and coordination.	The i-Code Studio uses a directed acyclic graph (DAG) to configure the flow of input data through pre-trained models and services from Azure Cognitive Services and OpenAI, enabling complex multimodal tasks.	i-Code Studio achieves state-of-the-art performance on zero-shot video-to-text retrieval. It significantly outperforms baseline methods on visual question answering, even without access to supporting facts. It demonstrates strong performance on speech-to-speech translation, surpassing previous state-of-the-art methods by a large margin.	The framework currently relies on a limited number of pre-trained models and services. Further research is needed to apply the framework to more complex multimodal tasks.	integrative ai, multimodal learning, artificial general intelligence, composable framework, large pre-trained models
2305.13655 Report	LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models	Long Lian, Boyi Li, Adam Yala, Trevor Darrell	Recent advancements in text-to-image diffusion models have yielded impressive results in generating realistic and diverse images. However, these models still struggle with complex prompts, such as those that involve numeracy and spatial reasoning. This work proposes to enhance prompt understanding capabilities in diffusion models. Our method leverages a pretrained large language model (LLM) for grounded generation in a novel two-stage process. In the first stage, the LLM generates a scene layout that comprises captioned bounding boxes from a given prompt describing the desired image. In the second stage, a novel controller guides an off-the-shelf diffusion model for layout-grounded image generation. Both stages utilize existing pretrained models without additional model parameter optimization. Our method significantly outperforms the base diffusion model and several strong baselines in accurately generating images according to prompts that require various capabilities, doubling the generation accuracy across four tasks on average. Furthermore, our method enables instruction-based multi-round scene specification and can handle prompts in languages not supported by the underlying diffusion model. We anticipate that our method will unleash users' creativity by accurately following more complex prompts. Our code, demo, and benchmark are available at: https://llm-grounded-diffusion.github.io	This paper proposes LMD, a training-free, two-stage method to enhance the prompt understanding capabilities of text-to-image diffusion models.	Existing diffusion models struggle to accurately follow complex prompts requiring numeracy, spatial reasoning, and attribute binding.	LMD uses a pre-trained LLM to generate a scene layout with captioned bounding boxes from a text prompt. Then, a novel controller guides an off-the-shelf diffusion model to generate the image based on this layout.	LMD significantly outperforms the base diffusion model and other baselines in accurately generating images from complex prompts, doubling generation accuracy across four tasks. LMD enables instruction-based multi-round scene specification, allowing users to refine the generation through dialogue. LMD supports prompts in languages not supported by the base diffusion model by using English layouts generated by the LLM.	Ambiguity in LLM-generated layouts can sometimes lead to inaccuracies in image generation. LMD inherits potential biases from the base diffusion model.	text-to-image generation, diffusion models, large language models, layout generation, prompt understanding
2305.13625 Report	DiffProtect: Generate Adversarial Examples with Diffusion Models for Facial Privacy Protection	Jiang Liu, Chun Pong Lau, Rama Chellappa	The increasingly pervasive facial recognition (FR) systems raise serious concerns about personal privacy, especially for billions of users who have publicly shared their photos on social media. Several attempts have been made to protect individuals from being identified by unauthorized FR systems utilizing adversarial attacks to generate encrypted face images. However, existing methods suffer from poor visual quality or low attack success rates, which limit their utility. Recently, diffusion models have achieved tremendous success in image generation. In this work, we ask: can diffusion models be used to generate adversarial examples to improve both visual quality and attack performance? We propose DiffProtect, which utilizes a diffusion autoencoder to generate semantically meaningful perturbations on FR systems. Extensive experiments demonstrate that DiffProtect produces more natural-looking encrypted images than state-of-the-art methods while achieving significantly higher attack success rates, e.g., 24.5% and 25.1% absolute improvements on the CelebA-HQ and FFHQ datasets.	This paper proposes DiffProtect, a novel diffusion model-based adversarial attack method for facial privacy protection. It generates natural and inconspicuous adversarial examples on face recognition systems by perturbing the semantic code of an input image and using a conditional DDIM decoding process to create a protected image.	Existing methods for protecting against unauthorized facial recognition often produce low-quality images or have low attack success rates. This work aims to improve both visual quality and attack performance using diffusion models.	DiffProtect uses a pre-trained diffusion autoencoder to encode an input face image into semantic and noise codes. It then optimizes the semantic code to create a protected image that fools the face recognition model while preserving visual quality. The method also includes a face semantics regularization module and an attack acceleration strategy.	DiffProtect achieves significantly higher attack success rates (ASR) than previous state-of-the-art methods on CelebA-HQ and FFHQ datasets. DiffProtect generates more natural-looking encrypted images with lower FID scores compared to baselines. The accelerated version, DiffProtect-fast, maintains competitive attack performance while significantly reducing computation time.	The attack generation process can be further optimized to improve efficiency. Investigating the effectiveness of DiffProtect on other privacy-protection tasks beyond facial recognition.	facial privacy protection, adversarial attack, diffusion models, face recognition, generative models
2305.13620 Report	A Dive into SAM Prior in Image Restoration	Zeyu Xiao, Jiawang Bai, Zhihe Lu, Zhiwei Xiong	The goal of image restoration (IR), a fundamental issue in computer vision, is to restore a high-quality (HQ) image from its degraded low-quality (LQ) observation. Multiple HQ solutions may correspond to an LQ input in this poorly posed problem, creating an ambiguous solution space. This motivates the investigation and incorporation of prior knowledge in order to effectively constrain the solution space and enhance the quality of the restored images. In spite of the pervasive use of hand-crafted and learned priors in IR, limited attention has been paid to the incorporation of knowledge from large-scale foundation models. In this paper, we for the first time leverage the prior knowledge of the state-of-the-art segment anything model (SAM) to boost the performance of existing IR networks in an parameter-efficient tuning manner. In particular, the choice of SAM is based on its robustness to image degradations, such that HQ semantic masks can be extracted from it. In order to leverage semantic priors and enhance restoration quality, we propose a lightweight SAM prior tuning (SPT) unit. This plug-and-play component allows us to effectively integrate semantic priors into existing IR networks, resulting in significant improvements in restoration quality. As the only trainable module in our method, the SPT unit has the potential to improve both efficiency and scalability. We demonstrate the effectiveness of the proposed method in enhancing a variety of methods across multiple tasks, such as image super-resolution and color image denoising.	This paper proposes leveraging the semantic prior knowledge from Segment Anything Model (SAM) to enhance the performance of existing Image Restoration (IR) networks in a parameter-efficient tuning manner.	IR is an ill-posed problem with an ambiguous solution space. This work explores the use of large-scale foundation models to provide richer priors and improve restoration quality.	The method extracts semantic masks from SAM and incorporates them into a lightweight SAM Prior Tuning (SPT) unit. The SPT unit is integrated into existing IR networks, and only its parameters are tuned during training.	The proposed method significantly improves the performance of various CNN-based and Transformer-based IR methods. Experiments on image super-resolution and color image denoising demonstrate consistent performance gains over baseline methods. Ablation studies validate the effectiveness of the SPT unit, the efficient tuning scheme, and the impact of SAM mask granularity.	The use of SAM masks as semantic priors might introduce unrealistic fine-grained structures. Future work could explore more effective methods for incorporating semantic priors to improve fidelity.	image restoration, semantic prior, segment anything model, parameter-efficient tuning, foundation models
2305.13501 Report	LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On	Davide Morelli, Alberto Baldrati, Giuseppe Cartella, Marcella Cornia, Marco Bertini, Rita Cucchiara	The rapidly evolving fields of e-commerce and metaverse continue to seek innovative approaches to enhance the consumer experience. At the same time, recent advancements in the development of diffusion models have enabled generative networks to create remarkably realistic images. In this context, image-based virtual try-on, which consists in generating a novel image of a target model wearing a given in-shop garment, has yet to capitalize on the potential of these powerful generative solutions. This work introduces LaDI-VTON, the first Latent Diffusion textual Inversion-enhanced model for the Virtual Try-ON task. The proposed architecture relies on a latent diffusion model extended with a novel additional autoencoder module that exploits learnable skip connections to enhance the generation process preserving the model's characteristics. To effectively maintain the texture and details of the in-shop garment, we propose a textual inversion component that can map the visual features of the garment to the CLIP token embedding space and thus generate a set of pseudo-word token embeddings capable of conditioning the generation process. Experimental results on Dress Code and VITON-HD datasets demonstrate that our approach outperforms the competitors by a consistent margin, achieving a significant milestone for the task. Source code and trained models are publicly available at: https://github.com/miccunifi/ladi-vton.	This paper introduces LaDI-VTON, the first virtual try-on model that utilizes Latent Diffusion Models enhanced with textual inversion for improved garment transfer and detail preservation.	This approach leverages the superior image generation capabilities of diffusion models to advance the realism and user experience in virtual try-on applications within e-commerce and the metaverse.	LaDI-VTON extends Stable Diffusion with a novel autoencoder module (EMASC) for preserving details and a textual inversion component to accurately represent the input garment's visual features in the CLIP embedding space, conditioning the generation process.	LaDI-VTON outperforms state-of-the-art methods on Dress Code and VITON-HD benchmarks, achieving significantly better FID and KID scores. The introduced EMASC modules demonstrably reduce autoencoder compression loss, leading to better reconstruction of high-frequency human body details. The textual inversion component effectively preserves the texture and details of the original in-shop garments during the virtual try-on process.	LaDI-VTON, while excelling in realism, may not always perfectly synthesize textual details (logos, words) on garments due to its reliance on Stable Diffusion. Future work could explore non-latent diffusion approaches for enhanced textual detail reproduction, acknowledging potential computational trade-offs.	virtual try-on, latent diffusion models, textual inversion, generative architectures, e-commerce
2305.13460 Report	'Tax-free' 3DMM Conditional Face Generation	Yiwen Huang, Zhiqiu Yu, Xinjie Yi, Yue Wang, James Tompkin	3DMM conditioned face generation has gained traction due to its well-defined controllability; however, the trade-off is lower sample quality: Previous works such as DiscoFaceGAN and 3D-FM GAN show a significant FID gap compared to the unconditional StyleGAN, suggesting that there is a quality tax to pay for controllability. In this paper, we challenge the assumption that quality and controllability cannot coexist. To pinpoint the previous issues, we mathematically formalize the problem of 3DMM conditioned face generation. Then, we devise simple solutions to the problem under our proposed framework. This results in a new model that effectively removes the quality tax between 3DMM conditioned face GANs and the unconditional StyleGAN.	This paper introduces a novel 3DMM-conditioned GAN model for face generation that maintains high image quality comparable to unconditional StyleGAN while offering fine-grained control over facial attributes.	Existing 3DMM-conditioned GANs suffer from a 'quality tax', exhibiting reduced image quality compared to unconditional models due to constraints imposed by the 3DMM conditioning. This work aims to remove this quality tax by addressing overconstraint issues.	The authors propose a mathematical framework for 3DMM-conditioned face generation, optimizing for both consistency (generated image aligns with the input 3DMM parameters) and disentanglement (modifying one attribute doesn't affect others). They achieve this through a novel 3DMM representation, progressive blending for consistent training, and a structurally disentangled conditioning mechanism.	The model generates high-quality images with FID scores close to unconditional StyleGAN2, outperforming existing 3DMM-conditioned GANs. It demonstrates superior disentanglement capabilities compared to baselines, as evidenced by higher Disentanglement Scores (DS). The model enables reference-based generation, allowing for transferring facial attributes like expression, illumination, and pose from real images to generated ones.	The model inherits limitations from the pretrained face reconstruction (FR) and differentiable renderer (RDR), such as inaccurate skin tone prediction for darker skin tones. It lacks explicit control over attributes not included in the 3DMM parameter space, like hair and eyeglasses.	generative adversarial networks, 3d morphable models, face generation, disentanglement, conditional image synthesis
2305.13311 Report	VDT: General-purpose Video Diffusion Transformers via Mask Modeling	Haoyu Lu, Guoxing Yang, Nanyi Fei, Yuqi Huo, Zhiwu Lu, Ping Luo, Mingyu Ding	This work introduces Video Diffusion Transformer (VDT), which pioneers the use of transformers in diffusion-based video generation. It features transformer blocks with modularized temporal and spatial attention modules to leverage the rich spatial-temporal representation inherited in transformers. We also propose a unified spatial-temporal mask modeling mechanism, seamlessly integrated with the model, to cater to diverse video generation scenarios. VDT offers several appealing benefits. 1) It excels at capturing temporal dependencies to produce temporally consistent video frames and even simulate the physics and dynamics of 3D objects over time. 2) It facilitates flexible conditioning information, \eg, simple concatenation in the token space, effectively unifying different token lengths and modalities. 3) Pairing with our proposed spatial-temporal mask modeling mechanism, it becomes a general-purpose video diffuser for harnessing a range of tasks, including unconditional generation, video prediction, interpolation, animation, and completion, etc. Extensive experiments on these tasks spanning various scenarios, including autonomous driving, natural weather, human action, and physics-based simulation, demonstrate the effectiveness of VDT. Additionally, we present comprehensive studies on how \model handles conditioning information with the mask modeling mechanism, which we believe will benefit future research and advance the field. Project page: https:VDT-2023.github.io	The paper introduces Video Diffusion Transformer (VDT), a novel approach for video generation utilizing transformers in a diffusion-based framework.	Existing video generation methods struggle with capturing temporal dependencies for consistent videos, handling diverse conditioning information, and unifying different video generation tasks. VDT leverages transformers' strengths to address these challenges.	VDT employs transformer blocks with temporal and spatial attention modules to capture spatiotemporal dependencies. It utilizes a pre-trained VAE tokenizer for efficient processing and incorporates a unified spatial-temporal mask modeling mechanism for versatility.	VDT excels in capturing temporal dependencies, generating high-quality, consistent videos, and simulating object dynamics. It flexibly handles conditioning information via token concatenation, unifying tasks like unconditional generation, prediction, interpolation, and animation. VDT demonstrates state-of-the-art performance on various datasets, including UCF101, Cityscapes, and Physion, outperforming previous GAN-based and diffusion-based methods.	Limited pretraining due to computational constraints restricts potential. Future work includes pretraining on larger datasets and incorporating other modalities like text.	video generation, diffusion models, transformers, spatiotemporal attention, mask modeling
2305.13310 Report	Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching	Yang Liu, Muzhi Zhu, Hengtao Li, Hao Chen, Xinlong Wang, Chunhua Shen	Powered by large-scale pre-training, vision foundation models exhibit significant potential in open-world image understanding. However, unlike large language models that excel at directly tackling various language tasks, vision foundation models require a task-specific model structure followed by fine-tuning on specific tasks. In this work, we present Matcher, a novel perception paradigm that utilizes off-the-shelf vision foundation models to address various perception tasks. Matcher can segment anything by using an in-context example without training. Additionally, we design three effective components within the Matcher framework to collaborate with these foundation models and unleash their full potential in diverse perception tasks. Matcher demonstrates impressive generalization performance across various segmentation tasks, all without training. For example, it achieves 52.7% mIoU on COCO-20$^i$ with one example, surpassing the state-of-the-art specialist model by 1.6%. In addition, Matcher achieves 33.0% mIoU on the proposed LVIS-92$^i$ for one-shot semantic segmentation, outperforming the state-of-the-art generalist model by 14.4%. Our visualization results further showcase the open-world generality and flexibility of Matcher when applied to images in the wild. Our code can be found at https://github.com/aim-uofa/Matcher.	This paper introduces Matcher, a novel training-free perception framework that leverages pre-trained vision foundation models (VFMs) to solve a variety of perception tasks using in-context learning with a single example.	Existing VFMs often require task-specific fine-tuning and struggle to generalize across diverse perception tasks. Matcher addresses this limitation by enabling VFMs to perform well on a wide range of tasks without training.	Matcher leverages an all-purpose feature extractor (DINOv2) and a class-agnostic segmentation model (SAM). It employs bidirectional matching for accurate semantic correspondence, a robust prompt sampler for generating diverse mask proposals, and instance-level matching for selecting high-quality masks.	Matcher achieves state-of-the-art performance on one-shot semantic segmentation, outperforming specialized methods on COCO-20i and demonstrating strong generalization on FSS-1000 and LVIS-92i. Matcher excels in one-shot object part segmentation, significantly surpassing previous methods on PASCAL-Part and PACO-Part benchmarks. Matcher demonstrates competitive results in video object segmentation on DAVIS datasets, showcasing its ability to handle temporal information without training.	Matcher's performance on instance segmentation is currently limited by the instance-level matching capabilities of the image encoder. Future work will focus on improving instance-level segmentation and further exploring Matcher's potential for evaluating and advancing VFMs.	vision foundation models, few-shot learning, semantic segmentation, object part segmentation, video object segmentation
2305.13308 Report	If at First You Don't Succeed, Try, Try Again: Faithful Diffusion-based Text-to-Image Generation by Selection	Shyamgopal Karthik, Karsten Roth, Massimiliano Mancini, Zeynep Akata	Despite their impressive capabilities, diffusion-based text-to-image (T2I) models can lack faithfulness to the text prompt, where generated images may not contain all the mentioned objects, attributes or relations. To alleviate these issues, recent works proposed post-hoc methods to improve model faithfulness without costly retraining, by modifying how the model utilizes the input prompt. In this work, we take a step back and show that large T2I diffusion models are more faithful than usually assumed, and can generate images faithful to even complex prompts without the need to manipulate the generative process. Based on that, we show how faithfulness can be simply treated as a candidate selection problem instead, and introduce a straightforward pipeline that generates candidate images for a text prompt and picks the best one according to an automatic scoring system that can leverage already existing T2I evaluation metrics. Quantitative comparisons alongside user studies on diverse benchmarks show consistently improved faithfulness over post-hoc enhancement methods, with comparable or lower computational cost. Code is available at \url{https://github.com/ExplainableML/ImageSelect}.	This paper proposes \texttt{ImageSelect}, a simple but effective pipeline that improves the faithfulness of text-to-image diffusion models by generating multiple candidate images and automatically selecting the most faithful one.	Existing text-to-image (T2I) models, while impressive, often struggle to faithfully represent all details of a text prompt in the generated image. Recent methods trying to address this are computationally expensive and often tailored to specific prompt types.	\texttt{ImageSelect} generates several candidate images for a given prompt using different random seeds. Then, it leverages existing T2I evaluation metrics like TIFA and ImageReward to automatically select the most faithful image from the candidates.	\texttt{ImageSelect} consistently outperforms baseline methods, including model version upgrades (e.g., SD1.4 to SD2.1), in terms of faithfulness on diverse benchmarks. Quantitative analysis shows substantial improvements across various faithfulness categories, such as counting and spatial relations, addressing known T2I model limitations. Extensive human evaluation strongly favors \texttt{ImageSelect} outputs, demonstrating better alignment with human perception of faithfulness.	Despite significant improvements, \texttt{ImageSelect} is still limited by the capabilities of the underlying T2I model, particularly for complex challenges like rendering text or handling very long prompts. The reliance on pre-trained scoring models (TIFA, ImageReward) may introduce biases or limitations based on their training data.	text-to-image generation, faithfulness, diffusion models, candidate selection, evaluation metrics
2305.13307 Report	NeRFuser: Large-Scale Scene Representation by NeRF Fusion	Jiading Fang, Shengjie Lin, Igor Vasiljevic, Vitor Guizilini, Rares Ambrus, Adrien Gaidon, Gregory Shakhnarovich, Matthew R. Walter	A practical benefit of implicit visual representations like Neural Radiance Fields (NeRFs) is their memory efficiency: large scenes can be efficiently stored and shared as small neural nets instead of collections of images. However, operating on these implicit visual data structures requires extending classical image-based vision techniques (e.g., registration, blending) from image sets to neural fields. Towards this goal, we propose NeRFuser, a novel architecture for NeRF registration and blending that assumes only access to pre-generated NeRFs, and not the potentially large sets of images used to generate them. We propose registration from re-rendering, a technique to infer the transformation between NeRFs based on images synthesized from individual NeRFs. For blending, we propose sample-based inverse distance weighting to blend visual information at the ray-sample level. We evaluate NeRFuser on public benchmarks and a self-collected object-centric indoor dataset, showing the robustness of our method, including to views that are challenging to render from the individual source NeRFs.	This paper presents NeRFuser, a novel architecture for registering and blending pre-trained neural radiance fields (NeRFs) without access to the original training images.	NeRFs offer a memory-efficient way to represent 3D scenes. This work introduces tools to operate directly on NeRFs as data, expanding their utility in 3D vision applications.	The method involves two steps: 1) Registration from re-rendering: inferring the relative transformation between NeRFs by applying structure-from-motion to images rendered from novel viewpoints. 2) Sample-based inverse distance weighting: blending visual information at the ray-sample level during volumetric rendering.	NeRFuser accurately registers NeRFs, achieving low rotation, translation, and scale errors. The proposed sample-based blending method produces higher quality novel view synthesis than existing image- or pixel-based blending techniques. The method is robust to errors in pose estimation during registration and exhibits superior performance compared to point-cloud registration baselines.	The method inherits limitations of the input NeRFs, such as potential artifacts or inaccuracies in scene representation. Future work includes exploring the integration of structured priors for improved robustness and handling dynamic scenes.	neural radiance fields, nerf, 3d scene representation, registration, blending
2305.13292 Report	VideoLLM: Modeling Video Sequence with Large Language Models	Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei Huang, Junting Pan, Yi Wang, Yali Wang, Yu Qiao, Tong Lu, Limin Wang	With the exponential growth of video data, there is an urgent need for automated technology to analyze and comprehend video content. However, existing video understanding models are often task-specific and lack a comprehensive capability of handling diverse tasks. The success of large language models (LLMs) like GPT has demonstrated their impressive abilities in sequence causal reasoning. Building upon this insight, we propose a novel framework called VideoLLM that leverages the sequence reasoning capabilities of pre-trained LLMs from natural language processing (NLP) for video sequence understanding. VideoLLM incorporates a carefully designed Modality Encoder and Semantic Translator, which convert inputs from various modalities into a unified token sequence. This token sequence is then fed into a decoder-only LLM. Subsequently, with the aid of a simple task head, our VideoLLM yields an effective unified framework for different kinds of video understanding tasks. To evaluate the efficacy of VideoLLM, we conduct extensive experiments using multiple LLMs and fine-tuning methods. We evaluate our VideoLLM on eight tasks sourced from four different datasets. The experimental results demonstrate that the understanding and reasoning capabilities of LLMs can be effectively transferred to video understanding tasks. We release the code at https://github.com/cg1177/VideoLLM.	This paper proposes VideoLLM, a novel framework leveraging pre-trained LLMs for diverse video understanding tasks by converting video data into token sequences.	Existing video understanding models are often task-specific and struggle with the increasing volume and complexity of video data. LLMs offer strong sequence reasoning abilities learned from large-scale text data.	VideoLLM utilizes a Modality Encoder to process visual and textual data, a Semantic Translator to align visual and textual semantics, and a decoder-only LLM as a generalist video sequence reasoner.	Different LLMs exhibit varying strengths on different video understanding tasks, with GPT-2 generally performing well. The framework demonstrates strong performance on 8 diverse video understanding tasks across 4 datasets, achieving state-of-the-art or comparable results with fewer trainable parameters. Scaling up LLM size generally improves performance up to a certain point, suggesting potential for further improvement with larger models and improved semantic translation.	Performance of very large LLMs starts to decline, potentially due to overfitting in the semantic translator. Future work will explore incorporating spatiotemporal information about video frames for more comprehensive understanding.	video understanding, large language models, multimodal learning, sequence reasoning, parameter-efficient fine-tuning
2305.13173 Report	Semantic-Promoted Debiasing and Background Disambiguation for Zero-Shot Instance Segmentation	Shuting He, Henghui Ding, Wei Jiang	Zero-shot instance segmentation aims to detect and precisely segment objects of unseen categories without any training samples. Since the model is trained on seen categories, there is a strong bias that the model tends to classify all the objects into seen categories. Besides, there is a natural confusion between background and novel objects that have never shown up in training. These two challenges make novel objects hard to be raised in the final instance segmentation results. It is desired to rescue novel objects from background and dominated seen categories. To this end, we propose D$^2$Zero with Semantic-Promoted Debiasing and Background Disambiguation to enhance the performance of Zero-shot instance segmentation. Semantic-promoted debiasing utilizes inter-class semantic relationships to involve unseen categories in visual feature training and learns an input-conditional classifier to conduct dynamical classification based on the input image. Background disambiguation produces image-adaptive background representation to avoid mistaking novel objects for background. Extensive experiments show that we significantly outperform previous state-of-the-art methods by a large margin, e.g., 16.86% improvement on COCO. Project page: https://henghuiding.github.io/D2Zero/	This paper proposes D2Zero, a novel zero-shot instance segmentation approach, addressing the bias and background ambiguity challenges by employing semantic-promoted debiasing and image-adaptive background disambiguation.	Existing instance segmentation models struggle to generalize to novel object categories unseen during training, hindering their applicability to real-world scenarios. D2Zero addresses this limitation by enabling the model to segment objects of unseen categories.	D2Zero leverages semantic information (e.g., CLIP embeddings) to guide visual feature learning with an unseen-constrained training objective and employs an input-conditional classifier for dynamic classification. It further utilizes an image-adaptive background representation to enhance the distinction between novel objects and background.	D2Zero achieves state-of-the-art performance on zero-shot instance segmentation benchmarks, significantly outperforming prior arts like ZSI. The proposed input-conditional classifier effectively mitigates bias towards seen categories and reduces the domain gap between visual and semantic features. The image-adaptive background prototype significantly improves the model's ability to distinguish novel objects from the background.	The current approach focuses on single-modal visual features and could benefit from incorporating multi-modal features for richer representation. Future work could explore joint optimization of both the instance segmentation and background disambiguation tasks within a unified framework.	zero-shot learning, instance segmentation, debiasing, background disambiguation, computer vision
2305.13093 Report	Restore Anything Pipeline: Segment Anything Meets Image Restoration	Jiaxi Jiang, Christian Holz	Recent image restoration methods have produced significant advancements using deep learning. However, existing methods tend to treat the whole image as a single entity, failing to account for the distinct objects in the image that exhibit individual texture properties. Existing methods also typically generate a single result, which may not suit the preferences of different users. In this paper, we introduce the Restore Anything Pipeline (RAP), a novel interactive and per-object level image restoration approach that incorporates a controllable model to generate different results that users may choose from. RAP incorporates image segmentation through the recent Segment Anything Model (SAM) into a controllable image restoration model to create a user-friendly pipeline for several image restoration tasks. We demonstrate the versatility of RAP by applying it to three common image restoration tasks: image deblurring, image denoising, and JPEG artifact removal. Our experiments show that RAP produces superior visual results compared to state-of-the-art methods. RAP represents a promising direction for image restoration, providing users with greater control, and enabling image restoration at an object level.	Introduces Restore Anything Pipeline (RAP), an interactive and object-level image restoration approach using Segment Anything Model (SAM) for segmentation and a controllable model for user-specific results.	Addresses limitations of existing methods that treat images as single entities and produce single, potentially suboptimal results, failing to account for diverse object textures and user preferences.	Integrates SAM for object segmentation, a flexible blind restoration framework (adapted FBCNN) for automatic/controlled restoration based on predicted/user-adjusted degradation parameters, and optional object-level enhancement.	RAP produces superior visual results compared to state-of-the-art methods in deblurring, denoising, and JPEG artifact removal. Object-level processing allows different restoration levels for objects with varying textures and degradation. Users can control the restoration level by adjusting predicted degradation parameters (e.g., noise level, blur kernel, JPEG quality factor).	Limited to specific degradation types considered during training. Relies on the accuracy of SAM for segmentation, which can be affected by image quality.	image restoration, interactive image editing, segment anything model, object-level processing, controllable image restoration
2305.13077 Report	ControlVideo: Training-free Controllable Text-to-Video Generation	Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xiaopeng Zhang, Wangmeng Zuo, Qi Tian	Text-driven diffusion models have unlocked unprecedented abilities in image generation, whereas their video counterpart still lags behind due to the excessive training cost of temporal modeling. Besides the training burden, the generated videos also suffer from appearance inconsistency and structural flickers, especially in long video synthesis. To address these challenges, we design a \emph{training-free} framework called \textbf{ControlVideo} to enable natural and efficient text-to-video generation. ControlVideo, adapted from ControlNet, leverages coarsely structural consistency from input motion sequences, and introduces three modules to improve video generation. Firstly, to ensure appearance coherence between frames, ControlVideo adds fully cross-frame interaction in self-attention modules. Secondly, to mitigate the flicker effect, it introduces an interleaved-frame smoother that employs frame interpolation on alternated frames. Finally, to produce long videos efficiently, it utilizes a hierarchical sampler that separately synthesizes each short clip with holistic coherency. Empowered with these modules, ControlVideo outperforms the state-of-the-arts on extensive motion-prompt pairs quantitatively and qualitatively. Notably, thanks to the efficient designs, it generates both short and long videos within several minutes using one NVIDIA 2080Ti. Code is available at https://github.com/YBYBZhang/ControlVideo.	This paper proposes ControlVideo, a training-free framework for controllable text-to-video generation, enabling high-quality video synthesis with temporal consistency.	Training text-to-video diffusion models is computationally expensive and resource-intensive. ControlVideo leverages pre-trained text-to-image models and motion sequences, offering a more efficient alternative.	ControlVideo adapts ControlNet with three key modules: 1) Fully cross-frame interaction for appearance coherence, 2) Interleaved-frame smoother for reducing structural flickers, 3) Hierarchical sampler for efficient long-video generation.	Outperforms state-of-the-art methods in qualitative and quantitative comparisons on motion-prompt pairs. Demonstrates superior appearance consistency and video quality, effectively mitigating flickers and artifacts. Enables efficient long-video generation (100+ frames) within minutes on a single NVIDIA 2080Ti.	ControlVideo struggles to generate videos beyond the provided motion sequences. Future work will focus on adapting motion sequences based on text prompts for more diverse video generation.	text-to-video generation, diffusion models, controlnet, temporal consistency, hierarchical sampling
2305.13035 Report	Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design	Ibrahim Alabdulmohsin, Xiaohua Zhai, Alexander Kolesnikov, Lucas Beyer	Scaling laws have been recently employed to derive compute-optimal model size (number of parameters) for a given compute duration. We advance and refine such methods to infer compute-optimal model shapes, such as width and depth, and successfully implement this in vision transformers. Our shape-optimized vision transformer, SoViT, achieves results competitive with models that exceed twice its size, despite being pre-trained with an equivalent amount of compute. For example, SoViT-400m/14 achieves 90.3% fine-tuning accuracy on ILSRCV2012, surpassing the much larger ViT-g/14 and approaching ViT-G/14 under identical settings, with also less than half the inference cost. We conduct a thorough evaluation across multiple tasks, such as image classification, captioning, VQA and zero-shot transfer, demonstrating the effectiveness of our model across a broad range of domains and identifying limitations. Overall, our findings challenge the prevailing approach of blindly scaling up vision models and pave a path for a more informed scaling.	This paper presents SoViT, a shape-optimized vision transformer that achieves comparable performance to much larger models when pre-trained with the same amount of compute.	Scaling model size alone is not the most compute-efficient approach. Optimizing model shape (e.g., width, depth) for a given compute budget can lead to smaller, faster, and equally performant models.	The authors introduce a novel scaling strategy that leverages a joint functional form for model size and compute to estimate optimal scaling exponents for different shape dimensions. They use a "star sweep" and a "grid sweep" to efficiently explore the design space and identify compute-optimal model shapes.	MLP dimension should be scaled faster than depth, and depth faster than width in vision transformers. Compute-optimal ViTs are smaller than previously used, with parameter count scaling more slowly than the allocated compute. SoViT-400m/14, a model optimized for the compute-equivalent of ViT-g/14, achieves 90.3% fine-tuning accuracy on ILSRCV2012, matching ViT-g/14's performance while being significantly smaller.	The proposed optimal model shape might not be ideal for all vision tasks, as indicated by the panoptic segmentation results. The study primarily focuses on optimizing three shape dimensions (width, depth, MLP size) and fixing the patch size. Further investigation is needed to explore the impact of including patch size in the optimization process.	vision transformers, scaling laws, model shape optimization, compute efficiency, image classification
2305.12998 Report	MFT: Long-Term Tracking of Every Pixel	Michal Neoral, Jonáš Šerých, Jiří Matas	We propose MFT -- Multi-Flow dense Tracker -- a novel method for dense, pixel-level, long-term tracking. The approach exploits optical flows estimated not only between consecutive frames, but also for pairs of frames at logarithmically spaced intervals. It selects the most reliable sequence of flows on the basis of estimates of its geometric accuracy and the probability of occlusion, both provided by a pre-trained CNN. We show that MFT achieves competitive performance on the TAP-Vid benchmark, outperforming baselines by a significant margin, and tracking densely orders of magnitude faster than the state-of-the-art point-tracking methods. The method is insensitive to medium-length occlusions and it is robustified by estimating flow with respect to the reference frame, which reduces drift.	Presents MFT, a novel method for dense, pixel-level, long-term tracking in videos, by combining optical flows from multiple time intervals and leveraging occlusion and uncertainty estimations.	Addresses the limitations of existing dense tracking approaches like error accumulation and occlusion sensitivity, aiming for robust and accurate long-term tracking essential for applications like video editing and augmented reality.	Utilizes pre-computed optical flows at logarithmically spaced time intervals, and employs two small CNNs to estimate occlusion and uncertainty maps from optical flow cost volumes. A selection mechanism then identifies the most reliable flow chain for each pixel based on these maps.	Achieves competitive performance on the TAP-Vid benchmark, outperforming most baselines and demonstrating a good balance between speed and accuracy for dense tracking. Significantly outperforms other methods in terms of speed when tracking densely, achieving 2.4 FPS compared to 0.04 FPS for state-of-the-art point trackers, and reaching over 100 FPS with pre-computed flows. Shows robustness against moderate occlusions and adapts to appearance changes by dynamically switching between different flow intervals.	Exhibits occasional spurious re-detections when out-of-view template regions are incorrectly matched to visually similar areas in the current frame. Future work could explore techniques to mitigate these spurious re-detections, potentially by incorporating temporal consistency constraints or object-level reasoning.	dense tracking, long-term tracking, optical flow, occlusion handling, uncertainty estimation
2305.12972 Report	VanillaNet: the Power of Minimalism in Deep Learning	Hanting Chen, Yunhe Wang, Jianyuan Guo, Dacheng Tao	At the heart of foundation models is the philosophy of "more is different", exemplified by the astonishing success in computer vision and natural language processing. However, the challenges of optimization and inherent complexity of transformer models call for a paradigm shift towards simplicity. In this study, we introduce VanillaNet, a neural network architecture that embraces elegance in design. By avoiding high depth, shortcuts, and intricate operations like self-attention, VanillaNet is refreshingly concise yet remarkably powerful. Each layer is carefully crafted to be compact and straightforward, with nonlinear activation functions pruned after training to restore the original architecture. VanillaNet overcomes the challenges of inherent complexity, making it ideal for resource-constrained environments. Its easy-to-understand and highly simplified architecture opens new possibilities for efficient deployment. Extensive experimentation demonstrates that VanillaNet delivers performance on par with renowned deep neural networks and vision transformers, showcasing the power of minimalism in deep learning. This visionary journey of VanillaNet has significant potential to redefine the landscape and challenge the status quo of foundation model, setting a new path for elegant and effective model design. Pre-trained models and codes are available at https://github.com/huawei-noah/VanillaNet and https://gitee.com/mindspore/models/tree/master/research/cv/vanillanet.	This paper introduces VanillaNet, a simple neural network architecture for computer vision that avoids complex components like shortcuts, excessive depth, and self-attention, while still achieving competitive performance.	Existing deep learning models, while powerful, are becoming increasingly complex, posing challenges for deployment, especially in resource-constrained environments. VanillaNet addresses this by offering a simpler alternative without sacrificing performance.	VanillaNet employs a streamlined architecture built upon convolutional layers and introduces a "deep training" strategy for improved performance. This involves starting with additional non-linear activation functions and progressively pruning them during training to maintain inference speed. It also incorporates a novel series-based activation function for enhanced non-linearity.	VanillaNet achieves image classification accuracy comparable to well-known deep networks and vision transformers, even surpassing them in inference speed on GPUs. Ablation studies demonstrate the effectiveness of the proposed deep training strategy and series activation function in boosting the performance of simple architectures. Visualization of attention maps provides insights into VanillaNet's learning process, suggesting its strength in thoroughly extracting information from images.	Future work includes exploring better parameter allocation strategies for VanillaNet to further improve its efficiency. Further investigation into the trade-off between non-linearity and depth in extremely simple architectures is also warranted.	neural network architecture, deep learning, computer vision, model efficiency, convolutional neural networks
2305.12966 Report	Hierarchical Integration Diffusion Model for Realistic Image Deblurring	Zheng Chen, Yulun Zhang, Ding Liu, Bin Xia, Jinjin Gu, Linghe Kong, Xin Yuan	Diffusion models (DMs) have recently been introduced in image deblurring and exhibited promising performance, particularly in terms of details reconstruction. However, the diffusion model requires a large number of inference iterations to recover the clean image from pure Gaussian noise, which consumes massive computational resources. Moreover, the distribution synthesized by the diffusion model is often misaligned with the target results, leading to restrictions in distortion-based metrics. To address the above issues, we propose the Hierarchical Integration Diffusion Model (HI-Diff), for realistic image deblurring. Specifically, we perform the DM in a highly compacted latent space to generate the prior feature for the deblurring process. The deblurring process is implemented by a regression-based method to obtain better distortion accuracy. Meanwhile, the highly compact latent space ensures the efficiency of the DM. Furthermore, we design the hierarchical integration module to fuse the prior into the regression-based model from multiple scales, enabling better generalization in complex blurry scenarios. Comprehensive experiments on synthetic and real-world blur datasets demonstrate that our HI-Diff outperforms state-of-the-art methods. Code and trained models are available at https://github.com/zhengchen1999/HI-Diff.	This paper proposes HI-Diff, a novel Hierarchical Integration Diffusion Model, for realistic image deblurring.	Diffusion models (DMs) show promise in image deblurring for detailed reconstruction but suffer from high computational cost and potential distortion due to distribution misalignment.	HI-Diff uses a two-stage training approach: 1) compressing ground truth images into a compact latent representation as prior feature and integrating it into a Transformer through a hierarchical integration module (HIM), and 2) training a latent diffusion model to generate the prior feature, which further guides the Transformer during deblurring.	HI-Diff outperforms state-of-the-art methods on benchmark datasets, including GoPro, HIDE, RealBlur, and RWBI. The hierarchical integration with multi-scale prior features shows superior performance than using single-scale features. Jointly training the diffusion model and Transformer in stage two significantly improves the deblurring performance.	The study mainly focuses on image deblurring, and its applicability to other image restoration tasks needs further investigation. Exploring more advanced diffusion model architectures and training strategies could potentially further enhance the deblurring performance.	image deblurring, diffusion models, hierarchical integration, latent space, prior feature
2305.12799 Report	Interactive Data Synthesis for Systematic Vision Adaptation via LLMs-AIGCs Collaboration	Qifan Yu, Juncheng Li, Wentao Ye, Siliang Tang, Yueting Zhuang	Recent text-to-image generation models have shown promising results in generating high-fidelity photo-realistic images. In parallel, the problem of data scarcity has brought a growing interest in employing AIGC technology for high-quality data expansion. However, this paradigm requires well-designed prompt engineering that cost-less data expansion and labeling remain under-explored. Inspired by LLM's powerful capability in task guidance, we propose a new paradigm of annotated data expansion named as ChatGenImage. The core idea behind it is to leverage the complementary strengths of diverse models to establish a highly effective and user-friendly pipeline for interactive data augmentation. In this work, we extensively study how LLMs communicate with AIGC model to achieve more controllable image generation and make the first attempt to collaborate them for automatic data augmentation for a variety of downstream tasks. Finally, we present fascinating results obtained from our ChatGenImage framework and demonstrate the powerful potential of our synthetic data for systematic vision adaptation. Our codes are available at https://github.com/Yuqifan1117/Labal-Anything-Pipeline.	Presents ChatGenImage, a novel framework for interactive data augmentation that leverages the collaborative capabilities of LLMs, AIGC models, and label foundation toolkits to generate high-quality synthetic images with fine-grained annotations.	Addresses the limitations of existing data augmentation methods by enabling more controllable and diverse image generation with detailed annotations, which is crucial for improving the generalization and robustness of vision models, especially in data-scarce scenarios.	Utilizes a two-stage process: 1) Language Enhancement Image Initialization: LLMs generate descriptive prompts to guide AIGC models in creating initial images. 2) Iteratively Local Refinement and Labeling: LLMs analyze annotations from label foundation toolkits and provide local editing prompts to AIGC models for iterative refinement, resulting in images that align with complex annotations.	ChatGenImage effectively generates controllable and diverse images, even for unfamiliar or rare concepts, by leveraging the knowledge and reasoning capabilities of LLMs. The framework excels in creating images depicting intricate scenes with multiple objects and backgrounds through iterative local refinement guided by LLMs. Image filtering rules based on pixel and semantic checking ensure the generation of high-quality synthetic data suitable for downstream tasks.	Current experiments primarily focus on qualitative analysis, with quantitative evaluations for downstream task performance left for future work. The framework's computational cost, particularly for iterative refinement, presents a challenge for large-scale data generation.	data augmentation, synthetic data generation, large language models (llms), text-to-image synthesis, vision adaptation
2305.12716 Report	The CLIP Model is Secretly an Image-to-Prompt Converter	Yuxuan Ding, Chunna Tian, Haoxuan Ding, Lingqiao Liu	The Stable Diffusion model is a prominent text-to-image generation model that relies on a text prompt as its input, which is encoded using the Contrastive Language-Image Pre-Training (CLIP). However, text prompts have limitations when it comes to incorporating implicit information from reference images. Existing methods have attempted to address this limitation by employing expensive training procedures involving millions of training samples for image-to-image generation. In contrast, this paper demonstrates that the CLIP model, as utilized in Stable Diffusion, inherently possesses the ability to instantaneously convert images into text prompts. Such an image-to-prompt conversion can be achieved by utilizing a linear projection matrix that is calculated in a closed form. Moreover, the paper showcases that this capability can be further enhanced by either utilizing a small amount of similar-domain training data (approximately 100 images) or incorporating several online training steps (around 30 iterations) on the reference images. By leveraging these approaches, the proposed method offers a simple and flexible solution to bridge the gap between images and text prompts. This methodology can be applied to various tasks such as image variation and image editing, facilitating more effective and seamless interaction between images and textual prompts.	This paper introduces Stable Diffusion Image-to-Prompt Conversion (SD-IPC), a method for converting images into text prompts for use with Stable Diffusion, eliminating the need for expensive retraining procedures.	Text prompts in text-to-image generation models struggle to capture implicit information from reference images, limiting their effectiveness for tasks like image variation.	SD-IPC leverages the inherent relationship between CLIP's visual and textual embeddings by deriving a closed-form projection matrix. This matrix converts visual embeddings into textual prompts, enabling image-guided generation with Stable Diffusion. Additionally, the paper proposes fine-tuning approaches to improve content preservation and customize generation.	SD-IPC effectively captures semantic information from reference images, enabling image variation without extensive training. Fine-tuning SD-IPC on specific datasets enhances content preservation and editing capabilities, outperforming existing methods like SD-R. SD-IPC facilitates fast adaptation for customized generation, requiring significantly fewer updates than methods like Custom Diffusion.	Editing text needs to be contextually appropriate to avoid generating nonsensical images. The current method only supports single image input.	image variation, text-to-image generation, stable diffusion, clip, image-to-prompt conversion
2305.12659 Report	UVOSAM: A Mask-free Paradigm for Unsupervised Video Object Segmentation via Segment Anything Model	Zhenghao Zhang, Zhichao Wei, Shengfan Zhang, Zuozhuo Dai, Siyu Zhu	Unsupervised video object segmentation has made significant progress in recent years, but the manual annotation of video mask datasets is expensive and limits the diversity of available datasets. The Segment Anything Model (SAM) has introduced a new prompt-driven paradigm for image segmentation, unlocking a range of previously unexplored capabilities. In this paper, we propose a novel paradigm called UVOSAM, which leverages SAM for unsupervised video object segmentation without requiring video mask labels. To address SAM's limitations in instance discovery and identity association, we introduce a video salient object tracking network that automatically generates trajectories for prominent foreground objects. These trajectories then serve as prompts for SAM to produce video masks on a frame-by-frame basis. Our experimental results demonstrate that UVOSAM significantly outperforms current mask-supervised methods. These findings suggest that UVOSAM has the potential to improve unsupervised video object segmentation and reduce the cost of manual annotation.	This paper introduces UVOSAM, a novel paradigm for unsupervised video object segmentation using the Segment Anything Model (SAM) without requiring video mask labels.	Manually annotating video mask datasets is expensive and limits diversity. UVOSAM aims to address this by leveraging SAM for mask-free unsupervised video object segmentation.	UVOSAM consists of two stages: 1) a video salient object tracking (VSOT) network detects prominent objects and generates trajectories, and 2) SAM utilizes these trajectories as prompts to produce video masks frame-by-frame.	UVOSAM significantly outperforms current mask-supervised methods on DAVIS2017-unsupervised and Youtube-VIS 2019 datasets. Providing accurate bounding box prompts to SAM leads to near-perfect segmentation results, highlighting its potential. Ablation studies demonstrate the importance of the tracking framework, prompt types, and combining box and point prompts for optimal performance.	UVOSAM struggles with detecting slender objects and experiences trajectory drift in cases of occlusion or significant scale changes. Future work will focus on improving VSOT's robustness to address these limitations.	unsupervised video object segmentation, segment anything model (sam), mask-free training, video salient object tracking, prompt-driven segmentation
2305.12529 Report	DreamWaltz: Make a Scene with Complex 3D Animatable Avatars	Yukun Huang, Jianan Wang, Ailing Zeng, He Cao, Xianbiao Qi, Yukai Shi, Zheng-Jun Zha, Lei Zhang	We present DreamWaltz, a novel framework for generating and animating complex 3D avatars given text guidance and parametric human body prior. While recent methods have shown encouraging results for text-to-3D generation of common objects, creating high-quality and animatable 3D avatars remains challenging. To create high-quality 3D avatars, DreamWaltz proposes 3D-consistent occlusion-aware Score Distillation Sampling (SDS) to optimize implicit neural representations with canonical poses. It provides view-aligned supervision via 3D-aware skeleton conditioning which enables complex avatar generation without artifacts and multiple faces. For animation, our method learns an animatable 3D avatar representation from abundant image priors of diffusion model conditioned on various poses, which could animate complex non-rigged avatars given arbitrary poses without retraining. Extensive evaluations demonstrate that DreamWaltz is an effective and robust approach for creating 3D avatars that can take on complex shapes and appearances as well as novel poses for animation. The proposed framework further enables the creation of complex scenes with diverse compositions, including avatar-avatar, avatar-object and avatar-scene interactions. See https://dreamwaltz3d.github.io/ for more vivid 3D avatar and animation results.	DreamWaltz is a novel framework for generating and animating complex 3D avatars from text descriptions, leveraging human body priors.	Creating high-quality, animatable 3D avatars from text is challenging due to the complexity of avatar appearances, articulated structures, and pose-dependent shape changes.	DreamWaltz utilizes a trainable NeRF for 3D representation, a pre-trained text-and-skeleton-conditional diffusion model for supervision, and SMPL models for 3D-aware skeletons. It introduces 3D-consistent occlusion-aware Score Distillation Sampling for high-quality avatar creation and learns an animatable NeRF representation from diffusion and pose priors.	DreamWaltz generates high-quality 3D avatars with complex shapes and appearances from text prompts. The learned animatable NeRF enables realistic avatar animation with arbitrary motion sequences without retraining. The framework allows for scene composition with diverse avatar-avatar, avatar-object, and avatar-scene interactions.	The visual quality can be further improved with higher resolution training and dedicated optimization for details like face and hand. The model may inherit societal biases present in the training data of the underlying diffusion model.	text-to-3d, avatar generation, 3d animation, nerf, diffusion models
2305.12476 Report	Zero-shot Visual Relation Detection via Composite Visual Cues from Large Language Models	Lin Li, Jun Xiao, Guikun Chen, Jian Shao, Yueting Zhuang, Long Chen	Pretrained vision-language models, such as CLIP, have demonstrated strong generalization capabilities, making them promising tools in the realm of zero-shot visual recognition. Visual relation detection (VRD) is a typical task that identifies relationship (or interaction) types between object pairs within an image. However, naively utilizing CLIP with prevalent class-based prompts for zero-shot VRD has several weaknesses, e.g., it struggles to distinguish between different fine-grained relation types and it neglects essential spatial information of two objects. To this end, we propose a novel method for zero-shot VRD: RECODE, which solves RElation detection via COmposite DEscription prompts. Specifically, RECODE first decomposes each predicate category into subject, object, and spatial components. Then, it leverages large language models (LLMs) to generate description-based prompts (or visual cues) for each component. Different visual cues enhance the discriminability of similar relation categories from different perspectives, which significantly boosts performance in VRD. To dynamically fuse different cues, we further introduce a chain-of-thought method that prompts LLMs to generate reasonable weights for different visual cues. Extensive experiments on four VRD benchmarks have demonstrated the effectiveness and interpretability of RECODE.	This paper introduces RECODE, the first training-free framework for zero-shot visual relation detection using large language models (LLMs).	Existing VRD methods require extensive training data and struggle with unseen relations. RECODE addresses these limitations by leveraging the knowledge and reasoning capabilities of LLMs.	RECODE utilizes LLMs to generate descriptions of visual cues for various relation categories. These descriptions are then used to compute similarities between image regions and relation categories, enabling zero-shot relation detection.	RECODE achieves competitive results on the Visual Genome (VG) dataset without any training, demonstrating its potential for zero-shot VRD. Ablation studies on class-based prompts highlight RECODE's effectiveness compared to a baseline using only class-based prompts. Qualitative analysis showcases the interpretability of RECODE's predictions, revealing its ability to identify relevant visual cues for accurate relation classification.	RECODE's current evaluation doesn't explicitly cover spatial or ownership relation categories. The framework assumes access to perfect bounding boxes and object categories, which might not be realistic in real-world applications.	visual relation detection, zero-shot learning, large language models, computer vision, artificial intelligence
2305.12452 Report	Advancing Referring Expression Segmentation Beyond Single Image	Yixuan Wu, Zhao Zhang, Xie Chi, Feng Zhu, Rui Zhao	Referring Expression Segmentation (RES) is a widely explored multi-modal task, which endeavors to segment the pre-existing object within a single image with a given linguistic expression. However, in broader real-world scenarios, it is not always possible to determine if the described object exists in a specific image. Typically, we have a collection of images, some of which may contain the described objects. The current RES setting curbs its practicality in such situations. To overcome this limitation, we propose a more realistic and general setting, named Group-wise Referring Expression Segmentation (GRES), which expands RES to a collection of related images, allowing the described objects to be present in a subset of input images. To support this new setting, we introduce an elaborately compiled dataset named Grouped Referring Dataset (GRD), containing complete group-wise annotations of target objects described by given expressions. We also present a baseline method named Grouped Referring Segmenter (GRSer), which explicitly captures the language-vision and intra-group vision-vision interactions to achieve state-of-the-art results on the proposed GRES and related tasks, such as Co-Salient Object Detection and RES. Our dataset and codes will be publicly released in https://github.com/yixuan730/group-res.	This paper proposes Group-wise Referring Expression Segmentation (GRES), a new setting that extends Referring Expression Segmentation (RES) to a collection of related images, allowing for more realistic scenarios where the target object might not be present in all images.	Current RES methods are limited to single images with confirmed target presence, hindering their practicality in real-world applications such as image retrieval or multi-monitor event discovery. GRES addresses this limitation by considering groups of related images, some of which may not contain the described object.	The authors introduce GRSer, a baseline method for GRES, which leverages language and intra-group visual cues. GRSer utilizes a Triphasic Query Module (TQM) to generate heatmaps based on both linguistic and visual features, and a Heatmap Hierarchizer to rank these heatmaps for improved object localization. Additionally, a new dataset called GRD is presented, featuring complete group-wise annotations of target objects, including negative samples.	GRSer significantly outperforms existing RES methods on both GRES and conventional RES benchmarks. The proposed TQM and Heatmap Hierarchizer are shown to effectively capture language-vision and vision-vision interactions, contributing to improved object localization. The GRD dataset, with its complete group-wise annotations and negative samples, provides a more realistic and challenging benchmark for evaluating GRES methods.	The study primarily focuses on a fixed group size, leaving exploration of dynamic group sizes for future work. The impact of different language and vision encoders on GRES performance could be investigated further.	referring expression segmentation, group-wise segmentation, multi-modal learning, computer vision, natural language processing
2305.12328 Report	InstructVid2Vid: Controllable Video Editing with Natural Language Instructions	Bosheng Qin, Juncheng Li, Siliang Tang, Tat-Seng Chua, Yueting Zhuang	We present an end-to-end diffusion-based method for editing videos with human language instructions, namely $\textbf{InstructVid2Vid}$. Our approach enables the editing of input videos based on natural language instructions without any per-example fine-tuning or inversion. The proposed InstructVid2Vid model combines a pretrained image generation model, Stable Diffusion, with a conditional 3D U-Net architecture to generate time-dependent sequence of video frames. To obtain the training data, we incorporate the knowledge and expertise of different models, including ChatGPT, BLIP, and Tune-a-Video, to synthesize video-instruction triplets, which is a more cost-efficient alternative to collecting data in real-world scenarios. To improve the consistency between adjacent frames of generated videos, we propose the Frame Difference Loss, which is incorporated during the training process. During inference, we extend the classifier-free guidance to text-video input to guide the generated results, making them more related to both the input video and instruction. Experiments demonstrate that InstructVid2Vid is able to generate high-quality, temporally coherent videos and perform diverse edits, including attribute editing, change of background, and style transfer. These results highlight the versatility and effectiveness of our proposed method. Code is released in $\href{https://github.com/BrightQin/InstructVid2Vid}{InstructVid2Vid}$.	Introduces InstructVid2Vid, an end-to-end diffusion-based video editing method using human language instructions without per-example fine-tuning.	Addresses limitations of existing video editing methods that require computationally expensive fine-tuning for each input video.	Combines a pretrained Stable Diffusion model with a 3D U-Net, trained on a synthetic dataset generated using ChatGPT, BLIP, and Tune-a-Video. Introduces Frame Difference Loss to enhance temporal consistency in generated videos.	Achieves attribute modification, background change, and style transfer in videos while maintaining temporal consistency. Demonstrates superior performance in quantitative metrics like frame differencing, optical flow, and FID compared to models without Frame Difference Loss. Showcases the potential of model composition for generating training data and enabling advanced video editing capabilities.	Current model primarily excels at Level 1 video editing tasks and faces limitations in comprehending complex instructions or achieving high-level semantic editing. Future work focuses on enhancing InstructVid2Vid's capabilities for higher-level video editing tasks like motion manipulation and story-driven editing.	video editing, diffusion models, generative ai, text-to-video, multimodal learning
2305.12252 Report	Boosting Human-Object Interaction Detection with Text-to-Image Diffusion Model	Jie Yang, Bingliang Li, Fengyu Yang, Ailing Zeng, Lei Zhang, Ruimao Zhang	This paper investigates the problem of the current HOI detection methods and introduces DiffHOI, a novel HOI detection scheme grounded on a pre-trained text-image diffusion model, which enhances the detector's performance via improved data diversity and HOI representation. We demonstrate that the internal representation space of a frozen text-to-image diffusion model is highly relevant to verb concepts and their corresponding context. Accordingly, we propose an adapter-style tuning method to extract the various semantic associated representation from a frozen diffusion model and CLIP model to enhance the human and object representations from the pre-trained detector, further reducing the ambiguity in interaction prediction. Moreover, to fill in the gaps of HOI datasets, we propose SynHOI, a class-balance, large-scale, and high-diversity synthetic dataset containing over 140K HOI images with fully triplet annotations. It is built using an automatic and scalable pipeline designed to scale up the generation of diverse and high-precision HOI-annotated data. SynHOI could effectively relieve the long-tail issue in existing datasets and facilitate learning interaction representations. Extensive experiments demonstrate that DiffHOI significantly outperforms the state-of-the-art in regular detection (i.e., 41.50 mAP) and zero-shot detection. Furthermore, SynHOI can improve the performance of model-agnostic and backbone-agnostic HOI detection, particularly exhibiting an outstanding 11.55% mAP improvement in rare classes.	This paper introduces DiffHOI, a novel HOI detection scheme that leverages the generative and representative capabilities of pre-trained text-to-image diffusion models to enhance HOI detection performance.	Current HOI detection methods suffer from limitations such as class imbalance, small data size, limited diversity in existing datasets, and difficulties in extracting nuanced verb-associated contextual information for effective interaction prediction.	DiffHOI utilizes an adapter-style tuning approach to extract global and local semantic representations from a frozen diffusion model and CLIP model. It also introduces SynHOI, a class-balanced, large-scale synthetic HOI dataset generated using an automatic pipeline.	DiffHOI significantly outperforms state-of-the-art methods in regular HOI detection, achieving 41.50 mAP on HICO-DET. SynHOI effectively addresses the long-tail issue in existing datasets and improves the performance of HOI detection, particularly in rare classes with an 11.55% mAP improvement. DiffHOI demonstrates superior performance in zero-shot HOI detection.	The computational cost of incorporating large-scale diffusion models in HOI detection pipelines. Further exploration of more effective prompt design strategies for generating higher-quality synthetic HOI data.	human-object interaction detection, diffusion models, synthetic data generation, zero-shot learning, computer vision
2305.11588 Report	Text2NeRF: Text-Driven 3D Scene Generation with Neural Radiance Fields	Jingbo Zhang, Xiaoyu Li, Ziyu Wan, Can Wang, Jing Liao	Text-driven 3D scene generation is widely applicable to video gaming, film industry, and metaverse applications that have a large demand for 3D scenes. However, existing text-to-3D generation methods are limited to producing 3D objects with simple geometries and dreamlike styles that lack realism. In this work, we present Text2NeRF, which is able to generate a wide range of 3D scenes with complicated geometric structures and high-fidelity textures purely from a text prompt. To this end, we adopt NeRF as the 3D representation and leverage a pre-trained text-to-image diffusion model to constrain the 3D reconstruction of the NeRF to reflect the scene description. Specifically, we employ the diffusion model to infer the text-related image as the content prior and use a monocular depth estimation method to offer the geometric prior. Both content and geometric priors are utilized to update the NeRF model. To guarantee textured and geometric consistency between different views, we introduce a progressive scene inpainting and updating strategy for novel view synthesis of the scene. Our method requires no additional training data but only a natural language description of the scene as the input. Extensive experiments demonstrate that our Text2NeRF outperforms existing methods in producing photo-realistic, multi-view consistent, and diverse 3D scenes from a variety of natural language prompts. Our code is available at https://github.com/eckertzhang/Text2NeRF.	Text2NeRF generates realistic 3D scenes from text prompts by combining a pre-trained text-to-image diffusion model with NeRF.	Existing text-to-3D methods struggle to generate high-fidelity and diverse scenes with complex geometry, often resulting in simplistic or dreamlike outputs. Text2NeRF addresses this by leveraging the strengths of NeRF and diffusion models for realistic scene generation.	The method infers an initial image and depth map from the text prompt using a diffusion model and a monocular depth estimation method. This information initializes a NeRF model. A progressive inpainting and updating (PIU) strategy expands the scene view-by-view, using the diffusion model to fill missing regions while maintaining consistency. Support sets with multi-view constraints and a two-stage depth alignment strategy are employed to enhance realism and accuracy.	Generates photorealistic 3D scenes with complex geometry and textures from text prompts. Outperforms existing text-to-3D methods both qualitatively and quantitatively in terms of scene quality, realism, and semantic relevance. Supports generation of diverse scenes, including indoor, outdoor, and artistic styles, and allows for 360-degree scene generation.	Struggles with scenes containing large occlusions due to limitations in depth estimation accuracy. Requires longer optimization time compared to mesh or point cloud-based generation methods.	text-to-3d, nerf, 3d scene generation, diffusion models, novel view synthesis
2305.11577 Report	LeftRefill: Filling Right Canvas based on Left Reference through Generalized Text-to-Image Diffusion Model	Chenjie Cao, Yunuo Cai, Qiaole Dong, Yikai Wang, Yanwei Fu	This paper introduces LeftRefill, an innovative approach to efficiently harness large Text-to-Image (T2I) diffusion models for reference-guided image synthesis. As the name implies, LeftRefill horizontally stitches reference and target views together as a whole input. The reference image occupies the left side, while the target canvas is positioned on the right. Then, LeftRefill paints the right-side target canvas based on the left-side reference and specific task instructions. Such a task formulation shares some similarities with contextual inpainting, akin to the actions of a human painter. This novel formulation efficiently learns both structural and textured correspondence between reference and target without other image encoders or adapters. We inject task and view information through cross-attention modules in T2I models, and further exhibit multi-view reference ability via the re-arranged self-attention modules. These enable LeftRefill to perform consistent generation as a generalized model without requiring test-time fine-tuning or model modifications. Thus, LeftRefill can be seen as a simple yet unified framework to address reference-guided synthesis. As an exemplar, we leverage LeftRefill to address two different challenges: reference-guided inpainting and novel view synthesis, based on the pre-trained StableDiffusion. Codes and models are released at https://github.com/ewrfcas/LeftRefill.	This paper introduces LeftRefill, a novel method for reference-guided image synthesis using large text-to-image diffusion models by stitching reference and target images into a single canvas and leveraging contextual inpainting.	Existing methods for reference-guided synthesis rely on computationally expensive fine-tuning of large models or visual encoders that prioritize semantics over spatial details, hindering performance in tasks like novel view synthesis and reference-guided inpainting.	LeftRefill stitches reference and target images horizontally and trains using the inpainting capability of Stable Diffusion, guided by task and view-specific prompt tuning and a novel block causal masking technique for consistent autoregressive generation.	LeftRefill achieves state-of-the-art performance in both reference-guided inpainting and novel view synthesis with fewer trainable parameters. The method effectively leverages multi-view references to enhance inpainting quality and generate consistent novel views. The proposed block causal masking technique enables autoregressive generation in diffusion-based models, leading to improved geometric consistency in novel view synthesis.	LeftRefill's autoregressive generation suffers from error accumulation, requiring additional reference views for correction. Future work includes extending LeftRefill to higher resolutions and improving efficiency for larger, more powerful text-to-image models.	reference-guided image synthesis, text-to-image (t2i) diffusion models, novel view synthesis, image inpainting, prompt tuning
2305.11520 Report	Late-Constraint Diffusion Guidance for Controllable Image Synthesis	Chang Liu, Dong Liu	Diffusion models, either with or without text condition, have demonstrated impressive capability in synthesizing photorealistic images given a few or even no words. These models may not fully satisfy user need, as normal users or artists intend to control the synthesized images with specific guidance, like overall layout, color, structure, object shape, and so on. To adapt diffusion models for controllable image synthesis, several methods have been proposed to incorporate the required conditions as regularization upon the intermediate features of the diffusion denoising network. These methods, known as early-constraint ones in this paper, have difficulties in handling multiple conditions with a single solution. They intend to train separate models for each specific condition, which require much training cost and result in non-generalizable solutions. To address these difficulties, we propose a new approach namely late-constraint: we leave the diffusion networks unchanged, but constrain its output to be aligned with the required conditions. Specifically, we train a lightweight condition adapter to establish the correlation between external conditions and internal representations of diffusion models. During the iterative denoising process, the conditional guidance is sent into corresponding condition adapter to manipulate the sampling process with the established correlation. We further equip the introduced late-constraint strategy with a timestep resampling method and an early stopping technique, which boost the quality of synthesized image meanwhile complying with the guidance. Our method outperforms the existing early-constraint methods and generalizes better to unseen condition. Our code would be available.	This paper proposes Late-Constraint Diffusion Guidance (LCDG), a novel approach for controllable image synthesis that aligns the output of diffusion models with external guidance without altering the original network.	Existing methods for controlling diffusion models, known as early-constraint methods, have limitations in handling multiple conditions and generalizing to unseen ones. LCDG addresses these limitations by externally guiding the sampling process.	LCDG trains a lightweight Condition Adapter (CA) to learn the correlation between internal representations of diffusion models and external conditions. During sampling, it uses the CA to adjust the estimated score based on the difference between the desired and reconstructed conditions.	LCDG achieves superior FID scores compared to existing early-constraint methods on COCO dataset, demonstrating better sample quality. The method exhibits strong generalization ability, effectively handling various conditions like edge, sketch, color stroke, palette, and mask with a single model. LCDG offers flexible controllability through adjustable parameters like controlling scale and truncation threshold.	LCDG, similar to other gradient-based methods, can increase sampling time due to additional forwarding processes. Further research is needed to explore more efficient acceleration strategies to mitigate the increased sampling time.	image synthesis, diffusion models, controllable generation, condition adapter, structure-aware sampling
2305.11487 Report	PointGPT: Auto-regressively Generative Pre-training from Point Clouds	Guangyan Chen, Meiling Wang, Yi Yang, Kai Yu, Li Yuan, Yufeng Yue	Large language models (LLMs) based on the generative pre-training transformer (GPT) have demonstrated remarkable effectiveness across a diverse range of downstream tasks. Inspired by the advancements of the GPT, we present PointGPT, a novel approach that extends the concept of GPT to point clouds, addressing the challenges associated with disorder properties, low information density, and task gaps. Specifically, a point cloud auto-regressive generation task is proposed to pre-train transformer models. Our method partitions the input point cloud into multiple point patches and arranges them in an ordered sequence based on their spatial proximity. Then, an extractor-generator based transformer decoder, with a dual masking strategy, learns latent representations conditioned on the preceding point patches, aiming to predict the next one in an auto-regressive manner. Our scalable approach allows for learning high-capacity models that generalize well, achieving state-of-the-art performance on various downstream tasks. In particular, our approach achieves classification accuracies of 94.9% on the ModelNet40 dataset and 93.4% on the ScanObjectNN dataset, outperforming all other transformer models. Furthermore, our method also attains new state-of-the-art accuracies on all four few-shot learning benchmarks.	This paper introduces PointGPT, a novel self-supervised learning framework for point clouds inspired by the GPT architecture in NLP. It addresses challenges like point cloud disorder, low information density, and task-specific gaps to learn effective representations.	Existing point cloud learning methods often depend on fully-supervised training with costly annotations. Self-supervised learning, particularly GPT-like architectures, have shown great promise in NLP and image analysis for learning without explicit labels. This paper explores adapting this success to the point cloud domain.	PointGPT partitions point clouds into ordered sequences of point patches using Morton-order curve. It then utilizes an extractor-generator transformer decoder with a dual masking strategy. The extractor learns latent representations by predicting masked patches, while the generator reconstructs the point patches. Post-pre-training with a labeled hybrid dataset further enhances representation learning.	PointGPT outperforms other single-modal SSL methods on object classification (ScanObjectNN, ModelNet40) and part segmentation (ShapeNetPart) tasks. Scaled PointGPT models achieve state-of-the-art results on these tasks, exceeding even methods using cross-modal information or teacher models. The method demonstrates strong generalization ability, achieving superior performance in few-shot learning scenarios.	The data and model scales used are still smaller compared to NLP and image processing domains, limiting further exploration of PointGPT's potential. Future work can investigate scaling PointGPT to even larger datasets and model sizes to further bridge the gap with NLP and image processing.	point cloud, self-supervised learning, generative pre-training, transformer, representation learning
2305.11321 Report	JoIN: Joint GANs Inversion for Intrinsic Image Decomposition	Viraj Shah, Svetlana Lazebnik, Julien Philip	In this work, we propose to solve ill-posed inverse imaging problems using a bank of Generative Adversarial Networks (GAN) as a prior and apply our method to the case of Intrinsic Image Decomposition for faces and materials. Our method builds on the demonstrated success of GANs to capture complex image distributions. At the core of our approach is the idea that the latent space of a GAN is a well-suited optimization domain to solve inverse problems. Given an input image, we propose to jointly inverse the latent codes of a set of GANs and combine their outputs to reproduce the input. Contrary to most GAN inversion methods which are limited to inverting only a single GAN, we demonstrate that it is possible to maintain distribution priors while inverting several GANs jointly. We show that our approach is modular, allowing various forward imaging models, and that it can successfully decompose both synthetic and real images.	This paper introduces JoIN, a novel method that leverages a bank of Generative Adversarial Networks (GANs) as priors to solve ill-posed inverse imaging problems, particularly focusing on Intrinsic Image Decomposition (IID) for faces and materials.	IID is crucial for realistic image editing by decomposing images into independent components like albedo, shading, and specular reflections. Existing methods are often limited by restrictive priors or cross-contamination between components. This work addresses these limitations using the powerful image distribution modeling capabilities of GANs.	The proposed method involves training separate GANs for each image component (albedo, shading, specular). Given an input image, the latent codes of these GANs are jointly optimized to minimize the reconstruction loss between the generated output and the target image. A novel kNN-based loss regularization is introduced to constrain the optimization and prevent cross-contamination between components. The approach is further enhanced by incorporating encoder-based initialization and generator fine-tuning techniques.	JoIN successfully decomposes both synthetic and real images into their intrinsic components, outperforming existing methods quantitatively and qualitatively. Independently trained GANs prove advantageous for modularity, flexibility, and preventing cross-contamination between components. The novel kNN-based loss regularization is shown to be effective in maintaining GAN priors during optimization, leading to improved decomposition results.	The method is limited by the training quality of the GANs and currently more adapted to distributions easily modeled by GANs, such as faces. The optimization-based inversion process, while producing high-quality results, is computationally more expensive than feed-forward methods.	generative adversarial networks, intrinsic image decomposition, inverse problems, gan inversion, image editing
2305.11173 Report	Going Denser with Open-Vocabulary Part Segmentation	Peize Sun, Shoufa Chen, Chenchen Zhu, Fanyi Xiao, Ping Luo, Saining Xie, Zhicheng Yan	Object detection has been expanded from a limited number of categories to open vocabulary. Moving forward, a complete intelligent vision system requires understanding more fine-grained object descriptions, object parts. In this paper, we propose a detector with the ability to predict both open-vocabulary objects and their part segmentation. This ability comes from two designs. First, we train the detector on the joint of part-level, object-level and image-level data to build the multi-granularity alignment between language and image. Second, we parse the novel object into its parts by its dense semantic correspondence with the base object. These two designs enable the detector to largely benefit from various data sources and foundation models. In open-vocabulary part segmentation experiments, our method outperforms the baseline by 3.3$\sim$7.3 mAP in cross-dataset generalization on PartImageNet, and improves the baseline by 7.3 novel AP$_{50}$ in cross-category generalization on Pascal Part. Finally, we train a detector that generalizes to a wide range of part segmentation datasets while achieving better performance than dataset-specific training.	This paper presents a novel detector capable of performing both open-vocabulary object detection and part segmentation, enabling fine-grained understanding of objects and their components.	Expanding object detection to include part segmentation in an open-vocabulary setting is crucial for intelligent vision systems, enabling deeper object understanding and supporting applications like robotics and image editing.	The proposed method leverages a vision-language model trained on part, object, and image-level data to achieve multi-granularity alignment between images and text. It further parses novel objects into their parts by establishing dense semantic correspondence with base objects using a pre-trained DINO model.	The method outperforms baselines by 3.3-7.3 mAP in cross-dataset generalization on PartImageNet. It achieves a 7.3 AP improvement in cross-category generalization on Pascal Part. The model trained on joint data outperforms dataset-specific training on various part segmentation datasets.	Joint training for object detection and part segmentation may not always benefit both tasks equally. Further research is needed to explore better text prompt engineering for part segmentation.	open-vocabulary learning, part segmentation, object detection, vision-language model, semantic correspondence
2305.11147 Report	UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild	Can Qin, Shu Zhang, Ning Yu, Yihao Feng, Xinyi Yang, Yingbo Zhou, Huan Wang, Juan Carlos Niebles, Caiming Xiong, Silvio Savarese, Stefano Ermon, Yun Fu, Ran Xu	Achieving machine autonomy and human control often represent divergent objectives in the design of interactive AI systems. Visual generative foundation models such as Stable Diffusion show promise in navigating these goals, especially when prompted with arbitrary languages. However, they often fall short in generating images with spatial, structural, or geometric controls. The integration of such controls, which can accommodate various visual conditions in a single unified model, remains an unaddressed challenge. In response, we introduce UniControl, a new generative foundation model that consolidates a wide array of controllable condition-to-image (C2I) tasks within a singular framework, while still allowing for arbitrary language prompts. UniControl enables pixel-level-precise image generation, where visual conditions primarily influence the generated structures and language prompts guide the style and context. To equip UniControl with the capacity to handle diverse visual conditions, we augment pretrained text-to-image diffusion models and introduce a task-aware HyperNet to modulate the diffusion models, enabling the adaptation to different C2I tasks simultaneously. Trained on nine unique C2I tasks, UniControl demonstrates impressive zero-shot generation abilities with unseen visual conditions. Experimental results show that UniControl often surpasses the performance of single-task-controlled methods of comparable model sizes. This control versatility positions UniControl as a significant advancement in the realm of controllable visual generation.	This paper introduces UniControl, a novel unified diffusion model for controllable visual generation. UniControl consolidates various condition-to-image tasks within a single framework, allowing for pixel-level image generation by leveraging both visual conditions and language prompts.	Existing text-to-image generative models lack pixel-level precision for spatial control, while models like ControlNet, capable of incorporating visual conditions, require separate training for each condition. UniControl addresses this by handling diverse visual conditions in a single unified model, making it more efficient and versatile.	UniControl leverages a mixture of experts (MOE)-style adapter and a task-aware HyperNet. The MOE adapter captures unique information from different visual conditions, while the task-aware HyperNet enables the model to adapt to different C2I tasks using task instructions.	UniControl demonstrates impressive zero-shot generation abilities on unseen visual conditions and hybrid tasks. Experimental results show that UniControl often surpasses the performance of single-task controlled methods of comparable model sizes. User studies confirm UniControl's superiority over single-task models and official ControlNet implementations on various tasks.	The model's performance is limited by the potential biases present in the training dataset (Laion-Aesthetics). While UniControl excels in various tasks, it may face challenges when high-quality human output is desired.	generative models, diffusion models, controllable image generation, multi-task learning, zero-shot learning
2305.11116 Report	LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation	Yujie Lu, Xianjun Yang, Xiujun Li, Xin Eric Wang, William Yang Wang	Existing automatic evaluation on text-to-image synthesis can only provide an image-text matching score, without considering the object-level compositionality, which results in poor correlation with human judgments. In this work, we propose LLMScore, a new framework that offers evaluation scores with multi-granularity compositionality. LLMScore leverages the large language models (LLMs) to evaluate text-to-image models. Initially, it transforms the image into image-level and object-level visual descriptions. Then an evaluation instruction is fed into the LLMs to measure the alignment between the synthesized image and the text, ultimately generating a score accompanied by a rationale. Our substantial analysis reveals the highest correlation of LLMScore with human judgments on a wide range of datasets (Attribute Binding Contrast, Concept Conjunction, MSCOCO, DrawBench, PaintSkills). Notably, our LLMScore achieves Kendall's tau correlation with human evaluations that is 58.8% and 31.2% higher than the commonly-used text-image matching metrics CLIP and BLIP, respectively.	This paper presents LLMScore, a novel framework leveraging Large Language Models (LLMs) for evaluating text-to-image synthesis with a focus on multi-granularity compositionality.	Existing automatic evaluation metrics for text-to-image synthesis often fail to capture object-level alignment between text prompts and generated images, resulting in poor correlation with human judgments.	LLMScore first decomposes the image into image-level and object-level descriptions using vision and language models. These descriptions, along with the text prompt, are fed into an LLM (e.g., GPT-4) with specific evaluation instructions to generate a score and a rationale.	LLMScore achieves significantly higher correlation with human judgments compared to existing metrics like CLIP and BLIP across various datasets. The framework demonstrates accurate capture of object-level alignment through detailed rationales that highlight similarities and discrepancies between images and text prompts. LLMScore is adaptable to different evaluation objectives (e.g., overall quality, error counting) by simply modifying the evaluation instructions provided to the LLM.	The reliance on GPT, a non-free LLM, may limit accessibility and scalability. Potential biases inherited from the pre-trained LLM could affect evaluation fairness and require careful consideration for specific domains.	text-to-image synthesis, image evaluation, compositionality, large language models, multi-granularity understanding
2305.11031 Report	ConsistentNeRF: Enhancing Neural Radiance Fields with 3D Consistency for Sparse View Synthesis	Shoukang Hu, Kaichen Zhou, Kaiyu Li, Longhui Yu, Lanqing Hong, Tianyang Hu, Zhenguo Li, Gim Hee Lee, Ziwei Liu	Neural Radiance Fields (NeRF) has demonstrated remarkable 3D reconstruction capabilities with dense view images. However, its performance significantly deteriorates under sparse view settings. We observe that learning the 3D consistency of pixels among different views is crucial for improving reconstruction quality in such cases. In this paper, we propose ConsistentNeRF, a method that leverages depth information to regularize both multi-view and single-view 3D consistency among pixels. Specifically, ConsistentNeRF employs depth-derived geometry information and a depth-invariant loss to concentrate on pixels that exhibit 3D correspondence and maintain consistent depth relationships. Extensive experiments on recent representative works reveal that our approach can considerably enhance model performance in sparse view conditions, achieving improvements of up to 94% in PSNR, 76% in SSIM, and 31% in LPIPS compared to the vanilla baselines across various benchmarks, including DTU, NeRF Synthetic, and LLFF.	ConsistentNeRF enhances Neural Radiance Fields by integrating multi-view and single-view 3D consistency to improve performance in sparse view scenarios.	NeRF struggles in sparse view settings due to the lack of 3D consistency. This work addresses this limitation by enforcing 3D consistency, leading to improved performance.	ConsistentNeRF utilizes: (1) a depth-derived mask to focus on pixels with multi-view 3D correspondence, and (2) a depth-invariant loss to enforce single-view 3D consistency using monocular depth priors.	Achieves state-of-the-art results on DTU, NeRF Synthetic, and LLFF datasets under sparse view conditions. Significantly outperforms vanilla NeRF and other depth-based methods. Demonstrates the importance of 3D consistency for high-quality novel view synthesis.	Reliance on pre-trained MVSNeRF for mask derivation limits real-world applicability. Performance degrades when the target view is far from source views due to limitations in exploiting 3D correspondence.	neural radiance fields, nerf, sparse view synthesis, 3d consistency, depth estimation
2305.10973 Report	Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold	Xingang Pan, Ayush Tewari, Thomas Leimkühler, Lingjie Liu, Abhimitra Meka, Christian Theobalt	Synthesizing visual content that meets users' needs often requires flexible and precise controllability of the pose, shape, expression, and layout of the generated objects. Existing approaches gain controllability of generative adversarial networks (GANs) via manually annotated training data or a prior 3D model, which often lack flexibility, precision, and generality. In this work, we study a powerful yet much less explored way of controlling GANs, that is, to "drag" any points of the image to precisely reach target points in a user-interactive manner, as shown in Fig.1. To achieve this, we propose DragGAN, which consists of two main components: 1) a feature-based motion supervision that drives the handle point to move towards the target position, and 2) a new point tracking approach that leverages the discriminative generator features to keep localizing the position of the handle points. Through DragGAN, anyone can deform an image with precise control over where pixels go, thus manipulating the pose, shape, expression, and layout of diverse categories such as animals, cars, humans, landscapes, etc. As these manipulations are performed on the learned generative image manifold of a GAN, they tend to produce realistic outputs even for challenging scenarios such as hallucinating occluded content and deforming shapes that consistently follow the object's rigidity. Both qualitative and quantitative comparisons demonstrate the advantage of DragGAN over prior approaches in the tasks of image manipulation and point tracking. We also showcase the manipulation of real images through GAN inversion.	This paper proposes DragGAN, an interactive image editing method that allows users to manipulate the pose, shape, expression, and layout of objects in GAN-generated images by dragging handle points to target points.	Controllable image synthesis is crucial for real-world applications, and existing methods often lack flexibility, precision, or generality. DragGAN offers an intuitive and versatile solution by enabling precise control over pixel movement within the learned image manifold of a GAN.	DragGAN employs a two-step optimization process: 1) motion supervision, where a shifted feature patch loss guides the handle points towards targets, and 2) point tracking, using nearest neighbor search in the GAN's feature space to accurately track handle point locations during manipulation.	DragGAN achieves precise control over handle point movement, enabling manipulation of various spatial attributes across diverse object categories. The method effectively hallucinates occluded content and preserves object rigidity during deformation, indicating manipulation within the learned image manifold. DragGAN outperforms existing GAN manipulation and point tracking methods qualitatively and quantitatively, while maintaining interactive performance.	Editing quality is limited by the diversity of the training data and the presence of texture-less regions. Potential misuse for creating fake images raises ethical concerns regarding personality rights and privacy.	generative adversarial networks (gans), interactive image manipulation, point tracking, image editing, generative models
2305.10924 Report	Structural Pruning for Diffusion Models	Gongfan Fang, Xinyin Ma, Xinchao Wang	Generative modeling has recently undergone remarkable advancements, primarily propelled by the transformative implications of Diffusion Probabilistic Models (DPMs). The impressive capability of these models, however, often entails significant computational overhead during both training and inference. To tackle this challenge, we present Diff-Pruning, an efficient compression method tailored for learning lightweight diffusion models from pre-existing ones, without the need for extensive re-training. The essence of Diff-Pruning is encapsulated in a Taylor expansion over pruned timesteps, a process that disregards non-contributory diffusion steps and ensembles informative gradients to identify important weights. Our empirical assessment, undertaken across several datasets highlights two primary benefits of our proposed method: 1) Efficiency: it enables approximately a 50\% reduction in FLOPs at a mere 10\% to 20\% of the original training expenditure; 2) Consistency: the pruned diffusion models inherently preserve generative behavior congruent with their pre-trained models. Code is available at \url{https://github.com/VainF/Diff-Pruning}.	Presents Diff-Pruning, an efficient compression method for learning lightweight diffusion models from pre-existing ones, without extensive retraining.	Addresses the challenge of significant computational overhead during training and inference in diffusion probabilistic models (DPMs).	Employs Taylor expansion over pruned timesteps, discarding non-contributory diffusion steps and ensembling informative gradients to identify and remove unimportant weights.	Achieves approximately 50% reduction in FLOPs with only 10% to 20% of the original training expenditure. Pruned models maintain generative behavior consistent with their pre-trained counterparts. Outperforms baseline pruning methods and scratch training in terms of FID and SSIM scores.	Diffusion models exhibit sensitivity to model size reduction, requiring careful consideration of pruning ratios. Further research can explore enhancing generation quality and consistency of pruned models.	diffusion models, model compression, network pruning, generative models, taylor expansion
2305.10884 Report	Meta-Auxiliary Network for 3D GAN Inversion	Bangrui Jiang, Zhenhua Guo, Yujiu Yang	Real-world image manipulation has achieved fantastic progress in recent years. GAN inversion, which aims to map the real image to the latent code faithfully, is the first step in this pipeline. However, existing GAN inversion methods fail to achieve high reconstruction quality and fast inference at the same time. In addition, existing methods are built on 2D GANs and lack explicitly mechanisms to enforce multi-view consistency.In this work, we present a novel meta-auxiliary framework, while leveraging the newly developed 3D GANs as generator. The proposed method adopts a two-stage strategy. In the first stage, we invert the input image to an editable latent code using off-the-shelf inversion techniques. The auxiliary network is proposed to refine the generator parameters with the given image as input, which both predicts offsets for weights of convolutional layers and sampling positions of volume rendering. In the second stage, we perform meta-learning to fast adapt the auxiliary network to the input image, then the final reconstructed image is synthesized via the meta-learned auxiliary network. Extensive experiments show that our method achieves better performances on both inversion and editing tasks.	This paper presents a novel meta-auxiliary framework for 3D GAN inversion, enabling high-quality reconstruction and fast inference for multi-view consistent image editing.	Existing GAN inversion methods struggle to balance high reconstruction quality and fast inference. Additionally, they often lack explicit mechanisms for multi-view consistency, which is crucial for realistic 3D image editing.	The method employs a two-stage strategy: 1) inverting an input image into an editable latent code using existing techniques and an auxiliary network to refine generator parameters; 2) using meta-learning to adapt the auxiliary network to the input image for fast, high-quality reconstruction.	Achieves state-of-the-art reconstruction quality comparable to optimization-based methods but with significantly faster inference speed. Enables multi-view consistent image editing by leveraging a 3D-aware GAN generator. Demonstrates the effectiveness of meta-learning for fast adaptation of the auxiliary network to unseen images.	The method primarily focuses on facial image editing due to the use of a face-specific 3D GAN generator. The performance on profile images can be further improved.	gan inversion, 3d gan, meta-learning, image editing, multi-view consistency
2305.10855 Report	TextDiffuser: Diffusion Models as Text Painters	Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, Furu Wei	Diffusion models have gained increasing attention for their impressive generation abilities but currently struggle with rendering accurate and coherent text. To address this issue, we introduce TextDiffuser, focusing on generating images with visually appealing text that is coherent with backgrounds. TextDiffuser consists of two stages: first, a Transformer model generates the layout of keywords extracted from text prompts, and then diffusion models generate images conditioned on the text prompt and the generated layout. Additionally, we contribute the first large-scale text images dataset with OCR annotations, MARIO-10M, containing 10 million image-text pairs with text recognition, detection, and character-level segmentation annotations. We further collect the MARIO-Eval benchmark to serve as a comprehensive tool for evaluating text rendering quality. Through experiments and user studies, we show that TextDiffuser is flexible and controllable to create high-quality text images using text prompts alone or together with text template images, and conduct text inpainting to reconstruct incomplete images with text. The code, model, and dataset will be available at \url{https://aka.ms/textdiffuser}.	This paper introduces TextDiffuser, a novel two-stage diffusion model designed for generating images with visually appealing and coherent text.	Existing diffusion models struggle to render accurate and coherent text, which is essential given the widespread use of text images in various applications.	TextDiffuser employs a Layout Transformer to generate the layout of keywords from text prompts, then uses diffusion models to generate images conditioned on the text prompt and generated layout. This approach provides flexibility and control over the generation process, allowing for text inpainting and template-based generation.	TextDiffuser outperforms existing methods in terms of text rendering quality, as demonstrated by quantitative metrics (FID, CLIPScore, OCR evaluation) and user studies on the MARIO-Eval benchmark. The authors introduce MARIO-10M, a large-scale text image dataset with 10 million image-text pairs and OCR annotations, to address the lack of specialized datasets for text rendering. TextDiffuser demonstrates controllability in text color through language descriptions, allowing for personalized text image generation.	TextDiffuser exhibits limitations in generating images with small characters due to the use of VAE for image encoding. Generating images from long text with many keywords can lead to disordered and overlapped text layouts, potentially due to noisy training data with numerous keywords.	text rendering, diffusion models, image generation, text inpainting, ocr
2305.10853 Report	LDM3D: Latent Diffusion Model for 3D	Gabriela Ben Melech Stan, Diana Wofk, Scottie Fox, Alex Redden, Will Saxton, Jean Yu, Estelle Aflalo, Shao-Yen Tseng, Fabio Nonato, Matthias Muller, Vasudev Lal	This research paper proposes a Latent Diffusion Model for 3D (LDM3D) that generates both image and depth map data from a given text prompt, allowing users to generate RGBD images from text prompts. The LDM3D model is fine-tuned on a dataset of tuples containing an RGB image, depth map and caption, and validated through extensive experiments. We also develop an application called DepthFusion, which uses the generated RGB images and depth maps to create immersive and interactive 360-degree-view experiences using TouchDesigner. This technology has the potential to transform a wide range of industries, from entertainment and gaming to architecture and design. Overall, this paper presents a significant contribution to the field of generative AI and computer vision, and showcases the potential of LDM3D and DepthFusion to revolutionize content creation and digital experiences. A short video summarizing the approach can be found at https://t.ly/tdi2.	This paper introduces LDM3D, a novel Latent Diffusion Model that generates RGB images and their corresponding depth maps from text prompts, enabling the creation of immersive 360° experiences.	LDM3D advances generative AI by enabling the creation of more immersive and interactive content, pushing the boundaries of digital experience beyond traditional 2D representations.	The authors fine-tuned Stable Diffusion v1.4 on a dataset of RGB images, depth maps (generated by DPT-Large), and captions, modifying the model architecture to process and generate both data types. They also developed DepthFusion, an application using TouchDesigner to create 360° views from LDM3D outputs.	LDM3D achieves comparable image quality to Stable Diffusion v1.4 based on FID and CLIP similarity metrics. LDM3D generates depth maps with accuracy comparable to DPT-Large. DepthFusion successfully leverages LDM3D outputs to create immersive 360° experiences, demonstrating the potential for novel content creation.	Fine-tuning the KL-autoencoder for RGBD data slightly decreased reconstruction quality compared to the pre-trained model on RGB data only. Future work includes exploring alternative autoencoder architectures to further improve reconstruction quality for better content generation.	generative ai, diffusion models, depth estimation, 360° view synthesis, immersive experiences
2305.10764 Report	OpenShape: Scaling Up 3D Shape Representation Towards Open-World Understanding	Minghua Liu, Ruoxi Shi, Kaiming Kuang, Yinhao Zhu, Xuanlin Li, Shizhong Han, Hong Cai, Fatih Porikli, Hao Su	We introduce OpenShape, a method for learning multi-modal joint representations of text, image, and point clouds. We adopt the commonly used multi-modal contrastive learning framework for representation alignment, but with a specific focus on scaling up 3D representations to enable open-world 3D shape understanding. To achieve this, we scale up training data by ensembling multiple 3D datasets and propose several strategies to automatically filter and enrich noisy text descriptions. We also explore and compare strategies for scaling 3D backbone networks and introduce a novel hard negative mining module for more efficient training. We evaluate OpenShape on zero-shot 3D classification benchmarks and demonstrate its superior capabilities for open-world recognition. Specifically, OpenShape achieves a zero-shot accuracy of 46.8% on the 1,156-category Objaverse-LVIS benchmark, compared to less than 10% for existing methods. OpenShape also achieves an accuracy of 85.3% on ModelNet40, outperforming previous zero-shot baseline methods by 20% and performing on par with some fully-supervised methods. Furthermore, we show that our learned embeddings encode a wide range of visual and semantic concepts (e.g., subcategories, color, shape, style) and facilitate fine-grained text-3D and image-3D interactions. Due to their alignment with CLIP embeddings, our learned shape representations can also be integrated with off-the-shelf CLIP-based models for various applications, such as point cloud captioning and point cloud-conditioned image generation.	This paper introduces OpenShape, a method for learning multi-modal joint representations of text, image, and point clouds for open-world 3D shape understanding.	Existing 3D shape understanding methods are limited by the scale of training data and struggle with unseen categories, hindering real-world applications.	OpenShape scales up training data by ensembling multiple 3D datasets, employs strategies for filtering and enriching noisy text descriptions, explores scaling 3D backbone networks, and introduces a hard negative mining module.	OpenShape achieves superior zero-shot 3D classification accuracy on ModelNet40 (85.3%) and Objaverse-LVIS (46.8%). Learned embeddings encode rich visual and semantic concepts, enabling fine-grained text-3D and image-3D retrieval. OpenShape embeddings can be integrated with off-the-shelf CLIP-based models for point cloud captioning and image generation.	Training data size is still limited compared to 2D counterparts. Current shape representations mainly focus on global features, lacking part-level information.	3d shape understanding, multi-modal representation learning, zero-shot learning, contrastive learning, open-world recognition
2305.10683 Report	Paxion: Patching Action Knowledge in Video-Language Foundation Models	Zhenhailong Wang, Ansel Blume, Sha Li, Genglin Liu, Jaemin Cho, Zineng Tang, Mohit Bansal, Heng Ji	Action knowledge involves the understanding of textual, visual, and temporal aspects of actions. We introduce the Action Dynamics Benchmark (ActionBench) containing two carefully designed probing tasks: Action Antonym and Video Reversal, which targets multimodal alignment capabilities and temporal understanding skills of the model, respectively. Despite recent video-language models' (VidLM) impressive performance on various benchmark tasks, our diagnostic tasks reveal their surprising deficiency (near-random performance) in action knowledge, suggesting that current models rely on object recognition abilities as a shortcut for action understanding. To remedy this, we propose a novel framework, Paxion, along with a new Discriminative Video Dynamics Modeling (DVDM) objective. The Paxion framework utilizes a Knowledge Patcher network to encode new action knowledge and a Knowledge Fuser component to integrate the Patcher into frozen VidLMs without compromising their existing capabilities. Due to limitations of the widely-used Video-Text Contrastive (VTC) loss for learning action knowledge, we introduce the DVDM objective to train the Knowledge Patcher. DVDM forces the model to encode the correlation between the action text and the correct ordering of video frames. Our extensive analyses show that Paxion and DVDM together effectively fill the gap in action knowledge understanding (~50% to 80%), while maintaining or improving performance on a wide spectrum of both object- and action-centric downstream tasks. The code and data will be made publicly available for research purposes at https://github.com/MikeWangWZHL/Paxion.git.	This paper introduces Phy-Bench, a benchmark to evaluate action knowledge in video-language models, and proposes Paxion, a novel framework to enhance action knowledge in pretrained video-language models without hurting their general capabilities.	Existing video-language models show a surprising lack of action knowledge, relying on object recognition as a shortcut, which limits their understanding of dynamic events.	Paxion uses a Knowledge Patcher (KP) based on Perceivers to encode action knowledge and a Knowledge Fuser (KF) to integrate KP into frozen video-language model backbones. It introduces Dynamic Video Dynamics Modeling (DVDM) objectives, including Video-Action Contrastive (VAC) and Action-Temporal Matching (ATM) losses, to train KP.	Video-language models perform near-randomly on Phy-Bench, highlighting their deficiency in action knowledge. Paxion with DVDM significantly improves performance on Phy-Bench, effectively patching the action knowledge gap. Paxion maintains or surpasses the original model's performance on object-centric and action-centric downstream tasks, demonstrating its ability to improve both object and action understanding.	The paper focuses on patching only one type of knowledge (action knowledge). Future work includes exploring other types of physical knowledge (e.g., object affordances, mental simulation) and fusion with multiple learned Knowledge Patchers.	action knowledge, video-language models, benchmarking, parameter-efficient fine-tuning, dynamics modeling
2305.10675 Report	Tuned Contrastive Learning	Chaitanya Animesh, Manmohan Chandraker	In recent times, contrastive learning based loss functions have become increasingly popular for visual self-supervised representation learning owing to their state-of-the-art (SOTA) performance. Most of the modern contrastive learning methods generalize only to one positive and multiple negatives per anchor. A recent state-of-the-art, supervised contrastive (SupCon) loss, extends self-supervised contrastive learning to supervised setting by generalizing to multiple positives and negatives in a batch and improves upon the cross-entropy loss. In this paper, we propose a novel contrastive loss function -- Tuned Contrastive Learning (TCL) loss, that generalizes to multiple positives and negatives in a batch and offers parameters to tune and improve the gradient responses from hard positives and hard negatives. We provide theoretical analysis of our loss function's gradient response and show mathematically how it is better than that of SupCon loss. We empirically compare our loss function with SupCon loss and cross-entropy loss in supervised setting on multiple classification-task datasets to show its effectiveness. We also show the stability of our loss function to a range of hyper-parameter settings. Unlike SupCon loss which is only applied to supervised setting, we show how to extend TCL to self-supervised setting and empirically compare it with various SOTA self-supervised learning methods. Hence, we show that TCL loss achieves performance on par with SOTA methods in both supervised and self-supervised settings.	This paper proposes Tuned Contrastive Learning (TCL) loss, a novel contrastive loss function generalizing to multiple positives and negatives in a batch for both supervised and self-supervised settings.	TCL loss aims to overcome limitations of the SupCon loss by improving gradient responses from hard positives and hard negatives, leading to performance gains in representation learning.	TCL loss introduces tunable parameters (k1, k2) to regulate gradient contributions from positives and negatives. It's evaluated against SupCon and cross-entropy losses in supervised settings and compared to SOTA self-supervised learning methods.	TCL loss consistently outperforms SupCon loss and cross-entropy loss in supervised image classification tasks. TCL loss demonstrates stability across various hyperparameter settings, including encoder architectures, batch sizes, and augmentation strategies. In self-supervised settings, TCL loss, using positive triplets, outperforms SimCLR and achieves comparable performance to SOTA methods.	Choosing TCL's parameters k1 and k2 relies on heuristics. Future work could explore loss objectives that inherently provide TCL's benefits without introducing additional parameters.	contrastive learning, supervised learning, self-supervised learning, representation learning, loss function
2305.10579 Report	MultiPlaneNeRF: Neural Radiance Field with Non-Trainable Representation	Dominik Zimny, Artur Kasymov, Adam Kania, Jacek Tabor, Maciej Zięba, Przemysław Spurek	NeRF is a popular model that efficiently represents 3D objects from 2D images. However, vanilla NeRF has some important limitations. NeRF must be trained on each object separately. The training time is long since we encode the object's shape and color in neural network weights. Moreover, NeRF does not generalize well to unseen data. In this paper, we present MultiPlaneNeRF -- a model that simultaneously solves the above problems. Our model works directly on 2D images. We project 3D points on 2D images to produce non-trainable representations. The projection step is not parametrized and a very shallow decoder can efficiently process the representation. Furthermore, we can train MultiPlaneNeRF on a large data set and force our implicit decoder to generalize across many objects. Consequently, we can only replace the 2D images (without additional training) to produce a NeRF representation of the new object. In the experimental section, we demonstrate that MultiPlaneNeRF achieves results comparable to state-of-the-art models for synthesizing new views and has generalization properties. Additionally, MultiPlane decoder can be used as a component in large generative models like GANs.	This paper proposes MultiPlaneNeRF, a novel NeRF model that utilizes non-trainable representations of 3D objects using pre-existing 2D images, enabling efficient training of a small implicit decoder for view synthesis.	Existing NeRF models suffer from long training times, lack of generalization to unseen data, and limitations in scalability. MultiPlaneNeRF aims to address these issues by leveraging fixed 2D images as planar representations.	MultiPlaneNeRF projects 3D points onto a fixed set of 2D images to create non-trainable representations. A shallow decoder then aggregates color and position information from projected points to predict RGB colors and volume density, trained using a vanilla NeRF loss function.	MultiPlaneNeRF achieves rendering results comparable to state-of-the-art models like NeRF and NSFV while using fewer trainable parameters. The model demonstrates generalization capabilities by synthesizing novel views of unseen objects from different classes by simply replacing the input image representation. The MultiPlane decoder can be integrated into larger generative architectures like GANs, yielding comparable results to models like EG3D with the benefit of interpretable representations.	The trade-off between rendering quality and generalization properties requires further investigation. Future work can explore extending MultiPlaneNeRF to handle dynamic scenes.	neural radiance fields, view synthesis, 3d object representation, generalization, multiplane representation
2305.10513 Report	Learning Pose Image Manifolds Using Geometry-Preserving GANs and Elasticae	Shenyuan Liang, Pavan Turaga, Anuj Srivastava	This paper investigates the challenge of learning image manifolds, specifically pose manifolds, of 3D objects using limited training data. It proposes a DNN approach to manifold learning and for predicting images of objects for novel, continuous 3D rotations. The approach uses two distinct concepts: (1) Geometric Style-GAN (Geom-SGAN), which maps images to low-dimensional latent representations and maintains the (first-order) manifold geometry. That is, it seeks to preserve the pairwise distances between base points and their tangent spaces, and (2) uses Euler's elastica to smoothly interpolate between directed points (points + tangent directions) in the low-dimensional latent space. When mapped back to the larger image space, the resulting interpolations resemble videos of rotating objects. Extensive experiments establish the superiority of this framework in learning paths on rotation manifolds, both visually and quantitatively, relative to state-of-the-art GANs and VAEs.	This paper proposes a novel deep neural network (DNN) framework for learning image manifolds of 3D objects, specifically focusing on pose manifolds, with limited training data.	Learning pose manifolds is crucial for predicting object appearance under novel viewpoints and facilitates various applications in computer vision.	The framework utilizes a two-step approach: (1) Geometric Style-GAN (Geom-SGAN) maps images to a low-dimensional latent space while preserving pairwise distances and tangent space geometry. (2) Euler's elastica interpolates between directed points in the latent space, generating smooth rotation paths when mapped back to the image space.	The proposed method outperforms state-of-the-art GANs and VAEs in generating realistic and accurate rotation paths for various 3D objects. Quantitative evaluations using squared errors demonstrate the superior performance of the approach in preserving both image and tangent space geometry. The learnt pose manifolds can be utilized for applications such as image denoising by finding the nearest neighbor on the manifold.	The current implementation primarily focuses on rotation manifolds and assumes other imaging conditions are fixed. Future work includes extending the framework to handle more complex transformations and exploring its application in other computer vision tasks.	image manifolds, pose manifolds, generative models, deep learning, "eulers elastica"
2305.10474 Report	Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models	Songwei Ge, Seungjun Nah, Guilin Liu, Tyler Poon, Andrew Tao, Bryan Catanzaro, David Jacobs, Jia-Bin Huang, Ming-Yu Liu, Yogesh Balaji	Despite tremendous progress in generating high-quality images using diffusion models, synthesizing a sequence of animated frames that are both photorealistic and temporally coherent is still in its infancy. While off-the-shelf billion-scale datasets for image generation are available, collecting similar video data of the same scale is still challenging. Also, training a video diffusion model is computationally much more expensive than its image counterpart. In this work, we explore finetuning a pretrained image diffusion model with video data as a practical solution for the video synthesis task. We find that naively extending the image noise prior to video noise prior in video diffusion leads to sub-optimal performance. Our carefully designed video noise prior leads to substantially better performance. Extensive experimental validation shows that our model, Preserve Your Own Correlation (PYoCo), attains SOTA zero-shot text-to-video results on the UCF-101 and MSR-VTT benchmarks. It also achieves SOTA video generation quality on the small-scale UCF-101 benchmark with a $10\times$ smaller model using significantly less computation than the prior art.	This paper proposes a new video diffusion noise prior, called Preserve Your Own COrrelation (PYoCo), tailored for fine-tuning text-to-image diffusion models for text-to-video generation.	Training large-scale text-to-video diffusion models from scratch is computationally expensive and data-intensive. Leveraging pre-trained image diffusion models through fine-tuning is a practical alternative, but naively extending the image noise prior to video diffusion leads to sub-optimal performance.	The authors analyze the correlation of noise maps in video frames and design a progressive noise model that captures temporal correlations. They fine-tune a pre-trained text-to-image diffusion model (EDIFFI) with the proposed noise prior and incorporate techniques like temporal attention, joint image-video fine-tuning, cascaded generation, and ensemble denoisers.	PYoCo achieves state-of-the-art zero-shot text-to-video results on UCF-101 and MSR-VTT benchmarks. The method outperforms previous approaches on unconditional video generation on UCF-101 while using a significantly smaller model size and less computation. Ablation studies confirm that the proposed correlated noise model is superior to training from scratch or fine-tuning with an independent noise model.	The paper primarily focuses on generating videos with short durations. The impact of different video datasets on model performance requires further investigation.	video generation, diffusion models, text-to-video, noise prior, fine-tuning
2305.10456 Report	LPMM: Intuitive Pose Control for Neural Talking-Head Model via Landmark-Parameter Morphable Model	Kwangho Lee, Patrick Kwon, Myung Ki Lee, Namhyuk Ahn, Junsoo Lee	While current talking head models are capable of generating photorealistic talking head videos, they provide limited pose controllability. Most methods require specific video sequences that should exactly contain the head pose desired, being far from user-friendly pose control. Three-dimensional morphable models (3DMM) offer semantic pose control, but they fail to capture certain expressions. We present a novel method that utilizes parametric control of head orientation and facial expression over a pre-trained neural-talking head model. To enable this, we introduce a landmark-parameter morphable model (LPMM), which offers control over the facial landmark domain through a set of semantic parameters. Using LPMM, it is possible to adjust specific head pose factors, without distorting other facial attributes. The results show our approach provides intuitive rig-like control over neural talking head models, allowing both parameter and image-based inputs.	This paper introduces LPMM (Landmark-Parameter Morphable Model), a method for intuitive pose control of neural talking-head models using semantic parameters.	Current talking-head models lack user-friendly pose control, often requiring specific video sequences or offering limited semantic control.	The method involves training an LP-regressor to extract LPMM parameters from facial images and an LP-adaptor to convert these parameters into latent codes for a pre-trained talking-head generator.	LPMM enables independent control of facial expressions and head orientation using semantic parameters. The method allows both parameter-based and image-based pose control, providing flexibility to users. Evaluation shows superior performance compared to existing methods like StyleRig, especially for in-plane rotations and complex expressions.	Discovering parameter combinations for intuitive control of complex expressions is an open challenge. The method's reliance on a pre-trained talking-head model limits its generalizability to unseen identities.	talking-head synthesis, pose control, facial reenactment, landmark-parameter morphable model, semantic control
2305.10438 Report	IMAGINATOR: Pre-Trained Image+Text Joint Embeddings using Word-Level Grounding of Images	Varuna Krishna, S Suryavardan, Shreyash Mishra, Sathyanarayanan Ramamoorthy, Parth Patwa, Megha Chakraborty, Aman Chadha, Amitava Das, Amit Sheth	Word embeddings, i.e., semantically meaningful vector representation of words, are largely influenced by the distributional hypothesis "You shall know a word by the company it keeps" (Harris, 1954), whereas modern prediction-based neural network embeddings rely on design choices and hyperparameter optimization. Word embeddings like Word2Vec, GloVe etc. well capture the contextuality and real-world analogies but contemporary convolution-based image embeddings such as VGGNet, AlexNet, etc. do not capture contextual knowledge. The popular king-queen analogy does not hold true for most commonly used vision embeddings. In this paper, we introduce a pre-trained joint embedding (JE), named IMAGINATOR, trained on 21K distinct image objects level from 1M image+text pairs. JE is a way to encode multimodal data into a vector space where the text modality serves as the ground-ing key, which the complementary modality (in this case, the image) is anchored with. IMAGINATOR encapsulates three individual representations: (i) object-object co-location, (ii) word-object co-location, and (iii) word-object correlation. These three ways capture complementary aspects of the two modalities which are further combined to obtain the final JEs. Generated JEs are intrinsically evaluated to assess how well they capture the contextuality and real-world analogies. We also evaluate pre-trained IMAGINATOR JEs on three downstream tasks: (i) image captioning, (ii) Image2Tweet, and (iii) text-based image retrieval. IMAGINATOR establishes a new standard on the aforementioned down-stream tasks by outperforming the current SoTA on all the selected tasks. IMAGINATOR will be made publicly available. The codes are available at https://github.com/varunakk/IMAGINATOR	This paper introduces IMAGINATOR, a pre-trained joint embedding model for vision-language tasks that captures contextual relationships between words and objects.	Current image embeddings struggle to capture contextuality and real-world analogies, hindering performance in tasks like image captioning and retrieval. IMAGINATOR aims to address this by incorporating distributional semantics from NLP.	IMAGINATOR utilizes an object detection model (Detic) and leverages three co-location matrices (object-object, word-object, word-word) to learn joint embeddings. It uses PPMI, context distribution smoothing, SVD, and eigenvalue weighting to enhance representation quality.	IMAGINATOR outperforms state-of-the-art models on intrinsic evaluations of word contextuality and image similarity. It achieves superior performance on downstream tasks including image captioning, Image2Tweet, and text-based image retrieval. The proposed $BERT_{IMAGINATOR}$ architecture effectively leverages joint embeddings for compositional understanding of image-text pairs.	The performance of IMAGINATOR is limited by the capabilities of existing object detection techniques, which only identify a limited set of objects. Future work includes exploring contrastive learning to improve object representations and investigating vision transformers with positional encoding for finer-grained cross-modal connections.	joint embeddings, multimodal learning, image captioning, image retrieval, distributional semantics
2305.10431 Report	FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention	Guangxuan Xiao, Tianwei Yin, William T. Freeman, Frédo Durand, Song Han	Diffusion models excel at text-to-image generation, especially in subject-driven generation for personalized images. However, existing methods are inefficient due to the subject-specific fine-tuning, which is computationally intensive and hampers efficient deployment. Moreover, existing methods struggle with multi-subject generation as they often blend features among subjects. We present FastComposer which enables efficient, personalized, multi-subject text-to-image generation without fine-tuning. FastComposer uses subject embeddings extracted by an image encoder to augment the generic text conditioning in diffusion models, enabling personalized image generation based on subject images and textual instructions with only forward passes. To address the identity blending problem in the multi-subject generation, FastComposer proposes cross-attention localization supervision during training, enforcing the attention of reference subjects localized to the correct regions in the target images. Naively conditioning on subject embeddings results in subject overfitting. FastComposer proposes delayed subject conditioning in the denoising step to maintain both identity and editability in subject-driven image generation. FastComposer generates images of multiple unseen individuals with different styles, actions, and contexts. It achieves 300$\times$-2500$\times$ speedup compared to fine-tuning-based methods and requires zero extra storage for new subjects. FastComposer paves the way for efficient, personalized, and high-quality multi-subject image creation. Code, model, and dataset are available at https://github.com/mit-han-lab/fastcomposer.	\method is a tuning-free method for personalized, multi-subject text-to-image generation that uses a pre-trained vision encoder to achieve efficiency and accessibility.	Existing methods for subject-driven text-to-image generation are inefficient due to subject-specific fine-tuning and struggle with multi-subject generation due to identity blending.	The paper proposes to augment text prompts with visual features from reference subject images using a pre-trained vision encoder and a multi-layer perceptron (MLP). They introduce cross-attention localization during training to prevent identity blending and delayed subject conditioning to balance identity preservation with text guidance.	\method achieves state-of-the-art performance on both single-subject and multi-subject image generation benchmarks. It significantly outperforms optimization-based methods (DreamBooth, Textual Inversion, Custom Diffusion) in identity preservation while maintaining competitive prompt consistency. \method is significantly faster and more memory-efficient than fine-tuning-based approaches, achieving 300x-2500x speedup and 2.8x-6.7x memory saving.	The current training dataset (FFHQ) is limited in size and diversity, mainly containing headshots of human faces. The model is primarily human-centric due to the scarcity of large-scale, multi-subject datasets featuring other subjects like animals.	text-to-image generation, diffusion models, personalization, multi-subject generation, tuning-free
2305.10293 Report	Infinite Class Mixup	Thomas Mensink, Pascal Mettes	Mixup is a widely adopted strategy for training deep networks, where additional samples are augmented by interpolating inputs and labels of training pairs. Mixup has shown to improve classification performance, network calibration, and out-of-distribution generalisation. While effective, a cornerstone of Mixup, namely that networks learn linear behaviour patterns between classes, is only indirectly enforced since the output interpolation is performed at the probability level. This paper seeks to address this limitation by mixing the classifiers directly instead of mixing the labels for each mixed pair. We propose to define the target of each augmented sample as a uniquely new classifier, whose parameters are a linear interpolation of the classifier vectors of the input pair. The space of all possible classifiers is continuous and spans all interpolations between classifier pairs. To make optimisation tractable, we propose a dual-contrastive Infinite Class Mixup loss, where we contrast the classifier of a mixed pair to both the classifiers and the predicted outputs of other mixed pairs in a batch. Infinite Class Mixup is generic in nature and applies to many variants of Mixup. Empirically, we show that it outperforms standard Mixup and variants such as RegMixup and Remix on balanced, long-tailed, and data-constrained benchmarks, highlighting its broad applicability.	This paper proposes Infinite Class Mixup, a novel training strategy that improves upon traditional Mixup by directly interpolating image classifiers instead of just label probabilities.	Traditional Mixup indirectly enforces linear behavior between classes by interpolating labels at the probability level. This paper argues that directly interpolating classifiers provides a stronger and more direct enforcement, leading to better generalization.	The method defines a unique classifier for each mixed image pair by linearly interpolating classifier vectors of original classes. To handle the infinite possibilities of interpolated classifiers, a dual-contrastive loss function is introduced, contrasting each mixed pair against other classifiers and mixed images within the same batch.	Infinite Class Mixup consistently outperforms standard Mixup and its variants like RegMixup and Remix across various benchmarks, especially in data-constrained and imbalanced settings. The dual-contrastive loss function, employing both class-axis and pair-axis contrastive learning, is shown to be crucial for the effectiveness of Infinite Class Mixup. Analyses reveal that Infinite Class Mixup leads to lower confidence for ambiguous interpolations and better differentiation between interpolated classes compared to standard Mixup.	The paper primarily focuses on image classification tasks. Exploring its generalization to other data modalities like point clouds or graphs remains a potential future direction. Further investigation into the impact of different contrastive learning strategies and their potential benefits for specific tasks could be valuable.	mixup, deep learning, image classification, contrastive learning, data augmentation
2305.10223 Report	NAI$_2$: Learning Noise-Aware Illumination-Interpolator for Unsupervised Low-Light Image Enhancement	Xiaofeng Liu, Jiaxin Gao, Xin Fan, Risheng Liu	Contemporary Low-Light Image Enhancement (LLIE) techniques have made notable advancements in preserving image details and enhancing contrast, achieving commendable results on specific datasets. Nevertheless, these approaches encounter persistent challenges in efficiently mitigating dynamic noise and accommodating diverse low-light scenarios. Insufficient constraints on complex pixel-wise mapping learning lead to overfitting to specific types of noise and artifacts associated with low-light conditions, reducing effectiveness in variable lighting scenarios. To this end, we first propose a method for estimating the noise level in low light images in a quick and accurate way. This facilitates precise denoising, prevents over-smoothing, and adapts to dynamic noise patterns. Subsequently, we devise a Learnable Illumination Interpolator (LII), which employs learnlable interpolation operations between the input and unit vector to satisfy general constraints between illumination and input. Finally, we introduce a self-regularization loss that incorporates intrinsic image properties and essential visual attributes to guide the output towards meeting human visual expectations. Comprehensive experiments validate the competitiveness of our proposed algorithm in both qualitative and quantitative assessments. Notably, our noise estimation method, with linear time complexity and suitable for various denoisers, significantly improves both denoising and enhancement performance. Benefiting from this, our approach achieves a 0.675dB PSNR improvement on the LOL dataset and 0.818dB on the MIT dataset on LLIE task, even compared to supervised methods.	This paper proposes NAI$_2$, a novel unsupervised Low-Light Image Enhancement (LLIE) method employing a denoising-first and enhancing-later pipeline.	Existing LLIE techniques struggle to effectively mitigate dynamic noise and adapt to diverse low-light scenarios, often overfitting to specific datasets or noise types.	NAI$_2$ leverages a novel noise estimation method based on high-order image gradients for precise denoising. It then uses a Learnable Illumination Interpolator (LII) with a self-regularization loss based on natural image statistics to ensure natural color and illumination.	NAI$_2$ achieves state-of-the-art performance on benchmark datasets like MIT and LOL, surpassing supervised methods in some cases. The proposed noise estimation method significantly improves denoising efficacy and efficiency compared to traditional methods. LII, with its inherent structure constraint, ensures smooth yet structure-aware illumination maps, leading to visually pleasing enhancements.	The noise estimation method currently focuses on Gaussian noise and requires further exploration for other noise types. Future work will investigate incorporating data inline distribution for enhanced performance.	low-light image enhancement, noise estimation, illumination interpolation, unsupervised learning, image restoration
2305.10210 Report	Object Re-Identification from Point Clouds	Benjamin Thérien, Chengjie Huang, Adrian Chow, Krzysztof Czarnecki	Object re-identification (ReID) from images plays a critical role in application domains of image retrieval (surveillance, retail analytics, etc.) and multi-object tracking (autonomous driving, robotics, etc.). However, systems that additionally or exclusively perceive the world from depth sensors are becoming more commonplace without any corresponding methods for object ReID. In this work, we fill the gap by providing the first large-scale study of object ReID from point clouds and establishing its performance relative to image ReID. To enable such a study, we create two large-scale ReID datasets with paired image and LiDAR observations and propose a lightweight matching head that can be concatenated to any set or sequence processing backbone (e.g., PointNet or ViT), creating a family of comparable object ReID networks for both modalities. Run in Siamese style, our proposed point cloud ReID networks can make thousands of pairwise comparisons in real-time ($10$ Hz). Our findings demonstrate that their performance increases with higher sensor resolution and approaches that of image ReID when observations are sufficiently dense. Our strongest network trained at the largest scale achieves ReID accuracy exceeding $90\%$ for rigid objects and $85\%$ for deformable objects (without any explicit skeleton normalization). To our knowledge, we are the first to study object re-identification from real point cloud observations.	This paper presents the first large-scale study of object re-identification (ReID) from point clouds, comparing its performance to image-based ReID.	Object ReID is crucial for applications like multi-object tracking in autonomous driving and robotics, and using point clouds from LiDAR sensors can offer advantages over traditional image-based methods, especially as depth sensor resolution increases.	The authors create two large-scale ReID datasets with paired image and LiDAR data from nuScenes and Waymo Open Dataset. They propose a lightweight, real-time matching head (RTMM) that can be used with various point cloud processing backbones (PointNet, DGCNN, Point-Transformer) for pairwise object comparisons.	Point cloud ReID performance approaches image ReID with sufficiently dense point clouds. Performance improves significantly with higher LiDAR sensor resolution, suggesting a promising future for point cloud ReID. ReID accuracy exceeding 90% for rigid objects and 85% for deformable objects is achieved with their best model.	The study is limited by computational resources, preventing training on all possible data samples. Future work can explore fusing LiDAR and camera data, and incorporating geometric priors for improved performance.	object re-identification, point cloud, lidar, autonomous driving, multi-object tracking
2305.10028 Report	Pyramid Diffusion Models For Low-light Image Enhancement	Dewei Zhou, Zongxin Yang, Yi Yang	Recovering noise-covered details from low-light images is challenging, and the results given by previous methods leave room for improvement. Recent diffusion models show realistic and detailed image generation through a sequence of denoising refinements and motivate us to introduce them to low-light image enhancement for recovering realistic details. However, we found two problems when doing this, i.e., 1) diffusion models keep constant resolution in one reverse process, which limits the speed; 2) diffusion models sometimes result in global degradation (e.g., RGB shift). To address the above problems, this paper proposes a Pyramid Diffusion model (PyDiff) for low-light image enhancement. PyDiff uses a novel pyramid diffusion method to perform sampling in a pyramid resolution style (i.e., progressively increasing resolution in one reverse process). Pyramid diffusion makes PyDiff much faster than vanilla diffusion models and introduces no performance degradation. Furthermore, PyDiff uses a global corrector to alleviate the global degradation that may occur in the reverse process, significantly improving the performance and making the training of diffusion models easier with little additional computational consumption. Extensive experiments on popular benchmarks show that PyDiff achieves superior performance and efficiency. Moreover, PyDiff can generalize well to unseen noise and illumination distributions.	This paper proposes PyDiff, a novel pyramid diffusion model for low-light image enhancement that achieves state-of-the-art performance and efficiency.	Existing methods for low-light image enhancement struggle to recover fine details often resulting in blurred outputs. Diffusion models excel at generating realistic details through iterative refinement, making them suitable for this task.	PyDiff utilizes a pyramid diffusion method that performs sampling at progressively increasing resolutions within a single reverse process, leading to significant speed improvements. It also introduces a global corrector to alleviate global degradations like RGB shifts often occurring in diffusion models.	PyDiff achieves state-of-the-art performance on popular benchmarks like LOL and LOLv2, outperforming previous methods in both quantitative metrics and visual quality. The pyramid diffusion method significantly accelerates inference, making PyDiff nearly twice as fast as the previous state-of-the-art method LLFlow. PyDiff exhibits strong generalization capabilities, effectively handling unseen noise and illumination distributions.	The global corrector, while effective, introduces an additional hyperparameter (correction threshold) that requires tuning. The current implementation of PyDiff focuses on single-image enhancement. Exploring its potential for video enhancement could be a promising direction.	low-light image enhancement, diffusion models, pyramid diffusion, global corrector, image restoration
2305.09967 Report	Variable Length Embeddings	Johnathan Chiu, Andi Gu, Matt Zhou	In this work, we introduce a novel deep learning architecture, Variable Length Embeddings (VLEs), an autoregressive model that can produce a latent representation composed of an arbitrary number of tokens. As a proof of concept, we demonstrate the capabilities of VLEs on tasks that involve reconstruction and image decomposition. We evaluate our experiments on a mix of the iNaturalist and ImageNet datasets and find that VLEs achieve comparable reconstruction results to a state of the art VAE, using less than a tenth of the parameters.	This paper introduces Variable Length Embeddings (VLEs), an autoregressive model that generates a latent representation with a variable number of tokens, allowing for flexible and efficient image representation.	VLEs offer a more efficient and interpretable way to represent images compared to traditional fixed-dimensional autoencoders, potentially benefiting downstream tasks like classification, captioning, and generative modeling.	The authors develop two VLE variants: vanilla VLE, which focuses on pixel-level reconstruction, and masked VLE, which introduces a masking mechanism to encourage semantically meaningful token representation. Both variants are trained in a self-supervised manner.	VLEs achieve comparable reconstruction performance to state-of-the-art VAEs with significantly fewer parameters (less than one-tenth). Vanilla VLE demonstrates a strong dependence on pixel distribution complexity, while masked VLE shows potential for capturing semantically distinct objects. Masked VLE exhibits promising results in decomposing images into human-interpretable masks, highlighting its potential for downstream tasks.	The current masking mechanism in masked VLE, while promising, could be further improved by incorporating image segmentation or saliency priors. Future work includes exploring the integration of other modalities, such as image captioning, to enhance the model's understanding of contextual relationships.	autoencoders, variable length embeddings, image representation learning, unsupervised learning, generative modeling
2305.09900 Report	Efficient Equivariant Transfer Learning from Pretrained Models	Sourya Basu, Pulkit Katdare, Prasanna Sattigeri, Vijil Chenthamarakshan, Katherine Driggs-Campbell, Payel Das, Lav R. Varshney	Efficient transfer learning algorithms are key to the success of foundation models on diverse downstream tasks even with limited data. Recent works of Basu et al. (2023) and Kaba et al. (2022) propose group averaging (equitune) and optimization-based methods, respectively, over features from group-transformed inputs to obtain equivariant outputs from non-equivariant neural networks. While Kaba et al. (2022) are only concerned with training from scratch, we find that equitune performs poorly on equivariant zero-shot tasks despite good finetuning results. We hypothesize that this is because pretrained models provide better quality features for certain transformations than others and simply averaging them is deleterious. Hence, we propose {\lambda}-equitune that averages the features using importance weights, {\lambda}s. These weights are learned directly from the data using a small neural network, leading to excellent zero-shot and finetuned results that outperform equitune. Further, we prove that {\lambda}-equitune is equivariant and a universal approximator of equivariant functions. Additionally, we show that the method of Kaba et al. (2022) used with appropriate loss functions, which we call equizero, also gives excellent zero-shot and finetuned performance. Both equitune and equizero are special cases of {\lambda}-equitune. To show the simplicity and generality of our method, we validate on a wide range of diverse applications and models such as 1) image classification using CLIP, 2) deep Q-learning, 3) fairness in natural language generation (NLG), 4) compositional generalization in languages, and 5) image classification using pretrained CNNs such as Resnet and Alexnet.	This paper introduces lambda-equitune, a method for improving the zero-shot and fine-tuning performance of pretrained models on equivariant tasks by learning importance weights for features extracted from group-transformed inputs.	Efficient transfer learning algorithms are crucial for leveraging pretrained models in diverse downstream tasks with limited data, especially those exhibiting group equivariance.	Lambda-equitune extends the concept of equitune by incorporating learnable importance weights, lambda, assigned to features obtained from group-transformed inputs. These weights are learned directly from the data using a small neural network and are used for weighted group averaging.	Lambda-equitune outperforms equitune and is competitive with equizero, another proposed method based on optimizing a proxy loss function over group transformations, in zero-shot learning. For fine-tuning, lambda-equitune often surpasses both equitune and equizero. Both lambda-equitune and equizero demonstrate superior performance compared to non-equivariant pretrained models and equitune across various tasks, including image classification, deep Q-learning, fairness in natural language generation, compositional generalization in languages, and image classification using pretrained CNNs.	The current work focuses on finite groups; extending it to continuous groups requires further research. Future work can explore optimizing the design of equality and neutral sets used in fairness tasks for more general demographic groups.	equivariant deep learning, transfer learning, zero-shot learning, fine-tuning, group equivariance
2305.09847 Report	Selective Guidance: Are All the Denoising Steps of Guided Diffusion Important?	Pareesa Ameneh Golnari, Zhewei Yao, Yuxiong He	This study examines the impact of optimizing the Stable Diffusion (SD) guided inference pipeline. We propose optimizing certain denoising steps by limiting the noise computation to conditional noise and eliminating unconditional noise computation, thereby reducing the complexity of the target iterations by 50%. Additionally, we demonstrate that later iterations of the SD are less sensitive to optimization, making them ideal candidates for applying the suggested optimization. Our experiments show that optimizing the last 20% of the denoising loop iterations results in an 8.2% reduction in inference time with almost no perceivable changes to the human eye. Furthermore, we found that by extending the optimization to 50% of the last iterations, we can reduce inference time by approximately 20.3%, while still generating visually pleasing images.	This paper proposes an optimization method for the Stable Diffusion (SD) guided inference pipeline by selectively eliminating the computation of unconditional noise in later denoising steps.	This optimization aims to reduce the inference time of SD without significantly impacting the perceived quality of the generated images.	The authors analyze the sensitivity of different denoising iterations and find that later iterations are less sensitive to optimization. They then propose limiting the noise computation to only the conditional noise in a certain percentage of the last iterations, effectively reducing the computational complexity.	Optimizing the last 20% of denoising iterations results in an 8.2% reduction in inference time with almost no noticeable difference in image quality. Extending the optimization to 50% of the last iterations achieves a 20.3% speedup while still maintaining visually pleasing images. Further tuning of the guidance scale can compensate for detail loss when applying aggressive optimization.	The paper mainly focuses on a single diffusion model (SD) and its performance on a limited set of prompts. More extensive user studies are needed to thoroughly evaluate the impact of the optimization on perceived image quality.	stable diffusion, guided diffusion, optimization, inference time, image generation
2305.09828 Report	Mimetic Initialization of Self-Attention Layers	Asher Trockman, J. Zico Kolter	It is notoriously difficult to train Transformers on small datasets; typically, large pre-trained models are instead used as the starting point. We explore the weights of such pre-trained Transformers (particularly for vision) to attempt to find reasons for this discrepancy. Surprisingly, we find that simply initializing the weights of self-attention layers so that they "look" more like their pre-trained counterparts allows us to train vanilla Transformers faster and to higher final accuracies, particularly on vision tasks such as CIFAR-10 and ImageNet classification, where we see gains in accuracy of over 5% and 4%, respectively. Our initialization scheme is closed form, learning-free, and very simple: we set the product of the query and key weights to be approximately the identity, and the product of the value and projection weights to approximately the negative identity. As this mimics the patterns we saw in pre-trained Transformers, we call the technique "mimetic initialization".	This paper introduces "mimetic initialization," a learning-free initialization technique for Transformers that mimics patterns observed in pretrained models, leading to improved training and performance, especially on vision tasks.	Transformers often struggle to train on small datasets compared to CNNs. This work aims to improve the training of vanilla Transformers on such datasets without relying on extensive pretraining or architectural modifications.	The authors observed that pretrained Vision Transformers exhibit specific weight patterns: the product of query and key weights approximates the identity, while the product of value and projection weights approximates the negative identity. They propose a closed-form initialization scheme that replicates these patterns using the singular value decomposition.	Mimetic initialization significantly improves CIFAR-10 classification accuracy for various ViT architectures, with gains up to 7.77%. It also benefits ImageNet training, particularly in a ResNet-style pipeline, showing up to 4.1% accuracy improvement for ViT-Tiny. The method shows modest gains on language modeling tasks like WikiText-103, suggesting potential for broader application.	The study primarily focuses on vision tasks, with limited exploration on language models. Further investigation is needed to understand its full potential for language applications. The hyperparameter tuning of alpha and beta, which control the diagonal prominence in the initialization, is not extensively discussed. A more detailed analysis of their impact could be beneficial.	transformer, initialization, vision transformer, image classification, language modeling
2305.08995 Report	Denoising Diffusion Models for Plug-and-Play Image Restoration	Yuanzhi Zhu, Kai Zhang, Jingyun Liang, Jiezhang Cao, Bihan Wen, Radu Timofte, Luc Van Gool	Plug-and-play Image Restoration (IR) has been widely recognized as a flexible and interpretable method for solving various inverse problems by utilizing any off-the-shelf denoiser as the implicit image prior. However, most existing methods focus on discriminative Gaussian denoisers. Although diffusion models have shown impressive performance for high-quality image synthesis, their potential to serve as a generative denoiser prior to the plug-and-play IR methods remains to be further explored. While several other attempts have been made to adopt diffusion models for image restoration, they either fail to achieve satisfactory results or typically require an unacceptable number of Neural Function Evaluations (NFEs) during inference. This paper proposes DiffPIR, which integrates the traditional plug-and-play method into the diffusion sampling framework. Compared to plug-and-play IR methods that rely on discriminative Gaussian denoisers, DiffPIR is expected to inherit the generative ability of diffusion models. Experimental results on three representative IR tasks, including super-resolution, image deblurring, and inpainting, demonstrate that DiffPIR achieves state-of-the-art performance on both the FFHQ and ImageNet datasets in terms of reconstruction faithfulness and perceptual quality with no more than 100 NFEs. The source code is available at {\url{https://github.com/yuanzhi-zhu/DiffPIR}}	This paper proposes DiffPIR, a plug-and-play image restoration method that leverages the generative capabilities of pre-trained diffusion models within a diffusion sampling framework.	Existing plug-and-play methods primarily rely on discriminative Gaussian denoisers, limiting their performance. Diffusion models, as generative denoisers, offer improved potential for modeling complex data distributions and handling ill-posed inverse problems.	The method decouples the data and prior terms of the image restoration optimization problem using the Half-Quadratic-Splitting (HQS) algorithm. An off-the-shelf diffusion model acts as a plug-and-play denoiser prior, while the data term is solved analytically or approximated for various degradation models.	DiffPIR achieves state-of-the-art performance on super-resolution, image deblurring, and inpainting tasks. The method demonstrates superior reconstruction faithfulness and perceptual quality compared to existing techniques. DiffPIR maintains efficiency, requiring no more than 100 Neural Function Evaluations (NFEs) for inference.	The paper primarily focuses on bicubic super-resolution, potentially limiting the generalizability to other degradation models. Future work could explore adapting the sampling process to further reduce NFEs and enhance efficiency.	image restoration, diffusion models, plug-and-play priors, generative denoising, half-quadratic-splitting
2305.08891 Report	Common Diffusion Noise Schedules and Sample Steps are Flawed	Shanchuan Lin, Bingchen Liu, Jiashi Li, Xiao Yang	We discover that common diffusion noise schedules do not enforce the last timestep to have zero signal-to-noise ratio (SNR), and some implementations of diffusion samplers do not start from the last timestep. Such designs are flawed and do not reflect the fact that the model is given pure Gaussian noise at inference, creating a discrepancy between training and inference. We show that the flawed design causes real problems in existing implementations. In Stable Diffusion, it severely limits the model to only generate images with medium brightness and prevents it from generating very bright and dark samples. We propose a few simple fixes: (1) rescale the noise schedule to enforce zero terminal SNR; (2) train the model with v prediction; (3) change the sampler to always start from the last timestep; (4) rescale classifier-free guidance to prevent over-exposure. These simple changes ensure the diffusion process is congruent between training and inference and allow the model to generate samples more faithful to the original data distribution.	This paper identifies and corrects flaws in common diffusion noise schedules and sampling implementations that cause discrepancies between training and inference.	These flaws limit the generated images' brightness range and hinder the model's ability to accurately respond to prompts related to brightness.	The authors propose: (1) rescaling noise schedules to ensure zero terminal SNR, (2) training with v prediction and loss, (3) enforcing samplers to start from the last timestep, and (4) rescaling classifier-free guidance to prevent over-exposure.	Rescaling the noise schedule and enforcing sampling from the last timestep allows the model to generate images with a wider range of brightness. Training with v prediction and loss maintains visual quality comparable to using epsilon loss. The proposed classifier-free guidance rescaling technique effectively mitigates over-exposure issues encountered when terminal SNR approaches zero.	The paper primarily focuses on Stable Diffusion, and further investigation is needed to assess the impact on other diffusion models. The proposed rescaling method for classifier-free guidance relies on an empirically determined hyperparameter, and further exploration of optimal values is warranted.	diffusion models, noise schedules, sampling techniques, classifier-free guidance, stable diffusion
2305.08810 Report	AutoRecon: Automated 3D Object Discovery and Reconstruction	Yuang Wang, Xingyi He, Sida Peng, Haotong Lin, Hujun Bao, Xiaowei Zhou	A fully automated object reconstruction pipeline is crucial for digital content creation. While the area of 3D reconstruction has witnessed profound developments, the removal of background to obtain a clean object model still relies on different forms of manual labor, such as bounding box labeling, mask annotations, and mesh manipulations. In this paper, we propose a novel framework named AutoRecon for the automated discovery and reconstruction of an object from multi-view images. We demonstrate that foreground objects can be robustly located and segmented from SfM point clouds by leveraging self-supervised 2D vision transformer features. Then, we reconstruct decomposed neural scene representations with dense supervision provided by the decomposed point clouds, resulting in accurate object reconstruction and segmentation. Experiments on the DTU, BlendedMVS and CO3D-V2 datasets demonstrate the effectiveness and robustness of AutoRecon.	Proposes AutoRecon, a fully automated framework for discovering and reconstructing 3D objects from multi-view images without annotations.	Enables scalable 3D content creation and the potential for large-scale generation of free 2D and 3D object annotations for supervised learning.	A two-stage coarse-to-fine pipeline: 1) Coarse decomposition segments the foreground object from SfM point clouds using self-supervised 2D vision transformer features and a 3D segmentation Transformer. 2) Fine decomposition reconstructs a decomposed neural scene representation within the estimated object bounding box, guided by the coarse decomposition.	Achieves superior 3D salient object detection compared to baselines, especially on challenging datasets like CO3D. Reconstructs background-free object models with quality comparable to or exceeding NeuS, without manual annotation or post-processing. Produces high-quality and multi-view consistent 2D segmentation masks, outperforming existing single-view and multi-view baselines.	Remains sensitive to issues like shadows and thin structures. Storing multi-view ViT features is memory-intensive.	3d object reconstruction, unsupervised object discovery, scene decomposition, neural scene representation, point cloud segmentation
2305.08776 Report	Bridging the Domain Gap: Self-Supervised 3D Scene Understanding with Foundation Models	Zhimin Chen, Longlong Jing, Yingwei Li, Bing Li	Foundation models have achieved remarkable results in 2D and language tasks like image segmentation, object detection, and visual-language understanding. However, their potential to enrich 3D scene representation learning is largely untapped due to the existence of the domain gap. In this work, we propose an innovative methodology called Bridge3D to address this gap by pre-training 3D models using features, semantic masks, and captions sourced from foundation models. Specifically, our method employs semantic masks from foundation models to guide the masking and reconstruction process for the masked autoencoder, enabling more focused attention on foreground representations. Moreover, we bridge the 3D-text gap at the scene level using image captioning foundation models, thereby facilitating scene-level knowledge distillation. We further extend this bridging effort by introducing an innovative object-level knowledge distillation method that harnesses highly accurate object-level masks and semantic text data from foundation models. Our methodology significantly surpasses the performance of existing state-of-the-art methods in 3D object detection and semantic segmentation tasks. For instance, on the ScanNet dataset, Bridge3D improves the baseline by a notable margin of 6.3%. Code will be available at: https://github.com/Zhimin-C/Bridge3D	This paper presents Bridge3D, a novel method that leverages multiple foundation models for self-supervised 3D scene understanding. It uses semantic masks from foundation models to guide the masking and reconstruction process for the masked autoencoder, focusing on foreground representations and bridges the 3D-text gap at both scene and object levels for knowledge distillation.	Bridging the gap between the success of foundation models in 2D and language tasks with the need for enriched 3D scene representation learning is crucial for advancing 3D scene understanding.	The method uses a three-pronged approach: 1) Semantic-guided masked autoencoder with foreground-aware masking and patch dropping. 2) Multi-modal scene-level knowledge distillation using image captioning and 3D features. 3) Multi-modal object-level knowledge distillation leveraging accurate object-level masks and semantic text from foundation models.	Bridge3D outperforms state-of-the-art self-supervised learning methods in 3D object detection and semantic segmentation tasks. It significantly improves performance on ScanNet and SUN RGB-D datasets for object detection and S3DIS dataset for semantic segmentation. Ablation studies confirm the effectiveness of each component and modality used in Bridge3D.	The current work focuses primarily on indoor 3D scenes, limiting its generalizability. Future work will focus on extending Bridge3D to outdoor scenes and open-vocabulary 3D tasks.	3d scene understanding, self-supervised learning, foundation models, knowledge distillation, masked autoencoder
2305.08694 Report	A Reproducible Extraction of Training Images from Diffusion Models	Ryan Webster	Recently, Carlini et al. demonstrated the widely used model Stable Diffusion can regurgitate real training samples, which is troublesome from a copyright perspective. In this work, we provide an efficient extraction attack on par with the recent attack, with several order of magnitudes less network evaluations. In the process, we expose a new phenomena, which we dub template verbatims, wherein a diffusion model will regurgitate a training sample largely in tact. Template verbatims are harder to detect as they require retrieval and masking to correctly label. Furthermore, they are still generated by newer systems, even those which de-duplicate their training set, and we give insight into why they still appear during generation. We extract training images from several state of the art systems, including Stable Diffusion 2.0, Deep Image Floyd, and finally Midjourney v4. We release code to verify our extraction attack, perform the attack, as well as all extracted prompts at \url{https://github.com/ryanwebster90/onestep-extraction}.	This paper presents an efficient extraction attack on diffusion models, revealing a new phenomenon called "template verbatims" where models regurgitate training samples with non-semantic variations.	The research highlights copyright concerns and potential misuse of generative models, especially for artists whose work might be exploited without attribution.	The authors propose whitebox and blackbox attacks leveraging one-step synthesis properties and edge consistency, evaluating them against various diffusion models.	The attack achieves comparable performance to existing methods but with significantly fewer network evaluations. Template verbatims, harder to detect due to variations, are found even in models trained on deduplicated datasets. The attack successfully extracts training images from various models, including Stable Diffusion 2.0, Deep Image Floyd, and Midjourney v4.	The current ground truth construction struggles with images containing rearranged patches. Future work could explore more robust copy detection methods invariant to patch permutations.	diffusion models, extraction attack, copyright infringement, template verbatims, generative models
2305.08685 Report	CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding	Linhui Xiao, Xiaoshan Yang, Fang Peng, Ming Yan, Yaowei Wang, Changsheng Xu	Visual Grounding (VG) is a crucial topic in the field of vision and language, which involves locating a specific region described by expressions within an image. To reduce the reliance on manually labeled data, unsupervised visual grounding have been developed to locate regions using pseudo-labels. However, the performance of existing unsupervised methods is highly dependent on the quality of pseudo-labels and these methods always encounter issues with limited diversity. In order to utilize vision and language pre-trained models to address the grounding problem, and reasonably take advantage of pseudo-labels, we propose CLIP-VG, a novel method that can conduct self-paced curriculum adapting of CLIP with pseudo-language labels. We propose a simple yet efficient end-to-end network architecture to realize the transfer of CLIP to the visual grounding. Based on the CLIP-based architecture, we further propose single-source and multi-source curriculum adapting algorithms, which can progressively find more reliable pseudo-labels to learn an optimal model, thereby achieving a balance between reliability and diversity for the pseudo-language labels. Our method outperforms the current state-of-the-art unsupervised method by a significant margin on RefCOCO/+/g datasets in both single-source and multi-source scenarios, with improvements ranging from 6.78$\%$ to 10.67$\%$ and 11.39$\%$ to 14.87$\%$, respectively. The results even outperform existing weakly supervised visual grounding methods. Furthermore, our method is also competitive in fully supervised setting. The code and models are available at https://github.com/linhuixiao/CLIP-VG.	This paper proposes CLIP-VG, a novel method that conducts self-paced curriculum adapting of CLIP with pseudo-language labels for visual grounding, aiming to address the limitations of existing unsupervised methods that heavily rely on the quality of pseudo-labels and often encounter issues with limited diversity.	Visual grounding is a crucial task in vision and language, and this work reduces the reliance on manually labeled data by efficiently utilizing VLP models like CLIP and pseudo-labels in a self-paced curriculum learning paradigm.	The methodology involves a simple yet efficient end-to-end pure-Transformer encoder-only network architecture based on CLIP. It introduces a reliability measurement scheme to evaluate instance-level quality and proposes single-source and multi-source self-paced curriculum adapting algorithms (SSA and MSA) to progressively find more reliable pseudo-labels for training.	CLIP-VG significantly outperforms the current state-of-the-art unsupervised method (Pseudo-Q) on RefCOCO/+/g datasets in both single-source and multi-source scenarios, with improvements ranging from 6.78% to 10.67% and 11.39% to 14.87%, respectively. The proposed method also surpasses existing weakly supervised methods and achieves competitive results compared to fully supervised models. CLIP-VG demonstrates significant speedups in both training and inference compared to other SOTA models while maintaining high performance.	The quality of the currently utilized three types of pseudo-labels remains low, which could be further improved. The greedy sample selection strategy in SSA and MSA, while balancing efficiency and performance, represents a trade-off that could be further explored.	visual grounding, curriculum learning, pseudo-language label, vision-language models, clip
2305.08408 Report	SB-VQA: A Stack-Based Video Quality Assessment Framework for Video Enhancement	Ding-Jiun Huang, Yu-Ting Kao, Tieh-Hung Chuang, Ya-Chun Tsai, Jing-Kai Lou, Shuen-Huei Guan	In recent years, several video quality assessment (VQA) methods have been developed, achieving high performance. However, these methods were not specifically trained for enhanced videos, which limits their ability to predict video quality accurately based on human subjective perception. To address this issue, we propose a stack-based framework for VQA that outperforms existing state-of-the-art methods on VDPVE, a dataset consisting of enhanced videos. In addition to proposing the VQA framework for enhanced videos, we also investigate its application on professionally generated content (PGC). To address copyright issues with premium content, we create the PGCVQ dataset, which consists of videos from YouTube. We evaluate our proposed approach and state-of-the-art methods on PGCVQ, and provide new insights on the results. Our experiments demonstrate that existing VQA algorithms can be applied to PGC videos, and we find that VQA performance for PGC videos can be improved by considering the plot of a play, which highlights the importance of video semantic understanding.	The paper proposes SB-VQA, a stack-based video quality assessment framework for enhanced videos, and investigates its application on professionally generated content (PGC).	Accurate VQA for enhanced videos is crucial as traditional metrics like PSNR and SSIM fail to reflect human perception. Moreover, VQA for PGC content, while important for applications like old film restoration, remains underexplored.	SB-VQA utilizes a stack-based approach with multiple feature extractors (FANet) and patch-weighted convolution blocks to mitigate bias from diverse video enhancements. The authors create PGCVQ, a PGC dataset, by transcoding movie trailers at various bitrates and analyze the relationship between predicted quality, encoding bitrate, and video content appeal (using YouTube heatmaps).	SB-VQA outperforms state-of-the-art VQA methods on the VDPVE dataset (enhanced videos). SB-VQA's predicted quality scores on PGCVQ align with the expectation that higher bitrates yield better perceived quality. A correlation is observed between predicted quality scores and content appeal derived from YouTube heatmaps, suggesting that VQA can reflect video content richness.	The paper acknowledges potential overfitting of SB-VQA's regression block on the training dataset. Future work could explore multi-modal models incorporating semantic understanding to enhance VQA accuracy.	video quality assessment, video enhancement, professionally generated content, deep learning, content appeal
2305.07895 Report	On the Hidden Mystery of OCR in Large Multimodal Models	Yuliang Liu, Zhang Li, Biao Yang, Chunyuan Li, Xucheng Yin, Cheng-lin Liu, Lianwen Jin, Xiang Bai	Large models have recently played a dominant role in natural language processing and multimodal vision-language learning. However, their effectiveness in text-related visual tasks remains relatively unexplored. In this paper, we conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks including Text Recognition, Scene Text-Centric Visual Question Answering (VQA), Document-Oriented VQA, Key Information Extraction (KIE), and Handwritten Mathematical Expression Recognition (HMER). To facilitate the assessment of Optical Character Recognition (OCR) capabilities in Large Multimodal Models, we propose OCRBench, a comprehensive evaluation benchmark.Our study encompasses 29 datasets, making it the most comprehensive OCR evaluation benchmark available. Furthermore, our study reveals both the strengths and weaknesses of these models, particularly in handling multilingual text, handwritten text, non-semantic text, and mathematical expression recognition. Most importantly, the baseline results showcased in this study could provide a foundational framework for the conception and assessment of innovative strategies targeted at enhancing zero-shot multimodal techniques. The evaluation pipeline and benchmark are available at https://github.com/Yuliang-Liu/MultimodalOCR.	This paper presents a comprehensive evaluation of 14 Large Multimodal Models (LMMs) on various text-related visual tasks and proposes OCRBench, a new benchmark for assessing OCR capabilities in LMMs.	Understanding LMMs' effectiveness in text-related visual tasks is crucial due to their potential to revolutionize how we interact with and analyze information from both text and images.	The authors evaluate LMMs on 29 datasets across five key tasks: text recognition, scene text-centric VQA, document-oriented VQA, key information extraction, and handwritten mathematical expression recognition. They introduce OCRBench, a refined benchmark with 1000 manually verified question-answer pairs.	LMMs demonstrate promising results in text recognition, sometimes matching state-of-the-art supervised methods, but still lag behind in complex tasks like handwritten mathematical expression recognition. LMMs exhibit weaknesses in handling blurry images, handwritten text, multilingual text, and non-semantic text. OCRBench provides a standardized and accurate tool for evaluating LMM performance on OCR tasks, revealing strengths and areas for improvement.	The study primarily focuses on accuracy and doesn't delve into computational efficiency or resource consumption of LMMs compared to specialized models. Future work should explore the impact of training data size and fine-tuning strategies on LMM performance in specific OCR tasks.	large multimodal models, optical character recognition, benchmarking, text recognition, visual question answering
2305.07710 Report	Zero-shot racially balanced dataset generation using an existing biased StyleGAN2	Anubhav Jain, Nasir Memon, Julian Togelius	Facial recognition systems have made significant strides thanks to data-heavy deep learning models, but these models rely on large privacy-sensitive datasets. Further, many of these datasets lack diversity in terms of ethnicity and demographics, which can lead to biased models that can have serious societal and security implications. To address these issues, we propose a methodology that leverages the biased generative model StyleGAN2 to create demographically diverse images of synthetic individuals. The synthetic dataset is created using a novel evolutionary search algorithm that targets specific demographic groups. By training face recognition models with the resulting balanced dataset containing 50,000 identities per race (13.5 million images in total), we can improve their performance and minimize biases that might have been present in a model trained on a real dataset.	This paper introduces a novel search-based algorithm to generate balanced synthetic facial image datasets with diverse demographics from a pre-trained, biased StyleGAN2 model, aiming to improve fairness and accuracy in facial recognition.	Existing facial recognition models trained on real-world datasets often inherit biases due to imbalanced representation of ethnicities, leading to unfair performance across different demographic groups. This work addresses the need for balanced and privacy-aware facial image datasets to mitigate these biases.	The authors propose an evolutionary search algorithm that operates on the latent space of a pre-trained StyleGAN2 model. By leveraging an auxiliary demographic classifier, the algorithm explores the latent space to find and generate a large number of synthetic identities belonging to specific racial groups.	The proposed approach generates over 50,000 unique synthetic identities per racial group, totaling 13.5 million images, showcasing its ability to create large-scale, balanced datasets. Pre-training facial recognition models (ArcFace, AdaFace, ElasticFace) on the generated dataset leads to improved accuracy on standard benchmarks like RFW, LFW, CFP-FP, and AgeDB compared to models trained solely on real data. The generated balanced dataset aids in mitigating bias, demonstrated by a reduction in accuracy disparity across different racial groups for the trained facial recognition models.	The reliance on an external ethnicity classifier for supervision during synthetic data generation can introduce noise due to potential misclassifications by the classifier. The study primarily focuses on mitigating racial bias and could be extended to address other demographic attributes like gender and age.	facial recognition, bias mitigation, synthetic data, stylegan2, evolutionary search
2305.07625 Report	Meta Omnium: A Benchmark for General-Purpose Learning-to-Learn	Ondrej Bohdal, Yinbing Tian, Yongshuo Zong, Ruchika Chavhan, Da Li, Henry Gouk, Li Guo, Timothy Hospedales	Meta-learning and other approaches to few-shot learning are widely studied for image recognition, and are increasingly applied to other vision tasks such as pose estimation and dense prediction. This naturally raises the question of whether there is any few-shot meta-learning algorithm capable of generalizing across these diverse task types? To support the community in answering this question, we introduce Meta Omnium, a dataset-of-datasets spanning multiple vision tasks including recognition, keypoint localization, semantic segmentation and regression. We experiment with popular few-shot meta-learning baselines and analyze their ability to generalize across tasks and to transfer knowledge between them. Meta Omnium enables meta-learning researchers to evaluate model generalization to a much wider array of tasks than previously possible, and provides a single framework for evaluating meta-learners across a wide suite of vision applications in a consistent manner.	This paper introduces Meta-Omnium, a benchmark dataset for evaluating few-shot meta-learning algorithms across multiple vision tasks.	Existing few-shot learning benchmarks focus on single tasks, limiting the development of general-purpose algorithms capable of knowledge transfer across tasks.	The benchmark comprises datasets from four vision tasks: image classification, semantic segmentation, keypoint localization, and regression. It includes seen/unseen dataset splits for in-distribution and out-of-distribution generalization evaluation. The authors adapt several popular few-shot learning algorithms (ProtoNet, MAML, DDRR, etc.) to the multi-task setting and provide baseline results.	Prototypical Networks show the best overall performance and out-of-distribution generalization ability. Single-task meta-learning generally outperforms multi-task meta-learning, indicating the challenge of learning from heterogeneous task distributions. Meta-learning significantly outperforms simple transfer learning and training-from-scratch approaches.	Limited number of datasets within each task family. Exploration of more sophisticated meta-learning algorithms is needed.	meta-learning, few-shot learning, multi-task learning, benchmarking, computer vision
2305.07304 Report	CLIP-Count: Towards Text-Guided Zero-Shot Object Counting	Ruixiang Jiang, Lingbo Liu, Changwen Chen	Recent advances in visual-language models have shown remarkable zero-shot text-image matching ability that is transferable to downstream tasks such as object detection and segmentation. Adapting these models for object counting, however, remains a formidable challenge. In this study, we first investigate transferring vision-language models (VLMs) for class-agnostic object counting. Specifically, we propose CLIP-Count, the first end-to-end pipeline that estimates density maps for open-vocabulary objects with text guidance in a zero-shot manner. To align the text embedding with dense visual features, we introduce a patch-text contrastive loss that guides the model to learn informative patch-level visual representations for dense prediction. Moreover, we design a hierarchical patch-text interaction module to propagate semantic information across different resolution levels of visual features. Benefiting from the full exploitation of the rich image-text alignment knowledge of pretrained VLMs, our method effectively generates high-quality density maps for objects-of-interest. Extensive experiments on FSC-147, CARPK, and ShanghaiTech crowd counting datasets demonstrate state-of-the-art accuracy and generalizability of the proposed method. Code is available: https://github.com/songrise/CLIP-Count.	CLIP-Count, the first end-to-end text-guided zero-shot object counting model, estimates density maps for open-vocabulary objects using text prompts.	Existing class-agnostic counting methods are limited by reliance on manual patch exemplars or lack of object specificity. CLIP-Count addresses these limitations with a more user-friendly and flexible text-guided approach.	The method adapts CLIP by introducing: 1) a patch-text contrastive loss to align text and patch embeddings, 2) a hierarchical patch-text interaction module to propagate semantic information across resolutions, and 3) a CNN decoder to generate density maps.	Outperforms state-of-the-art zero-shot counting methods on FSC-147. Demonstrates superior cross-dataset generalizability on CARPK and ShanghaiTech. Effectively localizes and counts diverse objects with high fidelity, as shown qualitatively.	Performance can be limited by ambiguity in text guidance for object counting. Future work includes collecting datasets with more fine-grained text annotations.	class-agnostic object counting, zero-shot learning, text-guided vision, density estimation, clip
2305.07223 Report	Transavs: End-To-End Audio-Visual Segmentation With Transformer	Yuhang Ling, Yuxi Li, Zhenye Gan, Jiangning Zhang, Mingmin Chi, Yabiao Wang	Audio-Visual Segmentation (AVS) is a challenging task, which aims to segment sounding objects in video frames by exploring audio signals. Generally AVS faces two key challenges: (1) Audio signals inherently exhibit a high degree of information density, as sounds produced by multiple objects are entangled within the same audio stream; (2) Objects of the same category tend to produce similar audio signals, making it difficult to distinguish between them and thus leading to unclear segmentation results. Toward this end, we propose TransAVS, the first Transformer-based end-to-end framework for AVS task. Specifically, TransAVS disentangles the audio stream as audio queries, which will interact with images and decode into segmentation masks with full transformer architectures. This scheme not only promotes comprehensive audio-image communication but also explicitly excavates instance cues encapsulated in the scene. Meanwhile, to encourage these audio queries to capture distinctive sounding objects instead of degrading to be homogeneous, we devise two self-supervised loss functions at both query and mask levels, allowing the model to capture distinctive features within similar audio data and achieve more precise segmentation. Our experiments demonstrate that TransAVS achieves state-of-the-art results on the AVSBench dataset, highlighting its effectiveness in bridging the gap between audio and visual modalities.	This paper proposes TransAVS, the first Transformer-based end-to-end framework for Audio-Visual Segmentation (AVS) that leverages audio cues to segment sounding objects in video frames.	AVS is challenging because audio signals are information-dense (mixing sounds from multiple sources) and objects of the same category often produce similar sounds making segmentation difficult.	TransAVS disentangles the audio stream into queries that interact with image features in a Transformer architecture to generate segmentation masks. Two self-supervised losses, Audio Query Distance Loss (AQDL) and Audio Query Mask Loss (AQML), encourage the model to learn distinctive features for more precise segmentation.	TransAVS achieves state-of-the-art results on the AVSBench dataset, outperforming previous methods in both single-source and multi-source sound segmentation. The use of audio queries for instance-level awareness and discrimination significantly improves segmentation accuracy. Self-supervised losses, AQDL and AQML, effectively address the challenge of sound homogeneity among objects of the same category.	The number of audio queries needs to be carefully tuned for optimal performance. Future work could explore incorporating temporal information for better handling of object movements and occlusions.	audio-visual segmentation, multi-modal learning, transformer, self-supervised learning, sound source separation
2305.07024 Report	SparseGNV: Generating Novel Views of Indoor Scenes with Sparse Input Views	Weihao Cheng, Yan-Pei Cao, Ying Shan	We study to generate novel views of indoor scenes given sparse input views. The challenge is to achieve both photorealism and view consistency. We present SparseGNV: a learning framework that incorporates 3D structures and image generative models to generate novel views with three modules. The first module builds a neural point cloud as underlying geometry, providing contextual information and guidance for the target novel view. The second module utilizes a transformer-based network to map the scene context and the guidance into a shared latent space and autoregressively decodes the target view in the form of discrete image tokens. The third module reconstructs the tokens into the image of the target view. SparseGNV is trained across a large indoor scene dataset to learn generalizable priors. Once trained, it can efficiently generate novel views of an unseen indoor scene in a feed-forward manner. We evaluate SparseGNV on both real-world and synthetic indoor scenes and demonstrate that it outperforms state-of-the-art methods based on either neural radiance fields or conditional image generation.	Proposes SparseGNV, a learning framework combining 3D structures and image generative models to synthesize novel views of indoor scenes from sparse input views.	Addresses the challenge of generating photorealistic and consistent novel views of indoor scenes, which are often spatially complex and require dense, expensive scans.	SparseGNV uses three modules: 1) Neural geometry module to build a 3D point cloud from sparse input views; 2) View generator module to encode scene context and target viewpoint into latent space, and autoregressively decode a novel view as discrete tokens; 3) Image converter module to reconstruct the tokens into a final image.	Outperforms state-of-the-art methods like NeRFs and conditional image generation on real-world and synthetic datasets. Generates high-fidelity novel views with consistent structure faithful to the observations. Demonstrates strong generalization ability by effectively leveraging sparse input information.	Output can be less stable compared to volume rendering methods, with potential alterations in object details and lighting. Requires camera poses and depths, which can be unavailable in extremely sparse settings. Future work could explore incorporating depth estimation into the framework.	novel view synthesis, indoor scenes, sparse input, 3d structure, image generation
2305.07021 Report	Simple Token-Level Confidence Improves Caption Correctness	Suzanne Petryk, Spencer Whitehead, Joseph E. Gonzalez, Trevor Darrell, Anna Rohrbach, Marcus Rohrbach	The ability to judge whether a caption correctly describes an image is a critical part of vision-language understanding. However, state-of-the-art models often misinterpret the correctness of fine-grained details, leading to errors in outputs such as hallucinating objects in generated captions or poor compositional reasoning. In this work, we explore Token-Level Confidence, or TLC, as a simple yet surprisingly effective method to assess caption correctness. Specifically, we fine-tune a vision-language model on image captioning, input an image and proposed caption to the model, and aggregate either algebraic or learned token confidences over words or sequences to estimate image-caption consistency. Compared to sequence-level scores from pretrained models, TLC with algebraic confidence measures achieves a relative improvement in accuracy by 10% on verb understanding in SVO-Probes and outperforms prior state-of-the-art in image and group scores for compositional reasoning in Winoground by a relative 37% and 9%, respectively. When training data are available, a learned confidence estimator provides further improved performance, reducing object hallucination rates in MS COCO Captions by a relative 30% over the original model and setting a new state-of-the-art.	This paper proposes Token-Level Confidence (TLC), a simple yet effective method to assess image-caption correctness by leveraging token-level confidences from a fine-tuned image captioning model.	Current vision-language models often struggle with fine-grained details in image-caption correctness, impacting tasks like hallucination detection and compositional reasoning.	TLC uses a fine-tuned captioning model. TLC-A, uses algebraic confidence measures (e.g., softmax) on token predictions. TLC-L, learns a confidence estimator trained on predicting token correctness based on references.	TLC-A surpasses prior state-of-the-art on Winoground, showing substantial improvement in image and group scores for compositional reasoning. TLC-A outperforms sequence-level image-text matching scores on verb understanding evaluated with SVO-Probes. TLC-L significantly reduces object hallucination rates in generated captions on MS COCO Captions, setting a new state-of-the-art.	TLC-L requires in-domain training data for the confidence estimator. The study uses uncalibrated output distributions for confidence estimation, potentially limiting reliability.	image captioning, caption correctness, hallucination reduction, compositional reasoning, vision-language models
2305.07017 Report	An Inverse Scaling Law for CLIP Training	Xianhang Li, Zeyu Wang, Cihang Xie	CLIP, one of the pioneering foundation models that connect images and text, has enabled many recent breakthroughs in computer vision. However, its associated training cost is prohibitively high, imposing a significant barrier to its widespread exploration. In this paper, we present a surprising finding that there exists an inverse scaling law for CLIP training, whereby the larger the image/text encoders used, the shorter the sequence length of image/text tokens that can be applied in training. Moreover, we showcase that the strategy for reducing image/text token length plays a crucial role in determining the quality of this scaling law. As a result of this finding, we are able to successfully train CLIP even with limited computational resources. For example, using 8 A100 GPUs, our CLIP models achieve zero-shot top-1 ImageNet-1k accuracies of 63.2% in ~2 days, 67.8% in ~3 days, and 69.3% in ~4 days. Our method also works well when scaling up -- with G/14, we register a new record of 83.0% ImageNet-1k zero-shot accuracy, and meanwhile accelerate the training by ~33x compared to its OpenCLIP counterpart. By reducing the computation barrier associated with CLIP, we hope to inspire more research in this field, particularly from academics. Our code is available at https://github.com/UCSC-VLAA/CLIPA.	This paper discovers an inverse scaling law for CLIP training, showing that larger image/text encoders can be trained with shorter image/text token sequences while maintaining performance.	This is important because it lowers the computational barrier of CLIP training, making it more accessible to researchers with limited resources.	The authors experimented with various token reduction strategies (resizing, masking, etc.) and model sizes on ImageNet-1k, COCO, and robustness benchmarks.	Larger CLIP models exhibit smaller performance drops when trained with reduced token lengths. Image resizing and syntax masking are the most effective strategies for reducing image and text tokens, respectively. The proposed CLIPA framework, utilizing this inverse scaling law, achieves competitive results with significantly reduced training cost compared to OpenCLIP.	Current CLIP models, including CLIPA, struggle with capturing complex relationships, attributes, and order information. Future work includes investigating the inverse scaling law with even larger models and datasets, as well as addressing the limitations in relational understanding.	clip, inverse scaling law, efficient training, foundation models, computer vision
2305.07015 Report	Exploiting Diffusion Prior for Real-World Image Super-Resolution	Jianyi Wang, Zongsheng Yue, Shangchen Zhou, Kelvin C. K. Chan, Chen Change Loy	We present a novel approach to leverage prior knowledge encapsulated in pre-trained text-to-image diffusion models for blind super-resolution (SR). Specifically, by employing our time-aware encoder, we can achieve promising restoration results without altering the pre-trained synthesis model, thereby preserving the generative prior and minimizing training cost. To remedy the loss of fidelity caused by the inherent stochasticity of diffusion models, we employ a controllable feature wrapping module that allows users to balance quality and fidelity by simply adjusting a scalar value during the inference process. Moreover, we develop a progressive aggregation sampling strategy to overcome the fixed-size constraints of pre-trained diffusion models, enabling adaptation to resolutions of any size. A comprehensive evaluation of our method using both synthetic and real-world benchmarks demonstrates its superiority over current state-of-the-art approaches. Code and models are available at https://github.com/IceClear/StableSR.	This paper introduces StableSR, a novel blind super-resolution method that leverages the generative prior of pre-trained text-to-image diffusion models for high-quality image restoration.	The approach addresses the limitations of existing super-resolution techniques, which often require extensive training from scratch or rely on explicit degradation assumptions, limiting their generalizability and computational efficiency.	StableSR employs a time-aware encoder to condition a frozen pre-trained diffusion model on the input low-resolution image, preserving the generative prior and enabling efficient training. It further incorporates a controllable feature wrapping module for balancing realism and fidelity, and a progressive aggregation sampling strategy for handling arbitrary image resolutions.	StableSR achieves state-of-the-art performance on both synthetic and real-world benchmarks, surpassing existing methods in perceptual quality metrics. The method demonstrates superior artifact removal and detail generation capabilities, producing visually compelling results. StableSR offers flexible control over the trade-off between fidelity and realism, catering to user preferences and diverse image degradation scenarios.	The inference speed of StableSR, being a diffusion-based approach, is slower compared to GAN-based methods, demanding further exploration into fast sampling strategies. The pre-cleaning stage, while effective for severely degraded images, introduces an additional dependency on external models, necessitating further research into enhancing the robustness of StableSR.	super-resolution, image restoration, diffusion models, generative prior, blind super-resolution
2305.07011 Report	Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers	Dahun Kim, Anelia Angelova, Weicheng Kuo	We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection. At the pretraining phase, we propose to randomly crop and resize regions of positional embeddings instead of using the whole image positional embeddings. This better matches the use of positional embeddings at region-level in the detection finetuning phase. In addition, we replace the common softmax cross entropy loss in contrastive learning with focal loss to better learn the informative yet difficult examples. Finally, we leverage recent advances in novel object proposals to improve open-vocabulary detection finetuning. We evaluate our full model on the LVIS and COCO open-vocabulary detection benchmarks and zero-shot transfer. RO-ViT achieves a state-of-the-art 34.1 $AP_r$ on LVIS, surpassing the best existing approach by +7.8 points in addition to competitive zero-shot transfer detection. Surprisingly, RO-ViT improves the image-level representation as well and achieves the state of the art on 9 out of 12 metrics on COCO and Flickr image-text retrieval benchmarks, outperforming competitive approaches with larger models.	This paper presents Region-aware Open-vocabulary Vision Transformers (RO-ViT), a contrastive image-text pretraining approach for open-vocabulary object detection, utilizing cropped positional embeddings (CPE) and focal loss during pretraining.	Existing image-text pretraining methods are designed for image-level tasks and lack region-level understanding, hindering their performance in open-vocabulary object detection.	RO-ViT introduces two key innovations: (1) Cropped Positional Embeddings (CPE) that randomly crop and resize regions of positional embeddings during pretraining to better match region-level use in detection, and (2) replacement of softmax cross-entropy loss with focal loss in contrastive learning to emphasize informative examples.	RO-ViT achieves state-of-the-art performance (34.1 AP_r) on the LVIS open-vocabulary detection benchmark, surpassing the best existing approach by +7.8 AP_r. Despite not being optimized for retrieval, RO-ViT achieves state-of-the-art performance on 9 out of 12 metrics on COCO and Flickr image-text retrieval benchmarks. Ablation studies confirm the benefits of both CPE and focal loss, demonstrating improved region-level representation for open-vocabulary detection.	The model's performance relies on the quality and potential biases present in the pretrained VLMs. Future work can explore the application of RO-ViT to video-based open-vocabulary detection, leveraging its strong performance on the ego-centric dataset.	open-vocabulary object detection, vision transformers, contrastive image-text pretraining, cropped positional embeddings, focal loss
2305.06973 Report	FreePoint: Unsupervised Point Cloud Instance Segmentation	Zhikai Zhang, Jian Ding, Li Jiang, Dengxin Dai, Gui-Song Xia	Instance segmentation of point clouds is a crucial task in 3D field with numerous applications that involve localizing and segmenting objects in a scene. However, achieving satisfactory results requires a large number of manual annotations, which is a time-consuming and expensive process. To alleviate dependency on annotations, we propose a method, called FreePoint, for underexplored unsupervised class-agnostic instance segmentation on point clouds. In detail, we represent the point features by combining coordinates, colors, normals, and self-supervised deep features. Based on the point features, we perform a multicut algorithm to segment point clouds into coarse instance masks as pseudo labels, which are used to train a point cloud instance segmentation model. To alleviate the inaccuracy of coarse masks during training, we propose a weakly-supervised training strategy and corresponding loss. Our work can also serve as an unsupervised pre-training pretext for supervised semantic instance segmentation with limited annotations. For class-agnostic instance segmentation on point clouds, FreePoint largely fills the gap with its fully-supervised counterpart based on the state-of-the-art instance segmentation model Mask3D and even surpasses some previous fully-supervised methods. When serving as a pretext task and fine-tuning on S3DIS, FreePoint outperforms training from scratch by 5.8% AP with only 10% mask annotations.	Proposes FreePoint, an unsupervised approach for class-agnostic instance segmentation on point clouds, using a combination of traditional features and self-supervised deep-learning embeddings.	Addresses the labor-intensive and expensive nature of obtaining manual annotations for point cloud instance segmentation.	1. Filters out background points using plane segmentation. 2. Extracts features by combining coordinates, colors, normals, and self-supervised deep features. 3. Generates pseudo masks via multicut algorithm on a graph constructed using point feature affinities. 4. Trains an instance segmentation model using a two-step training strategy and a weakly-supervised loss based on the pseudo masks.	Achieves over 50% accuracy of its fully-supervised counterpart (Mask3D) on class-agnostic instance segmentation on ScanNet. Outperforms existing fully-supervised methods on class-agnostic instance segmentation. Demonstrates strong performance as a pre-training task, improving semantic instance segmentation results with limited annotations on S3DIS.	Performance gap exists compared to fully-supervised methods. Reliance on a user-defined parameter (σ) for generating coarse masks, though robust and easily set with visualization.	unsupervised learning, point cloud segmentation, instance segmentation, 3d vision, weakly-supervised learning
2305.06710 Report	Null-text Guidance in Diffusion Models is Secretly a Cartoon-style Creator	Jing Zhao, Heliang Zheng, Chaoyue Wang, Long Lan, Wanrong Huang, Wenjing Yang	Classifier-free guidance is an effective sampling technique in diffusion models that has been widely adopted. The main idea is to extrapolate the model in the direction of text guidance and away from null-text guidance. In this paper, we demonstrate that null-text guidance in diffusion models is secretly a cartoon-style creator, i.e., the generated images can be efficiently transformed into cartoons by simply perturbing the null-text guidance. Specifically, we proposed two disturbance methods, i.e., Rollback disturbance (Back-D) and Image disturbance (Image-D), to construct misalignment between the noisy images used for predicting null-text guidance and text guidance (subsequently referred to as \textbf{null-text noisy image} and \textbf{text noisy image} respectively) in the sampling process. Back-D achieves cartoonization by altering the noise level of null-text noisy image via replacing $x_t$ with $x_{t+\Delta t}$. Image-D, alternatively, produces high-fidelity, diverse cartoons by defining $x_t$ as a clean input image, which further improves the incorporation of finer image details. Through comprehensive experiments, we delved into the principle of noise disturbing for null-text and uncovered that the efficacy of disturbance depends on the correlation between the null-text noisy image and the source image. Moreover, our proposed techniques, which can generate cartoon images and cartoonize specific ones, are training-free and easily integrated as a plug-and-play component in any classifier-free guided diffusion model. Project page is available at \url{https://nulltextforcartoon.github.io/}.	This paper discovers that null-text guidance in diffusion models can be manipulated to create cartoon-style images.	This is the first work to achieve cartoonization using diffusion models without requiring additional training, offering a novel and efficient approach.	The authors introduce two noise disturbance methods: Rollback disturbance (Back-D), which replaces the null-text noisy image with a noisier version, and Image disturbance (Image-D), which uses a clean input image. Both methods create a misalignment between null-text and text guidance, driving the generation towards a cartoon style.	Both Back-D and Image-D can generate free-form cartoon images from text prompts and cartoonize specific input images. Image-D generally produces higher-fidelity cartoons with richer details compared to Back-D. The degree of cartoonization is influenced by the correlation between the null-text noisy image and the input image, with higher correlation leading to better results.	The effectiveness of the cartoonization relies heavily on the performance of the underlying text-to-image diffusion model. The paper mainly explores visual cartoonization, leaving exploration of other artistic styles for future work.	diffusion models, cartoonization, classifier-free guidance, null-text guidance, image generation
2305.06525 Report	Pyramid Texture Filtering	Qing Zhang, Hao Jiang, Yongwei Nie, Wei-Shi Zheng	We present a simple but effective technique to smooth out textures while preserving the prominent structures. Our method is built upon a key observation -- the coarsest level in a Gaussian pyramid often naturally eliminates textures and summarizes the main image structures. This inspires our central idea for texture filtering, which is to progressively upsample the very low-resolution coarsest Gaussian pyramid level to a full-resolution texture smoothing result with well-preserved structures, under the guidance of each fine-scale Gaussian pyramid level and its associated Laplacian pyramid level. We show that our approach is effective to separate structure from texture of different scales, local contrasts, and forms, without degrading structures or introducing visual artifacts. We also demonstrate the applicability of our method on various applications including detail enhancement, image abstraction, HDR tone mapping, inverse halftoning, and LDR image enhancement.	This paper introduces a novel texture smoothing technique that leverages image pyramids, effectively removing textures while preserving prominent structures.	Texture smoothing is crucial in computational photography and image analysis for tasks like image abstraction, detail enhancement, and HDR tone mapping. This method offers a simple yet effective way to achieve this, addressing limitations of previous approaches.	The method iteratively upsamples the coarsest level of a Gaussian pyramid to the original image resolution. This upsampling is guided by finer-scale levels of both Gaussian and Laplacian pyramids, ensuring structure preservation while eliminating textures.	The coarsest level of a Gaussian pyramid naturally eliminates textures while preserving main image structures. Pyramid-guided structure-aware upsampling effectively removes textures of varying scales, contrasts, and forms without degrading structures or introducing artifacts. The method proves applicable and beneficial in various applications such as detail enhancement, image abstraction, HDR tone mapping, inverse halftoning, and LDR image enhancement.	The method may struggle to preserve small-scale structures not present in the coarsest Gaussian pyramid level. Similar to bilateral filtering, over-smoothing can lead to gradient reversal artifacts.	image smoothing, structure extraction, image decomposition, image pyramid, upsampling
2305.06422 Report	An Empirical Study on the Robustness of the Segment Anything Model (SAM)	Yuqing Wang, Yun Zhao, Linda Petzold	The Segment Anything Model (SAM) is a foundation model for general image segmentation. Although it exhibits impressive performance predominantly on natural images, understanding its robustness against various image perturbations and domains is critical for real-world applications where such challenges frequently arise. In this study we conduct a comprehensive robustness investigation of SAM under diverse real-world conditions. Our experiments encompass a wide range of image perturbations. Our experimental results demonstrate that SAM's performance generally declines under perturbed images, with varying degrees of vulnerability across different perturbations. By customizing prompting techniques and leveraging domain knowledge based on the unique characteristics of each dataset, the model's resilience to these perturbations can be enhanced, addressing dataset-specific challenges. This work sheds light on the limitations and strengths of SAM in real-world applications, promoting the development of more robust and versatile image segmentation solutions.	This paper presents the first comprehensive robustness analysis of the Segment Anything Model (SAM) under various image perturbations and across different domains.	Evaluating SAM's robustness is crucial for real-world applications where image perturbations are common, ensuring its reliability in challenging conditions.	The study evaluates SAM on nine diverse datasets spanning various domains, using three prompting methods (point, box, combination) and fifteen image perturbation types with different severity levels.	SAM's performance generally declines under perturbed images, with varying degrees of vulnerability across different perturbations. SAM exhibits particular vulnerability to chromatic aberration, motion blur, and Gaussian noise, while showing robustness against brightness and saturation changes. The combination of point and box prompting consistently yields superior results and improved robustness compared to single prompting methods.	The study primarily focuses on zero-shot learning and doesn't explore fine-tuning SAM for specific domains or perturbations. Future work includes exploring more adaptive prompting strategies, incorporating human-in-the-loop interactions, and developing dataset-specific data augmentation techniques.	image segmentation, segment anything model, robustness evaluation, prompting methods, domain-specific analysis
2305.06402 Report	Analyzing Bias in Diffusion-based Face Generation Models	Malsha V. Perera, Vishal M. Patel	Diffusion models are becoming increasingly popular in synthetic data generation and image editing applications. However, these models can amplify existing biases and propagate them to downstream applications. Therefore, it is crucial to understand the sources of bias in their outputs. In this paper, we investigate the presence of bias in diffusion-based face generation models with respect to attributes such as gender, race, and age. Moreover, we examine how dataset size affects the attribute composition and perceptual quality of both diffusion and Generative Adversarial Network (GAN) based face generation models across various attribute classes. Our findings suggest that diffusion models tend to worsen distribution bias in the training data for various attributes, which is heavily influenced by the size of the dataset. Conversely, GAN models trained on balanced datasets with a larger number of samples show less bias across different attributes.	This paper investigates bias in diffusion-based face generation models with respect to gender, race, and age, focusing on the impact of training dataset size on attribute composition and perceptual quality in comparison to GAN-based models.	Understanding bias in diffusion models is crucial for promoting fairness and mitigating negative societal consequences when these models are used in real-world applications.	The study uses the FFHQ and FairFace datasets to train diffusion and GAN models, analyzing the attribute distribution of generated images with varying training subset sizes and employing attribute classifiers to assess the results.	Diffusion models tend to amplify existing biases in training data, particularly for gender, race, and age, and the bias is influenced by dataset size. GAN models, when trained on balanced datasets with larger sample sizes, demonstrate better preservation of attribute composition compared to diffusion models. Data replication is more common in diffusion models trained with smaller datasets, which can contribute to bias in attribute distribution.	The study primarily focuses on unconditional face generation using specific diffusion and GAN architectures; exploring other architectures could provide further insights. While automated classifiers were used, potential bias in these classifiers might require further investigation using alternative methods for determining attribute classes.	bias in ai, diffusion models, generative adversarial networks, face generation, dataset bias
2305.06386 Report	Text-To-Concept (and Back) via Cross-Model Alignment	Mazda Moayeri, Keivan Rezaei, Maziar Sanjabi, Soheil Feizi	We observe that the mapping between an image's representation in one model to its representation in another can be learned surprisingly well with just a linear layer, even across diverse models. Building on this observation, we propose $\textit{text-to-concept}$, where features from a fixed pretrained model are aligned linearly to the CLIP space, so that text embeddings from CLIP's text encoder become directly comparable to the aligned features. With text-to-concept, we convert fixed off-the-shelf vision encoders to surprisingly strong zero-shot classifiers for free, with accuracy at times even surpassing that of CLIP, despite being much smaller models and trained on a small fraction of the data compared to CLIP. We show other immediate use-cases of text-to-concept, like building concept bottleneck models with no concept supervision, diagnosing distribution shifts in terms of human concepts, and retrieving images satisfying a set of text-based constraints. Lastly, we demonstrate the feasibility of $\textit{concept-to-text}$, where vectors in a model's feature space are decoded by first aligning to the CLIP before being fed to a GPT-based generative model. Our work suggests existing deep models, with presumably diverse architectures and training, represent input samples relatively similarly, and a two-way communication across model representation spaces and to humans (through language) is viable.	This paper introduces "text-to-concept", a technique that leverages linear alignment of pretrained vision models to CLIP space to enable direct comparison of text embeddings (representing human concepts) with image features from these models.	This method makes existing vision models significantly more interpretable and functional by allowing us to understand and utilize the semantic knowledge encoded in their feature spaces through the lens of human language.	The core methodology involves training a linear layer to map image representations from a given vision model to the representation space of a CLIP model. This allows text embeddings from CLIP's text encoder, which inherently represent concepts, to be directly compared to aligned features from the other model, enabling text-to-concept mapping.	Linear alignment effectively maps representations across diverse models, indicating they encode information similarly despite different architectures and training. Text-to-concept allows off-the-shelf models to perform zero-shot classification competitively with CLIP, even surpassing it in some cases (e.g., color recognition). The method enables novel applications like building Concept Bottleneck Models without concept supervision, analyzing dataset distributions in terms of human concepts, and performing concept-based image retrieval.	Concept vector quality depends on prompt engineering and data used, requiring refinement for optimal performance. The method's success relies on the quality of the underlying models (CLIP, vision encoders, and language models), implying limitations inherited from these components.	interpretability, text-to-concept, cross-model alignment, zero-shot learning, concept bottleneck models
2305.06356 Report	HumanRF: High-Fidelity Neural Radiance Fields for Humans in Motion	Mustafa Işık, Martin Rünz, Markos Georgopoulos, Taras Khakhulin, Jonathan Starck, Lourdes Agapito, Matthias Nießner	Representing human performance at high-fidelity is an essential building block in diverse applications, such as film production, computer games or videoconferencing. To close the gap to production-level quality, we introduce HumanRF, a 4D dynamic neural scene representation that captures full-body appearance in motion from multi-view video input, and enables playback from novel, unseen viewpoints. Our novel representation acts as a dynamic video encoding that captures fine details at high compression rates by factorizing space-time into a temporal matrix-vector decomposition. This allows us to obtain temporally coherent reconstructions of human actors for long sequences, while representing high-resolution details even in the context of challenging motion. While most research focuses on synthesizing at resolutions of 4MP or lower, we address the challenge of operating at 12MP. To this end, we introduce ActorsHQ, a novel multi-view dataset that provides 12MP footage from 160 cameras for 16 sequences with high-fidelity, per-frame mesh reconstructions. We demonstrate challenges that emerge from using such high-resolution data and show that our newly introduced HumanRF effectively leverages this data, making a significant step towards production-level quality novel view synthesis.	Introduces HumanRF, a 4D dynamic neural scene representation that captures full-body human appearance in motion from multi-view video input for high-fidelity novel view synthesis.	To close the gap to production-level quality human performance capture, addressing limitations of existing dynamic NeRF methods in handling long sequences with complex motion and high-resolution data.	Leverages a novel 4D scene representation with adaptive temporal partitioning, using a low-rank space-time tensor decomposition of feature grids and shared MLPs for efficient and temporally consistent reconstruction.	HumanRF consistently outperforms state-of-the-art methods in novel view synthesis quality for long sequences with complex motion. The method effectively utilizes high-resolution data (12MP), capturing fine details beyond the capabilities of previous datasets and techniques. HumanRF demonstrates strong performance on a dynamic furry animal dataset, indicating its potential beyond human subjects.	Current implementation requires separate optimization for each sequence, limiting generalization ability. Lacks explicit control over articulation outside training poses.	neural rendering, novel view synthesis, human performance capture, 4d dynamic nerf, high-resolution data
2305.06351 Report	Reconstructing Animatable Categories from Videos	Gengshan Yang, Chaoyang Wang, N Dinesh Reddy, Deva Ramanan	Building animatable 3D models is challenging due to the need for 3D scans, laborious registration, and manual rigging, which are difficult to scale to arbitrary categories. Recently, differentiable rendering provides a pathway to obtain high-quality 3D models from monocular videos, but these are limited to rigid categories or single instances. We present RAC that builds category 3D models from monocular videos while disentangling variations over instances and motion over time. Three key ideas are introduced to solve this problem: (1) specializing a skeleton to instances via optimization, (2) a method for latent space regularization that encourages shared structure across a category while maintaining instance details, and (3) using 3D background models to disentangle objects from the background. We show that 3D models of humans, cats, and dogs can be learned from 50-100 internet videos.	This supplementary material provides additional details, results, and comparisons for the paper 'Reconstructing Animatable 3D Categories from Videos'. It focuses on shape regularization techniques, handling categories outside the DensePose framework, evaluating performance on the 'Pablo' sequence, and highlighting differences from prior work.	This material supplements the main paper with: 1) technical details on shape regularization and ensuring smoothness in pose, deformation, and appearance; 2) extending the method to handle categories without pre-defined DensePose features (e.g., vehicles); 3) quantitative evaluation on the 'Pablo' sequence, comparing with other methods; and 4) a clear comparison table highlighting how the approach differs from related work.	The authors elaborate on the use of eikonal regularization for shape and time-dependent positional embeddings for smooth temporal variations. For categories outside DensePose, they explain a two-stage camera pose initialization process using manual annotation and a viewpoint network. Performance evaluation on the 'Pablo' sequence involves calculating the average point-to-surface distances in the clothing region and comparing it to baselines. Finally, a table summarizes key differences from previous works regarding shape, motion, background modeling, and reliance on 3D data.	Eikonal regularization improves surface reconstruction quality. The method successfully reconstructs a car category model from videos, demonstrating generalization beyond human shapes. The proposed method outperforms some single-view human shape predictors on the 'Pablo' sequence but shows limitations compared to methods using parametric models or personalized templates.	The method depends on an initial viewpoint estimation step for categories outside DensePose, which might require manual annotation. While the approach performs well without 3D supervision, incorporating shape priors could further improve accuracy, especially in challenging cases.	3d reconstruction, category-level modeling, video-based animation, nerf, differentiable rendering
2305.05947 Report	iEdit: Localised Text-guided Image Editing with Weak Supervision	Rumeysa Bodur, Erhan Gundogdu, Binod Bhattarai, Tae-Kyun Kim, Michael Donoser, Loris Bazzani	Diffusion models (DMs) can generate realistic images with text guidance using large-scale datasets. However, they demonstrate limited controllability in the output space of the generated images. We propose a novel learning method for text-guided image editing, namely \texttt{iEdit}, that generates images conditioned on a source image and a textual edit prompt. As a fully-annotated dataset with target images does not exist, previous approaches perform subject-specific fine-tuning at test time or adopt contrastive learning without a target image, leading to issues on preserving the fidelity of the source image. We propose to automatically construct a dataset derived from LAION-5B, containing pseudo-target images with their descriptive edit prompts given input image-caption pairs. This dataset gives us the flexibility of introducing a weakly-supervised loss function to generate the pseudo-target image from the latent noise of the source image conditioned on the edit prompt. To encourage localised editing and preserve or modify spatial structures in the image, we propose a loss function that uses segmentation masks to guide the editing during training and optionally at inference. Our model is trained on the constructed dataset with 200K samples and constrained GPU resources. It shows favourable results against its counterparts in terms of image fidelity, CLIP alignment score and qualitatively for editing both generated and real images.	\texttt{iEdit} is a novel learning method for localized text-guided image editing based on Latent Diffusion Models, which takes a source image and a textual edit prompt as input to generate an edited image.	Existing text-to-image generation models show limited controllability and struggle to balance preserving fidelity of unmodified regions while implementing localized edits according to the prompt.	The method constructs a dataset of image pairs with automatically generated edit prompts by leveraging CLIP embeddings and manipulating image captions. This dataset is used to fine-tune a pre-trained LDM with a novel loss function incorporating segmentation masks for localized editing.	\texttt{iEdit} outperforms state-of-the-art methods in terms of CLIP alignment score, demonstrating improved fidelity to the edit prompt. It achieves a good balance between editing and preserving image fidelity, as evidenced by SSIM scores on edited and unmodified regions. The method is computationally efficient, requiring only fine-tuning of a pre-trained LDM on a relatively small dataset with limited GPU resources.	The automatic dataset generation may produce suboptimal image pairs, impacting training effectiveness. Evaluation of image editing methods remains challenging due to the lack of standardized metrics and datasets.	image editing, text-guided synthesis, diffusion models, weakly supervised learning, semantic segmentation
2305.05901 Report	Text-guided High-definition Consistency Texture Model	Zhibin Tang, Tiantong He	With the advent of depth-to-image diffusion models, text-guided generation, editing, and transfer of realistic textures are no longer difficult. However, due to the limitations of pre-trained diffusion models, they can only create low-resolution, inconsistent textures. To address this issue, we present the High-definition Consistency Texture Model (HCTM), a novel method that can generate high-definition and consistent textures for 3D meshes according to the text prompts. We achieve this by leveraging a pre-trained depth-to-image diffusion model to generate single viewpoint results based on the text prompt and a depth map. We fine-tune the diffusion model with Parameter-Efficient Fine-Tuning to quickly learn the style of the generated result, and apply the multi-diffusion strategy to produce high-resolution and consistent results from different viewpoints. Furthermore, we propose a strategy that prevents the appearance of noise on the textures caused by backpropagation. Our proposed approach has demonstrated promising results in generating high-definition and consistent textures for 3D meshes, as demonstrated through a series of experiments.	Presents HCTM, a novel method generating high-definition, consistent textures for 3D meshes from text prompts.	Existing text-guided 3D texture generation methods produce low-resolution, inconsistent results.	Leverages a pre-trained depth-to-image diffusion model fine-tuned with Parameter-Efficient Fine-Tuning and a multi-diffusion strategy to generate high-resolution, consistent textures from different viewpoints. Also employs textual inversion for better prompt-image alignment and a noise reduction strategy during texture projection.	Generates textures with higher consistency than Latent-NeRF and TEXTure. Produces clearer, more detailed textures, as demonstrated with the 'oak wood dining table' prompt. Exhibits greater stability than existing methods, even with challenging prompts like 'gold dining table'. Outperforms baselines in a user study for overall quality, prompt relevance, and texture consistency, especially on complex meshes.	Discontinuity, severe flare, and shadows still impact the visual quality. Multi-diffusion strategy doesn't work well in 3D due to UV mapping altering white noise distribution.	3d texture generation, diffusion models, text-guided synthesis, parameter-efficient fine-tuning, multi-diffusion strategy
2305.05803 Report	Segment Anything Model (SAM) Enhanced Pseudo Labels for Weakly Supervised Semantic Segmentation	Tianle Chen, Zheda Mai, Ruiwen Li, Wei-lun Chao	Weakly supervised semantic segmentation (WSSS) aims to bypass the need for laborious pixel-level annotation by using only image-level annotation. Most existing methods rely on Class Activation Maps (CAM) to derive pixel-level pseudo-labels and use them to train a fully supervised semantic segmentation model. Although these pseudo-labels are class-aware, indicating the coarse regions for particular classes, they are not object-aware and fail to delineate accurate object boundaries. To address this, we introduce a simple yet effective method harnessing the Segment Anything Model (SAM), a class-agnostic foundation model capable of producing fine-grained instance masks of objects, parts, and subparts. We use CAM pseudo-labels as cues to select and combine SAM masks, resulting in high-quality pseudo-labels that are both class-aware and object-aware. Our approach is highly versatile and can be easily integrated into existing WSSS methods without any modification. Despite its simplicity, our approach shows consistent gain over the state-of-the-art WSSS methods on both PASCAL VOC and MS-COCO datasets.	This paper proposes SEPL, a novel method that leverages the Segment Anything Model (SAM) to enhance pseudo-labels generated by Class Activation Maps (CAM) in weakly supervised semantic segmentation (WSSS).	CAM-derived pseudo-labels, while class-aware, often lack object awareness, leading to inaccurate object boundaries. This work addresses this limitation by integrating SAM's ability to produce fine-grained instance masks.	SEPL uses CAM pseudo-labels as cues to select and combine relevant SAM masks. It assigns each SAM mask to the class with the largest intersection area and then selects masks based on their overlap with pseudo-labels to mitigate false and partial activations.	SEPL consistently improves the quality of pseudo-labels generated by various WSSS methods on PASCAL VOC and MS COCO datasets. Using SEPL-enhanced pseudo-labels for training supervised segmentation models leads to significant performance improvements. SEPL can be directly applied to initial CAMs, potentially replacing time-consuming post-processing steps and accelerating WSSS pipelines.	SEPL's effectiveness is contingent on the quality of initial pseudo-labels and SAM masks. Future work includes exploring SAM's hierarchical mask structure for more sophisticated mask selection.	weakly supervised semantic segmentation, class activation maps, segment anything model (sam), pseudo-label enhancement, object boundary detection
2305.05768 Report	DifFIQA: Face Image Quality Assessment Using Denoising Diffusion Probabilistic Models	Žiga Babnik, Peter Peer, Vitomir Štruc	Modern face recognition (FR) models excel in constrained scenarios, but often suffer from decreased performance when deployed in unconstrained (real-world) environments due to uncertainties surrounding the quality of the captured facial data. Face image quality assessment (FIQA) techniques aim to mitigate these performance degradations by providing FR models with sample-quality predictions that can be used to reject low-quality samples and reduce false match errors. However, despite steady improvements, ensuring reliable quality estimates across facial images with diverse characteristics remains challenging. In this paper, we present a powerful new FIQA approach, named DifFIQA, which relies on denoising diffusion probabilistic models (DDPM) and ensures highly competitive results. The main idea behind the approach is to utilize the forward and backward processes of DDPMs to perturb facial images and quantify the impact of these perturbations on the corresponding image embeddings for quality prediction. Because the diffusion-based perturbations are computationally expensive, we also distill the knowledge encoded in DifFIQA into a regression-based quality predictor, called DifFIQA(R), that balances performance and execution time. We evaluate both models in comprehensive experiments on 7 datasets, with 4 target FR models and against 10 state-of-the-art FIQA techniques with highly encouraging results. The source code will be made publicly available.	DifFIQA, a novel Face Image Quality Assessment (FIQA) technique, leverages Denoising Diffusion Probabilistic Models (DDPMs) to assess the quality of face images by analyzing their embedding stability under perturbations introduced by the forward and backward diffusion processes.	FIQA is crucial for improving the reliability and performance of Face Recognition (FR) models in real-world scenarios, where input image quality can vary significantly. DifFIQA addresses the need for accurate and robust quality assessment across diverse facial characteristics and FR models.	DifFIQA utilizes a custom DDPM, trained with time-dependent degradations, to generate noisy and reconstructed versions of input face images. By analyzing the disparities between the embeddings of the original, noisy, and reconstructed images in the target FR model's embedding space, DifFIQA infers the quality of the input image. To enhance efficiency, a distilled regression-based model, DifFIQA(R), is also introduced.	DifFIQA and DifFIQA(R) demonstrate highly competitive performance, consistently outperforming state-of-the-art FIQA methods on challenging datasets like IJB-C and XQLFW. The distillation process significantly reduces runtime complexity by three orders of magnitude, making DifFIQA(R) comparable to faster FIQA models without substantial performance degradation. Ablation studies highlight the importance of incorporating image flipping, forward diffusion pass, and appropriate noise levels for optimal DifFIQA performance.	The computational complexity of the original DifFIQA model poses a challenge for real-time applications, despite being addressed through distillation. The reliance on CNN-based UNet for denoising in DifFIQA may limit its ability to capture global image properties, suggesting potential improvements with transformer-based models.	face image quality assessment, face recognition, denoising diffusion probabilistic models, deep learning, computer vision
2305.05594 Report	PET-NeuS: Positional Encoding Tri-Planes for Neural Surfaces	Yiqun Wang, Ivan Skorokhodov, Peter Wonka	A signed distance function (SDF) parametrized by an MLP is a common ingredient of neural surface reconstruction. We build on the successful recent method NeuS to extend it by three new components. The first component is to borrow the tri-plane representation from EG3D and represent signed distance fields as a mixture of tri-planes and MLPs instead of representing it with MLPs only. Using tri-planes leads to a more expressive data structure but will also introduce noise in the reconstructed surface. The second component is to use a new type of positional encoding with learnable weights to combat noise in the reconstruction process. We divide the features in the tri-plane into multiple frequency scales and modulate them with sin and cos functions of different frequencies. The third component is to use learnable convolution operations on the tri-plane features using self-attention convolution to produce features with different frequency bands. The experiments show that PET-NeuS achieves high-fidelity surface reconstruction on standard datasets. Following previous work and using the Chamfer metric as the most important way to measure surface reconstruction quality, we are able to improve upon the NeuS baseline by 57% on Nerf-synthetic (0.84 compared to 1.97) and by 15.5% on DTU (0.71 compared to 0.84). The qualitative evaluation reveals how our method can better control the interference of high-frequency noise. Code available at \url{https://github.com/yiqun-wang/PET-NeuS}.	Presents PET-NeuS, a novel neural surface reconstruction method utilizing a tri-plane representation modulated by positional encoding and enhanced by self-attention convolution.	Aims to enhance the expressiveness of neural surface reconstruction methods for preserving fine-grained local features while mitigating noise interference.	Integrates tri-planes into NeuS framework, introduces a novel positional encoding strategy for tri-plane features, and employs self-attention convolution to generate multi-frequency tri-plane features.	Achieves state-of-the-art surface reconstruction quality on DTU and NeRF-synthetic datasets, outperforming baselines like NeuS, VolSDF, and HF-NeuS. Demonstrates superior ability to reconstruct fine-grained details, such as bumps, holes, and complex structures, as evidenced by qualitative results. Exhibits faster training time compared to competing methods while maintaining high fidelity in surface reconstruction.	Computation time, although faster than some baselines, remains a limitation. Balancing fine detail reconstruction with potential overfitting and noise in flat surface areas requires further investigation.	neural surface reconstruction, tri-plane representation, positional encoding, self-attention convolution, multi-view reconstruction
2305.05464 Report	Style-A-Video: Agile Diffusion for Arbitrary Text-based Video Style Transfer	Nisha Huang, Yuxin Zhang, Weiming Dong	Large-scale text-to-video diffusion models have demonstrated an exceptional ability to synthesize diverse videos. However, due to the lack of extensive text-to-video datasets and the necessary computational resources for training, directly applying these models for video stylization remains difficult. Also, given that the noise addition process on the input content is random and destructive, fulfilling the style transfer task's content preservation criteria is challenging. This paper proposes a zero-shot video stylization method named Style-A-Video, which utilizes a generative pre-trained transformer with an image latent diffusion model to achieve a concise text-controlled video stylization. We improve the guidance condition in the denoising process, establishing a balance between artistic expression and structure preservation. Furthermore, to decrease inter-frame flicker and avoid the formation of additional artifacts, we employ a sampling optimization and a temporal consistency module. Extensive experiments show that we can attain superior content preservation and stylistic performance while incurring less consumption than previous solutions. Code will be available at https://github.com/haha-lisa/Style-A-Video.	This paper introduces Style-A-Video, a novel zero-shot video stylization method leveraging a generative pre-trained transformer and an image latent diffusion model for text-driven video stylization.	Existing text-to-video diffusion models are limited by data scarcity and computational resources, making direct application to video stylization challenging. Also, existing methods struggle to balance stylistic changes with preserving the input video's content.	Style-A-Video utilizes a combination of text prompts for style, video frames for content, and attention maps for detailed guidance in the denoising process. It uses a custom guidance method with classifier-free guidance and employs sampling optimization and a temporal consistency module to reduce flicker and artifacts.	Style-A-Video achieves superior content preservation compared to existing text-driven video editing approaches. The method demonstrates strong stylistic representation capabilities, effectively transferring styles from text prompts to videos. Evaluations show that Style-A-Video excels in temporal consistency, ensuring smooth transitions between stylized frames.	The reliance on pre-trained models may limit the method's flexibility in handling complex or unseen styles. Future work could explore incorporating additional conditioning elements, such as depth maps or motion cues, to enhance stylization control. Investigating the effect of other parameters on video stability is another promising avenue for future research.	video stylization, diffusion models, text-driven editing, content preservation, temporal consistency
2305.05445 Report	StyleSync: High-Fidelity Generalized and Personalized Lip Sync in Style-based Generator	Jiazhi Guan, Zhanwang Zhang, Hang Zhou, Tianshu Hu, Kaisiyuan Wang, Dongliang He, Haocheng Feng, Jingtuo Liu, Errui Ding, Ziwei Liu, Jingdong Wang	Despite recent advances in syncing lip movements with any audio waves, current methods still struggle to balance generation quality and the model's generalization ability. Previous studies either require long-term data for training or produce a similar movement pattern on all subjects with low quality. In this paper, we propose StyleSync, an effective framework that enables high-fidelity lip synchronization. We identify that a style-based generator would sufficiently enable such a charming property on both one-shot and few-shot scenarios. Specifically, we design a mask-guided spatial information encoding module that preserves the details of the given face. The mouth shapes are accurately modified by audio through modulated convolutions. Moreover, our design also enables personalized lip-sync by introducing style space and generator refinement on only limited frames. Thus the identity and talking style of a target person could be accurately preserved. Extensive experiments demonstrate the effectiveness of our method in producing high-fidelity results on a variety of scenes. Resources can be found at https://hangz-nju-cuhk.github.io/projects/StyleSync.	This paper introduces StyleSync, a novel framework that leverages a modified style-based generator for high-fidelity lip synchronization in both generalized and personalized scenarios.	Current lip-syncing methods struggle to achieve high fidelity while maintaining generalization ability and often require extensive training data or produce repetitive lip movements.	StyleSync utilizes a mask-guided spatial information encoding module to preserve facial details while modulating mouth shapes according to the input audio. For personalization, it employs style space and generator refinement with limited target data.	StyleSync outperforms state-of-the-art methods in one-shot lip-syncing, producing high-fidelity results. The personalized optimization procedure enhances fidelity by capturing individual speaking styles. Extensive evaluations on LRW and VoxCeleb2 datasets demonstrate the effectiveness of StyleSync in terms of both quantitative metrics and subjective user experience.	The current method relies on a fixed mask, limiting its ability to handle dynamic head poses or expressions. Extreme jaw positions in target videos might exceed the masked area, posing challenges for seamless blending. Future work will focus on addressing these limitations.	lip sync, stylegan, personalized lip-sync, audio-driven facial animation, generative adversarial networks
2305.05208 Report	Boosting Visual-Language Models by Exploiting Hard Samples	Haonan Wang, Minbin Huang, Runhui Huang, Lanqing Hong, Hang Xu, Tianyang Hu, Xiaodan Liang, Zhenguo Li, Hong Cheng, Kenji Kawaguchi	Contrastive Language-Image Pre-training (CLIP) has become the standard for learning cross-modal representations between images and text. Efforts to improve its capabilities typically demand the collection of additional data and retraining with new loss functions. While effective, the added requirements limit their practical use due to the increased resource and time investments needed. In this work, we present HELIP, a cost-effective strategy tailored to enhance the performance of existing CLIP models without the need for training a model from scratch or collecting additional data. Our method allows for effortless integration with existing models' training pipelines, providing an instant boost by training them with selected challenging text-image pairs from their original training datasets. HELIP treats each text-image pair as a single point in the joint vision-language space, identifying those in close proximity as hard pairs. By incorporating the challenging data, pre-trained CLIP models are refined using both the traditional contrastive loss and the newly introduced hard negative margin loss, ensuring the challenging data is fully utilized. On comprehensive benchmarks, HELIP consistently boosts existing models to achieve leading performance. In particular, it improves the zero-shot classification accuracy on ImageNet for SLIP models pre-trained on CC3M, CC12M and YFCC15M datasets. The improvements are 3.05%, 4.47%, and 10.1% respectively, achieved within two epochs of training. In addition, across fine-grained classification datasets, HELIP improves the zero-shot performance of pre-trained CLIP and SLIP by an average of 8.4% and 18.6%, and their linear probe performance by an average of 9.5% and 3.0%.	This paper introduces HELIP, a cost-effective method for improving pre-trained CLIP models by fine-tuning them with challenging data selected from their original training datasets.	Existing methods to improve CLIP models often require retraining from scratch or collecting additional data, limiting their practical use.	HELIP identifies 'hard pairs' – pairs of images and text in close proximity in the joint vision-language space – using a novel Hard Pair Mining (HPM) strategy. It then fine-tunes the model using both the traditional contrastive loss and a new Hard Negative Margin Loss (HNML) that leverages the hard pairs.	HELIP consistently boosts the zero-shot classification accuracy of existing CLIP models across various datasets, including ImageNet, CIFAR-10, and CIFAR-100. It significantly improves zero-shot and linear probe performance on fine-grained image classification datasets. HELIP enhances zero-shot image-text retrieval performance on MS-COCO and Flickr30K datasets.	The paper acknowledges that the combination of HELIP with image self-supervision and larger training batch sizes could further improve linear probe performance. Future work will explore composition-aware fine-tuning, parameter-efficient tuning, and extending the approach to other contrastive learning domains.	contrastive language-image pretraining (clip), hard negative mining, zero-shot learning, image classification, image-text retrieval
2305.05189 Report	SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models	Shanshan Zhong, Zhongzhan Huang, Wushao Wen, Jinghui Qin, Liang Lin	Diffusion models, which have emerged to become popular text-to-image generation models, can produce high-quality and content-rich images guided by textual prompts. However, there are limitations to semantic understanding and commonsense reasoning in existing models when the input prompts are concise narrative, resulting in low-quality image generation. To improve the capacities for narrative prompts, we propose a simple-yet-effective parameter-efficient fine-tuning approach called the Semantic Understanding and Reasoning adapter (SUR-adapter) for pre-trained diffusion models. To reach this goal, we first collect and annotate a new dataset SURD which consists of more than 57,000 semantically corrected multi-modal samples. Each sample contains a simple narrative prompt, a complex keyword-based prompt, and a high-quality image. Then, we align the semantic representation of narrative prompts to the complex prompts and transfer knowledge of large language models (LLMs) to our SUR-adapter via knowledge distillation so that it can acquire the powerful semantic understanding and reasoning capabilities to build a high-quality textual semantic representation for text-to-image generation. We conduct experiments by integrating multiple LLMs and popular pre-trained diffusion models to show the effectiveness of our approach in enabling diffusion models to understand and reason concise natural language without image quality degradation. Our approach can make text-to-image diffusion models easier to use with better user experience, which demonstrates our approach has the potential for further advancing the development of user-friendly text-to-image generation models by bridging the semantic gap between simple narrative prompts and complex keyword-based prompts. The code is released at https://github.com/Qrange-group/SUR-adapter.	This paper introduces SUR-adapter, a novel fine-tuning approach for enhancing pre-trained text-to-image diffusion models with improved semantic understanding and reasoning (SUR) capabilities from LLMs.	Existing diffusion models often struggle to generate high-quality images from concise narrative prompts due to limitations in the semantic understanding and commonsense reasoning of their text encoders.	The authors collect a new dataset (SURD) with simple narrative prompts, complex keyword-based prompts, and corresponding images. They then use SURD to fine-tune diffusion models with SUR-adapter, which leverages knowledge distillation from LLMs and aligns representations of simple and complex prompts.	SUR-adapter significantly improves the semantic accuracy of generated images from simple prompts across different diffusion models and control methods. The fine-tuned models maintain image generation quality comparable to the original pre-trained models. Deeper layers of LLMs contribute more effectively to semantic distillation.	While improved, SUR-adapter doesn't completely solve the semantic understanding issue, suggesting a need for larger multi-modal datasets and more advanced distillation techniques. The performance difference between LLMs of varying sizes is insignificant, indicating potential limitations in the adapter's capacity to transfer knowledge.	diffusion model, large language model, multimodal image generation, adapter, knowledge distillation
2305.04966 Report	NerfAcc: Efficient Sampling Accelerates NeRFs	Ruilong Li, Hang Gao, Matthew Tancik, Angjoo Kanazawa	Optimizing and rendering Neural Radiance Fields is computationally expensive due to the vast number of samples required by volume rendering. Recent works have included alternative sampling approaches to help accelerate their methods, however, they are often not the focus of the work. In this paper, we investigate and compare multiple sampling approaches and demonstrate that improved sampling is generally applicable across NeRF variants under an unified concept of transmittance estimator. To facilitate future experiments, we develop NerfAcc, a Python toolbox that provides flexible APIs for incorporating advanced sampling methods into NeRF related methods. We demonstrate its flexibility by showing that it can reduce the training time of several recent NeRF methods by 1.5x to 20x with minimal modifications to the existing codebase. Additionally, highly customized NeRFs, such as Instant-NGP, can be implemented in native PyTorch using NerfAcc.	This paper introduces NerfAcc, a Python toolbox designed to accelerate the training of Neural Radiance Fields (NeRFs) through efficient sampling techniques.	Optimizing and rendering NeRFs is computationally expensive due to the numerous samples required for volume rendering. Existing efficient sampling methods are often tightly coupled with specific NeRF implementations, hindering wider adoption.	The paper presents a unified view of various sampling methods as constructing a 'transmittance estimator' for importance sampling. NerfAcc decouples the sampling procedure, offering a plug-and-play solution compatible with different NeRF variants.	NerfAcc significantly reduces training time (1.5x to 20x) for various NeRF methods across different datasets, often with slightly improved performance. The toolbox allows for the training of an Instant-NGP model with pure Python code, achieving comparable speed and slightly better performance than the original CUDA implementation. The unified framework enables combining different sampling approaches, leading to improved results in certain scenarios (e.g., combining occupancy grid and proposal network).	The current implementation primarily focuses on density-based NeRFs for per-scene optimization, with limited support for SDF-based methods. Exploring alternative update functions for the transmittance estimator, beyond EMA and SGD, could be a potential direction for future work.	neural radiance fields, nerf, volumetric rendering, importance sampling, transmittance estimation
2305.04790 Report	MultiModal-GPT: A Vision and Language Model for Dialogue with Humans	Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, Kai Chen	We present a vision and language model named MultiModal-GPT to conduct multi-round dialogue with humans. MultiModal-GPT can follow various instructions from humans, such as generating a detailed caption, counting the number of interested objects, and answering general questions from users. MultiModal-GPT is parameter-efficiently fine-tuned from OpenFlamingo, with Low-rank Adapter (LoRA) added both in the cross-attention part and the self-attention part of the language model. We first construct instruction templates with vision and language data for multi-modality instruction tuning to make the model understand and follow human instructions. We find the quality of training data is vital for the dialogue performance, where few data containing short answers can lead the model to respond shortly to any instructions. To further enhance the ability to chat with humans of the MultiModal-GPT, we utilize language-only instruction-following data to train the MultiModal-GPT jointly. The joint training of language-only and visual-language instructions with the \emph{same} instruction template effectively improves dialogue performance. Various demos show the ability of continuous dialogue of MultiModal-GPT with humans. Code, dataset, and demo are at https://github.com/open-mmlab/Multimodal-GPT	Introduces MultiModal-GPT, a vision and language model fine-tuned from OpenFlamingo for multi-round dialogues with humans, capable of tasks like detailed captioning, object counting, and general query answering.	Aims to bridge the gap in existing models' ability to engage in accurate, human-like multimodal dialogues.	Fine-tunes OpenFlamingo with Low-rank Adapter (LoRA) using a unified instruction template for both language and visual-language instruction data, enabling synergistic training and improved performance.	Joint training with language and visual-language data significantly improves dialogue performance. Highlights the importance of high-quality training data, showing that datasets with limited response types (e.g., yes/no) can degrade the model's conversational abilities. Demonstrates through various demos MultiModal-GPT's proficiency in maintaining continuous dialogues, handling tasks like recipe generation, object counting, OCR, and general knowledge questions.	Current work does not include certain potentially beneficial vision and language instruction datasets like MultiInstruct. Further exploration of advanced techniques to improve the model's ability to handle more complex dialogue scenarios.	multimodal dialogue, vision and language model, instruction tuning, openflamingo, multimodal-gpt
2305.04517 Report	DiffBFR: Bootstrapping Diffusion Model Towards Blind Face Restoration	Xinmin Qiu, Congying Han, Zicheng Zhang, Bonan Li, Tiande Guo, Xuecheng Nie	Blind face restoration (BFR) is important while challenging. Prior works prefer to exploit GAN-based frameworks to tackle this task due to the balance of quality and efficiency. However, these methods suffer from poor stability and adaptability to long-tail distribution, failing to simultaneously retain source identity and restore detail. We propose DiffBFR to introduce Diffusion Probabilistic Model (DPM) for BFR to tackle the above problem, given its superiority over GAN in aspects of avoiding training collapse and generating long-tail distribution. DiffBFR utilizes a two-step design, that first restores identity information from low-quality images and then enhances texture details according to the distribution of real faces. This design is implemented with two key components: 1) Identity Restoration Module (IRM) for preserving the face details in results. Instead of denoising from pure Gaussian random distribution with LQ images as the condition during the reverse process, we propose a novel truncated sampling method which starts from LQ images with part noise added. We theoretically prove that this change shrinks the evidence lower bound of DPM and then restores more original details. With theoretical proof, two cascade conditional DPMs with different input sizes are introduced to strengthen this sampling effect and reduce training difficulty in the high-resolution image generated directly. 2) Texture Enhancement Module (TEM) for polishing the texture of the image. Here an unconditional DPM, a LQ-free model, is introduced to further force the restorations to appear realistic. We theoretically proved that this unconditional DPM trained on pure HQ images contributes to justifying the correct distribution of inference images output from IRM in pixel-level space. Truncated sampling with fractional time step is utilized to polish pixel-level textures while preserving identity information.	This paper introduces DiffBFR, the first approach applying pure diffusion models to Blind Face Restoration (BFR). It leverages the advantages of diffusion models over GANs in handling long-tail distributions and avoiding training collapse.	BFR, crucial for various applications, remains challenging due to the complex degradation in real-world images. Existing GAN-based methods struggle to restore fine-grained details and realistic textures, especially in cases of long-tail distribution.	DiffBFR utilizes a two-step design: 1) IRM (Identity Restoration Module): Restores identity information from low-quality images using a novel truncated sampling method with cascaded conditional DPMs. 2) TEM (Texture Enhancement Module): Enhances texture details based on the distribution of real faces learned by an unconditional DPM trained on HQ images.	DiffBFR achieves superior quantitative results compared to state-of-the-art methods, as demonstrated by FID, NIQE, and LPIPS metrics. Qualitative results highlight DiffBFR's ability to restore high-fidelity facial details, maintain person identities, and produce realistic textures. Theoretical analysis and ablation studies validate the effectiveness of the proposed IRM and TEM modules.	Inference time, though reduced by truncated sampling, remains longer than GAN-based methods, requiring further optimization. The cascaded multi-stage structure results in a larger parameter scale compared to single-stage diffusion models.	blind face restoration, diffusion probabilistic models, long-tail distribution, identity restoration, texture enhancement
2305.04470 Report	Video Object Segmentation in Panoptic Wild Scenes	Yuanyou Xu, Zongxin Yang, Yi Yang	In this paper, we introduce semi-supervised video object segmentation (VOS) to panoptic wild scenes and present a large-scale benchmark as well as a baseline method for it. Previous benchmarks for VOS with sparse annotations are not sufficient to train or evaluate a model that needs to process all possible objects in real-world scenarios. Our new benchmark (VIPOSeg) contains exhaustive object annotations and covers various real-world object categories which are carefully divided into subsets of thing/stuff and seen/unseen classes for comprehensive evaluation. Considering the challenges in panoptic VOS, we propose a strong baseline method named panoptic object association with transformers (PAOT), which uses panoptic identification to associate objects with a pyramid architecture on multiple scales. Experimental results show that VIPOSeg can not only boost the performance of VOS models by panoptic training but also evaluate them comprehensively in panoptic scenes. Previous methods for classic VOS still need to improve in performance and efficiency when dealing with panoptic scenes, while our PAOT achieves SOTA performance with good efficiency on VIPOSeg and previous VOS benchmarks. PAOT also ranks 1st in the VOT2022 challenge. Our dataset is available at https://github.com/yoxu515/VIPOSeg-Benchmark.	This paper introduces the concept of panoptic video object segmentation (VOS) and presents VIPOSeg, a new large-scale benchmark dataset with exhaustive object annotations encompassing seen/unseen and thing/stuff classes, along with a baseline method PAOT for this task.	Existing VOS benchmarks are limited by sparse annotations and lack of diverse object categories, hindering the development and evaluation of models equipped for real-world scenarios with numerous objects and stuff classes.	The authors leverage the VIPSeg dataset for video panoptic segmentation to create VIPOSeg by re-splitting the data, converting annotations to VOS format, and meticulously cleaning the annotations. They also introduce PAOT, which employs decoupled identity banks for thing/stuff objects, a pyramid architecture for multi-scale matching, and efficient long-short term transformers to address panoptic scene challenges.	VIPOSeg proves to be significantly more challenging than previous VOS benchmarks due to its dense object annotations and diverse object scales. Training on VIPOSeg significantly improves the performance of VOS methods, including on classic VOS benchmarks. PAOT, with its pyramid architecture and panoptic ID strategy, achieves state-of-the-art performance on VIPOSeg and other benchmarks, demonstrating its effectiveness for panoptic VOS.	The efficiency of current VOS models on VIPOSeg requires further improvement, as most models demand over 11 GB of memory. Future work includes exploring better memory strategies and larger ID capacities to enhance model efficiency for panoptic VOS.	video object segmentation, panoptic segmentation, benchmark dataset, deep learning, computer vision
2305.04461 Report	Locally Attentional SDF Diffusion for Controllable 3D Shape Generation	Xin-Yang Zheng, Hao Pan, Peng-Shuai Wang, Xin Tong, Yang Liu, Heung-Yeung Shum	Although the recent rapid evolution of 3D generative neural networks greatly improves 3D shape generation, it is still not convenient for ordinary users to create 3D shapes and control the local geometry of generated shapes. To address these challenges, we propose a diffusion-based 3D generation framework -- locally attentional SDF diffusion, to model plausible 3D shapes, via 2D sketch image input. Our method is built on a two-stage diffusion model. The first stage, named occupancy-diffusion, aims to generate a low-resolution occupancy field to approximate the shape shell. The second stage, named SDF-diffusion, synthesizes a high-resolution signed distance field within the occupied voxels determined by the first stage to extract fine geometry. Our model is empowered by a novel view-aware local attention mechanism for image-conditioned shape generation, which takes advantage of 2D image patch features to guide 3D voxel feature learning, greatly improving local controllability and model generalizability. Through extensive experiments in sketch-conditioned and category-conditioned 3D shape generation tasks, we validate and demonstrate the ability of our method to provide plausible and diverse 3D shapes, as well as its superior controllability and generalizability over existing work. Our code and trained models are available at https://zhengxinyang.github.io/projects/LAS-Diffusion.html	This paper proposes LAS-Diffusion, a novel two-stage diffusion-based 3D shape generation framework that takes 2D sketch images as input, aiming at achieving plausible 3D shape generation with local controllability.	Existing 3D shape generation methods struggle with quality and lack intuitive control, especially for ordinary users who wish to embed creative ideas into the generation process.	The framework employs two stages: occupancy-diffusion, generating a low-resolution occupancy field to approximate the shape shell; and SDF-diffusion, synthesizing a high-resolution signed distance field within the occupied voxels. It utilizes a view-aware local attention mechanism to leverage 2D image patch features for guiding 3D voxel feature learning and local control.	LAS-Diffusion outperforms existing methods in terms of local controllability and generalizability for sketch-conditioned generation tasks. The method exhibits superior shape quality and diversity for category-conditioned generation tasks. It demonstrates robustness to different sketch styles, including synthetic sketches, freehand sketches, and professional sketches.	The model's sketch style is currently limited to the rendering pipeline used during training, making it less adaptable to highly distorted or inconsistent sketches. The current work focuses on shape geometry, with future work aiming to incorporate shape appearance generation using sketches and language descriptions.	3d shape generation, diffusion model, sketch-conditioned, local attention, sdf
2305.04451 Report	FashionTex: Controllable Virtual Try-on with Text and Texture	Anran Lin, Nanxuan Zhao, Shuliang Ning, Yuda Qiu, Baoyuan Wang, Xiaoguang Han	Virtual try-on attracts increasing research attention as a promising way for enhancing the user experience for online cloth shopping. Though existing methods can generate impressive results, users need to provide a well-designed reference image containing the target fashion clothes that often do not exist. To support user-friendly fashion customization in full-body portraits, we propose a multi-modal interactive setting by combining the advantages of both text and texture for multi-level fashion manipulation. With the carefully designed fashion editing module and loss functions, FashionTex framework can semantically control cloth types and local texture patterns without annotated pairwise training data. We further introduce an ID recovery module to maintain the identity of input portrait. Extensive experiments have demonstrated the effectiveness of our proposed pipeline.	Presents FashionTex, a novel pipeline for interactive and controllable full-body virtual try-on using text prompts to modify clothing types and texture patches to adjust local patterns.	Addresses limitations of existing virtual try-on methods that require reference images with specific clothing items, enabling user-friendly fashion customization.	Leverages StyleGAN's latent space for editing, designing a fashion editing module with separate mappers for text and texture inputs. Introduces a novel CLIP-based type loss for accurate cloth type manipulation and an ID recovery module to maintain portrait identity.	Achieves precise control over cloth types and textures, enabling customization based on textual descriptions and reference patches. Outperforms existing text-driven image manipulation methods (TediGAN, StyleCLIP) in fashion type editing based on FID and accuracy metrics. Exhibits superior performance in texture transfer compared to TextureGAN, DiOr, and Texture Reformer, as evidenced by FID and LPIPS scores.	Reliance on human parsing for region-specific editing may introduce errors. Limited diversity in generated clothing styles due to the training dataset.	virtual try-on, fashion editing, multi-modal learning, text-to-image synthesis, stylegan
2305.04441 Report	Prompt Tuning Inversion for Text-Driven Image Editing Using Diffusion Models	Wenkai Dong, Song Xue, Xiaoyue Duan, Shumin Han	Recently large-scale language-image models (e.g., text-guided diffusion models) have considerably improved the image generation capabilities to generate photorealistic images in various domains. Based on this success, current image editing methods use texts to achieve intuitive and versatile modification of images. To edit a real image using diffusion models, one must first invert the image to a noisy latent from which an edited image is sampled with a target text prompt. However, most methods lack one of the following: user-friendliness (e.g., additional masks or precise descriptions of the input image are required), generalization to larger domains, or high fidelity to the input image. In this paper, we design an accurate and quick inversion technique, Prompt Tuning Inversion, for text-driven image editing. Specifically, our proposed editing method consists of a reconstruction stage and an editing stage. In the first stage, we encode the information of the input image into a learnable conditional embedding via Prompt Tuning Inversion. In the second stage, we apply classifier-free guidance to sample the edited image, where the conditional embedding is calculated by linearly interpolating between the target embedding and the optimized one obtained in the first stage. This technique ensures a superior trade-off between editability and high fidelity to the input image of our method. For example, we can change the color of a specific object while preserving its original shape and background under the guidance of only a target text prompt. Extensive experiments on ImageNet demonstrate the superior editing performance of our method compared to the state-of-the-art baselines.	This paper proposes Prompt Tuning Inversion, a novel text-driven image editing method based on diffusion models that allows accurate and efficient editing of real images using only target text prompts.	Existing text-driven image editing methods lack in user-friendliness (requiring masks or image descriptions), generalization to larger domains, or fidelity to the input image. This method aims to address these shortcomings.	The method has two stages. First, Prompt Tuning Inversion encodes the input image information into a learnable conditional embedding. Second, it linearly interpolates this embedding with the target text embedding to guide the diffusion model in generating an edited image.	The method outperforms state-of-the-art baselines like DiffEdit in terms of the trade-off between editability and fidelity to the input image. Prompt Tuning Inversion demonstrates faster convergence and superior reconstruction quality compared to Null-Text Inversion. Ablation studies highlight the influence of interpolation ratio and learning rate on the balance between editability and fidelity.	The method may not successfully edit all instances of an object in an image with multiple objects. Future work could explore techniques like precise attention map manipulation or multi-modal conditional control to overcome this limitation.	image editing, diffusion models, text-guided synthesis, prompt tuning, image inversion
2305.04440 Report	Vision Transformer Off-the-Shelf: A Surprising Baseline for Few-Shot Class-Agnostic Counting	Zhicheng Wang, Liwen Xiao, Zhiguo Cao, Hao Lu	Class-agnostic counting (CAC) aims to count objects of interest from a query image given few exemplars. This task is typically addressed by extracting the features of query image and exemplars respectively and then matching their feature similarity, leading to an extract-then-match paradigm. In this work, we show that CAC can be simplified in an extract-and-match manner, particularly using a vision transformer (ViT) where feature extraction and similarity matching are executed simultaneously within the self-attention. We reveal the rationale of such simplification from a decoupled view of the self-attention. The resulting model, termed CACViT, simplifies the CAC pipeline into a single pretrained plain ViT. Further, to compensate the loss of the scale and the order-of-magnitude information due to resizing and normalization in plain ViT, we present two effective strategies for scale and magnitude embedding. Extensive experiments on the FSC147 and the CARPK datasets show that CACViT significantly outperforms state-of-the art CAC approaches in both effectiveness (23.60% error reduction) and generalization, which suggests CACViT provides a concise and strong baseline for CAC. Code will be available.	This paper presents CACViT, a simple yet effective ViT-based model for class-agnostic counting (CAC) that simplifies CAC into an extract-and-match paradigm within a single pretrained plain ViT, outperforming state-of-the-art approaches.	CAC is important due to its potential to generalize to unseen scenes and reduced reliance on class-specific training data. Existing methods suffer from redundancy and task-specific designs by following an extract-then-match paradigm.	The authors leverage the self-attention mechanism of ViT to simultaneously extract features and perform matching for query images and exemplars. Two effective strategies, aspect-ratio-aware scale embedding and order-of-magnitude embedding, are introduced to compensate for scale information loss due to resizing and normalization in ViT.	CACViT significantly outperforms state-of-the-art CAC approaches on FSC147, achieving relative error reductions of 19.04% and 23.60% on the validation and test sets, respectively. The method demonstrates strong cross-dataset generalization ability on the CARPK car counting dataset. Ablation studies validate the effectiveness of the proposed scale and magnitude embedding strategies.	The performance on the FSC147 test set for the 1-shot setting is unexpectedly better than the 3-shot setting, potentially due to annotation quality issues for dense environments. The improvement on MAE by magnitude embedding alone is marginal, possibly due to inaccurate object size priors from resized exemplars.	class-agnostic counting, vision transformer, self-attention, scale embedding, magnitude embedding
2305.04296 Report	HashCC: Lightweight Method to Improve the Quality of the Camera-less NeRF Scene Generation	Jan Olszewski	Neural Radiance Fields has become a prominent method of scene generation via view synthesis. A critical requirement for the original algorithm to learn meaningful scene representation is camera pose information for each image in a data set. Current approaches try to circumnavigate this assumption with moderate success, by learning approximate camera positions alongside learning neural representations of a scene. This requires complicated camera models, causing a long and complicated training process, or results in a lack of texture and sharp details in rendered scenes. In this work we introduce Hash Color Correction (HashCC) -- a lightweight method for improving Neural Radiance Fields rendered image quality, applicable also in situations where camera positions for a given set of images are unknown.	Introduces HashCC, a lightweight color correction method for NeRF to enhance rendered image quality in camera-less scenarios.	Addresses limitations of existing camera-less NeRF methods that produce blurry results or require complex models and training, hindering wider application.	Extends NeRF-/- by incorporating a Color Correction Network with Hash Encoding and a shallow MLP, adding a color correction term to the main network output. Uses Spherical Harmonics encoding for viewing direction.	Improves rendered image quality in 6 out of 8 scenes from the LLFF dataset based on PSNR, SSIM, and LPIPS metrics. Enhances camera pose estimation compared to NeRF-/- in most scenes. Shows sharper details and textures compared to baseline, as demonstrated in qualitative comparisons.	Limited improvement in camera pose estimation due to reliance on a simple pinhole camera model. Future work can explore sophisticated camera models and exploit camera trajectory information for improved pose estimation, especially in non-forward-facing scenarios.	neural radiance fields, camera-less nerf, hash encoding, color correction, view synthesis
2305.04268 Report	Multi-Space Neural Radiance Fields	Ze-Xin Yin, Jiaxiong Qiu, Ming-Ming Cheng, Bo Ren	Existing Neural Radiance Fields (NeRF) methods suffer from the existence of reflective objects, often resulting in blurry or distorted rendering. Instead of calculating a single radiance field, we propose a multi-space neural radiance field (MS-NeRF) that represents the scene using a group of feature fields in parallel sub-spaces, which leads to a better understanding of the neural network toward the existence of reflective and refractive objects. Our multi-space scheme works as an enhancement to existing NeRF methods, with only small computational overheads needed for training and inferring the extra-space outputs. We demonstrate the superiority and compatibility of our approach using three representative NeRF-based models, i.e., NeRF, Mip-NeRF, and Mip-NeRF 360. Comparisons are performed on a novelly constructed dataset consisting of 25 synthetic scenes and 7 real captured scenes with complex reflection and refraction, all having 360-degree viewpoints. Extensive experiments show that our approach significantly outperforms the existing single-space NeRF methods for rendering high-quality scenes concerned with complex light paths through mirror-like objects. Our code and dataset will be publicly available at https://zx-yin.github.io/msnerf.	This paper introduces Multi-Space Neural Radiance Fields (MS-NeRF), a novel method addressing challenges in rendering reflective objects in 360-degree scenes.	Existing NeRF methods struggle with reflections, often producing blurry results due to violated multi-view consistency.	The MS-NeRF represents scenes as multiple virtual sub-spaces, each adhering to multi-view consistency. This is achieved using a lightweight multi-space module replacing the original output layer of NeRF, generating densities and features for each sub-space, subsequently decoded and weighted for final rendering.	MS-NeRF significantly outperforms single-space NeRF methods in rendering scenes with complex reflections. The approach demonstrates compatibility with various NeRF architectures like NeRF, Mip-NeRF, and Mip-NeRF 360. A new dataset featuring complex reflections and refractions is introduced, including synthetic and real-world captured scenes.	The number of sub-spaces, while not needing to precisely match the virtual images, requires careful selection. Future work could explore automatically determining the optimal number of sub-spaces.	neural radiance fields, nerf, reflections, 360-degree rendering, novel view synthesis
2305.04232 Report	CatFLW: Cat Facial Landmarks in the Wild Dataset	George Martvel, Nareed Farhat, Ilan Shimshoni, Anna Zamansky	Animal affective computing is a quickly growing field of research, where only recently first efforts to go beyond animal tracking into recognizing their internal states, such as pain and emotions, have emerged. In most mammals, facial expressions are an important channel for communicating information about these states. However, unlike the human domain, there is an acute lack of datasets that make automation of facial analysis of animals feasible. This paper aims to fill this gap by presenting a dataset called Cat Facial Landmarks in the Wild (CatFLW) which contains 2016 images of cat faces in different environments and conditions, annotated with 48 facial landmarks specifically chosen for their relationship with underlying musculature, and relevance to cat-specific facial Action Units (CatFACS). To the best of our knowledge, this dataset has the largest amount of cat facial landmarks available. In addition, we describe a semi-supervised (human-in-the-loop) method of annotating images with landmarks, used for creating this dataset, which significantly reduces the annotation time and could be used for creating similar datasets for other animals. The dataset is available on request.	This paper introduces CatFLW, a dataset of cat faces annotated with 48 facial landmarks, designed for advancing animal affective computing, especially for cats.	Publicly available animal facial landmark datasets are scarce, hindering research in animal affective computing. This is particularly crucial for cats, where facial analysis can help with pain assessment.	The authors selected images of single, fully visible cat faces from an existing dataset. They then used a semi-supervised 'human-in-the-loop' method to annotate facial landmarks, leveraging a gradually trained EfficientNet model to expedite the process.	CatFLW contains \cat images with 48 landmarks, focusing on features relevant to cat facial expressions and musculature. The 'human-in-the-loop' annotation significantly reduced annotation time per image compared to purely manual methods. The dataset exhibits diverse cat breeds, environments, and head poses, similar to the AnimalWeb dataset, making it suitable for training robust computer vision models.	The dataset size, while the largest of its kind, is still limited compared to human facial landmark datasets. Future work can focus on expanding the dataset, developing automated landmark detection models, and exploring applications in pain and emotion recognition.	facial landmark detection, animal affective computing, cat facial expressions, pain assessment, dataset
2305.04075 Report	PointCMP: Contrastive Mask Prediction for Self-supervised Learning on Point Cloud Videos	Zhiqiang Shen, Xiaoxiao Sheng, Longguang Wang, Yulan Guo, Qiong Liu, Xi Zhou	Self-supervised learning can extract representations of good quality from solely unlabeled data, which is appealing for point cloud videos due to their high labelling cost. In this paper, we propose a contrastive mask prediction (PointCMP) framework for self-supervised learning on point cloud videos. Specifically, our PointCMP employs a two-branch structure to achieve simultaneous learning of both local and global spatio-temporal information. On top of this two-branch structure, a mutual similarity based augmentation module is developed to synthesize hard samples at the feature level. By masking dominant tokens and erasing principal channels, we generate hard samples to facilitate learning representations with better discrimination and generalization performance. Extensive experiments show that our PointCMP achieves the state-of-the-art performance on benchmark datasets and outperforms existing full-supervised counterparts. Transfer learning results demonstrate the superiority of the learned representations across different datasets and tasks.	This paper proposes PointCMP, a novel contrastive mask prediction framework for self-supervised learning on point cloud videos.	Self-supervised learning on point cloud videos is crucial for reducing the high cost of annotation, and current paradigms struggle to effectively capture both local and global spatio-temporal information essential for this task.	PointCMP leverages a two-branch structure to learn local and global information, coupled with a mutual similarity based augmentation module to generate hard samples at the feature level by masking dominant tokens and erasing principal channels.	PointCMP achieves state-of-the-art performance on benchmark datasets for 3D action and gesture recognition, outperforming fully supervised counterparts. The method demonstrates superior performance in linear probing and semi-supervised settings, highlighting the quality of the learned representations. Transfer learning experiments show strong generalization capabilities across datasets and tasks, exceeding previous self-supervised methods.	The current design of PointCMP primarily focuses on single-view point cloud videos, limiting its applicability to multi-view scenarios. Exploring more advanced masking and augmentation strategies at the feature level could further enhance the performance of PointCMP.	self-supervised learning, point cloud videos, contrastive learning, mask prediction, action recognition
2305.03989 Report	LEO: Generative Latent Image Animator for Human Video Synthesis	Yaohui Wang, Xin Ma, Xinyuan Chen, Antitza Dantcheva, Bo Dai, Yu Qiao	Spatio-temporal coherency is a major challenge in synthesizing high quality videos, particularly in synthesizing human videos that contain rich global and local deformations. To resolve this challenge, previous approaches have resorted to different features in the generation process aimed at representing appearance and motion. However, in the absence of strict mechanisms to guarantee such disentanglement, a separation of motion from appearance has remained challenging, resulting in spatial distortions and temporal jittering that break the spatio-temporal coherency. Motivated by this, we here propose LEO, a novel framework for human video synthesis, placing emphasis on spatio-temporal coherency. Our key idea is to represent motion as a sequence of flow maps in the generation process, which inherently isolate motion from appearance. We implement this idea via a flow-based image animator and a Latent Motion Diffusion Model (LMDM). The former bridges a space of motion codes with the space of flow maps, and synthesizes video frames in a warp-and-inpaint manner. LMDM learns to capture motion prior in the training data by synthesizing sequences of motion codes. Extensive quantitative and qualitative analysis suggests that LEO significantly improves coherent synthesis of human videos over previous methods on the datasets TaichiHD, FaceForensics and CelebV-HQ. In addition, the effective disentanglement of appearance and motion in LEO allows for two additional tasks, namely infinite-length human video synthesis, as well as content-preserving video editing.	This paper proposes LEO, a novel framework for human video synthesis that prioritizes spatio-temporal coherency by representing motion as a sequence of flow maps, effectively disentangling it from appearance.	Synthesizing high-quality human videos with strong spatio-temporal coherency is challenging due to the difficulty in disentangling motion from appearance, leading to spatial distortions and temporal jittering.	LEO utilizes a two-phase training approach. It first trains a flow-based image animator to learn latent motion codes and their mapping to flow maps. Then, a Latent Motion Diffusion Model (LMDM) learns motion prior from these codes, enabling the synthesis of coherent videos via a warp-and-inpaint mechanism.	LEO demonstrates superior spatio-temporal coherency compared to existing methods, even in long videos (512 frames). The disentanglement of appearance and motion enables infinite-length video synthesis and content-preserving video editing. Quantitative evaluations on TaichiHD, FaceForensics, and CelebV-HQ datasets, using metrics like FVD, KVD, and ACD, confirm LEO’s superiority.	The quality of unconditionally generated videos depends on the starting frame generated by a separate model, suggesting a need for improvement in that area. The diversity of motion patterns in infinite-length generation is limited by the training data, requiring further research.	video synthesis, human video generation, motion disentanglement, diffusion models, spatio-temporal coherency
2305.03713 Report	Avatar Fingerprinting for Authorized Use of Synthetic Talking-Head Videos	Ekta Prashnani, Koki Nagano, Shalini De Mello, David Luebke, Orazio Gallo	Modern generators render talking-head videos with impressive photorealism, ushering in new user experiences such as videoconferencing under constrained bandwidth budgets. Their safe adoption, however, requires a mechanism to verify if the rendered video is trustworthy. For instance, for videoconferencing we must identify cases in which a synthetic video portrait uses the appearance of an individual without their consent. We term this task avatar fingerprinting. Specifically, we learn an embedding in which the motion signatures of one identity are grouped together, and pushed away from those of the other identities. This allows us to link the synthetic video to the identity driving the expressions in the video, regardless of the facial appearance shown. Avatar fingerprinting algorithms will be critical as talking head generators become more ubiquitous, and yet no large scale datasets exist for this new task. Therefore, we contribute a large dataset of people delivering scripted and improvised short monologues, accompanied by synthetic videos in which we render videos of one person using the facial appearance of another. Project page: https://research.nvidia.com/labs/nxp/avatar-fingerprinting/.	The paper introduces "avatar fingerprinting", a method to verify the identity of the person driving the expressions in a synthetic talking-head video, regardless of the facial appearance.	This is crucial for safe adoption of talking-head generators, which are becoming increasingly realistic, to prevent unauthorized use of someone's likeness.	The method extracts temporal facial landmark distances from videos and uses a novel contrastive loss to learn a "dynamic identity embedding". In this embedding space, videos driven by the same identity cluster together, regardless of the target appearance.	The method achieves an AUC of 0.886, outperforming baselines adapted from deepfake detection. The learned embeddings capture motion dynamics rather than appearance, evidenced by similar distances between different target identities driven by the same person. The method demonstrates robustness to unseen talking-head generators.	The algorithm struggles to differentiate subjects with consistently neutral expressions. Accuracy is affected if the generator fails to capture expressions crucial for distinguishing identities.	avatar fingerprinting, talking-head video, synthetic media, identity verification, deepfake detection
2305.03689 Report	COLA: A Benchmark for Compositional Text-to-image Retrieval	Arijit Ray, Filip Radenovic, Abhimanyu Dubey, Bryan A. Plummer, Ranjay Krishna, Kate Saenko	Compositional reasoning is a hallmark of human visual intelligence. Yet, despite the size of large vision-language models, they struggle to represent simple compositions by combining objects with their attributes. To measure this lack of compositional capability, we design Cola, a text-to-image retrieval benchmark to Compose Objects Localized with Attributes. To solve Cola, a model must retrieve images with the correct configuration of attributes and objects and avoid choosing a distractor image with the same objects and attributes but in the wrong configuration. Cola contains about 1.2k composed queries of 168 objects and 197 attributes on around 30K images. Our human evaluation finds that Cola is 83.33% accurate, similar to contemporary compositionality benchmarks. Using Cola as a testbed, we explore empirical modeling designs to adapt pre-trained vision-language models to reason compositionally. We explore 6 adaptation strategies on 2 seminal vision-language models, using compositionality-centric test benchmarks - Cola and CREPE. We find the optimal adaptation strategy is to train a multi-modal attention layer that jointly attends over the frozen pre-trained image and language features. Surprisingly, training multimodal layers on CLIP performs better than tuning a larger FLAVA model with already pre-trained multimodal layers. Furthermore, our adaptation strategy improves CLIP and FLAVA to comparable levels, suggesting that training multimodal layers using contrastive attribute-object data is key, as opposed to using them pre-trained. Lastly, we show that Cola is harder than a closely related contemporary benchmark, CREPE, since simpler fine-tuning strategies without multimodal layers suffice on CREPE but not on Cola. However, we still see a significant gap between our best adaptation and human accuracy, suggesting considerable room for further research.	The paper introduces COLA, a text-to-image retrieval benchmark designed to test the ability of vision-language models to compose objects localized with attributes.	Compositional reasoning, particularly binding attributes to the correct objects, is crucial for vision-language models to understand complex scenes and execute instructions accurately.	The authors create COLA with single and multi-object queries containing multiple attributes. They then explore and compare various fine-tuning strategies on pre-trained models (CLIP and FLAVA) using the benchmark.	A lightweight multimodal adaptation strategy using a transformer encoder-decoder to jointly attend over image and language features outperforms common tuning methods like prompt-tuning and fine-tuning. Training multimodal layers on attribute-object data during adaptation is crucial for performance, even surpassing the use of pre-trained multimodal layers in larger models. COLA proves to be a more challenging benchmark than existing ones, highlighting the difficulty of text-to-image retrieval with fine-grained compositional differences.	While focusing on attribute-object compositionality, other compositional structures like relationships and scene graphs require further exploration. The fine-tuning strategy, while effective for compositionality, might impact performance on other generic vision-language tasks.	compositional reasoning, vision-language models, text-to-image retrieval, attribute-object binding, benchmarking
2305.03382 Report	Guided Image Synthesis via Initial Image Editing in Diffusion Model	Jiafeng Mao, Xueting Wang, Kiyoharu Aizawa	Diffusion models have the ability to generate high quality images by denoising pure Gaussian noise images. While previous research has primarily focused on improving the control of image generation through adjusting the denoising process, we propose a novel direction of manipulating the initial noise to control the generated image. Through experiments on stable diffusion, we show that blocks of pixels in the initial latent images have a preference for generating specific content, and that modifying these blocks can significantly influence the generated image. In particular, we show that modifying a part of the initial image affects the corresponding region of the generated image while leaving other regions unaffected, which is useful for repainting tasks. Furthermore, we find that the generation preferences of pixel blocks are primarily determined by their values, rather than their position. By moving pixel blocks with a tendency to generate user-desired content to user-specified regions, our approach achieves state-of-the-art performance in layout-to-image generation. Our results highlight the flexibility and power of initial image manipulation in controlling the generated image.	This paper investigates the impact of the initial noise image in diffusion models, revealing its inherent preference for generating specific content and leveraging this insight to control image generation.	This is important because it offers a novel approach to fine-grained control in image generation, addressing limitations of prompt-based methods and enabling tasks like repainting and layout-to-image synthesis.	The authors conduct experiments on Stable Diffusion, manipulating the initial noise image by either partially re-randomizing regions or swapping pixel blocks based on attention maps to guide content generation.	The initial noise image exhibits distinct preferences for generating specific content, impacting the final output. Modifying regions in the initial image leads to corresponding changes in generated images, enabling repainting tasks. Moving pixel blocks based on their generation tendency achieves state-of-the-art performance in layout-to-image synthesis.	The method's effectiveness is limited when guidance bounding boxes are small, particularly affecting small object generation. Future work can explore optimizing the initial image or accelerating denoising based on optimized starting points.	text-to-image, diffusion model, fine-grained control, layout-to-image, initial image editing
2305.03374 Report	DisenBooth: Identity-Preserving Disentangled Tuning for Subject-Driven Text-to-Image Generation	Hong Chen, Yipeng Zhang, Simin Wu, Xin Wang, Xuguang Duan, Yuwei Zhou, Wenwu Zhu	Subject-driven text-to-image generation aims to generate customized images of the given subject based on the text descriptions, which has drawn increasing attention. Existing methods mainly resort to finetuning a pretrained generative model, where the identity-relevant information (e.g., the boy) and the identity-irrelevant information (e.g., the background or the pose of the boy) are entangled in the latent embedding space. However, the highly entangled latent embedding may lead to the failure of subject-driven text-to-image generation as follows: (i) the identity-irrelevant information hidden in the entangled embedding may dominate the generation process, resulting in the generated images heavily dependent on the irrelevant information while ignoring the given text descriptions; (ii) the identity-relevant information carried in the entangled embedding can not be appropriately preserved, resulting in identity change of the subject in the generated images. To tackle the problems, we propose DisenBooth, an identity-preserving disentangled tuning framework for subject-driven text-to-image generation. Specifically, DisenBooth finetunes the pretrained diffusion model in the denoising process. Different from previous works that utilize an entangled embedding to denoise each image, DisenBooth instead utilizes disentangled embeddings to respectively preserve the subject identity and capture the identity-irrelevant information. We further design the novel weak denoising and contrastive embedding auxiliary tuning objectives to achieve the disentanglement. Extensive experiments show that our proposed DisenBooth framework outperforms baseline models for subject-driven text-to-image generation with the identity-preserved embedding. Additionally, by combining the identity-preserved embedding and identity-irrelevant embedding, DisenBooth demonstrates more generation flexibility and controllability	This paper proposes DisenBooth, an identity-preserving disentangled tuning framework for subject-driven text-to-image generation.	Existing methods suffer from entanglement of subject identity and irrelevant information in the latent space, leading to inaccurate subject generation and overfitting to background or pose.	DisenBooth disentangles identity-relevant and irrelevant information by using separate textual and visual embeddings during finetuning of a pretrained diffusion model. It employs a weak denoising objective and a contrastive embedding objective to enforce disentanglement.	DisenBooth outperforms baseline models in subject-driven text-to-image generation, demonstrating superior identity preservation and text prompt fidelity. The disentangled embeddings enable more flexible and controllable generation by allowing users to combine identity-preserved embeddings with identity-irrelevant embeddings from reference images. DisenBooth effectively disentangles identity-relevant and irrelevant information, as evidenced by ablation studies and visualizations.	DisenBooth inherits limitations of the pretrained Stable Diffusion model. Future work could explore more fine-grained disentanglement within identity-irrelevant information.	text-to-image generation, subject-driven generation, disentangled representation learning, diffusion models, fine-tuning
2305.03302 Report	High-Fidelity 3D Face Generation from Natural Language Descriptions	Menghua Wu, Hao Zhu, Linjia Huang, Yiyu Zhuang, Yuanxun Lu, Xun Cao	Synthesizing high-quality 3D face models from natural language descriptions is very valuable for many applications, including avatar creation, virtual reality, and telepresence. However, little research ever tapped into this task. We argue the major obstacle lies in 1) the lack of high-quality 3D face data with descriptive text annotation, and 2) the complex mapping relationship between descriptive language space and shape/appearance space. To solve these problems, we build Describe3D dataset, the first large-scale dataset with fine-grained text descriptions for text-to-3D face generation task. Then we propose a two-stage framework to first generate a 3D face that matches the concrete descriptions, then optimize the parameters in the 3D shape and texture space with abstract description to refine the 3D face model. Extensive experimental results show that our method can produce a faithful 3D face that conforms to the input descriptions with higher accuracy and quality than previous methods. The code and Describe3D dataset are released at https://github.com/zhuhao-nju/describe3d .	This paper presents a novel method for generating high-fidelity 3D faces from natural language descriptions.	Creating 3D faces is crucial for applications like avatars and VR, but current methods struggle to translate textual descriptions into accurate 3D models.	The authors created Describe3D, a dataset of 3D faces paired with fine-grained text descriptions. They propose a two-stage pipeline: 1) concrete synthesis maps text to 3D shape and texture, and 2) abstract synthesis refines the model based on abstract descriptions using CLIP.	The method generates 3D faces that accurately reflect both concrete and abstract descriptions. Quantitative evaluation shows superior performance over cascading text-to-image and image-to-shape methods, as well as Latent3D. Ablation studies demonstrate the effectiveness of the descriptive code, 3DMM representation, region-specific losses, and abstract synthesis.	The method relies on distinguishing concrete and abstract descriptions, and struggles with complex sentences. The dataset has limited racial diversity, impacting performance for certain ethnicities.	3d face generation, text-to-3d, natural language processing, computer vision, deep learning
2305.03051 Report	Controllable Visual-Tactile Synthesis	Ruihan Gao, Wenzhen Yuan, Jun-Yan Zhu	Deep generative models have various content creation applications such as graphic design, e-commerce, and virtual Try-on. However, current works mainly focus on synthesizing realistic visual outputs, often ignoring other sensory modalities, such as touch, which limits physical interaction with users. In this work, we leverage deep generative models to create a multi-sensory experience where users can touch and see the synthesized object when sliding their fingers on a haptic surface. The main challenges lie in the significant scale discrepancy between vision and touch sensing and the lack of explicit mapping from touch sensing data to a haptic rendering device. To bridge this gap, we collect high-resolution tactile data with a GelSight sensor and create a new visuotactile clothing dataset. We then develop a conditional generative model that synthesizes both visual and tactile outputs from a single sketch. We evaluate our method regarding image quality and tactile rendering accuracy. Finally, we introduce a pipeline to render high-quality visual and tactile outputs on an electroadhesion-based haptic device for an immersive experience, allowing for challenging materials and editable sketch inputs.	This paper presents a novel method for synthesizing both visual and tactile outputs of garments from user sketches, aiming to create a multi-sensory experience.	Existing generative models primarily focus on visual outputs, neglecting other sensory modalities like touch. This work addresses the gap by enabling users to both see and feel synthesized objects, enhancing user experience in various applications like online shopping and virtual reality.	The authors collect a new dataset of garments with spatially aligned visual and high-resolution tactile data. They propose a conditional GAN model that learns from dense visual supervision and sparse local tactile supervision to synthesize both outputs from a single sketch. The synthesized outputs are then rendered on a haptic device for an immersive experience.	The proposed method outperforms baseline conditional GANs in terms of image quality and perceptual realism, as evidenced by quantitative metrics (LPIPS, SIFID) and human preference studies. The model generalizes to unseen sketches, allowing for user-driven design edits and customization. The system allows for text-conditioned synthesis, enabling users to modify garment designs using text prompts.	The model struggles to generalize to user sketches with intricate patterns. The current haptic rendering is limited to surface textures and primarily excels with relatively flat objects like garments, posing challenges for rendering 3D objects with significant surface normal changes.	generative models, multi-sensory synthesis, visual-tactile generation, haptic rendering, conditional gans
2305.03049 Report	NeuralEditor: Editing Neural Radiance Fields via Manipulating Point Clouds	Jun-Kun Chen, Jipeng Lyu, Yu-Xiong Wang	This paper proposes NeuralEditor that enables neural radiance fields (NeRFs) natively editable for general shape editing tasks. Despite their impressive results on novel-view synthesis, it remains a fundamental challenge for NeRFs to edit the shape of the scene. Our key insight is to exploit the explicit point cloud representation as the underlying structure to construct NeRFs, inspired by the intuitive interpretation of NeRF rendering as a process that projects or "plots" the associated 3D point cloud to a 2D image plane. To this end, NeuralEditor introduces a novel rendering scheme based on deterministic integration within K-D tree-guided density-adaptive voxels, which produces both high-quality rendering results and precise point clouds through optimization. NeuralEditor then performs shape editing via mapping associated points between point clouds. Extensive evaluation shows that NeuralEditor achieves state-of-the-art performance in both shape deformation and scene morphing tasks. Notably, NeuralEditor supports both zero-shot inference and further fine-tuning over the edited scene. Our code, benchmark, and demo video are available at https://immortalco.github.io/NeuralEditor.	hemodel, a point cloud-guided NeRF model enabling general shape editing by manipulating underlying point clouds.	NeRF excels in novel-view synthesis but struggles with shape editing. hemodel leverages point clouds for their ease of manipulation, combining the strengths of both representations.	Uses K-D trees for density-adaptive voxels and deterministic spline integration for rendering. Employs Phong reflection for color modeling with normal vectors from the point cloud. Optimizes the point cloud via pruning, growing, and normal vector guidance.	hemodel generates more precise point clouds than PointNeRF. Significantly outperforms baselines in shape deformation, both in zero-shot and fine-tuned settings. Achieves smooth scene morphing between multiple scenes, a challenging task for prior work.	Point cloud-guided NeRF models, including hemodel, struggle with surfaces having complex visual effects (e.g., translucent mirrors). Shape deformation doesn't consider the surrounding environment, limiting its ability to realistically adjust scene colors based on lighting changes.	neural radiance fields, shape editing, point clouds, scene morphing, 3d vision
2305.03048 Report	Personalize Segment Anything Model with One Shot	Renrui Zhang, Zhengkai Jiang, Ziyu Guo, Shilin Yan, Junting Pan, Xianzheng Ma, Hao Dong, Peng Gao, Hongsheng Li	Driven by large-data pre-training, Segment Anything Model (SAM) has been demonstrated as a powerful and promptable framework, revolutionizing the segmentation models. Despite the generality, customizing SAM for specific visual concepts without man-powered prompting is under explored, e.g., automatically segmenting your pet dog in different images. In this paper, we propose a training-free Personalization approach for SAM, termed as PerSAM. Given only a single image with a reference mask, PerSAM first localizes the target concept by a location prior, and segments it within other images or videos via three techniques: target-guided attention, target-semantic prompting, and cascaded post-refinement. In this way, we effectively adapt SAM for private use without any training. To further alleviate the mask ambiguity, we present an efficient one-shot fine-tuning variant, PerSAM-F. Freezing the entire SAM, we introduce two learnable weights for multi-scale masks, only training 2 parameters within 10 seconds for improved performance. To demonstrate our efficacy, we construct a new segmentation dataset, PerSeg, for personalized evaluation, and test our methods on video object segmentation with competitive performance. Besides, our approach can also enhance DreamBooth to personalize Stable Diffusion for text-to-image generation, which discards the background disturbance for better target appearance learning. Code is released at https://github.com/ZrrSkywalker/Personalize-SAM	This paper presents PerSAM, a training-free personalization approach for the Segment Anything Model (SAM) that enables it to segment user-designated visual concepts using only one-shot data (a reference image and a mask).	While SAM demonstrates impressive general-purpose segmentation abilities, it lacks the capacity to automatically segment specific user-defined objects in new images or videos. PerSAM addresses this limitation, making SAM more practical for personalized use cases.	PerSAM leverages a location confidence map derived from feature similarities between the reference and test images to guide SAM's attention. It introduces target-guided attention and target-semantic prompting techniques to enhance SAM's focus on the target object. A fine-tuning variant, PerSAM-F, further improves performance by addressing the ambiguity of segmentation scales through a lightweight, scale-aware fine-tuning process.	PerSAM achieves superior performance compared to other in-context learning methods on personalized object segmentation benchmarks, including the authors' newly introduced PerSeg dataset. PerSAM-F, with minimal fine-tuning, further enhances accuracy, surpassing even fully trained video object segmentation models on DAVIS 2017. The authors demonstrate PerSAM's applicability for improving DreamBooth's personalized text-to-image synthesis by mitigating background disturbance.	The reliance on an accurate one-shot mask as input can be a limitation, although the authors provide a relaxation by allowing a bounding box as input with a slight performance trade-off. Further exploration of personalization techniques for broader applications of SAM is suggested as future work.	personalized object segmentation, segment anything model, one-shot learning, parameter-efficient fine-tuning, dreambooth
2305.03045 Report	OctFormer: Octree-based Transformers for 3D Point Clouds	Peng-Shuai Wang	We propose octree-based transformers, named OctFormer, for 3D point cloud learning. OctFormer can not only serve as a general and effective backbone for 3D point cloud segmentation and object detection but also have linear complexity and is scalable for large-scale point clouds. The key challenge in applying transformers to point clouds is reducing the quadratic, thus overwhelming, computation complexity of attentions. To combat this issue, several works divide point clouds into non-overlapping windows and constrain attentions in each local window. However, the point number in each window varies greatly, impeding the efficient execution on GPU. Observing that attentions are robust to the shapes of local windows, we propose a novel octree attention, which leverages sorted shuffled keys of octrees to partition point clouds into local windows containing a fixed number of points while permitting shapes of windows to change freely. And we also introduce dilated octree attention to expand the receptive field further. Our octree attention can be implemented in 10 lines of code with open-sourced libraries and runs 17 times faster than other point cloud attentions when the point number exceeds 200k. Built upon the octree attention, OctFormer can be easily scaled up and achieves state-of-the-art performances on a series of 3D segmentation and detection benchmarks, surpassing previous sparse-voxel-based CNNs and point cloud transformers in terms of both efficiency and effectiveness. Notably, on the challenging ScanNet200 dataset, OctFormer outperforms sparse-voxel-based CNNs by 7.3 in mIoU. Our code and trained models are available at https://wang-ps.github.io/octformer.	This paper introduces OctFormer, an efficient and scalable transformer architecture for 3D point cloud understanding, based on a novel octree attention mechanism.	Existing point cloud transformers suffer from low efficiency, hindering their applicability to large-scale point clouds.	OctFormer leverages octree structures to divide point clouds into groups with equal point numbers for efficient window attention, enabling easy parallelization and scalability.	OctFormer achieves state-of-the-art performance on ScanNet segmentation, outperforming previous methods including point cloud transformers and sparse-voxel-based CNNs. OctFormer demonstrates superior efficiency, running over 17 times faster than other point cloud transformers on large-scale inputs. OctFormer with only 18M parameters surpasses previous sparse-voxel-based CNNs with 38M parameters on ScanNet segmentation.	OctFormer might overfit on small-scale datasets, requiring exploration of unsupervised pretraining techniques. The current positional encoding limits the flexibility of OctFormer, demanding investigation into alternative positional encoding methods.	3d deep learning, point cloud processing, transformers, octree, attention mechanism
2305.03043 Report	Single-Shot Implicit Morphable Faces with Consistent Texture Parameterization	Connor Z. Lin, Koki Nagano, Jan Kautz, Eric R. Chan, Umar Iqbal, Leonidas Guibas, Gordon Wetzstein, Sameh Khamis	There is a growing demand for the accessible creation of high-quality 3D avatars that are animatable and customizable. Although 3D morphable models provide intuitive control for editing and animation, and robustness for single-view face reconstruction, they cannot easily capture geometric and appearance details. Methods based on neural implicit representations, such as signed distance functions (SDF) or neural radiance fields, approach photo-realism, but are difficult to animate and do not generalize well to unseen data. To tackle this problem, we propose a novel method for constructing implicit 3D morphable face models that are both generalizable and intuitive for editing. Trained from a collection of high-quality 3D scans, our face model is parameterized by geometry, expression, and texture latent codes with a learned SDF and explicit UV texture parameterization. Once trained, we can reconstruct an avatar from a single in-the-wild image by leveraging the learned prior to project the image into the latent space of our model. Our implicit morphable face models can be used to render an avatar from novel views, animate facial expressions by modifying expression codes, and edit textures by directly painting on the learned UV-texture maps. We demonstrate quantitatively and qualitatively that our method improves upon photo-realism, geometry, and expression accuracy compared to state-of-the-art methods.	This paper proposes a novel method for constructing implicit 3D morphable face models that are both generalizable and intuitive for editing by combining the advantages of template-based 3DMMs with the quality and topological flexibility of implicit 3D representations.	There is a growing demand for the accessible creation of high-quality 3D avatars that are animatable and customizable.	The proposed method disentangles each facial avatar into identity and expression. It leverages an implicit geometry branch with a signed distance function (SDF) and a UV texture parameterization branch to represent the face. The model is trained on a large dataset of 3D face scans with various expressions. It also utilizes a single-shot inversion framework to map a single in-the-wild RGB image to the implicit 3D morphable model representation.	The method achieves state-of-the-art reconstruction accuracy for photo-realistic rendering, geometry, and expression accuracy in the single-view reconstruction setting. The learned texture map is intuitive to edit and propagates naturally during animation. The proposed model demonstrates superior performance in expression and pose transfer between in-the-wild source and target images.	The optimization process during inversion is relatively slow, limiting its use in real-time applications. The reliance on a de-lighting module may result in subjects appearing paler than expected and the model does not capture hair or accessories due to the limitations of the training dataset.	neural avatars, implicit representations, texture maps, animation, inversion
2305.03040 Report	TUVF: Learning Generalizable Texture UV Radiance Fields	An-Chieh Cheng, Xueting Li, Sifei Liu, Xiaolong Wang	Textures are a vital aspect of creating visually appealing and realistic 3D models. In this paper, we study the problem of generating high-fidelity texture given shapes of 3D assets, which has been relatively less explored compared with generic 3D shape modeling. Our goal is to facilitate a controllable texture generation process, such that one texture code can correspond to a particular appearance style independent of any input shapes from a category. We introduce Texture UV Radiance Fields (TUVF) that generate textures in a learnable UV sphere space rather than directly on the 3D shape. This allows the texture to be disentangled from the underlying shape and transferable to other shapes that share the same UV space, i.e., from the same category. We integrate the UV sphere space with the radiance field, which provides a more efficient and accurate representation of textures than traditional texture maps. We perform our experiments on synthetic and real-world object datasets where we achieve not only realistic synthesis but also substantial improvements over state-of-the-arts on texture controlling and editing. Project Page: https://www.anjiecheng.me/TUVF	This paper proposes Texture UV Radiance Fields (TUVF), a novel method for generating high-quality and disentangled textures on 3D objects, enabling controllable texture synthesis and editing.	Texture plays a vital role in creating realistic 3D models, but generating high-fidelity, controllable textures remains a challenge. Existing methods often entangle texture with geometry, limiting controllability and transferability.	TUVF generates textures in a learnable UV sphere space, disentangling texture from the underlying 3D shape. It utilizes a Canonical Surface Auto-encoder to learn dense correspondence between a canonical UV sphere and object instances, enabling texture transfer across different shapes. A texture generator creates textures on the UV sphere, and a radiance field renders the final textured object. Adversarial learning is employed for training.	TUVF achieves state-of-the-art results on CompCars, Photoshape, and DiffusionCats datasets, demonstrating superior texture quality and disentanglement. It enables texture transfer across different shapes with consistent style and local details. TUVF supports texture editing by modifying rendered images and fine-tuning the corresponding texture features.	The current one-to-one dense mapping assumption in correspondence learning might not hold in all real-world scenarios with shape variations. Future work could explore incorporating data-driven priors (e.g., diffusion models) and advanced neural rendering architectures (e.g., ray transformers) for further improvement.	texture synthesis, 3d deep learning, neural rendering, generative adversarial networks, disentanglement
2305.02981 Report	Adversarially-Guided Portrait Matting	Sergej Chicherin, Karen Efremyan	We present a method for generating alpha mattes using a limited data source. We pretrain a novel transformerbased model (StyleMatte) on portrait datasets. We utilize this model to provide image-mask pairs for the StyleGAN3-based network (StyleMatteGAN). This network is trained unsupervisedly and generates previously unseen imagemask training pairs that are fed back to StyleMatte. We demonstrate that the performance of the matte pulling network improves during this cycle and obtains top results on the human portraits and state-of-the-art metrics on animals dataset. Furthermore, StyleMatteGAN provides high-resolution, privacy-preserving portraits with alpha mattes, making it suitable for various image composition tasks. Our code is available at https://github.com/chroneus/stylematte	Presents StyleMatteGAN, a novel approach for generating synthetic portraits with high-quality alpha mattes using a StyleGAN3-based architecture trained in an unsupervised manner.	Addresses the scarcity of large, high-quality datasets for portrait matting, a critical challenge in computer vision.	Leverages a pretrained StyleGAN3 network modified to generate RGBA images and employs a cyclical training process where a transformer-based matting network (StyleMatte) is iteratively refined using synthetic data from StyleMatteGAN.	StyleMatte achieves state-of-the-art results on benchmark datasets like P3M-10k and AM-2k. StyleMatteGAN generates high-resolution, realistic portraits with consistent alpha mattes, as evidenced by FID scores. Cyclical training with synthetic data improves the performance of the StyleMatte matting network.	Generated portraits primarily exhibit a frontal head pose due to limitations in the training data. Future work could explore 3D-aware GANs and diffusion models to enhance pose variety and image quality.	image matting, generative adversarial networks, stylegan3, synthetic data generation, portrait matting
2305.02677 Report	Caption Anything: Interactive Image Description with Diverse Multimodal Controls	Teng Wang, Jinrui Zhang, Junjie Fei, Hao Zheng, Yunlong Tang, Zhe Li, Mingqi Gao, Shanshan Zhao	Controllable image captioning is an emerging multimodal topic that aims to describe the image with natural language following human purpose, $\textit{e.g.}$, looking at the specified regions or telling in a particular text style. State-of-the-art methods are trained on annotated pairs of input controls and output captions. However, the scarcity of such well-annotated multimodal data largely limits their usability and scalability for interactive AI systems. Leveraging unimodal instruction-following foundation models is a promising alternative that benefits from broader sources of data. In this paper, we present Caption AnyThing (CAT), a foundation model augmented image captioning framework supporting a wide range of multimodel controls: 1) visual controls, including points, boxes, and trajectories; 2) language controls, such as sentiment, length, language, and factuality. Powered by Segment Anything Model (SAM) and ChatGPT, we unify the visual and language prompts into a modularized framework, enabling the flexible combination between different controls. Extensive case studies demonstrate the user intention alignment capabilities of our framework, shedding light on effective user interaction modeling in vision-language applications. Our code is publicly available at https://github.com/ttengwang/Caption-Anything.	Presents Caption Anything (CAT), a training-free controllable image captioning framework augmented by pre-trained foundation models like Segment Anything Model (SAM) and ChatGPT, supporting diverse visual and language controls.	Addresses the limitations of existing controllable image captioning methods that rely on limited annotated data and support only pre-defined control signals, aiming for enhanced user interactivity and controllability.	Integrates pre-trained image captioners with SAM and an instruction-tuned LLM: SAM processes visual controls (points, boxes, trajectory) into masks; the captioner generates descriptions based on the masked image; and the LLM refines the caption based on language controls (sentiment, length, language, factuality).	CAT accurately identifies and describes objects based on various visual controls (points, boxes, trajectory). CAT generates captions with diverse language styles based on user-defined language controls. CAT can be extended to object-centric chatting and image paragraph captioning by incorporating additional tools like VQA and OCR.	The reliance on multiple foundation models might lead to increased computational cost. Further quantitative analysis is needed to evaluate the performance of CAT compared to existing methods.	controllable image captioning, foundation models, segment anything model, chatgpt, user interaction
2305.02541 Report	Catch Missing Details: Image Reconstruction with Frequency Augmented Variational Autoencoder	Xinmiao Lin, Yikang Li, Jenhao Hsiao, Chiuman Ho, Yu Kong	The popular VQ-VAE models reconstruct images through learning a discrete codebook but suffer from a significant issue in the rapid quality degradation of image reconstruction as the compression rate rises. One major reason is that a higher compression rate induces more loss of visual signals on the higher frequency spectrum which reflect the details on pixel space. In this paper, a Frequency Complement Module (FCM) architecture is proposed to capture the missing frequency information for enhancing reconstruction quality. The FCM can be easily incorporated into the VQ-VAE structure, and we refer to the new model as Frequency Augmented VAE (FA-VAE). In addition, a Dynamic Spectrum Loss (DSL) is introduced to guide the FCMs to balance between various frequencies dynamically for optimal reconstruction. FA-VAE is further extended to the text-to-image synthesis task, and a Cross-attention Autoregressive Transformer (CAT) is proposed to obtain more precise semantic attributes in texts. Extensive reconstruction experiments with different compression rates are conducted on several benchmark datasets, and the results demonstrate that the proposed FA-VAE is able to restore more faithfully the details compared to SOTA methods. CAT also shows improved generation quality with better image-text semantic alignment.	This paper proposes Frequency Augmented VAE (FA-VAE), a novel architecture that enhances image reconstruction quality in VQ-VAE models by addressing the loss of high-frequency details during compression.	VQ-VAE models suffer from reduced image reconstruction quality at high compression rates due to the loss of high-frequency information. Existing methods often overlook the importance of frequency alignment for accurate reconstruction.	FA-VAE incorporates Frequency Complement Modules (FCM) into the decoder to restore missing high-frequency information guided by a Dynamic Spectrum Loss (DSL). DSL leverages encoder activations to guide FCMs in learning a dynamic balance of frequencies for optimal reconstruction.	FA-VAE demonstrates superior image reconstruction quality compared to state-of-the-art VQ-VAE models on FFHQ and ImageNet datasets across various compression rates. Ablation studies confirm the effectiveness of FCMs and the DSL in enhancing reconstruction by capturing and restoring high-frequency details. The proposed Cross-attention Autoregressive Transformer (CAT), an extension of FA-VAE for text-to-image generation, exhibits strong performance and generates high-quality images with accurate semantic alignment.	The impact of kernel size in DSL on reconstruction quality requires further investigation. Exploring alternative FCM architectures and merging techniques could lead to further improvements.	image reconstruction, vq-vae, frequency analysis, image generation, text-to-image synthesis
2305.02463 Report	Shap-E: Generating Conditional 3D Implicit Functions	Heewoo Jun, Alex Nichol	We present Shap-E, a conditional generative model for 3D assets. Unlike recent work on 3D generative models which produce a single output representation, Shap-E directly generates the parameters of implicit functions that can be rendered as both textured meshes and neural radiance fields. We train Shap-E in two stages: first, we train an encoder that deterministically maps 3D assets into the parameters of an implicit function; second, we train a conditional diffusion model on outputs of the encoder. When trained on a large dataset of paired 3D and text data, our resulting models are capable of generating complex and diverse 3D assets in a matter of seconds. When compared to Point-E, an explicit generative model over point clouds, Shap-E converges faster and reaches comparable or better sample quality despite modeling a higher-dimensional, multi-representation output space. We release model weights, inference code, and samples at https://github.com/openai/shap-e.	This paper presents Shap-E, a generative model that produces 3D assets as both textured meshes and neural radiance fields (NeRFs) conditioned on text prompts.	This work addresses limitations of existing 3D generative models that struggle to represent complex assets efficiently. Shap-E provides a faster and more flexible approach for text-to-3D generation.	The authors train a two-stage model. First, a 3D encoder learns to map 3D assets into implicit function parameters, trained using NeRF and differentiable rendering objectives. Second, a conditional diffusion model is trained on the encoded latents, learning from paired text-3D data.	Shap-E generates diverse and recognizable 3D assets from text prompts in seconds. Compared to Point-E, an explicit 3D generative model, Shap-E shows faster convergence and comparable or better sample quality. The authors find shared success and failure cases between Shap-E and Point-E when conditioned on images, suggesting data and model architecture outweigh representation choice in influencing output.	Shap-E struggles with multi-object composition and attribute binding, likely due to limitations in paired training data. Generated 3D assets, while recognizable, often lack fine details. Improved encoders and incorporating optimization-based methods could enhance details and quality.	generative models, 3d generation, text-to-3d, neural radiance fields (nerfs), diffusion models
2305.02385 Report	SimSC: A Simple Framework for Semantic Correspondence with Temperature Learning	Xinghui Li, Kai Han, Xingchen Wan, Victor Adrian Prisacariu	We propose SimSC, a remarkably simple framework, to address the problem of semantic matching only based on the feature backbone. We discover that when fine-tuning ImageNet pre-trained backbone on the semantic matching task, L2 normalization of the feature map, a standard procedure in feature matching, produces an overly smooth matching distribution and significantly hinders the fine-tuning process. By setting an appropriate temperature to the softmax, this over-smoothness can be alleviated and the quality of features can be substantially improved. We employ a learning module to predict the optimal temperature for fine-tuning feature backbones. This module is trained together with the backbone and the temperature is updated online. We evaluate our method on three public datasets and demonstrate that we can achieve accuracy on par with state-of-the-art methods under the same backbone without using a learned matching head. Our method is versatile and works on various types of backbones. We show that the accuracy of our framework can be easily improved by coupling it with more powerful backbones.	This paper presents SimSC, a simple yet effective framework for semantic correspondence matching. It highlights the detrimental impact of L2 normalization on feature map smoothness during backbone fine-tuning and proposes using a learned temperature in the softmax to mitigate this issue.	Existing semantic matching methods often rely on complex matching heads and training strategies, overlooking the importance of properly fine-tuning the feature backbone. This work emphasizes the backbone's significance and offers a simple solution to enhance its performance.	The method uses a temperature learning module, implemented as a two-layer MLP, to predict the optimal temperature for the softmax operation based on the input image pair's feature maps. This module is jointly trained with the backbone, eliminating the need for manual temperature tuning.	SimSC achieves state-of-the-art accuracy on PF-Pascal and SPair-71K datasets using ResNet101 as the backbone, despite having no learned matching head. The framework is versatile and effectively fine-tunes both CNN-based (ResNet) and ViT-based (DINO, iBOT) backbones. Fine-tuning the entire backbone with SimSC consistently outperforms fine-tuning only the last block, showcasing the benefit of propagating the learned temperature's effect throughout the network.	The method's performance on transfer learning to PF-Willow, while decent, is not as significant as its results on SPair-71K, suggesting potential limitations in handling different data distributions. The paper primarily focuses on single-scale matching. Exploring multi-scale strategies within the SimSC framework could further improve its performance, especially for challenging cases with significant scale variations between images.	semantic correspondence, temperature learning, feature backbone fine-tuning, l2 normalization, deep learning
2305.02312 Report	AG3D: Learning to Generate 3D Avatars from 2D Image Collections	Zijian Dong, Xu Chen, Jinlong Yang, Michael J. Black, Otmar Hilliges, Andreas Geiger	While progress in 2D generative models of human appearance has been rapid, many applications require 3D avatars that can be animated and rendered. Unfortunately, most existing methods for learning generative models of 3D humans with diverse shape and appearance require 3D training data, which is limited and expensive to acquire. The key to progress is hence to learn generative models of 3D avatars from abundant unstructured 2D image collections. However, learning realistic and complete 3D appearance and geometry in this under-constrained setting remains challenging, especially in the presence of loose clothing such as dresses. In this paper, we propose a new adversarial generative model of realistic 3D people from 2D images. Our method captures shape and deformation of the body and loose clothing by adopting a holistic 3D generator and integrating an efficient and flexible articulation module. To improve realism, we train our model using multiple discriminators while also integrating geometric cues in the form of predicted 2D normal maps. We experimentally find that our method outperforms previous 3D- and articulation-aware methods in terms of geometry and appearance. We validate the effectiveness of our model and the importance of each component via systematic ablation studies.	This paper proposes AG3D, a novel adversarial generative model that learns to generate realistic and animatable 3D human avatars from unstructured 2D image collections, effectively capturing the shape and deformation of the body and loose clothing.	Generating diverse and high-quality 3D avatars typically requires expensive and limited 3D training data. This work leverages widely available 2D images to learn a generative model, overcoming the limitations of 3D data acquisition.	The method utilizes a holistic 3D generator with an efficient articulation module (Fast-SNARF) for pose control and loose clothing deformation. It employs multiple discriminators specializing in full images, faces, and normal maps, enhancing visual and geometric fidelity.	AG3D outperforms state-of-the-art methods in terms of image quality, particularly in side views, as evidenced by FID scores and user preference studies. The model effectively captures subtle geometric details, producing realistic 3D shapes, unlike previous methods that suffer from noise and artifacts. Unlike part-based models, AG3D effectively handles loose clothing like dresses and skirts, avoiding discontinuity artifacts.	The model may generate incorrect clothing patterns in occluded areas due to the ambiguity of pixel-to-body-part association in single-view training data. The training datasets used, primarily focused on fashion, lack diversity in body shapes, skin tones, and age, potentially leading to biases in generated avatars.	3d human generation, generative adversarial networks, articulated deformation, loose clothing modeling, normal map discriminator
2305.02310 Report	Real-Time Radiance Fields for Single-Image Portrait View Synthesis	Alex Trevithick, Matthew Chan, Michael Stengel, Eric R. Chan, Chao Liu, Zhiding Yu, Sameh Khamis, Manmohan Chandraker, Ravi Ramamoorthi, Koki Nagano	We present a one-shot method to infer and render a photorealistic 3D representation from a single unposed image (e.g., face portrait) in real-time. Given a single RGB input, our image encoder directly predicts a canonical triplane representation of a neural radiance field for 3D-aware novel view synthesis via volume rendering. Our method is fast (24 fps) on consumer hardware, and produces higher quality results than strong GAN-inversion baselines that require test-time optimization. To train our triplane encoder pipeline, we use only synthetic data, showing how to distill the knowledge from a pretrained 3D GAN into a feedforward encoder. Technical contributions include a Vision Transformer-based triplane encoder, a camera data augmentation strategy, and a well-designed loss function for synthetic data training. We benchmark against the state-of-the-art methods, demonstrating significant improvements in robustness and image quality in challenging real-world settings. We showcase our results on portraits of faces (FFHQ) and cats (AFHQ), but our algorithm can also be applied in the future to other categories with a 3D-aware image generator.	Presents a one-shot method to infer and render a photorealistic 3D representation from a single unposed image in real-time using a triplane representation of a neural radiance field.	Enables real-time 3D-aware novel view synthesis from a single image, significantly faster than optimization-based methods, opening possibilities for applications like AR/VR and 3D telepresence.	Trains a Vision Transformer-based encoder to predict canonical triplane features from a single image, supervised using synthetic data generated from a pre-trained 3D GAN (EG3D) with on-the-fly camera augmentation.	Achieves real-time performance (24 fps) on consumer hardware. Outperforms state-of-the-art GAN-inversion baselines in terms of robustness and image quality on challenging real-world portraits. Demonstrates generalization ability by lifting stylized images (drawings and paintings) to 3D.	May struggle with strong profile views due to limitations in the training data distribution. Temporal inconsistencies may arise when applied to videos frame-by-frame due to independent frame processing.	novel view synthesis, 3d reconstruction, generative adversarial networks, neural radiance fields, synthetic data
2305.02187 Report	CLUSTSEG: Clustering for Universal Segmentation	James Liang, Tianfei Zhou, Dongfang Liu, Wenguan Wang	We present CLUSTSEG, a general, transformer-based framework that tackles different image segmentation tasks (i.e., superpixel, semantic, instance, and panoptic) through a unified neural clustering scheme. Regarding queries as cluster centers, CLUSTSEG is innovative in two aspects:1) cluster centers are initialized in heterogeneous ways so as to pointedly address task-specific demands (e.g., instance- or category-level distinctiveness), yet without modifying the architecture; and 2) pixel-cluster assignment, formalized in a cross-attention fashion, is alternated with cluster center update, yet without learning additional parameters. These innovations closely link CLUSTSEG to EM clustering and make it a transparent and powerful framework that yields superior results across the above segmentation tasks.	\textsc{ClustSeg}, a universal, transformer-based segmentation framework that tackles superpixel, semantic, instance, and panoptic segmentation through a unified, neural clustering scheme.	To shift the image segmentation field from task-specialized architectures towards a universal framework and address the limitations of existing universal segmenters.	1. \textit{Dreamy-Start}: Task-specific initialization of cluster centers (queries) respecting the nature of each segmentation task. 2. \textit{Recurrent Cross-Attention}: A non-parametric, recursive module for effective and efficient neural clustering by alternating between pixel-cluster assignment and cluster center update.	\textsc{ClustSeg} sets new records across all metrics on COCO Panoptic val (59.0 PQ). It establishes a new state-of-the-art on COCO instance segmentation (49.1 AP). It ranks top in ADE20K semantic segmentation benchmarking (57.4 mIoU).	Extra clustering loops in training may reduce computational efficiency. Future work includes developing more robust clustering algorithms to handle complex scenarios.	image segmentation, universal framework, clustering, transformers, deep learning
2305.01644 Report	Key-Locked Rank One Editing for Text-to-Image Personalization	Yoad Tewel, Rinon Gal, Gal Chechik, Yuval Atzmon	Text-to-image models (T2I) offer a new level of flexibility by allowing users to guide the creative process through natural language. However, personalizing these models to align with user-provided visual concepts remains a challenging problem. The task of T2I personalization poses multiple hard challenges, such as maintaining high visual fidelity while allowing creative control, combining multiple personalized concepts in a single image, and keeping a small model size. We present Perfusion, a T2I personalization method that addresses these challenges using dynamic rank-1 updates to the underlying T2I model. Perfusion avoids overfitting by introducing a new mechanism that "locks" new concepts' cross-attention Keys to their superordinate category. Additionally, we develop a gated rank-1 approach that enables us to control the influence of a learned concept during inference time and to combine multiple concepts. This allows runtime-efficient balancing of visual-fidelity and textual-alignment with a single 100KB trained model, which is five orders of magnitude smaller than the current state of the art. Moreover, it can span different operating points across the Pareto front without additional training. Finally, we show that Perfusion outperforms strong baselines in both qualitative and quantitative terms. Importantly, key-locking leads to novel results compared to traditional approaches, allowing to portray personalized object interactions in unprecedented ways, even in one-shot settings.	This paper introduces Key-Locked Rank One Editing (Perfusion), a novel method for personalizing text-to-image (T2I) diffusion models that achieves high visual fidelity and improved textual alignment with a small model size.	Existing T2I personalization methods often overfit to training images, limiting their ability to generate diverse and creative compositions. They also struggle to combine multiple learned concepts in a single image.	Perfusion leverages a gated rank-one editing approach applied to the cross-attention layers of diffusion models. It introduces a 'Key-Locking' mechanism that restricts a concept's attention to its super-category, preventing overfitting and promoting generalization. It also employs a gated rank-one update to control the influence of learned concepts during inference, enabling multi-concept compositions.	Perfusion outperforms state-of-the-art methods in qualitative and quantitative comparisons, showing improved text-alignment and visual fidelity. The method allows for runtime control over the trade-off between visual fidelity and text alignment by adjusting sigmoid parameters. Key-Locking enables the generation of novel compositions and interactions between individually learned concepts.	The choice of super-category for Key-Locking can sometimes lead to 'over-generalization' effects, impacting visual fidelity. Combining multiple concepts effectively often requires significant prompt engineering.	text-to-image synthesis, personalization, diffusion models, rank-one editing, key-locking
2305.01569 Report	Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation	Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, Omer Levy	The ability to collect a large dataset of human preferences from text-to-image users is usually limited to companies, making such datasets inaccessible to the public. To address this issue, we create a web app that enables text-to-image users to generate images and specify their preferences. Using this web app we build Pick-a-Pic, a large, open dataset of text-to-image prompts and real users' preferences over generated images. We leverage this dataset to train a CLIP-based scoring function, PickScore, which exhibits superhuman performance on the task of predicting human preferences. Then, we test PickScore's ability to perform model evaluation and observe that it correlates better with human rankings than other automatic evaluation metrics. Therefore, we recommend using PickScore for evaluating future text-to-image generation models, and using Pick-a-Pic prompts as a more relevant dataset than MS-COCO. Finally, we demonstrate how PickScore can enhance existing text-to-image models via ranking.	This work introduces Pick-a-Pic, a large, open dataset of user preferences over text-to-image generations, and PickScore, a CLIP-based scoring function trained on this dataset for predicting human preferences.	Existing text-to-image generation models lack large, open datasets of human preferences, hindering the development of models that align with user expectations.	The authors created a web app to collect user preferences on generated images, resulting in Pick-a-Pic. They then trained PickScore, a CLIP-based model, on this dataset using a reward model objective similar to InstructGPT.	PickScore achieves superhuman performance (70.5% accuracy) in predicting human preferences, outperforming baselines like zero-shot CLIP-H (60.8%) and human experts (68.0%). PickScore shows a stronger correlation (0.917) with human preferences than FID (-0.900) for evaluating text-to-image models, even when tested on MS-COCO captions. PickScore effectively improves the quality of text-to-image generations via ranking, with human raters preferring its selections over those made by other scoring functions and baselines.	Despite efforts to ensure data quality, Pick-a-Pic may contain NSFW content and inherent biases from user preferences. Future work includes exploring the use of PickScore and Pick-a-Pic for RLHF and other alignment techniques to further improve text-to-image models.	text-to-image generation, human preferences, dataset, evaluation metric, clip
2305.01275 Report	Segment Anything is A Good Pseudo-label Generator for Weakly Supervised Semantic Segmentation	Peng-Tao Jiang, Yuqi Yang	Weakly supervised semantic segmentation with weak labels is a long-lived ill-posed problem. Mainstream methods mainly focus on improving the quality of pseudo labels. In this report, we attempt to explore the potential of 'prompt to masks' from the powerful class-agnostic large segmentation model, segment-anything. Specifically, different weak labels are used as prompts to the segment-anything model, generating precise class masks. The class masks are utilized to generate pseudo labels to train the segmentation networks. We have conducted extensive experiments on PASCAL VOC 2012 dataset. Experiments demonstrate that segment-anything can serve as a good pseudo-label generator. The code will be made publicly available.	This paper proposes using the Segment-Anything Model (SAM) to generate pseudo labels for weakly supervised semantic segmentation.	Constructing large-scale finely-annotated datasets for semantic segmentation is time-consuming and expensive. This paper explores the potential of using a powerful, pre-trained model (SAM) to improve weakly supervised methods, which rely on cheaper annotations.	The paper investigates using different weak annotations (image-level labels, points, scribbles, bounding boxes) as prompts for SAM to generate object masks. These masks are then used as pseudo labels to train segmentation networks.	SAM with scribble prompts achieves 89.7% mIoU on PASCAL VOC 2012 train set for pseudo label generation. Using these pseudo labels, DeepLab-v2 achieves 76.6% mIoU on the test set. SAM outperforms other weakly supervised methods across different annotation types.	The study is limited to the PASCAL VOC 2012 dataset. The text prompt functionality of SAM, which is currently unavailable, could be explored in the future.	weakly supervised semantic segmentation, segment-anything model, pseudo labels, deep learning, computer vision
2305.01257 Report	DreamPaint: Few-Shot Inpainting of E-Commerce Items for Virtual Try-On without 3D Modeling	Mehmet Saygin Seyfioglu, Karim Bouyarmane, Suren Kumar, Amir Tavanaei, Ismail B. Tutar	We introduce DreamPaint, a framework to intelligently inpaint any e-commerce product on any user-provided context image. The context image can be, for example, the user's own image for virtual try-on of clothes from the e-commerce catalog on themselves, the user's room image for virtual try-on of a piece of furniture from the e-commerce catalog in their room, etc. As opposed to previous augmented-reality (AR)-based virtual try-on methods, DreamPaint does not use, nor does it require, 3D modeling of neither the e-commerce product nor the user context. Instead, it directly uses 2D images of the product as available in product catalog database, and a 2D picture of the context, for example taken from the user's phone camera. The method relies on few-shot fine tuning a pre-trained diffusion model with the masked latents (e.g., Masked DreamBooth) of the catalog images per item, whose weights are then loaded on a pre-trained inpainting module that is capable of preserving the characteristics of the context image. DreamPaint allows to preserve both the product image and the context (environment/user) image without requiring text guidance to describe the missing part (product/context). DreamPaint also allows to intelligently infer the best 3D angle of the product to place at the desired location on the user context, even if that angle was previously unseen in the product's reference 2D images. We compare our results against both text-guided and image-guided inpainting modules and show that DreamPaint yields superior performance in both subjective human study and quantitative metrics.	DreamPaint, a framework for intelligently inpainting e-commerce products onto user-provided context images (e.g., virtual try-on) without 3D modeling, using a combination of Masked DreamBooth and Stable Diffusion Inpainting.	Addresses limitations of current AR-based virtual try-on methods by using readily available 2D product images and user context images, improving the e-commerce customer experience.	Fine-tunes a pre-trained diffusion model with masked product images, enabling inpainting that preserves both product and context image characteristics. Leverages Masked DreamBooth and Stable Diffusion Inpainting modules.	Outperforms text-guided and image-guided inpainting methods in preserving product fidelity. Demonstrates superior performance in both subjective human evaluation and quantitative metrics (CLIP score). Allows for flexible user control with the option of additional text prompts for refinement.	Scalability challenges arise from the need for few-shot fine-tuning per e-commerce item. Context-appearance entanglement can lead to alterations in product appearance (e.g., color) based on the context image.	virtual try-on, e-commerce, image inpainting, diffusion models, dreambooth
2305.01239 Report	DRPT: Disentangled and Recurrent Prompt Tuning for Compositional Zero-Shot Learning	Xiaocheng Lu, Ziming Liu, Song Guo, Jingcai Guo, Fushuo Huo, Sikai Bai, Tao Han	Compositional Zero-shot Learning (CZSL) aims to recognize novel concepts composed of known knowledge without training samples. Standard CZSL either identifies visual primitives or enhances unseen composed entities, and as a result, entanglement between state and object primitives cannot be fully utilized. Admittedly, vision-language models (VLMs) could naturally cope with CZSL through tuning prompts, while uneven entanglement leads prompts to be dragged into local optimum. In this paper, we take a further step to introduce a novel Disentangled and Recurrent Prompt Tuning framework termed DRPT to better tap the potential of VLMs in CZSL. Specifically, the state and object primitives are deemed as learnable tokens of vocabulary embedded in prompts and tuned on seen compositions. Instead of jointly tuning state and object, we devise a disentangled and recurrent tuning strategy to suppress the traction force caused by entanglement and gradually optimize the token parameters, leading to a better prompt space. Notably, we develop a progressive fine-tuning procedure that allows for incremental updates to the prompts, optimizing the object first, then the state, and vice versa. Meanwhile, the optimization of state and object is independent, thus clearer features can be learned to further alleviate the issue of entangling misleading optimization. Moreover, we quantify and analyze the entanglement in CZSL and supplement entanglement rebalancing optimization schemes. DRPT surpasses representative state-of-the-art methods on extensive benchmark datasets, demonstrating superiority in both accuracy and efficiency.	This paper proposes DRPT, a novel Disentangled and Recurrent Prompt Tuning framework for Compositional Zero-Shot Learning (CZSL) that leverages the power of Vision-Language Models (VLMs).	Existing CZSL methods struggle to effectively utilize the entanglement between state and object primitives, often leading VLMs to converge to local optima due to uneven entanglement distribution.	DRPT treats state and object primitives as learnable tokens within prompts. It implements a disentangled and recurrent tuning strategy to decouple parameter updates, progressively optimizing object and state tokens independently before joint optimization.	DRPT surpasses state-of-the-art methods on three benchmark datasets (UT-Zappos, AO-Clevr, C-GQA) demonstrating superior accuracy and efficiency. The study quantifies entanglement in CZSL and demonstrates DRPT's effectiveness in mitigating entanglement issues. Ablation studies confirm the positive impact of disentangled recurrent tuning and entanglement re-balancing techniques.	The paper acknowledges the potential for exploring dynamic status transition sequences with varying K values and automatic status switching in future work. Further investigation into other re-balancing schemes for entanglement is also suggested.	zero-shot learning, compositional zero-shot learning, prompt learning, vision-language models, entanglement
2305.00942 Report	StyleAvatar: Real-time Photo-realistic Portrait Avatar from a Single Video	Lizhen Wang, Xiaochen Zhao, Jingxiang Sun, Yuxiang Zhang, Hongwen Zhang, Tao Yu, Yebin Liu	Face reenactment methods attempt to restore and re-animate portrait videos as realistically as possible. Existing methods face a dilemma in quality versus controllability: 2D GAN-based methods achieve higher image quality but suffer in fine-grained control of facial attributes compared with 3D counterparts. In this work, we propose StyleAvatar, a real-time photo-realistic portrait avatar reconstruction method using StyleGAN-based networks, which can generate high-fidelity portrait avatars with faithful expression control. We expand the capabilities of StyleGAN by introducing a compositional representation and a sliding window augmentation method, which enable faster convergence and improve translation generalization. Specifically, we divide the portrait scenes into three parts for adaptive adjustments: facial region, non-facial foreground region, and the background. Besides, our network leverages the best of UNet, StyleGAN and time coding for video learning, which enables high-quality video generation. Furthermore, a sliding window augmentation method together with a pre-training strategy are proposed to improve translation generalization and training performance, respectively. The proposed network can converge within two hours while ensuring high image quality and a forward rendering time of only 20 milliseconds. Furthermore, we propose a real-time live system, which further pushes research into applications. Results and experiments demonstrate the superiority of our method in terms of image quality, full portrait video generation, and real-time re-animation compared to existing facial reenactment methods. Training and inference code for this paper are at https://github.com/LizhenWangT/StyleAvatar.	This paper presents StyleAvatar, a real-time photo-realistic portrait avatar reconstruction method using a StyleGAN-based network trained on a single video.	The method addresses limitations of existing 2D and 3D portrait avatar approaches, aiming for high-fidelity, fast training, fine-grained control, and real-time efficiency.	StyleAvatar utilizes a compositional representation, dividing the scene into facial, non-facial foreground, and background regions for adaptive adjustments. It leverages StyleGAN generators, a StyleUNet, Neural Textures, and a sliding window augmentation method for high-quality and efficient portrait avatar generation.	StyleAvatar outperforms state-of-the-art one-shot and video-based facial reenactment methods in terms of image quality and control of facial attributes. The proposed method achieves significantly faster training and rendering times compared to existing methods. The system allows for real-time re-animation of the learned facial avatar with other subjects.	The method is limited by the training dataset and struggles with poses and expressions significantly different from those seen during training. Inaccuracies in 3DMM tracking can lead to imprecise expression control and unrealistic mouth interiors during reenactment.	facial reenactment, stylegan, video portraits, deep learning, rendering-to-video translation
2305.00936 Report	Generating Texture for 3D Human Avatar from a Single Image using Sampling and Refinement Networks	Sihun Cha, Kwanggyoon Seo, Amirsaman Ashtari, Junyong Noh	There has been significant progress in generating an animatable 3D human avatar from a single image. However, recovering texture for the 3D human avatar from a single image has been relatively less addressed. Because the generated 3D human avatar reveals the occluded texture of the given image as it moves, it is critical to synthesize the occluded texture pattern that is unseen from the source image. To generate a plausible texture map for 3D human avatars, the occluded texture pattern needs to be synthesized with respect to the visible texture from the given image. Moreover, the generated texture should align with the surface of the target 3D mesh. In this paper, we propose a texture synthesis method for a 3D human avatar that incorporates geometry information. The proposed method consists of two convolutional networks for the sampling and refining process. The sampler network fills in the occluded regions of the source image and aligns the texture with the surface of the target 3D mesh using the geometry information. The sampled texture is further refined and adjusted by the refiner network. To maintain the clear details in the given image, both sampled and refined texture is blended to produce the final texture map. To effectively guide the sampler network to achieve its goal, we designed a curriculum learning scheme that starts from a simple sampling task and gradually progresses to the task where the alignment needs to be considered. We conducted experiments to show that our method outperforms previous methods qualitatively and quantitatively.	This paper presents a novel method for generating high-quality texture maps for 3D human avatars from single images, addressing the challenge of synthesizing occluded texture details and ensuring proper alignment with the 3D mesh.	Recovering texture for 3D human avatars from a single image is crucial for various applications like VR/AR, but it's challenging due to limited visible texture information and the need for alignment with the 3D mesh.	The method utilizes two convolutional networks: a Sampler Network (SamplerNet) to complete the texture map by sampling from visible regions guided by geometry information and a Refiner Network (RefinerNet) to enhance details and refine the sampled texture. A curriculum learning scheme is employed to train SamplerNet effectively.	The proposed method outperforms previous state-of-the-art techniques in both visual quality and quantitative metrics. It effectively synthesizes occluded texture details while preserving the appearance of visible regions in the source image. The method demonstrates robustness to different viewpoints and successfully generates plausible textures from non-frontal images.	The method's performance is limited by the training dataset, particularly in handling diverse clothing styles and human identities. Current implementation relies on a supervised setting, requiring ground truth data for training.	texture synthesis, 3d human avatar, single image, curriculum learning, deep learning
2305.00866 Report	Attack-SAM: Towards Attacking Segment Anything Model With Adversarial Examples	Chenshuang Zhang, Chaoning Zhang, Taegoo Kang, Donghun Kim, Sung-Ho Bae, In So Kweon	Segment Anything Model (SAM) has attracted significant attention recently, due to its impressive performance on various downstream tasks in a zero-short manner. Computer vision (CV) area might follow the natural language processing (NLP) area to embark on a path from task-specific vision models toward foundation models. However, deep vision models are widely recognized as vulnerable to adversarial examples, which fool the model to make wrong predictions with imperceptible perturbation. Such vulnerability to adversarial attacks causes serious concerns when applying deep models to security-sensitive applications. Therefore, it is critical to know whether the vision foundation model SAM can also be fooled by adversarial attacks. To the best of our knowledge, our work is the first of its kind to conduct a comprehensive investigation on how to attack SAM with adversarial examples. With the basic attack goal set to mask removal, we investigate the adversarial robustness of SAM in the full white-box setting and transfer-based black-box settings. Beyond the basic goal of mask removal, we further investigate and find that it is possible to generate any desired mask by the adversarial attack.	This paper presents the first comprehensive study on the vulnerability of the Segment Anything Model (SAM) to adversarial attacks.	SAM, as a foundation model for image segmentation, has significant implications for various applications. Understanding its robustness against adversarial attacks is crucial, especially for security-sensitive applications.	The authors propose a framework called Attack-SAM, which focuses on mask removal as the primary attack goal. They employ FGSM and PGD attacks with a tailored loss function (ClipMSE) to generate adversarial examples. They further investigate cross-prompt and cross-task transferability of the attacks.	SAM is highly vulnerable to adversarial attacks in the white-box setting, exhibiting successful mask removal. The attack demonstrates cross-prompt transferability, meaning the adversary doesn't need prior knowledge of the prompt to launch a successful attack. Adversarial examples generated for semantic label prediction tasks can partially attack SAM, indicating cross-task transferability.	The study primarily focuses on the mask removal attack, leaving room for exploring other attack goals and their impact on SAM. The attack performance in challenging scenarios like cross-task attacks is partial, suggesting further research to enhance attack effectiveness.	adversarial attacks, segment anything model (sam), image segmentation, model robustness, computer vision
2305.00599 Report	StyleGenes: Discrete and Efficient Latent Distributions for GANs	Evangelos Ntavelis, Mohamad Shahbazi, Iason Kastanis, Radu Timofte, Martin Danelljan, Luc Van Gool	We propose a discrete latent distribution for Generative Adversarial Networks (GANs). Instead of drawing latent vectors from a continuous prior, we sample from a finite set of learnable latents. However, a direct parametrization of such a distribution leads to an intractable linear increase in memory in order to ensure sufficient sample diversity. We address this key issue by taking inspiration from the encoding of information in biological organisms. Instead of learning a separate latent vector for each sample, we split the latent space into a set of genes. For each gene, we train a small bank of gene variants. Thus, by independently sampling a variant for each gene and combining them into the final latent vector, our approach can represent a vast number of unique latent samples from a compact set of learnable parameters. Interestingly, our gene-inspired latent encoding allows for new and intuitive approaches to latent-space exploration, enabling conditional sampling from our unconditionally trained model. Moreover, our approach preserves state-of-the-art photo-realism while achieving better disentanglement than the widely-used StyleMapping network.	This paper introduces StyleGenes, a novel approach to GAN latent spaces that utilizes a discrete distribution inspired by biological DNA encoding for more interpretable and efficient image generation.	This approach addresses the limitations of continuous latent spaces in GANs, particularly in terms of disentanglement and interpretability, paving the way for more controllable and diverse image synthesis.	The method divides the latent code into smaller, independent units called "genes," each with a set of learnable "variants." By combining these variants, the model can generate a vast number of distinct images from a compact set of parameters. The relationship between genes and image attributes is analyzed to enable conditional sampling from the trained model.	StyleGenes achieves comparable image quality to GANs with continuous latent spaces, evidenced by similar FID scores across multiple datasets. The discrete nature of StyleGenes allows for a more straightforward analysis of the relationship between latent codes and image attributes, leading to improved disentanglement compared to StyleGAN's W space. This approach enables conditional image generation from an unconditionally trained model by leveraging the learned associations between genes and attributes, without needing additional training or modules.	The reliance on pre-trained classifiers to analyze attribute relationships introduces limitations due to potential biases and dataset discrepancies. Further exploration of techniques to improve attribute-based control and incorporate real images into the codebook is warranted.	generative adversarial networks (gans), discrete latent space, image generation, conditional image synthesis, disentanglement
2305.00521 Report	StyleLipSync: Style-based Personalized Lip-sync Video Generation	Taekyung Ki, Dongchan Min	In this paper, we present StyleLipSync, a style-based personalized lip-sync video generative model that can generate identity-agnostic lip-synchronizing video from arbitrary audio. To generate a video of arbitrary identities, we leverage expressive lip prior from the semantically rich latent space of a pre-trained StyleGAN, where we can also design a video consistency with a linear transformation. In contrast to the previous lip-sync methods, we introduce pose-aware masking that dynamically locates the mask to improve the naturalness over frames by utilizing a 3D parametric mesh predictor frame by frame. Moreover, we propose a few-shot lip-sync adaptation method for an arbitrary person by introducing a sync regularizer that preserves lip-sync generalization while enhancing the person-specific visual information. Extensive experiments demonstrate that our model can generate accurate lip-sync videos even with the zero-shot setting and enhance characteristics of an unseen face using a few seconds of target video through the proposed adaptation method.	Presents StyleLipSync, a style-based lip-sync video generative model that generates identity-agnostic lip-syncing videos from arbitrary audio using a pre-trained StyleGAN and pose-aware masking.	Addresses limitations of previous lip-sync methods that struggle with inaccurate lip-syncing, blurry results, and lack of temporal consistency, aiming to generate high-fidelity, temporally consistent videos of arbitrary identities.	Leverages a pre-trained StyleGAN for lip prior and linear manipulation of style codes for lip-syncing, introduces pose-aware masking using a 3D face mesh predictor for improved naturalness, and employs a Moving-average based Latent Smoothing module for temporal consistency. Additionally proposes a few-shot adaptation method for unseen faces using a sync regularizer.	Outperforms state-of-the-art methods in lip-sync and visual quality on Voxceleb2 reconstruction. Achieves state-of-the-art lip-sync accuracy and comparable face similarity in cross-id experiments on HDTF. User study confirms superior lip-sync accuracy, face similarity, and visual quality compared to other methods.	Extending to higher resolutions is challenging due to the need for a large number of identities during training. Improving lip identity preservation in a zero-shot setting with a more effective reference encoder is a potential area for improvement.	lip-sync, video generation, stylegan, pose-aware masking, few-shot adaptation
2305.00278 Report	Segment Anything Model (SAM) Meets Glass: Mirror and Transparent Objects Cannot Be Easily Detected	Dongsheng Han, Chaoning Zhang, Yu Qiao, Maryam Qamar, Yuna Jung, SeungKyu Lee, Sung-Ho Bae, Choong Seon Hong	Meta AI Research has recently released SAM (Segment Anything Model) which is trained on a large segmentation dataset of over 1 billion masks. As a foundation model in the field of computer vision, SAM (Segment Anything Model) has gained attention for its impressive performance in generic object segmentation. Despite its strong capability in a wide range of zero-shot transfer tasks, it remains unknown whether SAM can detect things in challenging setups like transparent objects. In this work, we perform an empirical evaluation of two glass-related challenging scenarios: mirror and transparent objects. We found that SAM often fails to detect the glass in both scenarios, which raises concern for deploying the SAM in safety-critical situations that have various forms of glass.	This paper presents the first empirical study evaluating the performance of the Segment Anything Model (SAM) in detecting and segmenting transparent and mirror objects.	This evaluation is crucial because the failure of SAM to recognize glass in safety-critical applications, where glass is ubiquitous, could lead to serious consequences.	The study uses four established benchmark datasets, two for glass (GDD and GSD) and two for mirrors (MSD and PMD), and employs five standard evaluation metrics (IoU, ACC, Fβ, MAE, and BER) to compare SAM with state-of-the-art methods in semantic and glass/mirror segmentation.	SAM often fails to detect glass in both mirror and transparent object scenarios, significantly underperforming compared to specialized models. The model frequently recognizes objects behind transparent surfaces but not the glass itself, highlighting its difficulty in distinguishing between transmitted and reflected light. While SAM shows comparable performance to some methods on the PMD mirror dataset (where boundaries are clearer), its performance on the MSD dataset is unsatisfactory, often segmenting reflected objects instead of the mirror.	The study primarily focuses on single-object scenes and might not fully represent the complexity of real-world scenarios with multiple transparent or reflective surfaces. Further research is needed to develop strategies, such as incorporating specific data augmentation techniques or fine-tuning SAM on glass-related datasets, to improve its performance in detecting glass.	segment anything model (sam), glass detection, transparent object segmentation, mirror segmentation, computer vision
2305.00121 Report	Learning Locally Editable Virtual Humans	Hsuan-I Ho, Lixin Xue, Jie Song, Otmar Hilliges	In this paper, we propose a novel hybrid representation and end-to-end trainable network architecture to model fully editable and customizable neural avatars. At the core of our work lies a representation that combines the modeling power of neural fields with the ease of use and inherent 3D consistency of skinned meshes. To this end, we construct a trainable feature codebook to store local geometry and texture features on the vertices of a deformable body model, thus exploiting its consistent topology under articulation. This representation is then employed in a generative auto-decoder architecture that admits fitting to unseen scans and sampling of realistic avatars with varied appearances and geometries. Furthermore, our representation allows local editing by swapping local features between 3D assets. To verify our method for avatar creation and editing, we contribute a new high-quality dataset, dubbed CustomHumans, for training and evaluation. Our experiments quantitatively and qualitatively show that our method generates diverse detailed avatars and achieves better model fitting performance compared to state-of-the-art methods. Our code and dataset are available at https://custom-humans.github.io/.	This paper introduces a novel hybrid representation and generative framework for creating fully editable and customizable 3D human avatars.	Creating personalized and easily editable avatars is crucial for enhancing user engagement in various applications like gaming and the Metaverse.	The proposed method combines a trainable feature codebook storing local geometry and texture features on a deformable body model with a generative auto-decoder architecture. This architecture is trained on 3D scans using both 3D reconstruction and 2D adversarial losses.	The hybrid representation allows local editing by swapping features between avatars. The model can be fitted to unseen 3D scans, enabling personalization. The generative framework allows for the creation of diverse and detailed avatars by sampling from the learned feature space.	The quality of generated avatars relies heavily on the diversity and quality of training data. The editing workflow currently requires manual intervention for feature swapping and could benefit from automation.	3d avatars, neural fields, generative models, avatar customization, local editing
2304.14610 Report	ALL-E: Aesthetics-guided Low-light Image Enhancement	Ling Li, Dong Liang, Yuanhang Gao, Sheng-Jun Huang, Songcan Chen	Evaluating the performance of low-light image enhancement (LLE) is highly subjective, thus making integrating human preferences into image enhancement a necessity. Existing methods fail to consider this and present a series of potentially valid heuristic criteria for training enhancement models. In this paper, we propose a new paradigm, i.e., aesthetics-guided low-light image enhancement (ALL-E), which introduces aesthetic preferences to LLE and motivates training in a reinforcement learning framework with an aesthetic reward. Each pixel, functioning as an agent, refines itself by recursive actions, i.e., its corresponding adjustment curve is estimated sequentially. Extensive experiments show that integrating aesthetic assessment improves both subjective experience and objective evaluation. Our results on various benchmarks demonstrate the superiority of ALL-E over state-of-the-art methods.	This paper introduces ALL-E, a novel aesthetics-guided low-light image enhancement (LLE) paradigm incorporating aesthetic assessment to improve the subjective and objective quality of enhanced images.	Existing LLE methods rely on heuristic criteria and overlook the crucial role of human subjective evaluation, particularly the impact of aesthetic preferences on perceived image quality.	ALL-E employs a reinforcement learning framework where each pixel acts as an agent, refining itself through iterative actions guided by an aesthetic reward. It leverages a pre-trained 'aesthetic oracle network' to provide general aesthetic preferences and incorporates rewards for aesthetics, feature preservation, and exposure control.	ALL-E generates visually more appealing enhancements compared to state-of-the-art methods, as demonstrated on LOL and LIME datasets. Quantitative evaluations using NIQE, UNIQUE, PSNR, and SSIM metrics on various datasets confirm ALL-E's superior performance in terms of image quality. Human subjective surveys consistently rank ALL-E as the top performer, highlighting its ability to enhance images while preserving naturalness and visual appeal.	The paper acknowledges the potential limitation of ALL-E in handling specific scenarios like nightscapes where preserving the low-light aesthetic might be preferable to brightening the entire scene. Future work will explore incorporating high-level semantic theme guidance to address the issue of excessive exposure adjustment in such cases.	low-light image enhancement, aesthetic assessment, reinforcement learning, image quality assessment, computer vision
2304.14573 Report	SceneGenie: Scene Graph Guided Diffusion Models for Image Synthesis	Azade Farshad, Yousef Yeganeh, Yu Chi, Chengzhi Shen, Björn Ommer, Nassir Navab	Text-conditioned image generation has made significant progress in recent years with generative adversarial networks and more recently, diffusion models. While diffusion models conditioned on text prompts have produced impressive and high-quality images, accurately representing complex text prompts such as the number of instances of a specific object remains challenging. To address this limitation, we propose a novel guidance approach for the sampling process in the diffusion model that leverages bounding box and segmentation map information at inference time without additional training data. Through a novel loss in the sampling process, our approach guides the model with semantic features from CLIP embeddings and enforces geometric constraints, leading to high-resolution images that accurately represent the scene. To obtain bounding box and segmentation map information, we structure the text prompt as a scene graph and enrich the nodes with CLIP embeddings. Our proposed model achieves state-of-the-art performance on two public benchmarks for image generation from scene graphs, surpassing both scene graph to image and text-based diffusion models in various metrics. Our results demonstrate the effectiveness of incorporating bounding box and segmentation map guidance in the diffusion model sampling process for more accurate text-to-image generation.	This paper introduces a novel guidance approach for diffusion models in image synthesis, enhancing accuracy in depicting complex scenes from text prompts, particularly in representing the correct number of object instances.	Existing text-to-image generation models, while producing impressive results, struggle with accurately representing the number of instances of objects in an image, especially from complex textual descriptions. This work addresses this limitation to achieve more precise image generation.	The proposed method leverages scene graphs derived from text prompts to predict bounding boxes and segmentation maps. These maps, along with CLIP embeddings, guide the diffusion model's sampling process, ensuring both object realism and correct scene layout. The approach incorporates CLIP text guidance, CLIP bounding box guidance (with an augmented version), and segmentation map guidance.	The method outperforms state-of-the-art text-to-image diffusion models and scene graph-to-image approaches on COCO stuff and Visual Genome benchmarks, without requiring additional training. Using predicted bounding box and segmentation map information from text prompts leads to superior results compared to existing models. Incorporating CLIP embeddings in the scene graph nodes enhances the accuracy of bounding box and segmentation predictions, further improving image generation quality.	The model faces challenges in generating high-quality images of complex structures like faces, suggesting potential improvement through fine-tuning on specific datasets. Like many diffusion models, the image generation process can be time-consuming during the reverse sampling stage.	image synthesis, diffusion models, scene graphs, text-to-image generation, clip embeddings
2304.14530 Report	Generating images of rare concepts using pre-trained diffusion models	Dvir Samuel, Rami Ben-Ari, Simon Raviv, Nir Darshan, Gal Chechik	Text-to-image diffusion models can synthesize high-quality images, but they have various limitations. Here we highlight a common failure mode of these models, namely, generating uncommon concepts and structured concepts like hand palms. We show that their limitation is partly due to the long-tail nature of their training data: web-crawled data sets are strongly unbalanced, causing models to under-represent concepts from the tail of the distribution. We characterize the effect of unbalanced training data on text-to-image models and offer a remedy. We show that rare concepts can be correctly generated by carefully selecting suitable generation seeds in the noise space, using a small reference set of images, a technique that we call SeedSelect. SeedSelect does not require retraining or finetuning the diffusion model. We assess the faithfulness, quality and diversity of SeedSelect in creating rare objects and generating complex formations like hand images, and find it consistently achieves superior performance. We further show the advantage of SeedSelect in semantic data augmentation. Generating semantically appropriate images can successfully improve performance in few-shot recognition benchmarks, for classes from the head and from the tail of the training data of diffusion models	This paper introduces SeedSelect, a method for improving text-to-image diffusion models' ability to generate uncommon or structurally complex concepts by carefully selecting the generation seed in the noise space.	Current text-to-image diffusion models struggle to generate concepts under-represented in their training data, limiting their ability to synthesize diverse and accurate images.	SeedSelect uses a small set of reference images to optimize the initial noise tensor (seed) via gradient descent, minimizing a combined semantic and appearance loss based on CLIP embeddings and the diffusion model's VAE.	SeedSelect significantly improves the faithfulness of generated images for rare concepts, as evaluated by pre-trained classifiers and human raters. SeedSelect maintains high image quality (measured by FID) comparable to the original diffusion model. SeedSelect boosts performance in few-shot image recognition tasks, outperforming previous methods in generating valuable and diverse semantic augmentations.	SeedSelect struggles to imitate the style of reference images. The optimized seed is prompt-specific and doesn't generalize to other prompts.	text-to-image synthesis, diffusion models, long-tail learning, semantic data augmentation, few-shot learning
2304.14403 Report	Make It So: Steering StyleGAN for Any Image Inversion and Editing	Anand Bhattad, Viraj Shah, Derek Hoiem, D. A. Forsyth	StyleGAN's disentangled style representation enables powerful image editing by manipulating the latent variables, but accurately mapping real-world images to their latent variables (GAN inversion) remains a challenge. Existing GAN inversion methods struggle to maintain editing directions and produce realistic results. To address these limitations, we propose Make It So, a novel GAN inversion method that operates in the $\mathcal{Z}$ (noise) space rather than the typical $\mathcal{W}$ (latent style) space. Make It So preserves editing capabilities, even for out-of-domain images. This is a crucial property that was overlooked in prior methods. Our quantitative evaluations demonstrate that Make It So outperforms the state-of-the-art method PTI~\cite{roich2021pivotal} by a factor of five in inversion accuracy and achieves ten times better edit quality for complex indoor scenes.	Presents "Make It So," a novel GAN inversion method that achieves superior accuracy, edit consistency, and generalization for complex scenes compared to existing methods.	Accurate GAN inversion is crucial for applying pre-trained GAN models for image manipulation and editing, especially in challenging domains like complex indoor scenes.	Make It So inverts images in the noise space (Z space) using joint optimization of the noise vector and the StyleGAN generator. It introduces anchor and support losses for edit consistency and generalization and employs an exponential moving average strategy for faster, cleaner inversion.	Significantly outperforms state-of-the-art methods in inversion accuracy by a factor of five. Preserves editing capabilities by a factor of ten, demonstrating superior edit consistency. Generalizes well to out-of-domain images, enabling inversion and editing of images from domains different from the StyleGAN's training data.	The optimization-based nature of Make It So limits its real-time applicability. Extremely challenging scenes might require more iterations for near-perfect inversion.	gan inversion, image editing, stylegan, deep learning, computer vision
2304.14396 Report	Learning Articulated Shape with Keypoint Pseudo-labels from Web Images	Anastasis Stathopoulos, Georgios Pavlakos, Ligong Han, Dimitris Metaxas	This paper shows that it is possible to learn models for monocular 3D reconstruction of articulated objects (e.g., horses, cows, sheep), using as few as 50-150 images labeled with 2D keypoints. Our proposed approach involves training category-specific keypoint estimators, generating 2D keypoint pseudo-labels on unlabeled web images, and using both the labeled and self-labeled sets to train 3D reconstruction models. It is based on two key insights: (1) 2D keypoint estimation networks trained on as few as 50-150 images of a given object category generalize well and generate reliable pseudo-labels; (2) a data selection mechanism can automatically create a "curated" subset of the unlabeled web images that can be used for training -- we evaluate four data selection methods. Coupling these two insights enables us to train models that effectively utilize web images, resulting in improved 3D reconstruction performance for several articulated object categories beyond the fully-supervised baseline. Our approach can quickly bootstrap a model and requires only a few images labeled with 2D keypoints. This requirement can be easily satisfied for any new object category. To showcase the practicality of our approach for predicting the 3D shape of arbitrary object categories, we annotate 2D keypoints on giraffe and bear images from COCO -- the annotation process takes less than 1 minute per image.	This paper presents a method for learning 3D reconstruction models of articulated objects from a limited set of 2D keypoint labeled images (50-150) by leveraging unlabeled web images and pseudo-labeling.	Building 3D reconstruction models for articulated objects typically requires a large amount of labeled data, which is often unavailable. This method addresses this challenge by enabling the use of readily available web images.	The method involves training a 2D keypoint estimator on a small labeled dataset, generating pseudo-labels for web images, and then using a data selection criterion to curate a subset of web images with high-quality pseudo-labels for training the 3D shape predictor.	Training with keypoint pseudo-labels significantly improves 3D reconstruction performance compared to using only limited labeled data. Data selection from web images is crucial, as using all pseudo-labels can degrade performance. Consistency-based data selection methods outperform confidence-based methods in selecting high-quality pseudo-labels.	The 3D shape prediction model is limited by the expressiveness of the template mesh and the assumed articulation model. The data selection relies on the quality of the 2D keypoint estimator, which might be limited by the initial small labeled dataset.	3d reconstruction, articulated objects, semi-supervised learning, pseudo-labeling, data selection
2304.14376 Report	Zero-shot Unsupervised Transfer Instance Segmentation	Gyungin Shin, Samuel Albanie, Weidi Xie	Segmentation is a core computer vision competency, with applications spanning a broad range of scientifically and economically valuable domains. To date, however, the prohibitive cost of annotation has limited the deployment of flexible segmentation models. In this work, we propose Zero-shot Unsupervised Transfer Instance Segmentation (ZUTIS), a framework that aims to meet this challenge. The key strengths of ZUTIS are: (i) no requirement for instance-level or pixel-level annotations; (ii) an ability of zero-shot transfer, i.e., no assumption on access to a target data distribution; (iii) a unified framework for semantic and instance segmentations with solid performance on both tasks compared to state-of-the-art unsupervised methods. While comparing to previous work, we show ZUTIS achieves a gain of 2.2 mask AP on COCO-20K and 14.5 mIoU on ImageNet-S with 919 categories for instance and semantic segmentations, respectively. The code is made publicly available.	This paper introduces ZUTIS, a novel framework for zero-shot unsupervised transfer instance segmentation, enabling simultaneous segmentation of objects and prediction of their semantic categories without human supervision or access to a target dataset.	This addresses the high cost of obtaining large, accurate collections of pixel-level annotations for training segmentation models, particularly for diverse and novel object categories.	ZUTIS leverages a pretrained vision-language model (e.g., CLIP) to retrieve images and generate pseudo-masks for training. It employs a query-based transformer decoder for instance segmentation and a projection matrix to align image features with text embeddings for semantic segmentation.	ZUTIS achieves comparable or better performance than state-of-the-art unsupervised instance segmentation methods on COCO-20K. It demonstrates strong performance in zero-shot semantic segmentation, outperforming previous methods on COCO and CoCA benchmarks. ZUTIS exhibits good generalization to novel, unseen categories, evidenced by its performance on CUB-200-2011 and unseen COCO categories.	The reliance on a vision-language model like CLIP limits ZUTIS's ability to segment extremely rare concepts not present in the pretraining data. The pseudo-mask generation pipeline using retrieved images and a saliency detector can lead to errors when images contain distracting objects of similar categories.	instance segmentation, semantic segmentation, zero-shot learning, unsupervised learning, vision-language model
2304.14291 Report	EDAPS: Enhanced Domain-Adaptive Panoptic Segmentation	Suman Saha, Lukas Hoyer, Anton Obukhov, Dengxin Dai, Luc Van Gool	With autonomous industries on the rise, domain adaptation of the visual perception stack is an important research direction due to the cost savings promise. Much prior art was dedicated to domain-adaptive semantic segmentation in the synthetic-to-real context. Despite being a crucial output of the perception stack, panoptic segmentation has been largely overlooked by the domain adaptation community. Therefore, we revisit well-performing domain adaptation strategies from other fields, adapt them to panoptic segmentation, and show that they can effectively enhance panoptic domain adaptation. Further, we study the panoptic network design and propose a novel architecture (EDAPS) designed explicitly for domain-adaptive panoptic segmentation. It uses a shared, domain-robust transformer encoder to facilitate the joint adaptation of semantic and instance features, but task-specific decoders tailored for the specific requirements of both domain-adaptive semantic and instance segmentation. As a result, the performance gap seen in challenging panoptic benchmarks is substantially narrowed. EDAPS significantly improves the state-of-the-art performance for panoptic segmentation UDA by a large margin of 20% on SYNTHIA-to-Cityscapes and even 72% on the more challenging SYNTHIA-to-Mapillary Vistas. The implementation is available at https://github.com/susaha/edaps.	Proposes EDAPS, a novel architecture for domain-adaptive panoptic segmentation using a shared transformer encoder and task-specific decoders, enhanced with UDA strategies like self-training, mean teacher, rare class sampling, and ImageNet feature distance.	Panoptic segmentation UDA has been overlooked and existing methods achieve subpar performance compared to semantic segmentation UDA, highlighting the need for specialized architectures and UDA techniques.	Conducts a systematic study of different panoptic architectures for UDA, identifying the strengths of shared encoder and task-specific decoders. Employs an enhanced UDA strategy incorporating recent techniques from semantic segmentation UDA.	Achieves a 20% improvement on SYNTHIA-to-Cityscapes and 72% on SYNTHIA-to-Mapillary Vistas over prior state-of-the-art mPQ. Significantly improves both recognition and segmentation quality compared to previous methods, particularly excelling in challenging classes. Provides an efficient architecture with faster inference speed compared to previous methods like CVRN.	Instance pseudo-labels are not explored, potentially leading to further performance gains. The approach is validated on synthetic-to-real benchmarks and its effectiveness on other domain shifts requires further investigation.	panoptic segmentation, domain adaptation, unsupervised domain adaptation, transformer, self-training
2304.14070 Report	Compositional 3D Human-Object Neural Animation	Zhi Hou, Baosheng Yu, Dacheng Tao	Human-object interactions (HOIs) are crucial for human-centric scene understanding applications such as human-centric visual generation, AR/VR, and robotics. Since existing methods mainly explore capturing HOIs, rendering HOI remains less investigated. In this paper, we address this challenge in HOI animation from a compositional perspective, i.e., animating novel HOIs including novel interaction, novel human and/or novel object driven by a novel pose sequence. Specifically, we adopt neural human-object deformation to model and render HOI dynamics based on implicit neural representations. To enable the interaction pose transferring among different persons and objects, we then devise a new compositional conditional neural radiance field (or CC-NeRF), which decomposes the interdependence between human and object using latent codes to enable compositionally animation control of novel HOIs. Experiments show that the proposed method can generalize well to various novel HOI animation settings. Our project page is https://zhihou7.github.io/CHONA/	This paper introduces CHONA, a novel approach for compositional 3D human-object neural animation. CHONA reconstructs and renders human-object interactions (HOIs) from sparse multi-view videos using neural implicit representations.	Rendering 3D human-object animation is crucial for various applications like AR/VR and robotics, but existing methods struggle to handle novel interactions, human subjects, and object instances.	CHONA employs neural human-object deformation, utilizing a pseudo bone for objects and skinning-based techniques for pose-dependent deformation. For compositional control, it utilizes compositional conditional neural radiance fields (CC-NeRF) with disentangled latent codes for human and object identity.	CHONA outperforms baseline methods in novel pose animation tasks, especially for larger objects. The compositional invariant learning strategy in CC-NeRF effectively disentangles human and object representations, enabling animation with novel combinations and even non-interactive persons or static objects. Quantitative and qualitative evaluations on BEHAVE, ZJU-mocap, and CO3D datasets demonstrate the effectiveness of CHONA for compositional HOI animation.	Accurately understanding the interaction region (object affordance) remains a challenge. Generating object poses from human motion poses based on interaction categories is a potential area for future exploration.	human-object interaction, 3d animation, neural radiance fields, compositional representation learning, computer vision
2304.14006 Report	Edit Everything: A Text-Guided Generative System for Images Editing	Defeng Xie, Ruichen Wang, Jian Ma, Chen Chen, Haonan Lu, Dong Yang, Fobo Shi, Xiaodong Lin	We introduce a new generative system called Edit Everything, which can take image and text inputs and produce image outputs. Edit Everything allows users to edit images using simple text instructions. Our system designs prompts to guide the visual module in generating requested images. Experiments demonstrate that Edit Everything facilitates the implementation of the visual aspects of Stable Diffusion with the use of Segment Anything model and CLIP. Our system is publicly available at https://github.com/DefengXie/Edit_Everything.	Introduces 'Edit Everything,' a text-guided image editing system that uses SAM for segmentation, CLIP for object ranking, and Stable Diffusion for realistic object replacement.	Enables efficient and precise image editing based on natural language instructions, addressing the limitations of traditional image editing tools.	Leverages SAM to segment images, trains CLIP on a large Chinese image-text dataset for object ranking, and utilizes Stable Diffusion for generating replacement objects guided by target prompts.	Edit Everything effectively edits images based on simple text prompts, seamlessly blending different styles. The system supports iterative editing for complex prompts, allowing for precise control over the generated output. Trained on a large Chinese dataset, Edit Everything outperforms open-source models in Chinese text-guided image editing.	The system relies on pre-trained models (SAM, CLIP, SD) without architectural modifications, potentially limiting performance. Iterative editing for complex prompts, while accurate, may not be the most efficient approach.	image editing, text-guided generation, stable diffusion, clip, segment anything
2304.14005 Report	ContraNeRF: 3D-Aware Generative Model via Contrastive Learning with Unsupervised Implicit Pose Embedding	Mijeong Kim, Hyunjoon Lee, Bohyung Han	Although 3D-aware GANs based on neural radiance fields have achieved competitive performance, their applicability is still limited to objects or scenes with the ground-truths or prediction models for clearly defined canonical camera poses. To extend the scope of applicable datasets, we propose a novel 3D-aware GAN optimization technique through contrastive learning with implicit pose embeddings. To this end, we first revise the discriminator design and remove dependency on ground-truth camera poses. Then, to capture complex and challenging 3D scene structures more effectively, we make the discriminator estimate a high-dimensional implicit pose embedding from a given image and perform contrastive learning on the pose embedding. The proposed approach can be employed for the dataset, where the canonical camera pose is ill-defined because it does not look up or estimate camera poses. Experimental results show that our algorithm outperforms existing methods by large margins on the datasets with multiple object categories and inconsistent canonical camera poses.	This paper introduces ContraNeRF, a novel 3D-aware GAN that leverages contrastive learning with implicit pose embeddings to generate images and their underlying 3D structures without relying on ground-truth camera poses.	Existing 3D-aware GANs often depend on ground-truth camera poses, limiting their applicability to datasets with clearly defined canonical views. ContraNeRF overcomes this limitation, enabling the generation of complex scenes with heterogeneous geometric configurations.	The authors modify the discriminator of EG3D to predict implicit pose embeddings instead of explicit camera poses. They then utilize contrastive learning to maximize the similarity between embeddings of images rendered from the same viewpoint while minimizing it for different viewpoints.	ContraNeRF outperforms state-of-the-art methods in terms of image quality and 3D reconstruction accuracy on datasets like LSUN Bedroom, LSUN Church, AFHQ, and CUB. The implicit pose embedding effectively captures complex 3D scene structures, even when canonical camera poses are ill-defined. Increasing the dimensionality of the pose embedding leads to improved 3D reconstruction quality.	While generally successful, ContraNeRF occasionally produces unrealistic geometries, likely due to outlier training samples or limitations in handling extreme viewpoints. Future work could explore techniques to mitigate the impact of outliers and further enhance the robustness of the model to diverse camera poses.	generative adversarial networks, 3d reconstruction, contrastive learning, implicit pose embedding, neural radiance fields
2304.13850 Report	Do SSL Models Have Déjà Vu? A Case of Unintended Memorization in Self-supervised Learning	Casey Meehan, Florian Bordes, Pascal Vincent, Kamalika Chaudhuri, Chuan Guo	Self-supervised learning (SSL) algorithms can produce useful image representations by learning to associate different parts of natural images with one another. However, when taken to the extreme, SSL models can unintendedly memorize specific parts in individual training samples rather than learning semantically meaningful associations. In this work, we perform a systematic study of the unintended memorization of image-specific information in SSL models -- which we refer to as d\'ej\`a vu memorization. Concretely, we show that given the trained model and a crop of a training image containing only the background (e.g., water, sky, grass), it is possible to infer the foreground object with high accuracy or even visually reconstruct it. Furthermore, we show that d\'ej\`a vu memorization is common to different SSL algorithms, is exacerbated by certain design choices, and cannot be detected by conventional techniques for evaluating representation quality. Our study of d\'ej\`a vu memorization reveals previously unknown privacy risks in SSL models, as well as suggests potential practical mitigation strategies. Code is available at https://github.com/facebookresearch/DejaVu.	This paper investigates and characterizes déjà vu memorization: the phenomenon where self-supervised learning (SSL) models unintentionally memorize specific details from individual training images, enabling the recovery of masked information beyond what can be inferred from correlations within the data distribution.	This work highlights previously unknown privacy risks in SSL models, particularly concerning the potential extraction of sensitive information from trained models. As SSL models gain popularity as foundation models in image processing, understanding and mitigating these risks is crucial for responsible AI development and deployment.	The authors develop a novel testing methodology that leverages a target model trained on a specific dataset and a reference model trained on a similar but disjoint dataset. By comparing their ability to infer masked information from training images, they can distinguish between memorization (unique to the target model) and correlation (present in both models). They further visualize this memorization through image reconstructions using a public dataset and a conditional generative model (RCDM).	SSL models exhibit a significant degree of déjà vu memorization, even surpassing supervised models in some cases. Memorization is exacerbated by factors like increasing training epochs and larger model capacity, while training set size has minimal effect. Certain SSL training criteria (e.g., VICReg) are more susceptible to memorization than others (e.g., SimCLR, BYOL).	The paper primarily focuses on image classification as a measure of memorization and relies on bounding box annotations for a subset of experiments. Further research is needed to understand the underlying mechanisms of déjà vu memorization and develop more robust mitigation strategies beyond hyperparameter tuning and architectural modifications.	self-supervised learning, memorization, privacy risks, image reconstruction, data protection
2304.13844 Report	GazeSAM: What You See is What You Segment	Bin Wang, Armstrong Aboah, Zheyuan Zhang, Ulas Bagci	This study investigates the potential of eye-tracking technology and the Segment Anything Model (SAM) to design a collaborative human-computer interaction system that automates medical image segmentation. We present the \textbf{GazeSAM} system to enable radiologists to collect segmentation masks by simply looking at the region of interest during image diagnosis. The proposed system tracks radiologists' eye movement and utilizes the eye-gaze data as the input prompt for SAM, which automatically generates the segmentation mask in real time. This study is the first work to leverage the power of eye-tracking technology and SAM to enhance the efficiency of daily clinical practice. Moreover, eye-gaze data coupled with image and corresponding segmentation labels can be easily recorded for further advanced eye-tracking research. The code is available in \url{https://github.com/ukaukaaaa/GazeSAM}.	GazeSAM, a collaborative human-computer interaction system that combines eye-tracking technology with the Segment Anything Model (SAM) for real-time medical image segmentation.	To address the time-consuming and costly manual segmentation process in medical image analysis, GazeSAM aims to automate segmentation, thereby improving efficiency in clinical practice.	GazeSAM uses a screen-based eye tracker to capture eye gaze data and transforms it into point prompts for SAM. The model then generates segmentation masks in real-time based on the user's eye movements, allowing for both coarse and refined segmentation.	GazeSAM enables real-time segmentation of both 2D and 3D medical images. The system provides an intuitive interface for users to interact with the model and refine segmentation results by simply looking at desired areas. GazeSAM facilitates the collection of eye-tracking data synchronized with images and segmentations, which can be valuable for further research in eye-tracking and medical image analysis.	The performance of GazeSAM may be limited by the inherent limitations of SAM in accurately segmenting medical images, especially those not well-represented in SAM's training data. Future work includes fine-tuning SAM on large-scale medical image datasets to improve its segmentation accuracy in the medical domain.	eye-tracking, segment anything model, medical image segmentation, human-computer interaction, real-time segmentation
2304.13518 Report	Super-NeRF: View-consistent Detail Generation for NeRF super-resolution	Yuqi Han, Tao Yu, Xiaohang Yu, Yuwang Wang, Qionghai Dai	The neural radiance field (NeRF) achieved remarkable success in modeling 3D scenes and synthesizing high-fidelity novel views. However, existing NeRF-based methods focus more on the make full use of the image resolution to generate novel views, but less considering the generation of details under the limited input resolution. In analogy to the extensive usage of image super-resolution, NeRF super-resolution is an effective way to generate the high-resolution implicit representation of 3D scenes and holds great potential applications. Up to now, such an important topic is still under-explored. In this paper, we propose a NeRF super-resolution method, named Super-NeRF, to generate high-resolution NeRF from only low-resolution inputs. Given multi-view low-resolution images, Super-NeRF constructs a consistency-controlling super-resolution module to generate view-consistent high-resolution details for NeRF. Specifically, an optimizable latent code is introduced for each low-resolution input image to control the 2D super-resolution images to converge to the view-consistent output. The latent codes of each low-resolution image are optimized synergistically with the target Super-NeRF representation to fully utilize the view consistency constraint inherent in NeRF construction. We verify the effectiveness of Super-NeRF on synthetic, real-world, and AI-generated NeRF datasets. Super-NeRF achieves state-of-the-art NeRF super-resolution performance on high-resolution detail generation and cross-view consistency.	This paper proposes Super-NeRF, a novel method for achieving view-consistent super-resolution of neural radiance fields (NeRFs) using only low-resolution (LR) input images.	High-quality NeRF reconstruction typically requires high-resolution (HR) images, which are costly to capture, store, and transmit. Super-NeRF addresses this by generating plausible HR details while preserving 3D consistency, enabling high-quality novel view synthesis from readily available LR inputs.	Super-NeRF utilizes a consistency-controlling super-resolution (CCSR) module and a mutual learning strategy between the CCSR and an HR NeRF. The CCSR explores diverse HR image solutions guided by a pre-trained LR NeRF, and a consistency enforcing module ensures adherence to LR inputs. Mutual learning between the CCSR and HR NeRF progressively refines HR details while maintaining view consistency.	Super-NeRF generates sharper edges and finer texture details compared to baselines on various datasets, including LLFF, Synthetic 360, Blender, and FaceScape. Quantitative evaluation using LPIPS and NIQE metrics demonstrates improved perceptual quality and superior view consistency achieved by Super-NeRF. Super-NeRF exhibits strong generalization capability, successfully extending to AI-generated NeRFs from Dreamfusion and gracefully handling hybrid-resolution input settings.	Current implementation only utilizes a 4x SR model; exploring higher upsampling ratios (8x, 16x) with lightweight, powerful SR backbones is a potential future direction. Training speed can be enhanced by integrating faster NeRF architectures like TensorRF or InstantNGP without altering the core framework or training strategy.	neural radiance field, nerf super-resolution, view consistency, generative super-resolution, 3d scene reconstruction
2304.13509 Report	EasyPortrait -- Face Parsing and Portrait Segmentation Dataset	Karina Kvanchiani, Elizaveta Petrova, Karen Efremyan, Alexander Sautin, Alexander Kapitanov	Recently, video conferencing apps have become functional by accomplishing such computer vision-based features as real-time background removal and face beautification. Limited variability in existing portrait segmentation and face parsing datasets, including head poses, ethnicity, scenes, and occlusions specific to video conferencing, motivated us to create a new dataset, EasyPortrait, for these tasks simultaneously. It contains 40,000 primarily indoor photos repeating video meeting scenarios with 13,705 unique users and fine-grained segmentation masks separated into 9 classes. Inappropriate annotation masks from other datasets caused a revision of annotator guidelines, resulting in EasyPortrait's ability to process cases, such as teeth whitening and skin smoothing. The pipeline for data mining and high-quality mask annotation via crowdsourcing is also proposed in this paper. In the ablation study experiments, we proved the importance of data quantity and diversity in head poses in our dataset for the effective learning of the model. The cross-dataset evaluation experiments confirmed the best domain generalization ability among portrait segmentation datasets. Moreover, we demonstrate the simplicity of training segmentation models on EasyPortrait without extra training tricks. The proposed dataset and trained models are publicly available.	Introduces EasyPortrait, a novel dataset for face parsing and portrait segmentation, specifically designed for video conferencing applications.	Existing datasets lack variability in head poses, ethnicity, scenes, and occlusions typical of video conferencing, hindering the development of robust and accurate models for real-time applications.	Collected 40,000 images from 13,705 unique users, focusing on indoor settings and video meeting scenarios. Employed a crowdsourcing pipeline with rigorous quality control for accurate annotation of 9 classes, including teeth.	Data quantity and head pose diversity significantly impact model performance. EasyPortrait-trained models achieve state-of-the-art results on cross-dataset evaluations for portrait segmentation. EasyPortrait enables the training of high-performing models without requiring specialized training tricks or occlusion simulations.	Current domain is limited to single-person scenarios, limiting generalizability to more complex scenes. Future work includes expanding annotation to cover additional facial features and accessories like hair, glasses, and headphones.	face parsing, portrait segmentation, dataset, video conferencing, deep learning
2304.13445 Report	Neural-PBIR Reconstruction of Shape, Material, and Illumination	Cheng Sun, Guangyan Cai, Zhengqin Li, Kai Yan, Cheng Zhang, Carl Marshall, Jia-Bin Huang, Shuang Zhao, Zhao Dong	Reconstructing the shape and spatially varying surface appearances of a physical-world object as well as its surrounding illumination based on 2D images (e.g., photographs) of the object has been a long-standing problem in computer vision and graphics. In this paper, we introduce an accurate and highly efficient object reconstruction pipeline combining neural based object reconstruction and physics-based inverse rendering (PBIR). Our pipeline firstly leverages a neural SDF based shape reconstruction to produce high-quality but potentially imperfect object shape. Then, we introduce a neural material and lighting distillation stage to achieve high-quality predictions for material and illumination. In the last stage, initialized by the neural predictions, we perform PBIR to refine the initial results and obtain the final high-quality reconstruction of object shape, material, and illumination. Experimental results demonstrate our pipeline significantly outperforms existing methods quality-wise and performance-wise.	A novel, efficient, and accurate inverse rendering pipeline named Neural-PBIR that combines neural reconstruction and physics-based inverse rendering (PBIR) to jointly estimate geometry, spatially varying material reflectance, and HDR environment map from multi-view images of an object.	Existing neural rendering methods are computationally expensive, often neglecting complex light transport effects like interreflection. Conversely, PBIR methods can handle complex lighting but are prone to local minima. Neural-PBIR aims to address these limitations by leveraging the strengths of both approaches.	The pipeline consists of three stages: 1) fast neural SDF and radiance field reconstruction using a hybrid volume representation, 2) neural material and lighting distillation from the reconstructed radiance field, and 3) joint refinement of geometry, materials, and lighting using a PBIR framework with a differentiable renderer handling global illumination effects.	Neural-PBIR outperforms state-of-the-art methods on both synthetic and real datasets in terms of reconstruction accuracy and computational efficiency. The neural material distilling stage provides high-quality initialization for the PBIR stage, significantly reducing optimization time. The PBIR stage, by modeling global illumination effects, improves material and lighting reconstruction accuracy compared to optimization without considering interreflection.	The method may still exhibit 'baking' artifacts when initial predictions are far from the ground truth. The current implementation assumes opaque materials and does not support transparent or translucent objects. Future work will focus on addressing the limitations and extending the method to handle more complex material types.	inverse rendering, neural rendering, physics-based inverse rendering, material reconstruction, lighting estimation
2304.13386 Report	VGOS: Voxel Grid Optimization for View Synthesis from Sparse Inputs	Jiakai Sun, Zhanjie Zhang, Jiafu Chen, Guangyuan Li, Boyan Ji, Lei Zhao, Wei Xing, Huaizhong Lin	Neural Radiance Fields (NeRF) has shown great success in novel view synthesis due to its state-of-the-art quality and flexibility. However, NeRF requires dense input views (tens to hundreds) and a long training time (hours to days) for a single scene to generate high-fidelity images. Although using the voxel grids to represent the radiance field can significantly accelerate the optimization process, we observe that for sparse inputs, the voxel grids are more prone to overfitting to the training views and will have holes and floaters, which leads to artifacts. In this paper, we propose VGOS, an approach for fast (3-5 minutes) radiance field reconstruction from sparse inputs (3-10 views) to address these issues. To improve the performance of voxel-based radiance field in sparse input scenarios, we propose two methods: (a) We introduce an incremental voxel training strategy, which prevents overfitting by suppressing the optimization of peripheral voxels in the early stage of reconstruction. (b) We use several regularization techniques to smooth the voxels, which avoids degenerate solutions. Experiments demonstrate that VGOS achieves state-of-the-art performance for sparse inputs with super-fast convergence. Code will be available at https://github.com/SJoJoK/VGOS.	This paper proposes VGOS, an approach for fast novel view synthesis from sparse inputs (3-10 views) based on voxel grid optimization, which addresses overfitting and artifacts issues.	NeRF achieved great success but requires dense inputs and long training times. Existing fast methods still need dense inputs for quality, and sparse input methods are limited by pre-training needs or extra data requirements like depth maps.	The paper introduces: (a) an incremental voxel training strategy that prevents overfitting by gradually incorporating peripheral voxels during optimization and (b) a voxel smoothing method with color-aware total variation loss and depth smoothness loss for artifact reduction.	VGOS achieves state-of-the-art performance for sparse inputs without pre-trained models. It outperforms previous methods on Realistic Synthetic 360° and LLFF datasets in terms of PSNR and SSIM. The method demonstrates a significant speedup of 10-100 times compared to previous approaches, achieving high-quality results within minutes.	The model shows slightly lower performance on LPIPS, a perceptual metric, compared to some methods using pre-trained models. Future work could explore the integration of high-level information from pre-trained models for improved perceptual quality.	novel view synthesis, neural radiance fields, sparse input, voxel grids, fast training
2304.13348 Report	TextDeformer: Geometry Manipulation using Text Guidance	William Gao, Noam Aigerman, Thibault Groueix, Vladimir G. Kim, Rana Hanocka	We present a technique for automatically producing a deformation of an input triangle mesh, guided solely by a text prompt. Our framework is capable of deformations that produce both large, low-frequency shape changes, and small high-frequency details. Our framework relies on differentiable rendering to connect geometry to powerful pre-trained image encoders, such as CLIP and DINO. Notably, updating mesh geometry by taking gradient steps through differentiable rendering is notoriously challenging, commonly resulting in deformed meshes with significant artifacts. These difficulties are amplified by noisy and inconsistent gradients from CLIP. To overcome this limitation, we opt to represent our mesh deformation through Jacobians, which updates deformations in a global, smooth manner (rather than locally-sub-optimal steps). Our key observation is that Jacobians are a representation that favors smoother, large deformations, leading to a global relation between vertices and pixels, and avoiding localized noisy gradients. Additionally, to ensure the resulting shape is coherent from all 3D viewpoints, we encourage the deep features computed on the 2D encoding of the rendering to be consistent for a given vertex from all viewpoints. We demonstrate that our method is capable of smoothly-deforming a wide variety of source mesh and target text prompts, achieving both large modifications to, e.g., body proportions of animals, as well as adding fine semantic details, such as shoe laces on an army boot and fine details of a face.	This paper introduces TextDeformer, a method for deforming existing 3D meshes into new shapes guided by text prompts, utilizing differentiable rendering and pre-trained image encoders like CLIP.	This approach allows for automated, semantically-aware mesh deformation, enabling both large-scale shape changes and the addition of fine details, which are challenging to achieve with traditional deformation techniques.	TextDeformer represents deformations using per-triangle Jacobians, optimized through a combination of CLIP-based semantic loss, view consistency loss (to ensure coherence across viewpoints), and Jacobian regularization (to preserve source shape characteristics).	TextDeformer can deform diverse source meshes to match various target texts, demonstrating generalization across different shapes and prompts. The method can produce both high-frequency details (e.g., giraffe spots, shoe laces) and low-frequency shape modifications (e.g., body proportions of animals, guitar body shapes). Utilizing Jacobians leads to smoother and more globally coherent deformations compared to directly optimizing vertex displacements, resulting in higher-quality output meshes.	The optimization process can be computationally expensive, taking several hours for each deformation. Exploring the possibility of learning a space of prompt-driven deformations for faster inference and potential improvements through neural regularization.	3d mesh deformation, text-guided synthesis, differentiable rendering, clip, jacobians
2304.13153 Report	LumiGAN: Unconditional Generation of Relightable 3D Human Faces	Boyang Deng, Yifan Wang, Gordon Wetzstein	Unsupervised learning of 3D human faces from unstructured 2D image data is an active research area. While recent works have achieved an impressive level of photorealism, they commonly lack control of lighting, which prevents the generated assets from being deployed in novel environments. To this end, we introduce LumiGAN, an unconditional Generative Adversarial Network (GAN) for 3D human faces with a physically based lighting module that enables relighting under novel illumination at inference time. Unlike prior work, LumiGAN can create realistic shadow effects using an efficient visibility formulation that is learned in a self-supervised manner. LumiGAN generates plausible physical properties for relightable faces, including surface normals, diffuse albedo, and specular tint without any ground truth data. In addition to relightability, we demonstrate significantly improved geometry generation compared to state-of-the-art non-relightable 3D GANs and notably better photorealism than existing relightable GANs.	\moniker{} is an unconditional 3D GAN for generating photorealistic and relightable 3D human faces, trained solely on unstructured single-view images under unknown and varying lighting conditions.	Existing 3D GANs for generating human faces lack control of lighting, which prevents the generated assets from being deployed in novel environments.	The framework uses an expressive, yet efficient physically based lighting model to learn to generate geometry, albedo, specular tint, and visibility components of a person's face. It employs a novel Neural Radiance Transfer approach to efficiently model visibility, producing plausible shadows without expensive ray casting.	Significantly improved photorealism and view consistency compared to existing relightable 3D GANs. Improved geometry generation due to the physically-based lighting model and self-supervised training. Generates plausible physical properties like surface normals, diffuse albedo, and specular tint without ground truth data.	Extending NRT (Neural Radiance Transfer) to dynamic scenes is non-trivial. Potential lack of diversity in generated faces due to dataset biases.	generative adversarial networks (gans), 3d face generation, relighting, neural rendering, computer vision
2304.13141 Report	CN-DHF: Compact Neural Double Height-Field Representations of 3D Shapes	Eric Hedlin, Jinfan Yang, Nicholas Vining, Kwang Moo Yi, Alla Sheffer	We introduce CN-DHF (Compact Neural Double-Height-Field), a novel hybrid neural implicit 3D shape representation that is dramatically more compact than the current state of the art. Our representation leverages Double-Height-Field (DHF) geometries, defined as closed shapes bounded by a pair of oppositely oriented height-fields that share a common axis, and leverages the following key observations: DHFs can be compactly encoded as 2D neural implicits that capture the maximal and minimal heights along the DHF axis; and typical closed 3D shapes are well represented as intersections of a very small number (three or fewer) of DHFs. We represent input geometries as CNDHFs by first computing the set of DHFs whose intersection well approximates each input shape, and then encoding these DHFs via neural fields. Our approach delivers high-quality reconstructions, and reduces the reconstruction error by a factor of 2:5 on average compared to the state-of-the-art, given the same parameter count or storage capacity. Compared to the best-performing alternative, our method produced higher accuracy models on 94% of the 400 input shape and parameter count combinations tested.	The paper introduces CN-DHF, a new hybrid neural implicit representation for 3D shapes that is significantly more compact than current state-of-the-art methods.	Compact 3D shape representations are crucial for applications like video games, streaming, and VR/AR, where storage, transmission, and processing times are critical.	CN-DHF leverages Double-Height-Field (DHF) surfaces, representing shapes as intersections of a few DHFs. Each DHF is encoded as a 2D neural field capturing maximal and minimal heights along a chosen axis. A geometric algorithm finds optimal DHF axes, and a Multi-Layer Perceptron (MLP) models individual DHFs using a loss function combining positional and Laplacian terms.	CN-DHF models achieve higher accuracy than state-of-the-art alternatives with the same storage capacity, reducing reconstruction error by an average factor of 2.5. 94% of CN-DHF models outperform the best alternative in terms of reconstruction accuracy for given parameter counts. Analysis reveals that most virtual environment shapes can be accurately represented using an intersection of three or fewer DHFs.	CN-DHF may not accurately capture portions of geometry invisible from the outside. Representations are limited by the number of DHFs used, potentially impacting accuracy for highly complex shapes requiring more than three DHFs.	3d shape representation, neural implicit representation, double-height-field (dhf), compactness, reconstruction accuracy
2304.13027 Report	A Strong and Reproducible Object Detector with Only Public Datasets	Tianhe Ren, Jianwei Yang, Shilong Liu, Ailing Zeng, Feng Li, Hao Zhang, Hongyang Li, Zhaoyang Zeng, Lei Zhang	This work presents Focal-Stable-DINO, a strong and reproducible object detection model which achieves 64.6 AP on COCO val2017 and 64.8 AP on COCO test-dev using only 700M parameters without any test time augmentation. It explores the combination of the powerful FocalNet-Huge backbone with the effective Stable-DINO detector. Different from existing SOTA models that utilize an extensive number of parameters and complex training techniques on large-scale private data or merged data, our model is exclusively trained on the publicly available dataset Objects365, which ensures the reproducibility of our approach.	This work presents Focal-Stable-DINO, a strong and reproducible object detection model achieving 64.6 AP on COCO val2017 and 64.8 AP on COCO test-dev using only 700M parameters without test time augmentation.	This model addresses the limited reproducibility of recent object detection advancements due to reliance on large-scale private data and complex training methods.	The model combines the FocalNet-Huge backbone pre-trained on ImageNet-22K with the Stable-DINO detector, trained solely on the publicly available Objects365 dataset.	Focal-Stable-DINO achieves 64.6 AP on COCO val2017 and 64.8 AP on COCO test-dev without test time augmentation. Analysis reveals performance disparity across object classes and significant room for improvement in detecting small objects. Study highlights inconsistencies and inaccuracies in COCO annotations impacting model evaluation.	Model performance still limited by object class and small object detection. Future work should focus on improving dataset annotation quality alongside model performance.	object detection, focalnet, stable-dino, reproducibility, coco dataset
2304.12944 Report	Latent Traversals in Generative Models as Potential Flows	Yue Song, T. Anderson Keller, Nicu Sebe, Max Welling	Despite the significant recent progress in deep generative models, the underlying structure of their latent spaces is still poorly understood, thereby making the task of performing semantically meaningful latent traversals an open research challenge. Most prior work has aimed to solve this challenge by modeling latent structures linearly, and finding corresponding linear directions which result in `disentangled' generations. In this work, we instead propose to model latent structures with a learned dynamic potential landscape, thereby performing latent traversals as the flow of samples down the landscape's gradient. Inspired by physics, optimal transport, and neuroscience, these potential landscapes are learned as physically realistic partial differential equations, thereby allowing them to flexibly vary over both space and time. To achieve disentanglement, multiple potentials are learned simultaneously, and are constrained by a classifier to be distinct and semantically self-consistent. Experimentally, we demonstrate that our method achieves both more qualitatively and quantitatively disentangled trajectories than state-of-the-art baselines. Further, we demonstrate that our method can be integrated as a regularization term during training, thereby acting as an inductive bias towards the learning of structured representations, ultimately improving model likelihood on similarly structured data.	This paper proposes a novel method for performing disentangled latent traversals in pre-trained generative models by modeling them as the flow of particles down learned dynamic potential landscapes defined by physically realistic partial differential equations.	Existing methods for latent traversal often struggle to disentangle semantic attributes due to limitations in modeling the complexity of the latent space. This work leverages intuitions from physics, optimal transport, and neuroscience to achieve more realistic and disentangled latent traversals.	The method learns potential functions as partial differential equations (PDEs), specifically the wave equation, to guide the flow of samples in the latent space. An auxiliary classifier is used to encourage the learned potentials to correspond to distinct and semantically consistent transformations. For VAEs, the method can be integrated during training as a regularization term to structure the latent space and improve likelihood on similarly structured data.	The method achieves qualitatively and quantitatively more disentangled trajectories compared to state-of-the-art baselines on pre-trained GANs (SNGAN, BigGAN, StyleGAN2) and VAEs. Integrating the method as a regularization term during VAE training improves likelihood on MNIST and dSprites datasets, indicating a beneficial inductive bias for learning structured representations. Empirical analysis demonstrates that the method can model unambiguous traversal paths with diverse shapes, capturing a wide range of semantic transformations.	The current formulation mainly explores the second-order wave equation; investigating alternative PDEs could be beneficial. The potential flow model used has inherent limitations in representing all types of physical flows (e.g., those with vorticity), potentially limiting its applicability to certain transformations.	latent traversal, disentanglement, generative models, partial differential equations, potential flow
2304.12748 Report	Inverting the Imaging Process by Learning an Implicit Camera Model	Xin Huang, Qi Zhang, Ying Feng, Hongdong Li, Qing Wang	Representing visual signals with implicit coordinate-based neural networks, as an effective replacement of the traditional discrete signal representation, has gained considerable popularity in computer vision and graphics. In contrast to existing implicit neural representations which focus on modelling the scene only, this paper proposes a novel implicit camera model which represents the physical imaging process of a camera as a deep neural network. We demonstrate the power of this new implicit camera model on two inverse imaging tasks: i) generating all-in-focus photos, and ii) HDR imaging. Specifically, we devise an implicit blur generator and an implicit tone mapper to model the aperture and exposure of the camera's imaging process, respectively. Our implicit camera model is jointly learned together with implicit scene models under multi-focus stack and multi-exposure bracket supervision. We have demonstrated the effectiveness of our new model on a large number of test images and videos, producing accurate and visually appealing all-in-focus and high dynamic range images. In principle, our new implicit neural camera model has the potential to benefit a wide array of other inverse imaging tasks.	This paper introduces an implicit neural camera model to represent the physical imaging process of a camera, enabling tasks like generating all-in-focus photos and HDR imaging.	Existing implicit neural representations primarily focus on scene modeling, neglecting the crucial role of the camera imaging process in image formation.	The model consists of a blur generator (simulating aperture effects) and a tone mapper (simulating exposure effects). It's jointly trained with implicit scene models using multi-focus and multi-exposure image stacks.	Outperforms state-of-the-art methods in all-in-focus and HDR imaging. Generates accurate and visually appealing all-in-focus and HDR images from fewer input images compared to traditional methods. Enables controllable rendering with adjustable focus and exposure.	Requires significant training time per scene. Noise modeling and handling scenes with complex geometry or extreme deformations need further improvement.	implicit neural representation, camera model, inverse imaging, hdr imaging, all-in-focus imaging
2304.12526 Report	Patch Diffusion: Faster and More Data-Efficient Training of Diffusion Models	Zhendong Wang, Yifan Jiang, Huangjie Zheng, Peihao Wang, Pengcheng He, Zhangyang Wang, Weizhu Chen, Mingyuan Zhou	Diffusion models are powerful, but they require a lot of time and data to train. We propose Patch Diffusion, a generic patch-wise training framework, to significantly reduce the training time costs while improving data efficiency, which thus helps democratize diffusion model training to broader users. At the core of our innovations is a new conditional score function at the patch level, where the patch location in the original image is included as additional coordinate channels, while the patch size is randomized and diversified throughout training to encode the cross-region dependency at multiple scales. Sampling with our method is as easy as in the original diffusion model. Through Patch Diffusion, we could achieve $\mathbf{\ge 2\times}$ faster training, while maintaining comparable or better generation quality. Patch Diffusion meanwhile improves the performance of diffusion models trained on relatively small datasets, $e.g.$, as few as 5,000 images to train from scratch. We achieve outstanding FID scores in line with state-of-the-art benchmarks: 1.77 on CelebA-64$\times$64, 1.93 on AFHQv2-Wild-64$\times$64, and 2.72 on ImageNet-256$\times$256. We share our code and pre-trained models at https://github.com/Zhendong-Wang/Patch-Diffusion.	This paper introduces Patch Diffusion, a novel training framework for diffusion models that leverages patch-level training with location and size conditioning to significantly reduce training time and data requirements.	Training diffusion models is computationally expensive and data-intensive, limiting accessibility for researchers with limited resources. This work aims to democratize diffusion model training by significantly reducing these costs.	The proposed method learns a conditional score function on image patches. It incorporates patch location as additional coordinate channels and employs a stochastic patch size scheduling strategy to capture cross-region dependencies at multiple scales.	Patch Diffusion achieves comparable or better image generation quality while reducing training time by at least 50% compared to state-of-the-art diffusion models. The method demonstrates improved data efficiency, achieving superior generation quality on small datasets with limited training images. Patch Diffusion effectively finetunes large-scale pre-trained diffusion models without compromising performance, as shown with ControlNet for image generation.	The current coordinate system could be further improved by utilizing more advanced positional embeddings. Theoretical proof of convergence for patch-wise score matching in general cases remains an open question.	diffusion models, generative ai, patch-based learning, efficient training, data efficiency
2304.12439 Report	TextMesh: Generation of Realistic 3D Meshes From Text Prompts	Christina Tsalicoglou, Fabian Manhardt, Alessio Tonioni, Michael Niemeyer, Federico Tombari	The ability to generate highly realistic 2D images from mere text prompts has recently made huge progress in terms of speed and quality, thanks to the advent of image diffusion models. Naturally, the question arises if this can be also achieved in the generation of 3D content from such text prompts. To this end, a new line of methods recently emerged trying to harness diffusion models, trained on 2D images, for supervision of 3D model generation using view dependent prompts. While achieving impressive results, these methods, however, have two major drawbacks. First, rather than commonly used 3D meshes, they instead generate neural radiance fields (NeRFs), making them impractical for most real applications. Second, these approaches tend to produce over-saturated models, giving the output a cartoonish looking effect. Therefore, in this work we propose a novel method for generation of highly realistic-looking 3D meshes. To this end, we extend NeRF to employ an SDF backbone, leading to improved 3D mesh extraction. In addition, we propose a novel way to finetune the mesh texture, removing the effect of high saturation and improving the details of the output 3D mesh.	Proposes TextMesh, a novel method for generating realistic 3D meshes from text prompts, addressing limitations of existing NeRF-based methods that produce oversaturated, non-mesh outputs.	Enables creation of photorealistic 3D content directly usable in standard computer graphics pipelines for applications like AR/VR, overcoming drawbacks of existing methods.	Modifies DreamFusion to use an SDF backbone for easier mesh extraction and employs a novel multi-view consistent texture refinement using a depth-conditioned diffusion model.	Generates 3D meshes with more natural textures than state-of-the-art methods. Demonstrates through user study that the proposed texture refinement significantly improves realism. Shows SDF-based approach leads to smoother meshes compared to NeRF-based methods.	Relies on metrics like CLIP R-Precision and FID_CLIP, which may not fully capture 3D consistency and realism. Exploiting temporal consistency for further enhancing texture realism is left for future work.	3d mesh generation, text-to-3d, diffusion models, photorealistic rendering, neural radiance fields
2304.12406 Report	AutoFocusFormer: Image Segmentation off the Grid	Chen Ziwen, Kaushik Patnaik, Shuangfei Zhai, Alvin Wan, Zhile Ren, Alex Schwing, Alex Colburn, Li Fuxin	Real world images often have highly imbalanced content density. Some areas are very uniform, e.g., large patches of blue sky, while other areas are scattered with many small objects. Yet, the commonly used successive grid downsampling strategy in convolutional deep networks treats all areas equally. Hence, small objects are represented in very few spatial locations, leading to worse results in tasks such as segmentation. Intuitively, retaining more pixels representing small objects during downsampling helps to preserve important information. To achieve this, we propose AutoFocusFormer (AFF), a local-attention transformer image recognition backbone, which performs adaptive downsampling by learning to retain the most important pixels for the task. Since adaptive downsampling generates a set of pixels irregularly distributed on the image plane, we abandon the classic grid structure. Instead, we develop a novel point-based local attention block, facilitated by a balanced clustering module and a learnable neighborhood merging module, which yields representations for our point-based versions of state-of-the-art segmentation heads. Experiments show that our AutoFocusFormer (AFF) improves significantly over baseline models of similar sizes.	Proposes AutoFocusFormer (AFF), the first end-to-end segmentation network with successive adaptive downsampling stages for more effective segmentation, especially of small objects.	Standard convolutional neural networks and existing vision transformer architectures uniformly downsample images, leading to the loss of information crucial for pixel-level tasks like segmentation, particularly for small objects.	AFF employs local attention blocks and introduces a novel balanced clustering algorithm for neighborhood definition on irregularly downsampled images. It also includes a novel adaptive downsampling module that learns to retain important pixels based on task relevance. Finally, it adapts state-of-the-art segmentation heads to operate on irregularly spaced tokens generated by the backbone.	AFF outperforms Swin Transformers on ImageNet classification across various model sizes, demonstrating the effectiveness of adaptive downsampling. Significant improvement over Swin Transformer baselines in semantic segmentation on ADE20K and instance segmentation on Cityscapes, highlighting AFF's strength in dense prediction tasks. AFF-Tiny achieves comparable performance to Swin-Base, a model 3.3 times larger, on Cityscapes panoptic segmentation, showcasing efficiency and superior performance with limited resources.	Regression observed in large object segmentation performance, suggesting room for improvement in decoder aggregation with uneven sampling rates. Limited exploration of class-specific performance correlations, warranting further investigation to understand how AFF's improvements vary across categories.	adaptive downsampling, image segmentation, vision transformer, local attention, balanced clustering
2304.12317 Report	Total-Recon: Deformable Scene Reconstruction for Embodied View Synthesis	Chonghyuk Song, Gengshan Yang, Kangle Deng, Jun-Yan Zhu, Deva Ramanan	We explore the task of embodied view synthesis from monocular videos of deformable scenes. Given a minute-long RGBD video of people interacting with their pets, we render the scene from novel camera trajectories derived from the in-scene motion of actors: (1) egocentric cameras that simulate the point of view of a target actor and (2) 3rd-person cameras that follow the actor. Building such a system requires reconstructing the root-body and articulated motion of every actor, as well as a scene representation that supports free-viewpoint synthesis. Longer videos are more likely to capture the scene from diverse viewpoints (which helps reconstruction) but are also more likely to contain larger motions (which complicates reconstruction). To address these challenges, we present Total-Recon, the first method to photorealistically reconstruct deformable scenes from long monocular RGBD videos. Crucially, to scale to long videos, our method hierarchically decomposes the scene into the background and objects, whose motion is decomposed into carefully initialized root-body motion and local articulations. To quantify such "in-the-wild" reconstruction and view synthesis, we collect ground-truth data from a specialized stereo RGBD capture rig for 11 challenging videos, significantly outperforming prior methods. Our code, model, and data can be found at https://andrewsonga.github.io/totalrecon .	This paper presents Total-Recon, a new method for embodied view synthesis from monocular videos of deformable scenes, enabling rendering from novel camera trajectories derived from the motion of actors in the scene.	Embodied view synthesis provides highly immersive experiences in gaming and virtual reality by simulating egocentric and 3rd-person-follow camera trajectories, and it has theoretical implications in spatial cognition theory.	Total-Recon reconstructs a deformable 3D scene representation by hierarchically decomposing the scene into object-centric neural fields, each encoding appearance, geometry, and motion. It separates object motion into global root-body movement and local articulations, enabling scalability to minute-long videos. Embodied views and 3D video filters are generated by leveraging these reconstructed motions.	Total-Recon successfully reconstructs the geometry and appearance of dynamic scenes, including humans and pets interacting, from minute-long monocular RGBD videos. The method outperforms state-of-the-art monocular deformable NeRF methods, even with depth supervision added to the baselines, on a new dataset of 11 challenging stereo RGBD videos. Ablation studies demonstrate the importance of the hierarchical motion decomposition, depth supervision, and object-centric representations for achieving high-quality reconstruction and view synthesis in dynamic scenes.	Current reliance on off-the-shelf segmentation models for scene decomposition may lead to inaccurate reconstructions in scenarios with partial occlusions. The model is computationally expensive, requiring around 15 hours of training per sequence on multiple GPUs, limiting its real-time applicability.	novel view synthesis, deformable nerfs, embodied view synthesis, rgbd reconstruction, object-centric representations
2304.12308 Report	Segment Anything in 3D with Radiance Fields	Jiazhong Cen, Jiemin Fang, Zanwei Zhou, Chen Yang, Lingxi Xie, Xiaopeng Zhang, Wei Shen, Qi Tian	The Segment Anything Model (SAM) emerges as a powerful vision foundation model to generate high-quality 2D segmentation results. This paper aims to generalize SAM to segment 3D objects. Rather than replicating the data acquisition and annotation procedure which is costly in 3D, we design an efficient solution, leveraging the radiance field as a cheap and off-the-shelf prior that connects multi-view 2D images to the 3D space. We refer to the proposed solution as SA3D, short for Segment Anything in 3D. With SA3D, the user is only required to provide a 2D segmentation prompt (e.g., rough points) for the target object in a single view, which is used to generate its corresponding 2D mask with SAM. Next, SA3D alternately performs mask inverse rendering and cross-view self-prompting across various views to iteratively refine the 3D mask of the target object. For one view, mask inverse rendering projects the 2D mask obtained by SAM into the 3D space with guidance of the density distribution learned by the radiance field for 3D mask refinement; Then, cross-view self-prompting extracts reliable prompts automatically as the input to SAM from the rendered 2D mask of the inaccurate 3D mask for a new view. We show in experiments that SA3D adapts to various scenes and achieves 3D segmentation within seconds. Our research reveals a potential methodology to lift the ability of a 2D segmentation model to 3D. Our code is available at https://github.com/Jumpat/SegmentAnythingin3D.	This paper introduces SA3D, a method that leverages radiance fields to extend the 2D Segment Anything Model (SAM) to 3D segmentation.	Building a 3D foundation model like SAM from scratch is impractical due to the high cost of acquiring and annotating 3D data. SA3D provides an efficient alternative by combining the power of SAM with off-the-shelf radiance fields.	SA3D takes 2D prompts in a single view as input and iteratively refines a 3D mask through two steps: 1) Mask inverse rendering, projecting the 2D mask generated by SAM into 3D space guided by the radiance field's density distribution; 2) Cross-view self-prompting, rendering the 3D mask in new views and automatically extracting prompts for SAM to generate more complete 2D masks.	SA3D achieves state-of-the-art 3D segmentation performance on multiple benchmarks, including NVOS and SPIn-NeRF. The method is efficient, capable of segmenting a 3D object within seconds, especially when combined with 3D Gaussian Splatting (3D-GS). SA3D is compatible with various radiance fields and generalizes to different segmentation tasks like instance and part segmentation.	The segmentation performance of SA3D is affected by the quality of the pre-trained radiance field, as demonstrated by the sub-optimal results on the Replica dataset. The current ambiguous Gaussian removal strategy for SA3D-GS has limitations in handling Gaussians that contribute significantly to rendering but also model occluded parts.	3d segmentation, radiance fields, 3d gaussian splatting, segment anything model, foundation models
2304.12160 Report	End-to-End Spatio-Temporal Action Localisation with Video Transformers	Alexey Gritsenko, Xuehan Xiong, Josip Djolonga, Mostafa Dehghani, Chen Sun, Mario Lučić, Cordelia Schmid, Anurag Arnab	The most performant spatio-temporal action localisation models use external person proposals and complex external memory banks. We propose a fully end-to-end, purely-transformer based model that directly ingests an input video, and outputs tubelets -- a sequence of bounding boxes and the action classes at each frame. Our flexible model can be trained with either sparse bounding-box supervision on individual frames, or full tubelet annotations. And in both cases, it predicts coherent tubelets as the output. Moreover, our end-to-end model requires no additional pre-processing in the form of proposals, or post-processing in terms of non-maximal suppression. We perform extensive ablation experiments, and significantly advance the state-of-the-art results on four different spatio-temporal action localisation benchmarks with both sparse keyframes and full tubelet annotations.	This paper proposes STAR, a fully end-to-end transformer-based model for spatio-temporal action localisation that directly ingests a video and outputs tubelets (sequences of bounding boxes and action classes).	Existing methods rely on external person proposals or complex memory banks, limiting their efficiency and practicality. This paper explores a purely end-to-end approach to address these limitations.	The model utilizes a transformer-based vision encoder and a decoder with temporal inductive biases. It can be trained with sparse bounding-box supervision or full tubelet annotations, predicting coherent tubelets in both cases.	STAR achieves state-of-the-art results on AVA and AVA-Kinetics (keyframe-based datasets), and UCF101-24 and JHMDB (tubelet-based datasets). The model surpasses previous methods while being end-to-end, not requiring external person detectors or memory banks. Experiments demonstrate that STAR effectively predicts tubelets even with sparse keyframe supervision.	The model's performance might be further improved by exploring alternative transformer architectures or training schemes. Evaluation on more diverse and complex datasets is needed to further validate the generalizability of the proposed approach.	action localization, spatio-temporal, transformer, end-to-end, tubelet detection
2304.11968 Report	Track Anything: Segment Anything Meets Videos	Jinyu Yang, Mingqi Gao, Zhe Li, Shang Gao, Fangjing Wang, Feng Zheng	Recently, the Segment Anything Model (SAM) gains lots of attention rapidly due to its impressive segmentation performance on images. Regarding its strong ability on image segmentation and high interactivity with different prompts, we found that it performs poorly on consistent segmentation in videos. Therefore, in this report, we propose Track Anything Model (TAM), which achieves high-performance interactive tracking and segmentation in videos. To be detailed, given a video sequence, only with very little human participation, i.e., several clicks, people can track anything they are interested in, and get satisfactory results in one-pass inference. Without additional training, such an interactive design performs impressively on video object tracking and segmentation. All resources are available on {https://github.com/gaomingqi/Track-Anything}. We hope this work can facilitate related research.	This paper proposes Track Anything Model (TAM), achieving high-performance interactive tracking and segmentation in videos through minimal human interaction (few clicks) and one-pass inference.	This work addresses limitations of existing video tracking/segmentation methods, including labor-intensive annotation, specific initialization requirements, and poor performance in complex scenarios.	TAM integrates Segment Anything Model (SAM) for interactive object initialization and refinement, and XMem for temporal object tracking. Users initialize object selection with clicks, XMem predicts subsequent frames, SAM refines uncertain masks, and users can manually correct errors during inference.	TAM achieves competitive J&F scores on DAVIS-2016-val and DAVIS-2017-test-dev datasets with only click initialization and one-pass inference. TAM handles challenging scenarios like multi-object separation, target deformation, scale change, and camera motion effectively. TAM demonstrates potential in various applications, including efficient video annotation, long-term object tracking, user-friendly video editing, and visualized development toolkit.	TAM's performance on long videos with mask shrinkage or lacking refinement needs improvement. Handling complex object structures with fine-grained details during initialization remains a challenge.	interactive tracking, video object segmentation, segment anything model (sam), one-pass inference, human-in-the-loop
2304.11829 Report	Hierarchical Diffusion Autoencoders and Disentangled Image Manipulation	Zeyu Lu, Chengyue Wu, Xinyuan Chen, Yaohui Wang, Lei Bai, Yu Qiao, Xihui Liu	Diffusion models have attained impressive visual quality for image synthesis. However, how to interpret and manipulate the latent space of diffusion models has not been extensively explored. Prior work diffusion autoencoders encode the semantic representations into a semantic latent code, which fails to reflect the rich information of details and the intrinsic feature hierarchy. To mitigate those limitations, we propose Hierarchical Diffusion Autoencoders (HDAE) that exploit the fine-grained-to-abstract and lowlevel-to-high-level feature hierarchy for the latent space of diffusion models. The hierarchical latent space of HDAE inherently encodes different abstract levels of semantics and provides more comprehensive semantic representations. In addition, we propose a truncated-feature-based approach for disentangled image manipulation. We demonstrate the effectiveness of our proposed approach with extensive experiments and applications on image reconstruction, style mixing, controllable interpolation, detail-preserving and disentangled image manipulation, and multi-modal semantic image synthesis.	This paper proposes Hierarchical Diffusion Autoencoders (HDAE), which exploit the inherent feature hierarchy in images to achieve a richer and more comprehensive latent space representation for diffusion models. This hierarchical representation enables finer-grained control and disentanglement in image manipulation tasks.	A semantically meaningful, editable, and decodable latent space is crucial for interpreting generative models and enabling applications like image editing. Existing diffusion autoencoders lack this fine-grained control and often suffer from information loss, especially in lower-level features.	The authors introduce a hierarchical latent space design that encodes features at different scales, capturing both low-level details and high-level semantics. They further propose a truncated-feature-based method for disentangled image manipulation, addressing the entanglement issues common in latent space editing.	HDAE achieves near-perfect image reconstruction, surpassing previous GAN inversion, VAE-based, and diffusion-based methods. The hierarchical latent space allows for style mixing, controllable interpolation, and multi-modal semantic image synthesis. The proposed truncated-feature method significantly improves the disentanglement of attributes during image manipulation, enabling editing of specific features without unwanted alterations.	The higher dimensionality of hierarchical semantic vectors in HDAE poses a challenge for predicting them using a latent DDIM, requiring techniques like dimensionality reduction. Further investigation into optimizing the trade-off between disentanglement and reconstruction quality is needed.	diffusion models, hierarchical latent space, image manipulation, disentanglement, image synthesis
2304.11603 Report	LaMD: Latent Motion Diffusion for Video Generation	Yaosi Hu, Zhenzhong Chen, Chong Luo	Generating coherent and natural movement is the key challenge in video generation. This research proposes to condense video generation into a problem of motion generation, to improve the expressiveness of motion and make video generation more manageable. This can be achieved by breaking down the video generation process into latent motion generation and video reconstruction. We present a latent motion diffusion (LaMD) framework, which consists of a motion-decomposed video autoencoder and a diffusion-based motion generator, to implement this idea. Through careful design, the motion-decomposed video autoencoder can compress patterns in movement into a concise latent motion representation. Meanwhile, the diffusion-based motion generator is able to efficiently generate realistic motion on a continuous latent space under multi-modal conditions, at a cost that is similar to that of image diffusion models. Results show that LaMD generates high-quality videos with a wide range of motions, from stochastic dynamics to highly controllable movements. It achieves new state-of-the-art performance on benchmark datasets, including BAIR, Landscape and CATER-GENs, for Image-to-Video (I2V) and Text-Image-to-Video (TI2V) generation. The source code of LaMD will be made available soon.	This paper presents LaMD (Latent Motion Diffusion), a novel framework for video generation that focuses on generating realistic and diverse motion.	Generating coherent and natural movement in videos remains a key challenge. LaMD addresses this by decomposing video generation into motion generation and video reconstruction, simplifying the process and improving motion expressiveness.	LaMD consists of: (1) MCD-VAE (Motion-Content Decomposed Video Autoencoder): Extracts compressed latent motion representations and reconstructs videos from motion and content features. (2) DMG (Diffusion-based Motion Generator): Generates motion latents conditioned on content features and optional text descriptions using a diffusion model.	LaMD generates high-quality videos with realistic and diverse motion, outperforming previous methods on benchmark datasets. MCD-VAE effectively decomposes and compresses motion while preserving reconstruction quality. DMG generates controllable motion guided by both image content and text descriptions.	Scaling MCD-VAE to larger and more diverse video datasets could further enhance its performance. Incorporating pre-trained image autoencoders into MCD-VAE could lead to even more compressed representations and reduced computational costs.	video generation, motion diffusion, latent space, image-to-video, text-image-to-video
2304.11523 Report	TransFlow: Transformer as Flow Learner	Yawen Lu, Qifan Wang, Siqi Ma, Tong Geng, Yingjie Victor Chen, Huaijin Chen, Dongfang Liu	Optical flow is an indispensable building block for various important computer vision tasks, including motion estimation, object tracking, and disparity measurement. In this work, we propose TransFlow, a pure transformer architecture for optical flow estimation. Compared to dominant CNN-based methods, TransFlow demonstrates three advantages. First, it provides more accurate correlation and trustworthy matching in flow estimation by utilizing spatial self-attention and cross-attention mechanisms between adjacent frames to effectively capture global dependencies; Second, it recovers more compromised information (e.g., occlusion and motion blur) in flow estimation through long-range temporal association in dynamic scenes; Third, it enables a concise self-learning paradigm and effectively eliminate the complex and laborious multi-stage pre-training procedures. We achieve the state-of-the-art results on the Sintel, KITTI-15, as well as several downstream tasks, including video object detection, interpolation and stabilization. For its efficacy, we hope TransFlow could serve as a flexible baseline for optical flow estimation.	This paper introduces TransFlow, a novel end-to-end optical flow estimation architecture based entirely on Transformers.	The authors aim to address limitations of CNN-based optical flow methods, such as their struggle to model global spatial dependencies and temporal associations, and their reliance on complex pretraining pipelines.	TransFlow leverages spatial self-attention and cross-attention mechanisms to capture global dependencies for accurate flow estimation. It also models temporal association across multiple frames using a Transformer encoder. For efficient training, a self-supervised pretraining module is introduced, inspired by MAE, which strategically masks and reconstructs image patches to learn strong pixel representations.	TransFlow achieves state-of-the-art results on Sintel and KITTI-15 benchmarks, outperforming previous methods even without the common multi-stage pretraining on synthetic datasets. The proposed self-supervised pretraining strategy proves effective, leading to competitive results with a simplified training pipeline. TransFlow demonstrates strong generalizability and improves performance in downstream tasks like video object detection, interpolation, and stabilization.	The impact of varying the number of Transformer blocks and their design choices on the trade-off between accuracy and computational cost needs further exploration. Investigating the effectiveness of TransFlow on more complex real-world scenarios with significant occlusions and challenging lighting conditions is crucial.	optical flow estimation, vision transformer, self-supervised learning, spatial-temporal attention, global matching
2304.11463 Report	OmniLabel: A Challenging Benchmark for Language-Based Object Detection	Samuel Schulter, Vijay Kumar B G, Yumin Suh, Konstantinos M. Dafnis, Zhixing Zhang, Shiyu Zhao, Dimitris Metaxas	Language-based object detection is a promising direction towards building a natural interface to describe objects in images that goes far beyond plain category names. While recent methods show great progress in that direction, proper evaluation is lacking. With OmniLabel, we propose a novel task definition, dataset, and evaluation metric. The task subsumes standard- and open-vocabulary detection as well as referring expressions. With more than 28K unique object descriptions on over 25K images, OmniLabel provides a challenging benchmark with diverse and complex object descriptions in a naturally open-vocabulary setting. Moreover, a key differentiation to existing benchmarks is that our object descriptions can refer to one, multiple or even no object, hence, providing negative examples in free-form text. The proposed evaluation handles the large label space and judges performance via a modified average precision metric, which we validate by evaluating strong language-based baselines. OmniLabel indeed provides a challenging test bed for future research on language-based detection.	OmniLabel, a novel benchmark for language-based object detection, unifying standard-, open-vocabulary detection and referring expressions.	Existing benchmarks lack proper evaluation for language-based object detection with complex descriptions and negative examples.	Leveraging existing datasets, the authors define an annotation process for diverse free-form object descriptions, including those referring to multiple or no objects, and propose a modified average precision metric handling a large label space.	OmniLabel contains more diverse and complex object descriptions than prior benchmarks (RefCOCO, Flickr30k, PhraseCut). Negative descriptions in OmniLabel pose a significant challenge to current language-based detectors. GLIP and FIBER models achieve the best results on OmniLabel, highlighting its difficulty.	The current version of OmniLabel has a different distribution of negative descriptions across different source datasets. Further investigation is needed to understand the impact of the number of categories on negative description verification rates.	object detection, language-based vision, benchmark, referring expressions, open-vocabulary
2304.11446 Report	Fast Diffusion Probabilistic Model Sampling through the lens of Backward Error Analysis	Yansong Gao, Zhihong Pan, Xin Zhou, Le Kang, Pratik Chaudhari	Denoising diffusion probabilistic models (DDPMs) are a class of powerful generative models. The past few years have witnessed the great success of DDPMs in generating high-fidelity samples. A significant limitation of the DDPMs is the slow sampling procedure. DDPMs generally need hundreds or thousands of sequential function evaluations (steps) of neural networks to generate a sample. This paper aims to develop a fast sampling method for DDPMs requiring much fewer steps while retaining high sample quality. The inference process of DDPMs approximates solving the corresponding diffusion ordinary differential equations (diffusion ODEs) in the continuous limit. This work analyzes how the backward error affects the diffusion ODEs and the sample quality in DDPMs. We propose fast sampling through the \textbf{Restricting Backward Error schedule (RBE schedule)} based on dynamically moderating the long-time backward error. Our method accelerates DDPMs without any further training. Our experiments show that sampling with an RBE schedule generates high-quality samples within only 8 to 20 function evaluations on various benchmark datasets. We achieved 12.01 FID in 8 function evaluations on the ImageNet $128\times128$, and a $20\times$ speedup compared with previous baseline samplers.	This paper introduces a novel fast sampling method for Denoising Diffusion Probabilistic Models (DDPMs) based on dynamically moderating the long-time backward error of the diffusion ODEs.	DDPMs, despite their prowess in generating high-fidelity samples, suffer from slow sampling procedures, needing numerous sequential function evaluations. This work aims to alleviate this bottleneck by enabling fast sampling while preserving sample quality.	The authors analyze the impact of backward error on diffusion ODEs and sample quality in DDPMs. They propose two methods: 1) Dynamically Restricting the Backward Error (DRBE) schedule, which crafts step size to restrict backward error, and 2) Restricting Backward Error (RBE) schedule, which learns an effective inference schedule by averaging schedules generated from DRBE.	Sampling with RBE schedule generates high-quality samples within 8 to 20 function evaluations on benchmarks like ImageNet and LSUN. The method achieved 12.01 FID in 8 function evaluations on ImageNet 128x128, demonstrating significant speedup over baseline samplers. Empirical analysis reveals that RBE schedule lies between linear and cosine noise schedules, approaching the latter as the number of function evaluations increases.	The assumption of the original flow being an analytic function in backward error analysis might not always hold true. Future work includes investigating the design of optimal noise schedules inspired by the behavior of RBE schedule.	denoising diffusion probabilistic models, fast sampling, backward error analysis, diffusion odes, generative models
2304.11342 Report	NaviNeRF: NeRF-based 3D Representation Disentanglement by Latent Semantic Navigation	Baao Xie, Bohan Li, Zequn Zhang, Junting Dong, Xin Jin, Jingyu Yang, Wenjun Zeng	3D representation disentanglement aims to identify, decompose, and manipulate the underlying explanatory factors of 3D data, which helps AI fundamentally understand our 3D world. This task is currently under-explored and poses great challenges: (i) the 3D representations are complex and in general contains much more information than 2D image; (ii) many 3D representations are not well suited for gradient-based optimization, let alone disentanglement. To address these challenges, we use NeRF as a differentiable 3D representation, and introduce a self-supervised Navigation to identify interpretable semantic directions in the latent space. To our best knowledge, this novel method, dubbed NaviNeRF, is the first work to achieve fine-grained 3D disentanglement without any priors or supervisions. Specifically, NaviNeRF is built upon the generative NeRF pipeline, and equipped with an Outer Navigation Branch and an Inner Refinement Branch. They are complementary -- the outer navigation is to identify global-view semantic directions, and the inner refinement dedicates to fine-grained attributes. A synergistic loss is further devised to coordinate two branches. Extensive experiments demonstrate that NaviNeRF has a superior fine-grained 3D disentanglement ability than the previous 3D-aware models. Its performance is also comparable to editing-oriented models relying on semantic or geometry priors.	NaviNeRF is a novel NeRF-based 3D representation learning method that disentangles fine-grained 3D features without relying on priors or supervision.	3D representation disentanglement is crucial for AI to understand the 3D world, but existing methods often lack interpretability, controllability, and struggle with the complexity of 3D data.	The method uses a two-branch approach: an outer navigation branch identifies semantic directions in the latent space by predicting shifts in latent codes, and an inner refinement branch focuses on fine-grained attributes and 3D consistency by applying shifts to specific dimensions of intermediate latent codes. A synergistic loss combines these branches.	NaviNeRF achieves fine-grained 3D disentanglement, enabling continuous manipulation of specific attributes like mouth, whiskers, and hair. It outperforms typical 3D-aware GANs (pi-GAN, GIRAFFE, StyleNeRF) in attribute manipulation quality. NaviNeRF shows comparable performance to editing-oriented NeRF models that rely on semantic or geometric priors (FENeRF, CGOF++).	The quality of 3D reconstruction depends heavily on the pre-trained generator. Future work could explore unsupervised disentanglement in more complex scenes and incorporate temporal consistency.	3d disentanglement, nerf, generative models, latent semantic navigation, 3d representation learning
2304.11330 Report	Self-supervised Learning by View Synthesis	Shaoteng Liu, Xiangyu Zhang, Tao Hu, Jiaya Jia	We present view-synthesis autoencoders (VSA) in this paper, which is a self-supervised learning framework designed for vision transformers. Different from traditional 2D pretraining methods, VSA can be pre-trained with multi-view data. In each iteration, the input to VSA is one view (or multiple views) of a 3D object and the output is a synthesized image in another target pose. The decoder of VSA has several cross-attention blocks, which use the source view as value, source pose as key, and target pose as query. They achieve cross-attention to synthesize the target view. This simple approach realizes large-angle view synthesis and learns spatial invariant representation, where the latter is decent initialization for transformers on downstream tasks, such as 3D classification on ModelNet40, ShapeNet Core55, and ScanObjectNN. VSA outperforms existing methods significantly for linear probing and is competitive for fine-tuning. The code will be made publicly available.	This paper introduces View-Synthesis Autoencoders (VSA), a self-supervised learning framework for vision transformers using multi-view data.	Existing self-supervised learning methods for vision transformers do not leverage the inherent 3D geometric relationships present in multi-view data. This paper aims to address this gap and learn spatial-invariant representations by synthesizing novel views from different angles.	VSA utilizes an encoder-decoder architecture. The encoder (e.g., ViT) processes a source view, and the decoder uses cross-attention blocks with source view features, source pose, and target pose as input to synthesize a target view. Training is done by minimizing the MSE loss between synthesized and actual target views.	VSA successfully synthesizes novel views from single or multiple source views, demonstrating its ability to learn spatial-invariant representations. VSA achieves competitive performance on 3D classification benchmarks (ModelNet40, ShapeNet Core55, ScanObjectNN), outperforming existing methods in linear probing evaluation. Ablation studies reveal the impact of decoder design, view sampling strategy, data augmentation, and masking ratio on VSA performance.	The paper primarily focuses on fixed-view scenarios, and exploring dynamic view selection could further enhance performance. While VSA demonstrates strong results, combining it with other self-supervised methods for further improvement requires investigation.	self-supervised learning, vision transformer, view synthesis, 3d classification, multi-view data
2304.11312 Report	Lookahead Diffusion Probabilistic Models for Refining Mean Estimation	Guoqiang Zhang, Niwa Kenta, W. Bastiaan Kleijn	We propose lookahead diffusion probabilistic models (LA-DPMs) to exploit the correlation in the outputs of the deep neural networks (DNNs) over subsequent timesteps in diffusion probabilistic models (DPMs) to refine the mean estimation of the conditional Gaussian distributions in the backward process. A typical DPM first obtains an estimate of the original data sample $\boldsymbol{x}$ by feeding the most recent state $\boldsymbol{z}_i$ and index $i$ into the DNN model and then computes the mean vector of the conditional Gaussian distribution for $\boldsymbol{z}_{i-1}$. We propose to calculate a more accurate estimate for $\boldsymbol{x}$ by performing extrapolation on the two estimates of $\boldsymbol{x}$ that are obtained by feeding $(\boldsymbol{z}_{i+1},i+1)$ and $(\boldsymbol{z}_{i},i)$ into the DNN model. The extrapolation can be easily integrated into the backward process of existing DPMs by introducing an additional connection over two consecutive timesteps, and fine-tuning is not required. Extensive experiments showed that plugging in the additional connection into DDPM, DDIM, DEIS, S-PNDM, and high-order DPM-Solvers leads to a significant performance gain in terms of FID score.	The paper proposes Lookahead Diffusion Probabilistic Models (LA-DPMs) that refine mean estimation of conditional Gaussian distributions during the backward process of DPMs by exploiting correlations in DNN outputs over consecutive timesteps.	LA-DPMs aim to improve sampling quality, especially with a limited computational budget (fewer timesteps), which is crucial for practical applications.	The method introduces an extrapolation operation on two recent estimates of the data sample obtained at timesteps i and i+1. This refines the data sample estimation at timestep i and consequently the mean estimation for the latent variable at timestep i-1. This is achieved by adding connections between two consecutive timesteps in the backward process.	LA-DPMs, requiring no fine-tuning, significantly improve FID scores compared to original DPMs, particularly for a small number of timesteps. The performance gain is observed across various DPM models like DDPM, DDIM, DEIS, S-PNDM and DPM-Solver. The computational overhead introduced by the extrapolation operation is negligible.	The optimal extrapolation strength (parameter λ) might vary across different timesteps and datasets. Future work could involve training a separate DNN to learn optimal λ values.	diffusion probabilistic models, generative models, deep learning, image generation, sampling efficiency
2304.11267 Report	Speed Is All You Need: On-Device Acceleration of Large Diffusion Models via GPU-Aware Optimizations	Yu-Hui Chen, Raman Sarokin, Juhyun Lee, Jiuqiang Tang, Chuo-Ling Chang, Andrei Kulik, Matthias Grundmann	The rapid development and application of foundation models have revolutionized the field of artificial intelligence. Large diffusion models have gained significant attention for their ability to generate photorealistic images and support various tasks. On-device deployment of these models provides benefits such as lower server costs, offline functionality, and improved user privacy. However, common large diffusion models have over 1 billion parameters and pose challenges due to restricted computational and memory resources on devices. We present a series of implementation optimizations for large diffusion models that achieve the fastest reported inference latency to-date (under 12 seconds for Stable Diffusion 1.4 without int8 quantization on Samsung S23 Ultra for a 512x512 image with 20 iterations) on GPU-equipped mobile devices. These enhancements broaden the applicability of generative AI and improve the overall user experience across a wide range of devices.	This paper introduces a set of GPU-aware optimizations for large diffusion models, specifically targeting on-device deployment.	On-device deployment of large diffusion models (e.g., Stable Diffusion) is crucial for reducing server costs, enhancing privacy, and enabling offline functionality, but is challenged by limited computational and memory resources on mobile devices.	The authors implement several optimizations: 1) specialized kernels for Group Normalization and GELU activation, 2) enhanced attention module efficiency via partially fused softmax and FlashAttention, and 3) strategic use of Winograd convolution.	Achieved state-of-the-art inference latency for Stable Diffusion 1.4 on mobile GPUs (under 12 seconds for a 512x512 image with 20 iterations on Samsung S23 Ultra). Significantly reduced latency compared to baseline implementation on both Samsung S23 Ultra (-52.2%) and iPhone 14 Pro Max (-32.9%). Optimized memory usage for intermediate tensors and model weights.	The paper focuses on Stable Diffusion; applicability of these optimizations to other diffusion models needs further investigation. The trade-off between Winograd convolution's computational efficiency and increased memory consumption requires careful consideration.	diffusion models, on-device ai, gpu optimization, stable diffusion, latency reduction
2304.11113 Report	Implicit Neural Head Synthesis via Controllable Local Deformation Fields	Chuhan Chen, Matthew O'Toole, Gaurav Bharaj, Pablo Garrido	High-quality reconstruction of controllable 3D head avatars from 2D videos is highly desirable for virtual human applications in movies, games, and telepresence. Neural implicit fields provide a powerful representation to model 3D head avatars with personalized shape, expressions, and facial parts, e.g., hair and mouth interior, that go beyond the linear 3D morphable model (3DMM). However, existing methods do not model faces with fine-scale facial features, or local control of facial parts that extrapolate asymmetric expressions from monocular videos. Further, most condition only on 3DMM parameters with poor(er) locality, and resolve local features with a global neural field. We build on part-based implicit shape models that decompose a global deformation field into local ones. Our novel formulation models multiple implicit deformation fields with local semantic rig-like control via 3DMM-based parameters, and representative facial landmarks. Further, we propose a local control loss and attention mask mechanism that promote sparsity of each learned deformation field. Our formulation renders sharper locally controllable nonlinear deformations than previous implicit monocular approaches, especially mouth interior, asymmetric expressions, and facial details.	This paper presents a novel approach to modeling fine-grained facial details and non-linear local deformations in human face rigs using neural radiance fields (NeRFs), surpassing the limitations of linear 3DMMs and global deformation models.	Existing methods for reconstructing controllable head models from 2D videos often lack the ability to represent fine-scale facial details and local control, especially for asymmetric expressions. This paper addresses these limitations.	The method decomposes the global deformation field into multiple local fields, each centered around a pre-defined facial landmark. An attention mask filters redundant expression parameters for each local field, and a novel local control loss enforces locality and consistency. The sum of local deformations is weakly supervised by a 3DMM mesh prior.	The approach reconstructs facial details, like wrinkles and mouth interiors, more accurately than previous methods. It enables fine-scale control of facial expressions, including asymmetric expressions, exceeding the capabilities of linear 3DMMs. The method achieves state-of-the-art performance on perceptual metrics for radiance image quality.	The reconstruction quality degrades for extreme pose and expression variations, indicating limitations in generalization. Non-facial parts, such as shoulders, are not explicitly modeled, leading to potential artifacts in those regions.	neural radiance fields, 3d face reconstruction, local deformation fields, facial expression control, monocular video
2304.10535 Report	Farm3D: Learning Articulated 3D Animals by Distilling 2D Diffusion	Tomas Jakab, Ruining Li, Shangzhe Wu, Christian Rupprecht, Andrea Vedaldi	We present Farm3D, a method for learning category-specific 3D reconstructors for articulated objects, relying solely on "free" virtual supervision from a pre-trained 2D diffusion-based image generator. Recent approaches can learn a monocular network that predicts the 3D shape, albedo, illumination, and viewpoint of any object occurrence, given a collection of single-view images of an object category. However, these approaches heavily rely on manually curated clean training data, which are expensive to obtain. We propose a framework that uses an image generator, such as Stable Diffusion, to generate synthetic training data that are sufficiently clean and do not require further manual curation, enabling the learning of such a reconstruction network from scratch. Additionally, we incorporate the diffusion model as a score to enhance the learning process. The idea involves randomizing certain aspects of the reconstruction, such as viewpoint and illumination, generating virtual views of the reconstructed 3D object, and allowing the 2D network to assess the quality of the resulting image, thus providing feedback to the reconstructor. Unlike work based on distillation, which produces a single 3D asset for each textual prompt, our approach yields a monocular reconstruction network capable of outputting a controllable 3D asset from any given image, whether real or generated, in a single forward pass in a matter of seconds. Our network can be used for analysis, including monocular reconstruction, or for synthesis, generating articulated assets for real-time applications such as video games.	This paper presents FARM3D, a method to learn articulated 3D models of object categories (e.g., cows, horses) solely from synthetic data generated by a pre-trained 2D image generator (Stable Diffusion) and without any manual data curation.	Existing methods for learning 3D models from real images require extensive manual curation of training data, which is time-consuming and limits scalability.	FARM3D replaces real training images with synthetic ones generated by prompting Stable Diffusion with category-specific text. It further leverages Stable Diffusion as a critic during training, providing virtual multi-view supervision via a modified Score Distillation Sampling (SDS) loss.	FARM3D achieves comparable 3D reconstruction quality to state-of-the-art methods trained on curated real datasets, despite using only synthetic data. The method generalizes to real images and enables controllable 3D synthesis by manipulating shape, appearance, and articulation. A new synthetic 3D animal dataset (Animodel) is introduced for benchmarking single-view articulated 3D reconstruction.	The current method is limited to a single object category. Assumptions about object topology (e.g., 4 legs) are made.	3d reconstruction, diffusion models, synthetic data, articulated objects, stable diffusion
2304.10530 Report	Collaborative Diffusion for Multi-Modal Face Generation and Editing	Ziqi Huang, Kelvin C. K. Chan, Yuming Jiang, Ziwei Liu	Diffusion models arise as a powerful generative tool recently. Despite the great progress, existing diffusion models mainly focus on uni-modal control, i.e., the diffusion process is driven by only one modality of condition. To further unleash the users' creativity, it is desirable for the model to be controllable by multiple modalities simultaneously, e.g., generating and editing faces by describing the age (text-driven) while drawing the face shape (mask-driven). In this work, we present Collaborative Diffusion, where pre-trained uni-modal diffusion models collaborate to achieve multi-modal face generation and editing without re-training. Our key insight is that diffusion models driven by different modalities are inherently complementary regarding the latent denoising steps, where bilateral connections can be established upon. Specifically, we propose dynamic diffuser, a meta-network that adaptively hallucinates multi-modal denoising steps by predicting the spatial-temporal influence functions for each pre-trained uni-modal model. Collaborative Diffusion not only collaborates generation capabilities from uni-modal diffusion models, but also integrates multiple uni-modal manipulations to perform multi-modal editing. Extensive qualitative and quantitative experiments demonstrate the superiority of our framework in both image quality and condition consistency.	This paper presents Collaborative Diffusion, a novel framework that leverages pre-trained uni-modal diffusion models for multi-modal face generation and editing without retraining.	Existing diffusion models for face generation and editing primarily focus on uni-modal control, limiting the ability to manipulate multiple aspects simultaneously. Collaborative Diffusion addresses this limitation by enabling multi-modal control, thereby unlocking greater creative possibilities for users.	The core of Collaborative Diffusion is the dynamic diffuser, a meta-network that predicts spatial-temporal influence functions for each pre-trained uni-modal model. These functions determine the extent of each model's contribution at every denoising step, allowing for seamless integration of multiple modalities.	Collaborative Diffusion achieves superior image quality and condition consistency compared to existing multi-modal face generation methods like TediGAN and Composable Diffusion. The dynamic diffuser's ability to adapt both spatially and temporally is crucial for effective collaboration between uni-modal models. The framework's flexibility is demonstrated by extending it to multi-modal face editing with minimal modifications, showcasing its potential for various applications.	The performance of Collaborative Diffusion is limited by the capabilities of the pre-trained uni-modal models, suggesting that training on larger datasets could further enhance results. The potential for malicious use of the technology, such as manipulating real human faces, raises ethical concerns that need to be addressed.	diffusion models, multi-modal generation, face generation, face editing, dynamic diffuser
2304.10520 Report	Contrastive Tuning: A Little Help to Make Masked Autoencoders Forget	Johannes Lehner, Benedikt Alkin, Andreas Fürst, Elisabeth Rumetshofer, Lukas Miklautz, Sepp Hochreiter	Masked Image Modeling (MIM) methods, like Masked Autoencoders (MAE), efficiently learn a rich representation of the input. However, for adapting to downstream tasks, they require a sufficient amount of labeled data since their rich features code not only objects but also less relevant image background. In contrast, Instance Discrimination (ID) methods focus on objects. In this work, we study how to combine the efficiency and scalability of MIM with the ability of ID to perform downstream classification in the absence of large amounts of labeled data. To this end, we introduce Masked Autoencoder Contrastive Tuning (MAE-CT), a sequential approach that utilizes the implicit clustering of the Nearest Neighbor Contrastive Learning (NNCLR) objective to induce abstraction in the topmost layers of a pre-trained MAE. MAE-CT tunes the rich features such that they form semantic clusters of objects without using any labels. Notably, MAE-CT does not rely on hand-crafted augmentations and frequently achieves its best performances while using only minimal augmentations (crop & flip). Further, MAE-CT is compute efficient as it requires at most 10% overhead compared to MAE re-training. Applied to large and huge Vision Transformer (ViT) models, MAE-CT excels over previous self-supervised methods trained on ImageNet in linear probing, k-NN and low-shot classification accuracy as well as in unsupervised clustering accuracy. With ViT-H/16 MAE-CT achieves a new state-of-the-art in linear probing of 82.2%.	This paper proposes MAE-CT, a sequential approach that combines the strengths of Masked Image Modeling (MIM) and Instance Discrimination (ID) by contrastively tuning a pre-trained Masked Autoencoder (MAE) to induce abstraction and form semantic clusters in its representation.	MIM methods like MAE are computationally efficient but lack abstraction, while ID methods excel in low-shot learning but heavily rely on augmentations. Combining both addresses these limitations to achieve label efficiency and computational efficiency.	The methodology involves three steps: 1) MAE pre-training, 2) initializing a Nearest Neighbor Contrastive Learning (NNCLR) head on top of the frozen MAE encoder, and 3) contrastive tuning, where the upper layers of the encoder and the NNCLR head are trained with layer-wise learning rate decay.	MAE-CT significantly outperforms previous self-supervised methods in linear probing, k-NN, and low-shot classification on ImageNet, achieving state-of-the-art linear probing accuracy of 82.2% with ViT-H/16. The method demonstrates superior label efficiency compared to state-of-the-art ID methods, especially with larger models, and achieves competitive performance even with minimal augmentations. Analysis reveals that MAE-CT effectively forms object-specific clusters, as evidenced by improved cluster accuracy and silhouette scores, confirming its ability to induce abstraction in MAE representations.	The reliance on a pre-trained MAE model could limit the applicability to other MIM methods. Future work could explore the use of alternative contrastive learning methods or cluster-based objectives.	self-supervised learning, masked image modeling, instance discrimination, contrastive learning, vision transformer
2304.10448 Report	ReLight My NeRF: A Dataset for Novel View Synthesis and Relighting of Real World Objects	Marco Toschi, Riccardo De Matteo, Riccardo Spezialetti, Daniele De Gregorio, Luigi Di Stefano, Samuele Salti	In this paper, we focus on the problem of rendering novel views from a Neural Radiance Field (NeRF) under unobserved light conditions. To this end, we introduce a novel dataset, dubbed ReNe (Relighting NeRF), framing real world objects under one-light-at-time (OLAT) conditions, annotated with accurate ground-truth camera and light poses. Our acquisition pipeline leverages two robotic arms holding, respectively, a camera and an omni-directional point-wise light source. We release a total of 20 scenes depicting a variety of objects with complex geometry and challenging materials. Each scene includes 2000 images, acquired from 50 different points of views under 40 different OLAT conditions. By leveraging the dataset, we perform an ablation study on the relighting capability of variants of the vanilla NeRF architecture and identify a lightweight architecture that can render novel views of an object under novel light conditions, which we use to establish a non-trivial baseline for the dataset. Dataset and benchmark are available at https://eyecan-ai.github.io/rene.	Introduces ReNe, a novel dataset for novel view synthesis and relighting of real-world objects under one-light-at-time (OLAT) conditions, and benchmarks lightweight modifications to NeRF for relighting.	Existing relighting datasets lack diversity, real-world capture, and ground truth light information, hindering research on relighting with NeRF.	Developed a dual-robot capture system to collect images of various objects under OLAT illumination with precise camera and light pose annotations. Conducted an ablation study on NeRF architectures modified to incorporate light information.	ReNe dataset comprises 20 scenes, each with 2000 images captured from 50 viewpoints under 40 OLAT conditions. Feeding relative light position and employing a separate visibility network (V5) significantly improves NeRF's relighting capabilities. The proposed V5 architecture establishes a strong baseline for the ReNe benchmark, outperforming standard NeRF.	Dataset is limited to frontal views due to the trajectory setup of the robotic arms. OLAT illumination, while offering control, is not representative of complex real-world lighting.	novel view synthesis, relighting, neural radiance fields, dataset, benchmark
2304.10406 Report	LiDAR-NeRF: Novel LiDAR View Synthesis via Neural Radiance Fields	Tang Tao, Longfei Gao, Guangrun Wang, Yixing Lao, Peng Chen, Hengshuang Zhao, Dayang Hao, Xiaodan Liang, Mathieu Salzmann, Kaicheng Yu	We introduce a new task, novel view synthesis for LiDAR sensors. While traditional model-based LiDAR simulators with style-transfer neural networks can be applied to render novel views, they fall short of producing accurate and realistic LiDAR patterns because the renderers rely on explicit 3D reconstruction and exploit game engines, that ignore important attributes of LiDAR points. We address this challenge by formulating, to the best of our knowledge, the first differentiable end-to-end LiDAR rendering framework, LiDAR-NeRF, leveraging a neural radiance field (NeRF) to facilitate the joint learning of geometry and the attributes of 3D points. However, simply employing NeRF cannot achieve satisfactory results, as it only focuses on learning individual pixels while ignoring local information, especially at low texture areas, resulting in poor geometry. To this end, we have taken steps to address this issue by introducing a structural regularization method to preserve local structural details. To evaluate the effectiveness of our approach, we establish an object-centric multi-view LiDAR dataset, dubbed NeRF-MVL. It contains observations of objects from 9 categories seen from 360-degree viewpoints captured with multiple LiDAR sensors. Our extensive experiments on the scene-level KITTI-360 dataset, and on our object-level NeRF-MVL show that our LiDAR-NeRF surpasses the model-based algorithms significantly.	This paper introduces a novel differentiable rendering framework, LiDAR-NeRF, for novel view synthesis of LiDAR data.	Existing methods for generating new LiDAR point clouds rely on explicit 3D reconstruction and game engines, resulting in inaccurate and unrealistic LiDAR patterns. Novel LiDAR view synthesis is crucial for applications like autonomous driving.	The method leverages a neural radiance field (NeRF) to facilitate the joint learning of geometry and attributes of 3D points. It addresses the limitation of traditional NeRFs by introducing a structural regularization method to preserve local structural details, improving geometry accuracy.	LiDAR-NeRF surpasses model-based algorithms on both scene-level (KITTI-360 dataset) and object-level (newly introduced multi-view LiDAR dataset) benchmarks. The method effectively encodes 3D information and multiple attributes, including ray-drop probability, leading to more realistic LiDAR patterns. LiDAR-NeRF enables scene editing by fusing novel objects into existing scenes with realistic occlusion effects.	Limited to static scenes and requires per-scene optimization due to reliance on the NeRF formalism. Current work focuses on synthesizing LiDAR data only; future work will explore joint rendering of LiDAR and images.	novel view synthesis, lidar, neural radiance fields, differentiable rendering, 3d point cloud
2304.10263 Report	PREIM3D: 3D Consistent Precise Image Attribute Editing from a Single Image	Jianhui Li, Jianmin Li, Haoji Zhang, Shilong Liu, Zhengyi Wang, Zihao Xiao, Kaiwen Zheng, Jun Zhu	We study the 3D-aware image attribute editing problem in this paper, which has wide applications in practice. Recent methods solved the problem by training a shared encoder to map images into a 3D generator's latent space or by per-image latent code optimization and then edited images in the latent space. Despite their promising results near the input view, they still suffer from the 3D inconsistency of produced images at large camera poses and imprecise image attribute editing, like affecting unspecified attributes during editing. For more efficient image inversion, we train a shared encoder for all images. To alleviate 3D inconsistency at large camera poses, we propose two novel methods, an alternating training scheme and a multi-view identity loss, to maintain 3D consistency and subject identity. As for imprecise image editing, we attribute the problem to the gap between the latent space of real images and that of generated images. We compare the latent space and inversion manifold of GAN models and demonstrate that editing in the inversion manifold can achieve better results in both quantitative and qualitative evaluations. Extensive experiments show that our method produces more 3D consistent images and achieves more precise image editing than previous work. Source code and pretrained models can be found on our project page: https://mybabyyh.github.io/Preim3D/	Presents PREIM3D, a pipeline for efficient and precise 3D-aware image attribute editing from a single image, by training a shared encoder to map real images to the latent space of a 3D GAN and performing manipulations in an "inversion manifold".	Addresses limitations of previous 3D GAN inversion techniques, which struggle to maintain 3D consistency at large camera poses and often lack editing precision, causing unintended attribute changes.	Trains a 3D consistent encoder using an alternating training scheme with in-domain and out-domain images and a multi-view identity loss. Editing is then performed in a learned "inversion manifold" to improve precision and minimize distortion between desired and generated attributes.	Achieves superior 3D consistency, particularly at large camera poses, compared to optimization-based and hybrid methods like IDE-3D and 3D-Inv. Demonstrates more precise attribute editing with less impact on unrelated attributes, outperforming baselines in quantitative metrics like AA and AD. Significantly faster inference time compared to optimization-based methods, making it suitable for interactive applications.	Faces difficulty in reconstructing uncommon or highly detailed features (e.g., intricate earrings, unique hairstyles). Reliance on the generator's capacity to capture real-world details limits the reconstruction fidelity in some cases.	3d gan inversion, image attribute editing, neural radiance fields, inversion manifold, 3d consistency
2304.10261 Report	Anything-3D: Towards Single-view Anything Reconstruction in the Wild	Qiuhong Shen, Xingyi Yang, Xinchao Wang	3D reconstruction from a single-RGB image in unconstrained real-world scenarios presents numerous challenges due to the inherent diversity and complexity of objects and environments. In this paper, we introduce Anything-3D, a methodical framework that ingeniously combines a series of visual-language models and the Segment-Anything object segmentation model to elevate objects to 3D, yielding a reliable and versatile system for single-view conditioned 3D reconstruction task. Our approach employs a BLIP model to generate textural descriptions, utilizes the Segment-Anything model for the effective extraction of objects of interest, and leverages a text-to-image diffusion model to lift object into a neural radiance field. Demonstrating its ability to produce accurate and detailed 3D reconstructions for a wide array of objects, \emph{Anything-3D\footnotemark[2]} shows promise in addressing the limitations of existing methodologies. Through comprehensive experiments and evaluations on various datasets, we showcase the merits of our approach, underscoring its potential to contribute meaningfully to the field of 3D reconstruction. Demos and code will be available at \href{https://github.com/Anything-of-anything/Anything-3D}{https://github.com/Anything-of-anything/Anything-3D}.	Introduces Anything-3D, a framework that leverages visual-language models and object segmentation (SAM) to reconstruct 3D objects from single-view images in uncontrolled environments.	Addresses the challenging problem of single-image 3D reconstruction in the wild, which has significant implications for robotics, VR/AR, and 3D printing.	Combines BLIP for image description generation, SAM for object segmentation, and a text-to-image diffusion model with score distillation to train a neural radiance field for 3D reconstruction.	Demonstrates successful 3D reconstruction of objects from challenging real-world images with varying lighting, occlusion, and viewpoints. Shows proficiency in reconstructing irregularly-shaped objects and small objects in cluttered scenes. Highlights the potential for using foundation models for 3D content creation from limited data.	Current reconstruction quality requires further refinement. Lacks quantitative evaluation on 3D datasets; future work will include novel view synthesis and reconstruction error assessments.	3d reconstruction, single-view reconstruction, visual-language models, object segmentation, diffusion models
2304.10250 Report	Revisiting Implicit Neural Representations in Low-Level Vision	Wentian Xu, Jianbo Jiao	Implicit Neural Representation (INR) has been emerging in computer vision in recent years. It has been shown to be effective in parameterising continuous signals such as dense 3D models from discrete image data, e.g. the neural radius field (NeRF). However, INR is under-explored in 2D image processing tasks. Considering the basic definition and the structure of INR, we are interested in its effectiveness in low-level vision problems such as image restoration. In this work, we revisit INR and investigate its application in low-level image restoration tasks including image denoising, super-resolution, inpainting, and deblurring. Extensive experimental evaluations suggest the superior performance of INR in several low-level vision tasks with limited resources, outperforming its counterparts by over 2dB. Code and models are available at https://github.com/WenTXuL/LINR	This paper explores the application of Implicit Neural Representation (INR) in low-level image restoration tasks for single-image restoration without requiring additional training data.	INR's effectiveness in 3D deep learning tasks, particularly in representing continuous signals from discrete data, motivates its exploration for 2D image restoration.	The study uses a lightweight INR (LINR) model based on SIREN, training it on corrupted images with task-specific loss functions for denoising, super-resolution, inpainting, and deblurring.	LINR outperforms competing methods on benchmark datasets with limited resources, achieving superior PSNR and SSIM scores. The study demonstrates that INR-based methods can benefit from joint training with multiple corruptions, further enhancing their performance. LINR showcases promising results on real-world noisy images, suggesting its practical applicability.	Denoising with LINR, similar to DIP, might be prone to overfitting, necessitating further research on optimal stopping criteria. Future work could explore architectural modifications and training strategies to enhance LINR's efficiency for higher-resolution images.	image restoration, implicit neural representation, single image restoration, zero-shot learning, low-level vision
2304.10224 Report	Multi-view Vision-Prompt Fusion Network: Can 2D Pre-trained Model Boost 3D Point Cloud Data-scarce Learning?	Haoyang Peng, Baopu Li, Bo Zhang, Xin Chen, Tao Chen, Hongyuan Zhu	Point cloud based 3D deep model has wide applications in many applications such as autonomous driving, house robot, and so on. Inspired by the recent prompt learning in natural language processing, this work proposes a novel Multi-view Vision-Prompt Fusion Network (MvNet) for few-shot 3D point cloud classification. MvNet investigates the possibility of leveraging the off-the-shelf 2D pre-trained models to achieve the few-shot classification, which can alleviate the over-dependence issue of the existing baseline models towards the large-scale annotated 3D point cloud data. Specifically, MvNet first encodes a 3D point cloud into multi-view image features for a number of different views. Then, a novel multi-view prompt fusion module is developed to effectively fuse information from different views to bridge the gap between 3D point cloud data and 2D pre-trained models. A set of 2D image prompts can then be derived to better describe the suitable prior knowledge for a large-scale pre-trained image model for few-shot 3D point cloud classification. Extensive experiments on ModelNet, ScanObjectNN, and ShapeNet datasets demonstrate that MvNet achieves new state-of-the-art performance for 3D few-shot point cloud image classification. The source code of this work will be available soon.	A novel Multi-view Vision-Prompt Fusion Network (MvNet) is proposed for few-shot 3D point cloud classification by leveraging off-the-shelf 2D pre-trained models.	Existing deep learning-based 3D point cloud classification methods require large amounts of data, while pre-trained 3D models suffer from domain gaps between datasets. Leveraging pre-trained 2D models can alleviate these issues.	MvNet encodes a 3D point cloud into multi-view image features. A multi-view prompt fusion module fuses information from different views to derive 2D image prompts. These prompts are fed to a pre-trained image model for classification.	MvNet achieves state-of-the-art performance for 3D few-shot point cloud classification on ModelNet and ScanObjectNN. MvNet significantly improves few-shot classification performance on ScanObjectNN compared to previous methods, especially under low-shot settings. Increasing the number of views and using both attention and convolution fusion modules effectively improve the model's performance.	The model's optimization process requires a large memory footprint. Future work includes exploring methods to reduce memory consumption.	3d point cloud classification, few-shot learning, prompt learning, multi-view fusion, transfer learning
2304.10168 Report	High-Fidelity and Freely Controllable Talking Head Video Generation	Yue Gao, Yuan Zhou, Jinglu Wang, Xiao Li, Xiang Ming, Yan Lu	Talking head generation is to generate video based on a given source identity and target motion. However, current methods face several challenges that limit the quality and controllability of the generated videos. First, the generated face often has unexpected deformation and severe distortions. Second, the driving image does not explicitly disentangle movement-relevant information, such as poses and expressions, which restricts the manipulation of different attributes during generation. Third, the generated videos tend to have flickering artifacts due to the inconsistency of the extracted landmarks between adjacent frames. In this paper, we propose a novel model that produces high-fidelity talking head videos with free control over head pose and expression. Our method leverages both self-supervised learned landmarks and 3D face model-based landmarks to model the motion. We also introduce a novel motion-aware multi-scale feature alignment module to effectively transfer the motion without face distortion. Furthermore, we enhance the smoothness of the synthesized talking head videos with a feature context adaptation and propagation module. We evaluate our model on challenging datasets and demonstrate its state-of-the-art performance.	This paper presents PECHead, a novel method for generating high-fidelity talking head videos with control over head pose and expression.	Current talking head generation methods suffer from limitations such as face distortion, limited controllability, and flickering artifacts. This work aims to address these challenges.	PECHead leverages both self-supervised learned landmarks and 3D face model-based landmarks to model motion. It employs a motion-aware multi-scale feature alignment module and a context adaptation and propagation module to enhance video quality and smoothness.	PECHead significantly outperforms existing methods in same-identity video reconstruction, exhibiting superior facial shape preservation and expression transfer. The method demonstrates state-of-the-art performance in cross-identity face reenactment, achieving high identity preservation and video quality while minimizing pose and expression errors. PECHead enables precise control over head pose and facial expressions, surpassing baseline methods in frontalization and expression transfer tasks.	The current method focuses on head and face regions, and further research is needed to incorporate full-body motions. The model currently relies on high-quality input images, and extending it to handle lower-quality inputs is an area for future work.	talking head generation, face reenactment, motion transfer, deep learning, computer vision
2304.10080 Report	NeUDF: Leaning Neural Unsigned Distance Fields with Volume Rendering	Yu-Tao Liu, Li Wang, Jie yang, Weikai Chen, Xiaoxu Meng, Bo Yang, Lin Gao	Multi-view shape reconstruction has achieved impressive progresses thanks to the latest advances in neural implicit surface rendering. However, existing methods based on signed distance function (SDF) are limited to closed surfaces, failing to reconstruct a wide range of real-world objects that contain open-surface structures. In this work, we introduce a new neural rendering framework, coded NeUDF, that can reconstruct surfaces with arbitrary topologies solely from multi-view supervision. To gain the flexibility of representing arbitrary surfaces, NeUDF leverages the unsigned distance function (UDF) as surface representation. While a naive extension of an SDF-based neural renderer cannot scale to UDF, we propose two new formulations of weight function specially tailored for UDF-based volume rendering. Furthermore, to cope with open surface rendering, where the in/out test is no longer valid, we present a dedicated normal regularization strategy to resolve the surface orientation ambiguity. We extensively evaluate our method over a number of challenging datasets, including DTU}, MGN, and Deep Fashion 3D. Experimental results demonstrate that nEudf can significantly outperform the state-of-the-art method in the task of multi-view surface reconstruction, especially for complex shapes with open boundaries.	\OurNetName{}: the first UDF-based neural volume rendering framework for multi-view reconstruction of shapes with arbitrary topologies.	Existing SDF or occupancy-based neural rendering methods are limited to closed surfaces, failing to reconstruct open surfaces which are commonly seen in the real world.	\OurNetName{} leverages unsigned distance function (UDF) as surface representation and introduces two specially-tailored weight functions for UDF-based volume rendering and points sampling. To solve the surface orientation ambiguity, \OurNetName{} employs a dedicated normal regularization strategy.	\OurNetName{} significantly outperforms the state-of-the-art methods in open surface reconstruction on DF3D and MGN datasets. \OurNetName{} achieves comparable performance in watertight surface reconstruction on DTU dataset. \OurNetName{} can reconstruct complex open surfaces such as plants, clothes, and hollow structures with high fidelity.	\OurNetName{} cannot reconstruct transparent surfaces well. Severely occluded parts are challenging for reconstruction.	multi-view reconstruction, neural rendering, open surface, unsigned distance function, implicit representation
2304.09987 Report	Tetra-NeRF: Representing Neural Radiance Fields Using Tetrahedra	Jonas Kulhanek, Torsten Sattler	Neural Radiance Fields (NeRFs) are a very recent and very popular approach for the problems of novel view synthesis and 3D reconstruction. A popular scene representation used by NeRFs is to combine a uniform, voxel-based subdivision of the scene with an MLP. Based on the observation that a (sparse) point cloud of the scene is often available, this paper proposes to use an adaptive representation based on tetrahedra obtained by Delaunay triangulation instead of uniform subdivision or point-based representations. We show that such a representation enables efficient training and leads to state-of-the-art results. Our approach elegantly combines concepts from 3D geometry processing, triangle-based rendering, and modern neural radiance fields. Compared to voxel-based representations, ours provides more detail around parts of the scene likely to be close to the surface. Compared to point-based representations, our approach achieves better performance. The source code is publicly available at: https://jkulhanek.com/tetra-nerf.	Presents Tetra-NeRF, a novel neural radiance field representation that leverages Delaunay triangulation of a point cloud to create an adaptive tetrahedra-based scene representation.	Addresses limitations of uniform voxel grids and point-based representations for neural radiance fields by providing higher resolution near surfaces and enabling efficient training.	Constructs a tetrahedra field from a point cloud, uses barycentric interpolation to query features stored at tetrahedra vertices, and employs a shallow MLP to predict density and color for volume rendering.	Outperforms Point-NeRF, a state-of-the-art point-based method, in rendering quality. Achieves comparable results to state-of-the-art MLP-based methods like Mip-NeRF. Demonstrates the effectiveness of adaptive tetrahedra representation over dense grid representation with similar parameter count.	Rendering quality can be affected by the density of the input point cloud, especially in regions with sparse points. Current implementation has a limit on the number of intersected tetrahedra per ray, which can impact performance in complex scenes. Future work includes exploring adaptive refinement and pruning of the tetrahedralisation and exploiting surface proximity to triangles.	neural radiance fields, tetrahedra, delaunay triangulation, volume rendering, novel view synthesis
2304.09748 Report	Reference-based Image Composition with Sketch via Structure-aware Diffusion Model	Kangyeol Kim, Sunghyun Park, Junsoo Lee, Jaegul Choo	Recent remarkable improvements in large-scale text-to-image generative models have shown promising results in generating high-fidelity images. To further enhance editability and enable fine-grained generation, we introduce a multi-input-conditioned image composition model that incorporates a sketch as a novel modal, alongside a reference image. Thanks to the edge-level controllability using sketches, our method enables a user to edit or complete an image sub-part with a desired structure (i.e., sketch) and content (i.e., reference image). Our framework fine-tunes a pre-trained diffusion model to complete missing regions using the reference image while maintaining sketch guidance. Albeit simple, this leads to wide opportunities to fulfill user needs for obtaining the in-demand images. Through extensive experiments, we demonstrate that our proposed method offers unique use cases for image manipulation, enabling user-driven modifications of arbitrary scenes.	Introduces a multi-input-conditioned image composition model for cartoons that incorporates a sketch and a reference image.	Enhances editability of large-scale text-to-image generative models by allowing edge-level controllability and fine-grained generation.	Fine-tunes a pre-trained diffusion model to complete missing regions using a reference image while adhering to sketch guidance. A 'sketch schedule' strategy is introduced to adjust the influence of the sketch during inference.	Model successfully generates and manipulates targeted regions based on user-provided sketches and reference images. Demonstrates superior performance compared to baselines using only reference images or text-sketch pairs. Offers practical applications for background scene editing, object shape editing, and object changes in cartoons.	Exploration of a more user-centric system for seamless interaction is needed. Developing a highly intuitive tool incorporating the model is planned for future work.	image composition, sketch-guided generation, diffusion models, cartoon editing, multi-input conditioning
2304.09677 Report	Reference-guided Controllable Inpainting of Neural Radiance Fields	Ashkan Mirzaei, Tristan Aumentado-Armstrong, Marcus A. Brubaker, Jonathan Kelly, Alex Levinshtein, Konstantinos G. Derpanis, Igor Gilitschenski	The popularity of Neural Radiance Fields (NeRFs) for view synthesis has led to a desire for NeRF editing tools. Here, we focus on inpainting regions in a view-consistent and controllable manner. In addition to the typical NeRF inputs and masks delineating the unwanted region in each view, we require only a single inpainted view of the scene, i.e., a reference view. We use monocular depth estimators to back-project the inpainted view to the correct 3D positions. Then, via a novel rendering technique, a bilateral solver can construct view-dependent effects in non-reference views, making the inpainted region appear consistent from any view. For non-reference disoccluded regions, which cannot be supervised by the single reference view, we devise a method based on image inpainters to guide both the geometry and appearance. Our approach shows superior performance to NeRF inpainting baselines, with the additional advantage that a user can control the generated scene via a single inpainted image. Project page: https://ashmrz.github.io/reference-guided-3d	This document provides author guidelines for preparing papers to be submitted to the ICCV proceedings.	Standardization of paper formatting is crucial for conference proceedings to ensure consistency and readability.	The document outlines specific formatting requirements for various aspects like title, author names, abstract, sections, bibliography, etc., likely using LaTeX.			iccv, author guidelines, latex, conference proceedings, paper formatting
2304.09479 Report	DiFaReli: Diffusion Face Relighting	Puntawat Ponglertnapakorn, Nontawat Tritrong, Supasorn Suwajanakorn	We present a novel approach to single-view face relighting in the wild. Handling non-diffuse effects, such as global illumination or cast shadows, has long been a challenge in face relighting. Prior work often assumes Lambertian surfaces, simplified lighting models or involves estimating 3D shape, albedo, or a shadow map. This estimation, however, is error-prone and requires many training examples with lighting ground truth to generalize well. Our work bypasses the need for accurate estimation of intrinsic components and can be trained solely on 2D images without any light stage data, multi-view images, or lighting ground truth. Our key idea is to leverage a conditional diffusion implicit model (DDIM) for decoding a disentangled light encoding along with other encodings related to 3D shape and facial identity inferred from off-the-shelf estimators. We also propose a novel conditioning technique that eases the modeling of the complex interaction between light and geometry by using a rendered shading reference to spatially modulate the DDIM. We achieve state-of-the-art performance on standard benchmark Multi-PIE and can photorealistically relight in-the-wild images. Please visit our page: https://diffusion-face-relighting.github.io	Presents DiFaReli, a novel diffusion-based face relighting framework that generates photorealistic shading without needing precise intrinsic decomposition or 3D and lighting ground truth, trained exclusively on 2D images.	Face relighting is crucial for various applications like AR and portrait photography. Existing methods struggle with non-diffuse effects and rely heavily on accurate estimations of intrinsic components, often requiring extensive training data.	Leverages a conditional DDIM to decode disentangled light encoding, along with encodings for shape and identity, inferred from off-the-shelf estimators. Introduces a novel conditioning technique using a rendered shading reference for spatial modulation of the DDIM, easing the modeling of complex light-geometry interactions.	Achieves state-of-the-art performance on the Multi-PIE dataset, outperforming existing methods. Demonstrates the capability to realistically add, remove, or adjust the intensity of cast shadows in images. Produces high-fidelity relighting results on in-the-wild images, effectively handling challenging cases with complex lighting.	Cast shadow rendering may not always be physically accurate, with room for improvement in temporal consistency for video applications. Relighting accuracy can be affected by limitations of light estimators and inherent ambiguities in distinguishing skin tone from lighting conditions.	face relighting, diffusion models, conditional image synthesis, spatial modulation, deep learning
2304.09463 Report	HyperStyle3D: Text-Guided 3D Portrait Stylization via Hypernetworks	Zhuo Chen, Xudong Xu, Yichao Yan, Ye Pan, Wenhan Zhu, Wayne Wu, Bo Dai, Xiaokang Yang	Portrait stylization is a long-standing task enabling extensive applications. Although 2D-based methods have made great progress in recent years, real-world applications such as metaverse and games often demand 3D content. On the other hand, the requirement of 3D data, which is costly to acquire, significantly impedes the development of 3D portrait stylization methods. In this paper, inspired by the success of 3D-aware GANs that bridge 2D and 3D domains with 3D fields as the intermediate representation for rendering 2D images, we propose a novel method, dubbed HyperStyle3D, based on 3D-aware GANs for 3D portrait stylization. At the core of our method is a hyper-network learned to manipulate the parameters of the generator in a single forward pass. It not only offers a strong capacity to handle multiple styles with a single model, but also enables flexible fine-grained stylization that affects only texture, shape, or local part of the portrait. While the use of 3D-aware GANs bypasses the requirement of 3D data, we further alleviate the necessity of style images with the CLIP model being the stylization guidance. We conduct an extensive set of experiments across the style, attribute, and shape, and meanwhile, measure the 3D consistency. These experiments demonstrate the superior capability of our HyperStyle3D model in rendering 3D-consistent images in diverse styles, deforming the face shape, and editing various attributes.	HyperStyle3D, a novel text-driven 3D portrait stylization method based on 3D-aware GANs and hyper-networks, enabling style transfer, attribute editing, and shape deformation.	Addresses the limitations of 2D stylization methods (lack of 3D consistency and shape deformation) and 3D methods (reliance on expensive 3D data and single-style limitations).	Utilizes a hyper-network to predict parameter offsets for a pre-trained 3D-aware GAN generator, guided by text prompts processed by CLIP. The hyper-network is split into three parts to handle shape, attribute, and style manipulations separately.	Achieves high-quality style transfer across diverse styles, surpassing baseline methods in qualitative and user study evaluations. Maintains 3D consistency, exhibiting comparable depth consistency to the original 3D-aware GAN and even superior facial identity consistency after manipulation. Enables disentangled multi-level manipulation (shape, attribute, style) by leveraging different layer groups in the hyper-network, with controllable degrees of manipulation through coefficients.	Limited to portrait stylization and may not generalize well to other object categories. Relies on a pre-trained 3D-aware GAN, which can limit the range of achievable styles and shapes.	3d portrait stylization, text-guided image manipulation, hyper-networks, 3d-aware gans, clip
2304.09423 Report	ASM: Adaptive Skinning Model for High-Quality 3D Face Modeling	Kai Yang, Hong Shang, Tianyang Shi, Xinghan Chen, Jingkai Zhou, Zhongqian Sun, Wei Yang	The research fields of parametric face model and 3D face reconstruction have been extensively studied. However, a critical question remains unanswered: how to tailor the face model for specific reconstruction settings. We argue that reconstruction with multi-view uncalibrated images demands a new model with stronger capacity. Our study shifts attention from data-dependent 3D Morphable Models (3DMM) to an understudied human-designed skinning model. We propose Adaptive Skinning Model (ASM), which redefines the skinning model with more compact and fully tunable parameters. With extensive experiments, we demonstrate that ASM achieves significantly improved capacity than 3DMM, with the additional advantage of model size and easy implementation for new topology. We achieve state-of-the-art performance with ASM for multi-view reconstruction on the Florence MICC Coop benchmark. Our quantitative analysis demonstrates the importance of a high-capacity model for fully exploiting abundant information from multi-view input in reconstruction. Furthermore, our model with physical-semantic parameters can be directly utilized for real-world applications, such as in-game avatar creation. As a result, our work opens up new research direction for parametric face model and facilitates future research on multi-view reconstruction.	This paper introduces ASM, a novel parametric face model based on an adaptive skinning approach, designed for high-quality 3D face modeling from multi-view uncalibrated images.	Existing methods struggle to balance reconstruction accuracy and generalization ability, especially for the middle-end scenario of multi-view uncalibrated images. ASM aims to bridge this gap by offering high capacity and flexibility.	ASM redefines skinning weights using Gaussian Mixture Models (GMM) and introduces dynamic bone binding. This allows for joint optimization of skinning weights, bone positions, and transformations, leading to increased model capacity.	ASM outperforms state-of-the-art parametric face models, including 3DMMs and static skinning models, in terms of representation capacity. It achieves state-of-the-art performance for multi-view reconstruction on the Florence MICC Coop benchmark. ASM is lightweight, easy to implement with new topologies, and offers semantically meaningful parameters for applications like avatar creation.	The paper acknowledges limitations in handling facial hair, which can cause artifacts in high-capacity models like ASM. Future work includes exploring identity and expression decoupling within ASM's skinning parameters.	3d face reconstruction, parametric face model, skinning model, gaussian mixture model, multi-view reconstruction
2304.09148 Report	SAM Fails to Segment Anything? -- SAM-Adapter: Adapting SAM in Underperformed Scenes: Camouflage, Shadow, Medical Image Segmentation, and More	Tianrun Chen, Lanyun Zhu, Chaotao Ding, Runlong Cao, Yan Wang, Zejian Li, Lingyun Sun, Papa Mao, Ying Zang	The emergence of large models, also known as foundation models, has brought significant advancements to AI research. One such model is Segment Anything (SAM), which is designed for image segmentation tasks. However, as with other foundation models, our experimental findings suggest that SAM may fail or perform poorly in certain segmentation tasks, such as shadow detection and camouflaged object detection (concealed object detection). This study first paves the way for applying the large pre-trained image segmentation model SAM to these downstream tasks, even in situations where SAM performs poorly. Rather than fine-tuning the SAM network, we propose \textbf{SAM-Adapter}, which incorporates domain-specific information or visual prompts into the segmentation network by using simple yet effective adapters. By integrating task-specific knowledge with general knowledge learnt by the large model, SAM-Adapter can significantly elevate the performance of SAM in challenging tasks as shown in extensive experiments. We can even outperform task-specific network models and achieve state-of-the-art performance in the task we tested: camouflaged object detection, shadow detection. We also tested polyp segmentation (medical image segmentation) and achieves better results. We believe our work opens up opportunities for utilizing SAM in downstream tasks, with potential applications in various fields, including medical image processing, agriculture, remote sensing, and more.	This paper presents SAM-Adapter, a novel method for adapting the Segment Anything (SAM) model to downstream tasks by incorporating domain-specific information via adapters.	While SAM demonstrates impressive general image segmentation capabilities, it may perform poorly on specific tasks. This work addresses the crucial challenge of leveraging the knowledge acquired by large pre-trained models like SAM for enhanced performance in downstream tasks.	SAM-Adapter utilizes SAM as the backbone network and injects task-specific knowledge through lightweight adapters. These adapters, consisting of MLPs, process domain-specific features and generate prompts to guide SAM's segmentation process.	SAM-Adapter significantly enhances SAM's performance on challenging tasks like camouflaged object detection and shadow detection, surpassing existing state-of-the-art methods. The method demonstrates flexibility in incorporating various forms of task-specific information, including texture features, frequency information, and hand-crafted rules. Experiments on diverse datasets, including COD10K, CHAMELEON, CAMO, and ISTD, consistently show significant performance improvements with SAM-Adapter.	The current work focuses on two specific downstream tasks, further exploration of SAM-Adapter's capabilities on a wider range of tasks is necessary. Future research could investigate more specialized adapter designs tailored for specific tasks to further improve performance.	image segmentation, foundation models, transfer learning, prompt engineering, camouflaged object detection, shadow detection
2304.08870 Report	UPGPT: Universal Diffusion Model for Person Image Generation, Editing and Pose Transfer	Soon Yau Cheong, Armin Mustafa, Andrew Gilbert	Text-to-image models (T2I) such as StableDiffusion have been used to generate high quality images of people. However, due to the random nature of the generation process, the person has a different appearance e.g. pose, face, and clothing, despite using the same text prompt. The appearance inconsistency makes T2I unsuitable for pose transfer. We address this by proposing a multimodal diffusion model that accepts text, pose, and visual prompting. Our model is the first unified method to perform all person image tasks - generation, pose transfer, and mask-less edit. We also pioneer using small dimensional 3D body model parameters directly to demonstrate new capability - simultaneous pose and camera view interpolation while maintaining the person's appearance.	This paper presents UPGPT, a novel unified diffusion model for person image generation, editing, and pose transfer. It leverages text, pose, and visual prompts to achieve fine-grained control over image synthesis.	Existing methods for person image generation and editing are limited in their ability to perform multiple tasks effectively. This paper addresses the need for a single, flexible framework that can generate, edit, and transfer person images with high fidelity.	The authors propose a multimodal diffusion model that disentangles person images into content (pose, context text) and style (style text, image features). This allows for independent manipulation of these elements during image sampling. The model uses a combination of SMPL pose parameters, CLIP image embeddings, and LLM text embeddings to condition the diffusion process.	UPGPT achieves state-of-the-art results on both text-pose guided image generation and pose transfer tasks. The method demonstrates fine-grained control over image editing, enabling users to modify clothing texture, shape, and appearance using text or reference images. The use of SMPL parameters allows for novel capabilities like simultaneous pose and camera view interpolation.	One limitation is the potential for blurry faces in generated images, particularly at lower resolutions. Future work could explore improving the fidelity of fine-grained texture transfer, potentially through enhanced image encoding techniques.	diffusion models, image generation, pose transfer, image editing, multimodal learning
2304.08818 Report	Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models	Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, Karsten Kreis	Latent Diffusion Models (LDMs) enable high-quality image synthesis while avoiding excessive compute demands by training a diffusion model in a compressed lower-dimensional latent space. Here, we apply the LDM paradigm to high-resolution video generation, a particularly resource-intensive task. We first pre-train an LDM on images only; then, we turn the image generator into a video generator by introducing a temporal dimension to the latent space diffusion model and fine-tuning on encoded image sequences, i.e., videos. Similarly, we temporally align diffusion model upsamplers, turning them into temporally consistent video super resolution models. We focus on two relevant real-world applications: Simulation of in-the-wild driving data and creative content creation with text-to-video modeling. In particular, we validate our Video LDM on real driving videos of resolution 512 x 1024, achieving state-of-the-art performance. Furthermore, our approach can easily leverage off-the-shelf pre-trained image LDMs, as we only need to train a temporal alignment model in that case. Doing so, we turn the publicly available, state-of-the-art text-to-image LDM Stable Diffusion into an efficient and expressive text-to-video model with resolution up to 1280 x 2048. We show that the temporal layers trained in this way generalize to different fine-tuned text-to-image LDMs. Utilizing this property, we show the first results for personalized text-to-video generation, opening exciting directions for future content creation. Project page: https://research.nvidia.com/labs/toronto-ai/VideoLDM/	The authors propose Video LDM, an efficient approach for training high-resolution, long-term consistent video generation models based on Latent Diffusion Models (LDMs).	Video modeling has lagged behind image modeling due to the high computational cost associated with training on video data and the lack of large-scale video datasets. This work aims to address this gap by enabling efficient high-resolution video generation.	The authors extend image LDMs to video generation by introducing a temporal dimension to the latent space diffusion model and fine-tuning on encoded image sequences. They also temporally align diffusion model upsamplers, turning them into temporally consistent video super resolution models. They focus on two applications: simulation of driving data and text-to-video modeling.	Video LDM achieves state-of-the-art performance on real driving videos of resolution 512x1024 and can generate videos of several minutes length. By temporally fine-tuning Stable Diffusion, the authors create an efficient and expressive text-to-video model with resolution up to 1280x2048. The learned temporal layers can be transferred to different fine-tuned text-to-image LDMs, enabling personalized text-to-video generation.	Synthesized videos are not yet indistinguishable from real content. The model, trained on internet data, is not suitable for productization due to ethical concerns.	video generation, diffusion models, latent diffusion models, text-to-video, high-resolution
2304.08483 Report	Text2Performer: Text-Driven Human Video Generation	Yuming Jiang, Shuai Yang, Tong Liang Koh, Wayne Wu, Chen Change Loy, Ziwei Liu	Text-driven content creation has evolved to be a transformative technique that revolutionizes creativity. Here we study the task of text-driven human video generation, where a video sequence is synthesized from texts describing the appearance and motions of a target performer. Compared to general text-driven video generation, human-centric video generation requires maintaining the appearance of synthesized human while performing complex motions. In this work, we present Text2Performer to generate vivid human videos with articulated motions from texts. Text2Performer has two novel designs: 1) decomposed human representation and 2) diffusion-based motion sampler. First, we decompose the VQVAE latent space into human appearance and pose representation in an unsupervised manner by utilizing the nature of human videos. In this way, the appearance is well maintained along the generated frames. Then, we propose continuous VQ-diffuser to sample a sequence of pose embeddings. Unlike existing VQ-based methods that operate in the discrete space, continuous VQ-diffuser directly outputs the continuous pose embeddings for better motion modeling. Finally, motion-aware masking strategy is designed to mask the pose embeddings spatial-temporally to enhance the temporal coherence. Moreover, to facilitate the task of text-driven human video generation, we contribute a Fashion-Text2Video dataset with manually annotated action labels and text descriptions. Extensive experiments demonstrate that Text2Performer generates high-quality human videos (up to 512x256 resolution) with diverse appearances and flexible motions.	This paper presents Text2Performer, a novel framework for generating human videos from text descriptions of appearance and motions.	The task is important because it addresses the limitations of general text-to-video models, which struggle to generate plausible human videos with consistent appearances and complex motions.	Text2Performer decomposes the VQVAE latent space into appearance and pose representations, and utilizes a continuous VQ-diffuser to sample pose sequences. A motion-aware masking strategy is also employed to enhance temporal coherence.	Text2Performer outperforms baselines on FID, FVD, and KVD metrics, indicating superior video quality and diversity. The decomposed VQ-space and continuous VQ-diffuser enable Text2Performer to maintain consistent human identities across frames. User studies confirm that Text2Performer generates videos that are more consistent with text descriptions and exhibit better overall quality.	Text2Performer is trained on videos with clean backgrounds, limiting its applicability to more complex scenes. The generated videos exhibit a bias towards female models in dresses due to limitations in the training dataset.	video generation, text-to-video, human video synthesis, vqvae, diffusion models
2304.08480 Report	DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training	Yihao Chen, Xianbiao Qi, Jianan Wang, Lei Zhang	We propose DisCo-CLIP, a distributed memory-efficient CLIP training approach, to reduce the memory consumption of contrastive loss when training contrastive learning models. Our approach decomposes the contrastive loss and its gradient computation into two parts, one to calculate the intra-GPU gradients and the other to compute the inter-GPU gradients. According to our decomposition, only the intra-GPU gradients are computed on the current GPU, while the inter-GPU gradients are collected via all_reduce from other GPUs instead of being repeatedly computed on every GPU. In this way, we can reduce the GPU memory consumption of contrastive loss computation from $\bigO(B^2)$ to $\bigO(\frac{B^2}{N})$, where $B$ and $N$ are the batch size and the number of GPUs used for training. Such a distributed solution is mathematically equivalent to the original non-distributed contrastive loss computation, without sacrificing any computation accuracy. It is particularly efficient for large-batch CLIP training. For instance, DisCo-CLIP can enable contrastive training of a ViT-B/32 model with a batch size of 32K or 196K using 8 or 64 A100 40GB GPUs, compared with the original CLIP solution which requires 128 A100 40GB GPUs to train a ViT-B/32 model with a batch size of 32K. The code will be released at https://github.com/IDEA-Research/DisCo-CLIP	This paper proposes DisCo-CLIP, a distributed memory-efficient approach for training CLIP models, which significantly reduces the memory consumption of contrastive loss computation during training, enabling the use of larger batch sizes without sacrificing accuracy.	Large batch sizes are crucial for effective contrastive learning in CLIP, but memory constraints limit the achievable batch size, hindering research, especially in resource-constrained settings.	DisCo-CLIP decomposes the contrastive loss and its gradient computation into intra-GPU and inter-GPU components. It calculates intra-GPU gradients locally and collects inter-GPU gradients via all_reduce operations, reducing memory consumption from O(B^2) to O(B^2/N), where B is the batch size and N is the number of GPUs.	DisCo-CLIP achieves the same accuracy as the original CLIP with significantly reduced memory consumption and faster training times. Using DisCo-CLIP, researchers can train a ViT-B/32 model with a batch size of 196K on 64 A100 40GB GPUs, compared to the original CLIP's limitation of 32K batch size. Larger batch sizes enabled by DisCo-CLIP further improve the performance of contrastive learning models, leading to higher zero-shot classification accuracy on various datasets.	The paper primarily evaluates DisCo-CLIP on ViT-B/32 due to resource constraints, leaving the investigation of larger backbones for future work. The impact of the extra all_reduce operation on training speed could be further analyzed, especially in different network environments.	contrastive learning, clip, vision-language representation learning, distributed training, memory efficiency
2304.08477 Report	Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video Generation	Jie An, Songyang Zhang, Harry Yang, Sonal Gupta, Jia-Bin Huang, Jiebo Luo, Xi Yin	We propose Latent-Shift -- an efficient text-to-video generation method based on a pretrained text-to-image generation model that consists of an autoencoder and a U-Net diffusion model. Learning a video diffusion model in the latent space is much more efficient than in the pixel space. The latter is often limited to first generating a low-resolution video followed by a sequence of frame interpolation and super-resolution models, which makes the entire pipeline very complex and computationally expensive. To extend a U-Net from image generation to video generation, prior work proposes to add additional modules like 1D temporal convolution and/or temporal attention layers. In contrast, we propose a parameter-free temporal shift module that can leverage the spatial U-Net as is for video generation. We achieve this by shifting two portions of the feature map channels forward and backward along the temporal dimension. The shifted features of the current frame thus receive the features from the previous and the subsequent frames, enabling motion learning without additional parameters. We show that Latent-Shift achieves comparable or better results while being significantly more efficient. Moreover, Latent-Shift can generate images despite being finetuned for T2V generation.	Proposes Latent-Shift, an efficient text-to-video generation method that extends a pre-trained text-to-image latent diffusion model by incorporating a parameter-free temporal shift module.	Existing pixel-based text-to-video diffusion models are computationally expensive, requiring additional super-resolution and frame interpolation models. Latent-Shift offers a simpler, more efficient solution.	Integrates a temporal shift module into the U-Net architecture of a pre-trained text-to-image latent diffusion model. This module shifts feature maps along the temporal dimension, enabling the model to learn temporal coherence without additional parameters.	Achieves comparable results to existing methods on MSR-VTT and state-of-the-art results on UCF-101. Demonstrates superior performance in video quality and text-video faithfulness compared to CogVideo in a user study. Significantly more efficient than previous approaches due to its smaller model size and faster inference speed.	May struggle with generating videos from complex or uncommon text prompts. Current automatic evaluation metrics for zero-shot text-to-video generation are not ideal and need improvement.	text-to-video generation, latent diffusion models, temporal shift module, generative ai, computer vision
2304.08463 Report	Learning to Render Novel Views from Wide-Baseline Stereo Pairs	Yilun Du, Cameron Smith, Ayush Tewari, Vincent Sitzmann	We introduce a method for novel view synthesis given only a single wide-baseline stereo image pair. In this challenging regime, 3D scene points are regularly observed only once, requiring prior-based reconstruction of scene geometry and appearance. We find that existing approaches to novel view synthesis from sparse observations fail due to recovering incorrect 3D geometry and due to the high cost of differentiable rendering that precludes their scaling to large-scale training. We take a step towards resolving these shortcomings by formulating a multi-view transformer encoder, proposing an efficient, image-space epipolar line sampling scheme to assemble image features for a target ray, and a lightweight cross-attention-based renderer. Our contributions enable training of our method on a large-scale real-world dataset of indoor and outdoor scenes. We demonstrate that our method learns powerful multi-view geometry priors while reducing the rendering time. We conduct extensive comparisons on held-out test scenes across two real-world datasets, significantly outperforming prior work on novel view synthesis from sparse image observations and achieving multi-view-consistent novel view synthesis.	This paper introduces a novel method for novel view synthesis of complex indoor and outdoor scenes from a single wide-baseline stereo image pair.	Existing methods for novel view synthesis either require dense input views or fail to produce high-quality results in this challenging setting due to inaccurate geometry reconstruction and costly rendering pipelines.	The method leverages a multi-view transformer encoder for geometry-aware feature extraction, an efficient image-space epipolar line sampling scheme, and a lightweight cross-attention-based renderer to enable large-scale training and high-quality reconstructions.	The proposed method significantly outperforms previous state-of-the-art methods for novel view synthesis from sparse inputs on standard benchmarks such as RealEstate10k and ACID. The method effectively learns multi-view geometry priors and achieves multi-view consistent novel view synthesis. The proposed rendering pipeline is significantly faster than volume rendering-based approaches, enabling efficient and high-quality reconstructions.	While showing significant improvement, the rendering quality is not yet on par with single-scene optimization methods using hundreds of input images. The generalization ability to scenes with drastically different appearances compared to the training data is limited.	novel view synthesis, wide-baseline stereo, differentiable rendering, vision transformer, epipolar geometry
2304.08386 Report	Progressive Visual Prompt Learning with Contrastive Feature Re-formation	Chen Xu, Haocheng Shen, Fengyuan Shi, Boheng Chen, Yixuan Liao, Xiaoxin Chen, Limin Wang	Prompt learning has been designed as an alternative to fine-tuning for adapting Vision-language (V-L) models to the downstream tasks. Previous works mainly focus on text prompt while visual prompt works are limited for V-L models. The existing visual prompt methods endure either mediocre performance or unstable training process, indicating the difficulty of visual prompt learning. In this paper, we propose a new Progressive Visual Prompt (ProVP) structure to strengthen the interactions among prompts of different layers. More importantly, our ProVP could effectively propagate the image embeddings to deep layers and behave partially similar to an instance adaptive prompt method. To alleviate generalization deterioration, we further propose a new contrastive feature re-formation, which prevents the serious deviation of the prompted visual feature from the fixed CLIP visual feature distribution. Combining both, our method (ProVP-Ref) is evaluated on 11 image benchmark datasets and achieves 7/11 state-of-theart results on both few-shot and base-to-novel settings. To the best of our knowledge, we are the first to demonstrate the superior performance of visual prompts in V-L models to previous prompt-based methods in downstream tasks. Meanwhile, it implies that our ProVP-Ref shows the best capability to adapt and to generalize.	This paper proposes ProVP-Ref, a novel progressive visual prompt learning approach for Vision-language models, to enhance their adaptation and generalization capabilities for downstream tasks.	Adapting large pre-trained V-L models to downstream tasks like few-shot learning often results in overfitting or catastrophic forgetting. Existing prompt learning methods focus on text prompts with limitations in handling visual domain shifts.	ProVP-Ref introduces a progressive visual prompt (ProVP) structure that strengthens prompt interactions across layers, and a contrastive feature re-formation strategy to prevent significant deviation from the pre-trained feature distribution.	ProVP-Ref achieves state-of-the-art results on 7 out of 11 image benchmark datasets for few-shot learning. ProVP-Ref exhibits superior performance on base-to-novel generalization, demonstrating its capability to adapt to unseen classes. The method shows significant improvements on datasets with large domain shifts from pre-trained data.	The performance of ProVP-Ref is hindered by the intrinsic limitations of CLIP's text features, particularly when dealing with a large number of classes. The best novel performance on datasets like StanfordCars and Flowers102 is achieved by zero-shot CLIP, suggesting potential conflict between pre-trained knowledge and downstream tasks that needs further exploration.	visual prompt learning, vision-language models, few-shot learning, generalization, contrastive learning
2304.08345 Report	VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset	Sihan Chen, Xingjian He, Longteng Guo, Xinxin Zhu, Weining Wang, Jinhui Tang, Jing Liu	In this paper, we propose a Vision-Audio-Language Omni-peRception pretraining model (VALOR) for multi-modal understanding and generation. Different from widely-studied vision-language pretraining models, VALOR jointly models relationships of vision, audio and language in an end-to-end manner. It contains three separate encoders for single modality representations, and a decoder for multimodal conditional text generation. We design two pretext tasks to pretrain VALOR model, including Multimodal Grouping Alignment (MGA) and Multimodal Grouping Captioning (MGC). MGA projects vision, language and audio to the same common space, building vision-language, audio-language and audiovisual-language alignment simultaneously. MGC learns how to generate text tokens in conditions of vision, audio or their both. To promote vision-audio-language pretraining research, we construct a large-scale high-quality tri-modality dataset named VALOR-1M, which contains 1M audiable videos with human annotated audiovisual captions. Extensive experiments show that VALOR can learn strong multimodal correlations and be generalized to various downstream tasks (e.g., retrieval, captioning and question answering), with different input modalities (e.g., vision-language, audio-language and audiovisual-language). VALOR achieves new state-of-the-art performances on series of public cross-modality benchmarks. Code and data are available at project page https://casia-iva-group.github.io/projects/VALOR.	The paper introduces VALOR, a novel Vision-Audio-Language Omni-perception pretraining model, for understanding and generating multimodal content.	Existing vision-language models struggle to capture the comprehensive semantic understanding offered by incorporating audio, which often provides complementary information. VALOR aims to bridge this gap by jointly modeling vision, audio, and language.	VALOR utilizes three encoders for individual modality representations and a multimodal decoder for text generation. Two novel pretext tasks, Multimodal Grouping Alignment (MGA) and Multimodal Grouping Captioning (MGC), facilitate cross-modal alignment and conditional text generation, respectively. A large-scale dataset, VALOR-1M, with 1 million audio-visual videos paired with human-annotated captions, is introduced to enable effective pretraining.	VALOR achieves state-of-the-art performance on various cross-modality benchmarks, including significant improvements in text-to-video retrieval, video captioning, and video question answering. The model effectively utilizes audio-visual clues for audio-visual retrieval and captioning tasks, showcasing its capability in handling multimodal inputs. Experiments demonstrate the effectiveness of modality grouping strategy in improving model generalization across different downstream tasks and modalities.	The current scale of VALOR-1M, while large, can be further expanded using unsupervised techniques to leverage a larger pool of audiovisual data. Future work aims to incorporate vision and audio generation capabilities into the VALOR framework.	vision-audio-language pretraining, multimodal understanding, multimodal pretraining, audiovisual captioning, cross-modality learning
2304.08271 Report	Open-World Weakly-Supervised Object Localization	Jinheng Xie, Zhaochuan Luo, Yuexiang Li, Haozhe Liu, Linlin Shen, Mike Zheng Shou	While remarkable success has been achieved in weakly-supervised object localization (WSOL), current frameworks are not capable of locating objects of novel categories in open-world settings. To address this issue, we are the first to introduce a new weakly-supervised object localization task called OWSOL (Open-World Weakly-Supervised Object Localization). During training, all labeled data comes from known categories and, both known and novel categories exist in the unlabeled data. To handle such data, we propose a novel paradigm of contrastive representation co-learning using both labeled and unlabeled data to generate a complete G-CAM (Generalized Class Activation Map) for object localization, without the requirement of bounding box annotation. As no class label is available for the unlabelled data, we conduct clustering over the full training set and design a novel multiple semantic centroids-driven contrastive loss for representation learning. We re-organize two widely used datasets, i.e., ImageNet-1K and iNatLoc500, and propose OpenImages150 to serve as evaluation benchmarks for OWSOL. Extensive experiments demonstrate that the proposed method can surpass all baselines by a large margin. We believe that this work can shift the close-set localization towards the open-world setting and serve as a foundation for subsequent works. Code will be released at https://github.com/ryylcc/OWSOL.	This paper introduces Open-World Weakly-Supervised Object Localization (OWSOL), a new task aiming to localize both known and novel objects using labeled and unlabeled data.	Current WSOL methods are limited to a closed-world setting and cannot handle novel categories, limiting their applicability to real-world scenarios.	The authors propose a contrastive representation co-learning paradigm using supervised and multiple semantic centroids-driven contrastive losses. They also introduce generalized class activation mapping (G-CAM) for localization in a non-parametric manner.	The proposed method outperforms existing WSOL methods and novel category discovery methods on ImageNet-1K, iNatLoc500, and OpenImages150 datasets. Multiple semantic centroids in contrastive learning are shown to be crucial for complete object activation. The method exhibits robustness to the number of clusters and strong zero-shot localization ability for novel categories.	The current method doesn't differentiate between Nov-S and Nov-D categories during training. Future work could explore fine-grained learning for Nov-S and Nov-D to improve performance.	weakly-supervised object localization, open-world learning, contrastive learning, class activation mapping, novel category discovery
2304.07547 Report	TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation	Jingyao Li, Pengguang Chen, Shengju Qian, Jiaya Jia	Recent success of Contrastive Language-Image Pre-training~(CLIP) has shown great promise in pixel-level open-vocabulary learning tasks. A general paradigm utilizes CLIP's text and patch embeddings to generate semantic masks. However, existing models easily misidentify input pixels from unseen classes, thus confusing novel classes with semantically-similar ones. In our work, we disentangle the ill-posed optimization problem into two parallel processes: one performs semantic matching individually, and the other judges reliability for improving discrimination ability. Motivated by special tokens in language modeling that represents sentence-level embeddings, we design a trusty token that decouples the known and novel category prediction tendency. With almost no extra overhead, we upgrade the pixel-level generalization capacity of existing models effectively. Our TagCLIP (CLIP adapting with Trusty-guidance) boosts the IoU of unseen classes by 7.4% and 1.7% on PASCAL VOC 2012 and COCO-Stuff 164K.	This paper proposes TagCLIP, a novel framework for open-vocabulary semantic segmentation that improves the recognition of unseen classes by disentangling semantic matching and prediction reliability.	Existing open-vocabulary segmentation models often misclassify pixels from unseen classes, confusing them with semantically-similar seen classes. This work addresses this issue to improve the generalization ability of these models.	The proposed TagCLIP introduces a trusty token to capture the prediction tendency for known and novel categories. It uses a Trusty Learner module to optimize this token based on the inter-category relationship, and weighs the raw segmentation map with the trusty map during inference to enhance discrimination.	TagCLIP significantly boosts the IoU of unseen classes by 7.4% on PASCAL VOC 2012 and 1.7% on COCO-Stuff 164K in the inductive setting. The method demonstrates superior performance on unseen categories compared to state-of-the-art approaches, especially for hard classes. TagCLIP also shows strong cross-dataset generalization capability, improving performance by 1.4% from COCO-Stuff 164K to PASCAL Context.	TagCLIP's design is not specifically optimized for the transductive setting, where unseen category names are accessible during training. Future work could explore incorporating mechanisms for more effective self-supervision on unseen names in the transductive setting.	open-vocabulary learning, semantic segmentation, zero-shot learning, vision-language models, clip
2304.07527 Report	Align-DETR: Improving DETR with Simple IoU-aware BCE loss	Zhi Cai, Songtao Liu, Guodong Wang, Zheng Ge, Xiangyu Zhang, Di Huang	DETR has set up a simple end-to-end pipeline for object detection by formulating this task as a set prediction problem, showing promising potential. However, despite the significant progress in improving DETR, this paper identifies a problem of misalignment in the output distribution, which prevents the best-regressed samples from being assigned with high confidence, hindering the model's accuracy. We propose a metric, recall of best-regressed samples, to quantitively evaluate the misalignment problem. Observing its importance, we propose a novel Align-DETR that incorporates a localization precision-aware classification loss in optimization. The proposed loss, IA-BCE, guides the training of DETR to build a strong correlation between classification score and localization precision. We also adopt the mixed-matching strategy, to facilitate DETR-based detectors with faster training convergence while keeping an end-to-end scheme. Moreover, to overcome the dramatic decrease in sample quality induced by the sparsity of queries, we introduce a prime sample weighting mechanism to suppress the interference of unimportant samples. Extensive experiments are conducted with very competitive results reported. In particular, it delivers a 46 (+3.8)% AP on the DAB-DETR baseline with the ResNet-50 backbone and reaches a new SOTA performance of 50.2% AP in the 1x setting on the COCO validation set when employing the strong baseline DINO. Our code is available at https://github.com/FelixCaae/AlignDETR.	This paper identifies a misalignment problem in DETR, where classification confidence and localization precision are inconsistent, and proposes Align-DETR to address it.	The misalignment problem hinders DETR's accuracy by preventing the best-regressed samples from being assigned high confidence.	Align-DETR introduces an IoU-aware BCE loss to align classification and regression scores, adopts a mixed-matching strategy for faster convergence, and implements prime sample weighting to handle low-quality samples during training.	Align-DETR achieves a 3.8% AP gain over the DAB-DETR baseline with a ResNet-50 backbone. It sets a new SOTA performance of 50.2% AP on COCO validation with the DINO baseline. Ablation studies validate the effectiveness of the proposed components, particularly the IoU-aware BCE loss and prime sample weighting.	The improvement from the IoU branch is limited, potentially due to DETR's reliance on self-attention for duplicate removal. Further investigation is needed to explore other methods for handling the misalignment problem in DETR.	object detection, detr, misalignment, iou-aware loss, mixed matching
2304.07483 Report	Video Generation Beyond a Single Clip	Hsin-Ping Huang, Yu-Chuan Su, Ming-Hsuan Yang	We tackle the long video generation problem, i.e.~generating videos beyond the output length of video generation models. Due to the computation resource constraints, video generation models can only generate video clips that are relatively short compared with the length of real videos. Existing works apply a sliding window approach to generate long videos at inference time, which is often limited to generating recurrent events or homogeneous content. To generate long videos covering diverse content and multiple events, we propose to use additional guidance to control the video generation process. We further present a two-stage approach to the problem, which allows us to utilize existing video generation models to generate high-quality videos within a small time window while modeling the video holistically based on the input guidance. The proposed approach is complementary to existing efforts on video generation, which focus on generating realistic video within a fixed time window. Extensive experiments on challenging real-world videos validate the benefit of the proposed method, which improves over state-of-the-art by up to 9.5% in objective metrics and is preferred by users more than 80% of time.	This paper tackles the long video generation problem, aiming to generate videos longer than the output length of typical video generation models. The authors propose a two-stage approach using additional guidance (object labels) to control the generation process.	Current video generation models are limited in the length of videos they can produce due to computational constraints. Existing sliding window approaches often result in repetitive content, highlighting the need for methods that can generate long videos with diverse content and multiple events.	The proposed method decomposes the problem into two stages: keyframe generation and frame interpolation. First, keyframes representing the start of each short video clip are predicted jointly based on object label guidance and a reference frame. Then, existing video generation models are utilized to interpolate intermediate frames between keyframes, generating the complete video.	The proposed method outperforms state-of-the-art video generation models on the challenging EPIC Kitchen dataset, showing significant improvement in metrics like LPIPS and FVD. Jointly predicting keyframes leads to better temporal consistency and quality compared to generating keyframes independently. User studies confirm that the generated videos have better visual quality and reproduce the content of ground truth videos more accurately.	The current implementation relies on object labels as guidance; exploring other types of guidance like text descriptions could be beneficial. Further research on improving the quality of intermediate representations, such as layout generation, could lead to even better long video generation results.	video generation, long video synthesis, keyframe generation, frame interpolation, object guidance
2304.07429 Report	Identity Encoder for Personalized Diffusion	Yu-Chuan Su, Kelvin C. K. Chan, Yandong Li, Yang Zhao, Han Zhang, Boqing Gong, Huisheng Wang, Xuhui Jia	Many applications can benefit from personalized image generation models, including image enhancement, video conferences, just to name a few. Existing works achieved personalization by fine-tuning one model for each person. While being successful, this approach incurs additional computation and storage overhead for each new identity. Furthermore, it usually expects tens or hundreds of examples per identity to achieve the best performance. To overcome these challenges, we propose an encoder-based approach for personalization. We learn an identity encoder which can extract an identity representation from a set of reference images of a subject, together with a diffusion generator that can generate new images of the subject conditioned on the identity representation. Once being trained, the model can be used to generate images of arbitrary identities given a few examples even if the model hasn't been trained on the identity. Our approach greatly reduces the overhead for personalized image generation and is more applicable in many potential applications. Empirical results show that our approach consistently outperforms existing fine-tuning based approach in both image generation and reconstruction, and the outputs is preferred by users more than 95% of the time compared with the best performing baseline.	This paper introduces an encoder-based approach for personalized image generation using diffusion models, enabling the generation of new images of arbitrary subjects given a few reference images.	Existing personalized image generation methods rely on fine-tuning for each identity, leading to high computational costs and limited practicality. This paper proposes a more efficient and scalable approach using an identity encoder.	The proposed method learns an identity encoder that extracts an identity representation from a set of reference images. A diffusion generator then synthesizes new images conditioned on this representation. The system is trained using a combination of random average embedding, identity loss, and multi-task learning to balance identity preservation, output diversity, and image quality.	The proposed method consistently outperforms baselines in terms of image quality and is on par with the best baseline in terms of identity preservation. The method generates diverse outputs from a few reference images, unlike baselines that require hundreds of examples. The approach effectively extends to conditional generation tasks like super-resolution and inpainting, demonstrating superior reconstruction accuracy compared to baselines.	The average embedding strategy might be sub-optimal for capturing all potential variations of a subject. Output quality can be subject-dependent, potentially due to biases in the training data.	personalized image generation, diffusion models, identity encoder, conditional generation, image inpainting, super-resolution
2304.07221 Report	Instance-aware Dynamic Prompt Tuning for Pre-trained Point Cloud Models	Yaohua Zha, Jinpeng Wang, Tao Dai, Bin Chen, Zhi Wang, Shu-Tao Xia	Pre-trained point cloud models have found extensive applications in 3D understanding tasks like object classification and part segmentation. However, the prevailing strategy of full fine-tuning in downstream tasks leads to large per-task storage overhead for model parameters, which limits the efficiency when applying large-scale pre-trained models. Inspired by the recent success of visual prompt tuning (VPT), this paper attempts to explore prompt tuning on pre-trained point cloud models, to pursue an elegant balance between performance and parameter efficiency. We find while instance-agnostic static prompting, e.g. VPT, shows some efficacy in downstream transfer, it is vulnerable to the distribution diversity caused by various types of noises in real-world point cloud data. To conquer this limitation, we propose a novel Instance-aware Dynamic Prompt Tuning (IDPT) strategy for pre-trained point cloud models. The essence of IDPT is to develop a dynamic prompt generation module to perceive semantic prior features of each point cloud instance and generate adaptive prompt tokens to enhance the model's robustness. Notably, extensive experiments demonstrate that IDPT outperforms full fine-tuning in most tasks with a mere 7% of the trainable parameters, providing a promising solution to parameter-efficient learning for pre-trained point cloud models. Code is available at \url{https://github.com/zyh16143998882/ICCV23-IDPT}.	This paper proposes Instance-aware Dynamic Prompt Tuning (IDPT), a novel method for parameter-efficient tuning of pre-trained point cloud models that addresses the limitations of static prompt tuning methods like VPT when applied to real-world point cloud data.	Full fine-tuning of pre-trained point cloud models for downstream tasks requires significant storage for model parameters, limiting efficiency. Prompt tuning offers a parameter-efficient alternative but existing static methods struggle with the distribution diversity present in real-world point cloud data.	IDPT employs a dynamic prompt generation module that leverages EdgeConv layers to extract multi-scale contextual features from point cloud instances. These features are used to generate adaptive prompt tokens that are concatenated with the input of the last transformer layer, enhancing the model's robustness to noise and missing data.	IDPT achieves state-of-the-art performance on the ScanObjectNN dataset for object classification, outperforming full fine-tuning in most cases with only 7% of trainable parameters. IDPT demonstrates superior performance compared to full fine-tuning and static prompt tuning methods on both synthetic and real-world datasets for object classification and few-shot learning. While IDPT shows improvements over static prompting for part segmentation, it still lags behind full fine-tuning, suggesting a need for further research in parameter-efficient methods for fine-grained point cloud understanding.	The performance gap between IDPT and full fine-tuning in part segmentation highlights the challenge of parameter-efficient tuning for fine-grained point cloud tasks. Future work could explore incorporating effective structure modeling mechanisms within the parameter-efficient tuning strategy to bridge this gap.	point cloud, prompt tuning, pre-trained models, parameter efficiency, domain adaptation
2304.07087 Report	Memory Efficient Diffusion Probabilistic Models via Patch-based Generation	Shinei Arakawa, Hideki Tsunashima, Daichi Horita, Keitaro Tanaka, Shigeo Morishima	Diffusion probabilistic models have been successful in generating high-quality and diverse images. However, traditional models, whose input and output are high-resolution images, suffer from excessive memory requirements, making them less practical for edge devices. Previous approaches for generative adversarial networks proposed a patch-based method that uses positional encoding and global content information. Nevertheless, designing a patch-based approach for diffusion probabilistic models is non-trivial. In this paper, we resent a diffusion probabilistic model that generates images on a patch-by-patch basis. We propose two conditioning methods for a patch-based generation. First, we propose position-wise conditioning using one-hot representation to ensure patches are in proper positions. Second, we propose Global Content Conditioning (GCC) to ensure patches have coherent content when concatenated together. We evaluate our model qualitatively and quantitatively on CelebA and LSUN bedroom datasets and demonstrate a moderate trade-off between maximum memory consumption and generated image quality. Specifically, when an entire image is divided into 2 x 2 patches, our proposed approach can reduce the maximum memory consumption by half while maintaining comparable image quality.	This paper presents a memory-efficient diffusion probabilistic model for image generation, which operates on a patch-by-patch basis.	Traditional diffusion models suffer from high memory requirements, especially for high-resolution images, limiting their practicality on edge devices.	The model divides images into patches and utilizes two conditioning methods: 1) Position-wise conditioning using one-hot representation to specify patch location. 2) Global Content Conditioning (GCC) which extracts global content features from the entire image to ensure coherence when patches are combined.	The proposed method can reduce the maximum memory consumption by half while maintaining comparable image quality when dividing an entire image into 2x2 patches. The model exhibits good performance on CelebA dataset, particularly with 2x2 and 4x4 patch divisions. On LSUN bedroom dataset, while quality is maintained with 2x2 division, further divisions lead to noticeable patch boundaries and quality degradation.	The model struggles with datasets containing diverse image compositions, leading to patch boundary artifacts. Extracting global content information at every diffusion step might lead to error accumulation and boundary discontinuities.	diffusion probabilistic models, memory efficient, patch-based generation, global content conditioning, image generation
2304.07060 Report	DCFace: Synthetic Face Generation with Dual Condition Diffusion Model	Minchul Kim, Feng Liu, Anil Jain, Xiaoming Liu	Generating synthetic datasets for training face recognition models is challenging because dataset generation entails more than creating high fidelity images. It involves generating multiple images of same subjects under different factors (\textit{e.g.}, variations in pose, illumination, expression, aging and occlusion) which follows the real image conditional distribution. Previous works have studied the generation of synthetic datasets using GAN or 3D models. In this work, we approach the problem from the aspect of combining subject appearance (ID) and external factor (style) conditions. These two conditions provide a direct way to control the inter-class and intra-class variations. To this end, we propose a Dual Condition Face Generator (DCFace) based on a diffusion model. Our novel Patch-wise style extractor and Time-step dependent ID loss enables DCFace to consistently produce face images of the same subject under different styles with precise control. Face recognition models trained on synthetic images from the proposed DCFace provide higher verification accuracies compared to previous works by $6.11\%$ on average in $4$ out of $5$ test datasets, LFW, CFP-FP, CPLFW, AgeDB and CALFW. Code is available at https://github.com/mk-minchul/dcface	Proposes DCFace, a two-stage dual condition diffusion model for generating synthetic face datasets with improved subject uniqueness, diversity, and label consistency.	Addresses limitations of existing synthetic face datasets in matching real-world image distributions and label accuracy, crucial for training effective face recognition models while mitigating privacy concerns associated with real datasets.	Combines an ID image generator with a style bank and a dual condition generator (G_mix). G_mix leverages a patch-wise style extractor and a time-step dependent ID loss to blend ID and style conditions from same-subject image pairs, enhancing control over image generation.	Achieves state-of-the-art face recognition performance with a 0.5M synthetic image dataset, outperforming previous methods by 6.11% on average across four benchmark datasets. Demonstrates the importance of balancing label consistency and diversity in synthetic datasets for optimal face recognition accuracy. Shows that DDPM trained on FFHQ can generate a substantial number of unique subjects, addressing the limitation of limited subject diversity in previous GAN-based methods.	Despite improvements, DCFace lacks 3D consistency across pose, a potential area for future work leveraging 3D priors. Current implementation still relies on real images for training, aspiring for fully synthetic datasets to completely eliminate dependence on real data.	synthetic data generation, face recognition, diffusion models, dual condition generation, dataset diversity and consistency
2304.07039 Report	Learning Semantic-Aware Knowledge Guidance for Low-Light Image Enhancement	Yuhui Wu, Chen Pan, Guoqing Wang, Yang Yang, Jiwei Wei, Chongyi Li, Heng Tao Shen	Low-light image enhancement (LLIE) investigates how to improve illumination and produce normal-light images. The majority of existing methods improve low-light images via a global and uniform manner, without taking into account the semantic information of different regions. Without semantic priors, a network may easily deviate from a region's original color. To address this issue, we propose a novel semantic-aware knowledge-guided framework (SKF) that can assist a low-light enhancement model in learning rich and diverse priors encapsulated in a semantic segmentation model. We concentrate on incorporating semantic knowledge from three key aspects: a semantic-aware embedding module that wisely integrates semantic priors in feature representation space, a semantic-guided color histogram loss that preserves color consistency of various instances, and a semantic-guided adversarial loss that produces more natural textures by semantic priors. Our SKF is appealing in acting as a general framework in LLIE task. Extensive experiments show that models equipped with the SKF significantly outperform the baselines on multiple datasets and our SKF generalizes to different models and scenes well. The code is available at Semantic-Aware-Low-Light-Image-Enhancement.	This paper proposes SKF, a semantic-aware knowledge-guided framework that improves low-light image enhancement using semantic priors.	Existing LLIE methods often enhance images globally without considering region-specific semantics, leading to color deviations and unnatural results.	The SKF leverages a pre-trained semantic segmentation network (HRNet) as a knowledge bank and introduces: 1) Semantic-aware embedding (SE) module for refining image features using semantic features. 2) Semantic-guided color histogram (SCH) loss for preserving instance-level color consistency. 3) Semantic-guided adversarial (SA) loss for enhancing texture realism by guiding the discriminator.	SKF consistently improves the performance of six baseline LLIE methods on LOL and LOL-v2 datasets. LLFlow-L with SKF achieves state-of-the-art results on both LOL and LOL-v2 datasets. The proposed framework effectively suppresses noise, preserves color consistency, and generates realistic textures in enhanced images.	The performance improvement is limited when encountering unknown object categories. Future work includes exploring the framework's potential in other low-level vision tasks.	low-light image enhancement, semantic segmentation, knowledge guidance, color consistency, adversarial learning
2304.06957 Report	MVP-SEG: Multi-View Prompt Learning for Open-Vocabulary Semantic Segmentation	Jie Guo, Qimeng Wang, Yan Gao, Xiaolong Jiang, Xu Tang, Yao Hu, Baochang Zhang	CLIP (Contrastive Language-Image Pretraining) is well-developed for open-vocabulary zero-shot image-level recognition, while its applications in pixel-level tasks are less investigated, where most efforts directly adopt CLIP features without deliberative adaptations. In this work, we first demonstrate the necessity of image-pixel CLIP feature adaption, then provide Multi-View Prompt learning (MVP-SEG) as an effective solution to achieve image-pixel adaptation and to solve open-vocabulary semantic segmentation. Concretely, MVP-SEG deliberately learns multiple prompts trained by our Orthogonal Constraint Loss (OCLoss), by which each prompt is supervised to exploit CLIP feature on different object parts, and collaborative segmentation masks generated by all prompts promote better segmentation. Moreover, MVP-SEG introduces Global Prompt Refining (GPR) to further eliminate class-wise segmentation noise. Experiments show that the multi-view prompts learned from seen categories have strong generalization to unseen categories, and MVP-SEG+ which combines the knowledge transfer stage significantly outperforms previous methods on several benchmarks. Moreover, qualitative results justify that MVP-SEG does lead to better focus on different local parts.	Proposes MVP-SEG, a multi-view prompt learning method for open-vocabulary semantic segmentation using pre-trained CLIP	CLIP, while powerful for image-level recognition, requires adaptation for pixel-level tasks like segmentation. Existing methods directly adopting CLIP features result in sub-optimal performance.	Learns multiple prompts to capture different object parts, supervised by an Orthogonal Constraint Loss (OCLoss) to ensure part-wise attention. Introduces Global Prompt Refining (GPR) to leverage CLIP's classification ability and refine segmentation masks.	MVP-SEG significantly outperforms baseline (MaskCLIP) on unseen classes, demonstrating the effectiveness of multi-view learnable prompts. Learnable prompts outperform handcrafted prompts, showing adaptability and superiority of the proposed method. MVP-SEG+, combining MVP-SEG with knowledge transfer, achieves state-of-the-art performance on three major benchmarks, even surpassing fully-supervised methods on some.	The number of prompts and their effectiveness might vary across datasets and object categories. Exploring alternative prompt learning strategies and architectures for further performance improvement.	open-vocabulary semantic segmentation, zero-shot learning, clip, prompt learning, multi-view learning
2304.06939 Report	Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text	Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, William Yang Wang, Yejin Choi	In-context vision and language models like Flamingo support arbitrarily interleaved sequences of images and text as input. This format not only enables few-shot learning via interleaving independent supervised (image, text) examples, but also, more complex prompts involving interaction between images, e.g., "What do image A and image B have in common?" To support this interface, pretraining occurs over web corpora that similarly contain interleaved images+text. To date, however, large-scale data of this form have not been publicly available. We release Multimodal C4, an augmentation of the popular text-only C4 corpus with images interleaved. We use a linear assignment algorithm to place images into longer bodies of text using CLIP features, a process that we show outperforms alternatives. Multimodal C4 spans everyday topics like cooking, travel, technology, etc. A manual inspection of a random sample of documents shows that a vast majority (88%) of images are topically relevant, and that linear assignment frequently selects individual sentences specifically well-aligned with each image (80%). After filtering NSFW images, ads, etc., the resulting corpus consists of 101.2M documents with 571M images interleaved in 43B English tokens.	This paper introduces Multimodal C4 (MMC4), a large-scale dataset with interleaved image and text sequences for training multimodal language models.	Existing multimodal datasets primarily consist of image-caption pairs, limiting the ability of models to learn complex interactions between images and text. MMC4 addresses this gap by providing a rich dataset with interleaved sequences, enabling the development of models capable of few-shot learning and complex multimodal reasoning.	The authors augmented the existing text-only C4 dataset with images from the corresponding web pages. They employed a CLIP-based linear assignment algorithm to align images with relevant sentences within each document, ensuring topical relevance and image-text alignment.	MMC4 consists of 101.2M documents, 571M images, and 43B English tokens, surpassing previous non-public datasets in scale. Manual verification indicates that 87.7% of images are topically relevant to their associated documents, and 80.4% are well-aligned with their assigned sentences. Preliminary experiments demonstrate that training a multimodal language model on MMC4 improves its performance on few-shot, in-context image captioning tasks compared to training on image-caption pairs alone.	The paper lacks detailed empirical evaluation of the model's in-context reasoning abilities beyond few-shot image captioning. Future work could explore the impact of data scaling and instruction tuning on multimodal in-context learning.	multimodal language models, dataset, image-text alignment, in-context learning, few-shot learning
2304.06911 Report	3D Feature Prediction for Masked-AutoEncoder-Based Point Cloud Pretraining	Siming Yan, Yuqi Yang, Yuxiao Guo, Hao Pan, Peng-shuai Wang, Xin Tong, Yang Liu, Qixing Huang	Masked autoencoders (MAE) have recently been introduced to 3D self-supervised pretraining for point clouds due to their great success in NLP and computer vision. Unlike MAEs used in the image domain, where the pretext task is to restore features at the masked pixels, such as colors, the existing 3D MAE works reconstruct the missing geometry only, i.e, the location of the masked points. In contrast to previous studies, we advocate that point location recovery is inessential and restoring intrinsic point features is much superior. To this end, we propose to ignore point position reconstruction and recover high-order features at masked points including surface normals and surface variations, through a novel attention-based decoder which is independent of the encoder design. We validate the effectiveness of our pretext task and decoder design using different encoder structures for 3D training and demonstrate the advantages of our pretrained networks on various point cloud analysis tasks.	This paper proposes MaskFeat3D, a novel masked autoencoding method for 3D self-supervised pretraining that focuses on reconstructing intrinsic point features (surface normals and variations) instead of point locations.	Existing 3D MAE methods primarily focus on reconstructing masked point locations, deviating from successful 2D approaches that prioritize feature restoration. This paper argues that recovering high-order surface features is crucial for better representation learning in 3D.	The method uses an attention-based decoder that takes masked points as queries and leverages cross-attention with encoder features to predict normals and variations. This decoder is agnostic to the encoder architecture, supporting ViT, PointNet++, and sparse CNNs.	MaskFeat3D consistently outperforms previous 3D MAE methods on ScanObjectNN classification and ShapeNetPart segmentation. Using both normal and surface variation as target features yields better performance than using either alone. The method achieves state-of-the-art results on these tasks, even surpassing supervised methods in some cases.	The computational cost can be high due to the use of attention. Exploration of other potential 3D features for reconstruction is left for future work.	self-supervised learning, point cloud, masked autoencoder, feature prediction, attention mechanism
2304.06720 Report	Expressive Text-to-Image Generation with Rich Text	Songwei Ge, Taesung Park, Jun-Yan Zhu, Jia-Bin Huang	Plain text has become a prevalent interface for text-to-image synthesis. However, its limited customization options hinder users from accurately describing desired outputs. For example, plain text makes it hard to specify continuous quantities, such as the precise RGB color value or importance of each word. Furthermore, creating detailed text prompts for complex scenes is tedious for humans to write and challenging for text encoders to interpret. To address these challenges, we propose using a rich-text editor supporting formats such as font style, size, color, and footnote. We extract each word's attributes from rich text to enable local style control, explicit token reweighting, precise color rendering, and detailed region synthesis. We achieve these capabilities through a region-based diffusion process. We first obtain each word's region based on attention maps of a diffusion process using plain text. For each region, we enforce its text attributes by creating region-specific detailed prompts and applying region-specific guidance, and maintain its fidelity against plain-text generation through region-based injections. We present various examples of image generation from rich text and demonstrate that our method outperforms strong baselines with quantitative evaluations.	This paper introduces rich-text-to-image generation, enabling precise control over image synthesis using attributes like font style, size, color, and footnotes.	Plain text prompts limit users' ability to specify precise details like color or object importance. Rich text offers a more expressive and user-friendly interface for text-to-image synthesis.	The method uses a two-step process. First, it computes spatial layouts for token spans using attention maps from a plain-text diffusion process. Second, it employs region-based diffusion and guidance to render each region's attributes, preserving fidelity to the plain-text generation.	The method generates more precise colors compared to baselines, accurately reflecting RGB values and subtle color names. It enables local style control, applying distinct artistic styles to different image regions, unlike baselines that produce uniform styles. It facilitates detailed region synthesis, incorporating information from footnotes to generate complex scenes with higher fidelity than competing approaches.	The method's reliance on multiple diffusion processes leads to longer inference times compared to plain-text generation. The token map generation process relies on a thresholding parameter that could be replaced with more advanced segmentation methods.	text-to-image synthesis, rich text, diffusion models, attention mechanisms, controllable image generation
2304.06718 Report	Segment Everything Everywhere All at Once	Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao, Yong Jae Lee	In this work, we present SEEM, a promptable and interactive model for segmenting everything everywhere all at once in an image, as shown in Fig.1. In SEEM, we propose a novel decoding mechanism that enables diverse prompting for all types of segmentation tasks, aiming at a universal segmentation interface that behaves like large language models (LLMs). More specifically, SEEM is designed with four desiderata: i) Versatility. We introduce a new visual prompt to unify different spatial queries including points, boxes, scribbles and masks, which can further generalize to a different referring image; ii) Compositionality. We learn a joint visual-semantic space between text and visual prompts, which facilitates the dynamic composition of two prompt types required for various segmentation tasks; iii) Interactivity. We further incorporate learnable memory prompts into the decoder to retain segmentation history through mask-guided cross-attention from decoder to image features; and iv) Semantic-awareness. We use a text encoder to encode text queries and mask labels into the same semantic space for open-vocabulary segmentation. We conduct a comprehensive empirical study to validate the effectiveness of SEEM across diverse segmentation tasks. Notably, our single SEEM model achieves competitive performance across interactive segmentation, generic segmentation, referring segmentation, and video object segmentation on 9 datasets with minimum 1/100 supervision. Furthermore, SEEM showcases a remarkable capacity for generalization to novel prompts or their combinations, rendering it a readily universal image segmentation interface.	This paper proposes SEEM, a promptable and interactive model for universal image segmentation, capable of segmenting everything in an image with semantic labels, covering every pixel, and supporting various prompt compositions.	The research aims to address the limitations of existing segmentation models by introducing a universal interface that accommodates diverse human prompts and handles various segmentation tasks within a single model.	SEEM utilizes an encoder-decoder architecture with a novel decoding mechanism that enables versatile prompting. It introduces visual prompts for non-textual inputs, facilitates compositionality of prompts, incorporates memory prompts for interactivity, and ensures semantic awareness for open-vocabulary segmentation.	SEEM achieves competitive performance across interactive segmentation, generic segmentation, and referring segmentation tasks on nine datasets with minimal supervision. The model demonstrates a remarkable capacity for generalization to novel prompts or their combinations, highlighting its potential as a universal image segmentation interface. SEEM exhibits efficiency in interactive segmentation, requiring only one feature extraction at the start and lightweight decoding per interaction round.	The model's performance on referring segmentation is slightly affected when trained from scratch. Increasing the number of interactive training iterations improves accuracy but also elevates computational costs.	image segmentation, interactive segmentation, referring segmentation, open-vocabulary segmentation, universal model
2304.06717 Report	Representing Volumetric Videos as Dynamic MLP Maps	Sida Peng, Yunzhi Yan, Qing Shuai, Hujun Bao, Xiaowei Zhou	This paper introduces a novel representation of volumetric videos for real-time view synthesis of dynamic scenes. Recent advances in neural scene representations demonstrate their remarkable capability to model and render complex static scenes, but extending them to represent dynamic scenes is not straightforward due to their slow rendering speed or high storage cost. To solve this problem, our key idea is to represent the radiance field of each frame as a set of shallow MLP networks whose parameters are stored in 2D grids, called MLP maps, and dynamically predicted by a 2D CNN decoder shared by all frames. Representing 3D scenes with shallow MLPs significantly improves the rendering speed, while dynamically predicting MLP parameters with a shared 2D CNN instead of explicitly storing them leads to low storage cost. Experiments show that the proposed approach achieves state-of-the-art rendering quality on the NHR and ZJU-MoCap datasets, while being efficient for real-time rendering with a speed of 41.7 fps for $512 \times 512$ images on an RTX 3090 GPU. The code is available at https://zju3dv.github.io/mlp_maps/.	This paper proposes a novel representation of volumetric video called "dynamic MLP maps" for efficient view synthesis of dynamic scenes.	Designing a volumetric video representation that allows for high-quality, real-time rendering while also being efficiently compressed remains an open problem.	The authors represent each video frame as a set of small MLP networks, with their parameters stored in 2D grids called MLP maps. These parameters are dynamically predicted by a 2D CNN decoder shared across all frames.	The approach achieves state-of-the-art rendering quality on the NHR and ZJU-MoCap datasets. It enables real-time rendering with speeds of 41.7 fps for 512x512 images on an RTX 3090 GPU. The method achieves compact representation, leading to low storage costs.	The current work only handles relatively short videos (100-300 frames), limiting its applicability to longer videos. The representation relies on dense camera views for training, similar to many existing methods.	volumetric video, view synthesis, neural scene representation, mlp, real-time rendering
2304.06712 Report	What does CLIP know about a red circle? Visual prompt engineering for VLMs	Aleksandar Shtedritski, Christian Rupprecht, Andrea Vedaldi	Large-scale Vision-Language Models, such as CLIP, learn powerful image-text representations that have found numerous applications, from zero-shot classification to text-to-image generation. Despite that, their capabilities for solving novel discriminative tasks via prompting fall behind those of large language models, such as GPT-3. Here we explore the idea of visual prompt engineering for solving computer vision tasks beyond classification by editing in image space instead of text. In particular, we discover an emergent ability of CLIP, where, by simply drawing a red circle around an object, we can direct the model's attention to that region, while also maintaining global information. We show the power of this simple approach by achieving state-of-the-art in zero-shot referring expressions comprehension and strong performance in keypoint localization tasks. Finally, we draw attention to some potential ethical concerns of large language-vision models.	This paper explores visual prompt engineering in Vision-Language Models (VLMs) by introducing a simple yet effective technique: marking image regions with a red circle to guide the model's attention.	This approach aims to enhance the ability of VLMs to solve novel discriminative tasks beyond classification, bridging the gap between their capabilities and those of large language models.	The researchers experiment with different visual prompt engineering techniques, including cropping and marking. They evaluate their method on three zero-shot tasks: naming keypoints, localizing keypoints, and referring expression comprehension.	Marking with red circles significantly outperforms cropping and random baselines in all tasks. The effectiveness of red circles is attributed to their presence, albeit rare, in the VLM training data (e.g., YFCC15M). The authors achieve state-of-the-art zero-shot performance on referring expression comprehension, surpassing methods that use image cropping and manually designed relation rules.	The reliance on the presence of specific markers in the training data may limit generalization. The study reveals potential ethical concerns as VLMs can learn and amplify biases present in the training data, such as associating red circles with negative connotations (e.g., missing persons or criminals).	visual prompt engineering, vision-language models, zero-shot learning, referring expression comprehension, ethical bias in ai
2304.06711 Report	DiffusionRig: Learning Personalized Priors for Facial Appearance Editing	Zheng Ding, Xuaner Zhang, Zhihao Xia, Lars Jebe, Zhuowen Tu, Xiuming Zhang	We address the problem of learning person-specific facial priors from a small number (e.g., 20) of portrait photos of the same person. This enables us to edit this specific person's facial appearance, such as expression and lighting, while preserving their identity and high-frequency facial details. Key to our approach, which we dub DiffusionRig, is a diffusion model conditioned on, or "rigged by," crude 3D face models estimated from single in-the-wild images by an off-the-shelf estimator. On a high level, DiffusionRig learns to map simplistic renderings of 3D face models to realistic photos of a given person. Specifically, DiffusionRig is trained in two stages: It first learns generic facial priors from a large-scale face dataset and then person-specific priors from a small portrait photo collection of the person of interest. By learning the CGI-to-photo mapping with such personalized priors, DiffusionRig can "rig" the lighting, facial expression, head pose, etc. of a portrait photo, conditioned only on coarse 3D models while preserving this person's identity and other high-frequency characteristics. Qualitative and quantitative experiments show that DiffusionRig outperforms existing approaches in both identity preservation and photorealism. Please see the project website: https://diffusionrig.github.io for the supplemental material, video, code, and data.	Proposes DiffusionRig, a diffusion model that learns personalized priors for facial appearance editing from a small set of portrait photos, enabling controllable edits while preserving identity and high-frequency details.	Addresses limitations of zero-shot facial appearance editing methods that struggle to preserve individual-specific features.	Two-stage training: (1) Learns generic facial priors from a large-scale face dataset using a diffusion model conditioned on physical buffers (normals, albedo, Lambertian rendering) extracted by DECA. (2) Fine-tunes the model on a small set (around 20) of a specific person's photos to capture personalized priors.	Achieves convincing appearance edits (relighting, expression, pose) while preserving identity. Outperforms existing methods in both identity preservation and photorealism, as shown quantitatively and via user study. Demonstrates disentanglement of physical properties from global appearance information (hairstyle, accessories) by swapping global latent codes.	Scalability: Requires finetuning for each individual, limiting practicality for massive user adoption. Background Inconsistency: May struggle with background preservation during dramatic head pose changes.	diffusion models, facial appearance editing, personalized priors, 3d morphable models, image generation
2304.06706 Report	Zip-NeRF: Anti-Aliased Grid-Based Neural Radiance Fields	Jonathan T. Barron, Ben Mildenhall, Dor Verbin, Pratul P. Srinivasan, Peter Hedman	Neural Radiance Field training can be accelerated through the use of grid-based representations in NeRF's learned mapping from spatial coordinates to colors and volumetric density. However, these grid-based approaches lack an explicit understanding of scale and therefore often introduce aliasing, usually in the form of jaggies or missing scene content. Anti-aliasing has previously been addressed by mip-NeRF 360, which reasons about sub-volumes along a cone rather than points along a ray, but this approach is not natively compatible with current grid-based techniques. We show how ideas from rendering and signal processing can be used to construct a technique that combines mip-NeRF 360 and grid-based models such as Instant NGP to yield error rates that are 8% - 77% lower than either prior technique, and that trains 24x faster than mip-NeRF 360.	Presents Zip-NeRF, a novel architecture that combines the advantages of grid-based NeRF models (like Instant NGP) and scale-aware anti-aliased NeRFs (like mip-NeRF 360)	Grid-based NeRFs, while fast, lack an inherent understanding of scale, leading to aliasing. Mip-NeRF 360 addresses aliasing but is not directly compatible with grid-based techniques. This work bridges this gap, aiming for both fast and high-quality rendering.	Employs multisampling and feature downweighting to integrate iNGP's grid pyramid into mip-NeRF 360. Introduces an anti-aliased loss function to address z-aliasing arising from the proposal network.	Reduces error rates by 8% -- 77% compared to previous techniques on the mip-NeRF 360 and a proposed multiscale benchmark. Achieves a 24x speedup in training time compared to mip-NeRF 360. Demonstrates superior visual quality, particularly in recovering thin structures and fine details, as shown in comparative renderings.	The rendering time, while not a focus, is not significantly improved. Further investigation into reducing the number of samples required without sacrificing quality.	neural radiance fields, anti-aliasing, multisampling, grid-based nerf, inverse rendering
2304.06700 Report	Control3Diff: Learning Controllable 3D Diffusion Models from Single-view Images	Jiatao Gu, Qingzhe Gao, Shuangfei Zhai, Baoquan Chen, Lingjie Liu, Josh Susskind	Diffusion models have recently become the de-facto approach for generative modeling in the 2D domain. However, extending diffusion models to 3D is challenging due to the difficulties in acquiring 3D ground truth data for training. On the other hand, 3D GANs that integrate implicit 3D representations into GANs have shown remarkable 3D-aware generation when trained only on single-view image datasets. However, 3D GANs do not provide straightforward ways to precisely control image synthesis. To address these challenges, We present Control3Diff, a 3D diffusion model that combines the strengths of diffusion models and 3D GANs for versatile, controllable 3D-aware image synthesis for single-view datasets. Control3Diff explicitly models the underlying latent distribution (optionally conditioned on external inputs), thus enabling direct control during the diffusion process. Moreover, our approach is general and applicable to any type of controlling input, allowing us to train it with the same diffusion objective without any auxiliary supervision. We validate the efficacy of Control3Diff on standard image generation benchmarks, including FFHQ, AFHQ, and ShapeNet, using various conditioning inputs such as images, sketches, and text prompts. Please see the project website (\url{https://jiataogu.me/control3diff}) for video comparisons.	Presents Control3Diff, a 3D diffusion model for controllable 3D-aware image synthesis from single-view images by linking diffusion models to 3D GANs.	Addresses the limitations of diffusion models in 3D generation due to the lack of 3D ground truth data and the difficulty in defining energy functions for guidance in the latent space.	Leverages pre-trained 3D GANs to sample latent representations (tri-planes) and trains diffusion models on these representations for both unconditional and conditional generation, enabling precise control over 3D properties.	Significantly outperforms existing 3D GAN inversion baselines on image-to-3D inversion tasks. Achieves comparable or better results than Pix2Pix3D on Seg-to-3D and Edge-to-3D tasks. Demonstrates the versatility of the framework by applying it to Text-to-3D generation and editing.	Mode collapse in the learned latent space of 3D GANs can limit the diversity of generated samples. The iterative nature of diffusion models results in a slower generation process compared to encoder-based approaches.	diffusion models, 3d gans, controllable image synthesis, single-view reconstruction, 3d-aware generation
2304.06648 Report	DiffFit: Unlocking Transferability of Large Diffusion Models via Simple Parameter-Efficient Fine-Tuning	Enze Xie, Lewei Yao, Han Shi, Zhili Liu, Daquan Zhou, Zhaoqiang Liu, Jiawei Li, Zhenguo Li	Diffusion models have proven to be highly effective in generating high-quality images. However, adapting large pre-trained diffusion models to new domains remains an open challenge, which is critical for real-world applications. This paper proposes DiffFit, a parameter-efficient strategy to fine-tune large pre-trained diffusion models that enable fast adaptation to new domains. DiffFit is embarrassingly simple that only fine-tunes the bias term and newly-added scaling factors in specific layers, yet resulting in significant training speed-up and reduced model storage costs. Compared with full fine-tuning, DiffFit achieves 2$\times$ training speed-up and only needs to store approximately 0.12\% of the total model parameters. Intuitive theoretical analysis has been provided to justify the efficacy of scaling factors on fast adaptation. On 8 downstream datasets, DiffFit achieves superior or competitive performances compared to the full fine-tuning while being more efficient. Remarkably, we show that DiffFit can adapt a pre-trained low-resolution generative model to a high-resolution one by adding minimal cost. Among diffusion-based methods, DiffFit sets a new state-of-the-art FID of 3.02 on ImageNet 512$\times$512 benchmark by fine-tuning only 25 epochs from a public pre-trained ImageNet 256$\times$256 checkpoint while being 30$\times$ more training efficient than the closest competitor.	This paper proposes DiffFit, a parameter-efficient strategy for fine-tuning large pre-trained diffusion models based on DiT, enabling fast adaptation to new domains.	Adapting large pre-trained diffusion models (like DiT) to new domains is important for real-world applications but remains a challenge due to the computational cost and storage requirements of full fine-tuning.	DiffFit freezes most parameters of a pre-trained diffusion model and only fine-tunes the bias term, normalization, class embedding, and newly-added scaling factors in specific layers.	DiffFit achieves 2x training speed-up and only needs to store 0.12% of the total model parameters compared to full fine-tuning. Evaluation on 8 downstream datasets shows DiffFit achieves superior or competitive performance compared to full fine-tuning while being more efficient. DiffFit sets a new state-of-the-art FID of 3.02 on ImageNet 512x512 benchmark by fine-tuning only 25 epochs from a pre-trained ImageNet 256x256 checkpoint, being 30x more training efficient than the closest competitor.	The experiments mainly focus on class-conditional image generation and it is unclear if DiffFit can generalize to more complex tasks like text-to-image or video generation. Further investigation is needed to determine if the scaling factor's effectiveness extends to deeper layers of the model.	diffusion models, parameter-efficient fine-tuning, image generation, diffusion transformer (dit), transfer learning
2304.06544 Report	DNeRV: Modeling Inherent Dynamics via Difference Neural Representation for Videos	Qi Zhao, M. Salman Asif, Zhan Ma	Existing implicit neural representation (INR) methods do not fully exploit spatiotemporal redundancies in videos. Index-based INRs ignore the content-specific spatial features and hybrid INRs ignore the contextual dependency on adjacent frames, leading to poor modeling capability for scenes with large motion or dynamics. We analyze this limitation from the perspective of function fitting and reveal the importance of frame difference. To use explicit motion information, we propose Difference Neural Representation for Videos (DNeRV), which consists of two streams for content and frame difference. We also introduce a collaborative content unit for effective feature fusion. We test DNeRV for video compression, inpainting, and interpolation. DNeRV achieves competitive results against the state-of-the-art neural compression approaches and outperforms existing implicit methods on downstream inpainting and interpolation for $960 \times 1920$ videos.	The paper proposes Difference Neural Representation for Videos (DNeRV), a novel implicit neural representation method that leverages frame differences to improve video representation, particularly in scenes with large motion or dynamic elements.	Existing NeRV methods struggle to effectively model content-specific spatial features and temporal correlations simultaneously, leading to poor performance in videos with significant motion.	DNeRV employs a two-stream architecture, processing both the original frame (content stream) and frame differences (diff stream). A novel Collaborative Content Unit (CCU) fuses features from both streams adaptively, enhancing the representation's ability to capture adjacent dynamics.	DNeRV outperforms existing NeRV methods on video regression tasks for benchmark datasets like Bunny and UVG. It demonstrates superior performance in downstream tasks, including video compression, interpolation, and inpainting, compared to other implicit methods, particularly for high-resolution videos. The incorporation of the diff stream and CCU contributes to more robust and efficient learning of the implicit mapping in videos with large motion.	DNeRV might face challenges in accurately representing detailed textures due to the nature of frame differences. Future work includes exploring higher-order frame differences and extending DNeRV for specific video-related tasks.	implicit neural representation, video representation learning, video interpolation, video inpainting, video compression
2304.06461 Report	Multi-Mode Online Knowledge Distillation for Self-Supervised Visual Representation Learning	Kaiyou Song, Jin Xie, Shan Zhang, Zimeng Luo	Self-supervised learning (SSL) has made remarkable progress in visual representation learning. Some studies combine SSL with knowledge distillation (SSL-KD) to boost the representation learning performance of small models. In this study, we propose a Multi-mode Online Knowledge Distillation method (MOKD) to boost self-supervised visual representation learning. Different from existing SSL-KD methods that transfer knowledge from a static pre-trained teacher to a student, in MOKD, two different models learn collaboratively in a self-supervised manner. Specifically, MOKD consists of two distillation modes: self-distillation and cross-distillation modes. Among them, self-distillation performs self-supervised learning for each model independently, while cross-distillation realizes knowledge interaction between different models. In cross-distillation, a cross-attention feature search strategy is proposed to enhance the semantic feature alignment between different models. As a result, the two models can absorb knowledge from each other to boost their representation learning performance. Extensive experimental results on different backbones and datasets demonstrate that two heterogeneous models can benefit from MOKD and outperform their independently trained baseline. In addition, MOKD also outperforms existing SSL-KD methods for both the student and teacher models.	This paper proposes MOKD, a Multi-mode Online Knowledge Distillation method for boosting self-supervised visual representation learning.	Existing SSL-KD methods use a static teacher, limiting the student's learning potential. MOKD enables collaborative learning between two models, potentially boosting both.	MOKD uses self-distillation (individual contrastive learning) and cross-distillation (knowledge transfer between models) with a cross-attention feature search strategy for semantic alignment.	MOKD significantly improves representation learning in both heterogeneous (ResNet-ViT) and homogeneous (two ResNets or two ViTs) model pairs. MOKD outperforms state-of-the-art SSL methods in linear probing and k-NN evaluations on ImageNet. Heterogeneous models trained with MOKD exhibit knowledge transfer, with ViT becoming more locally focused and ResNet becoming more globally focused.	Training larger models repeatedly for different smaller models increases computation cost compared to offline distillation. Future work can explore efficient fine-tuning methods to improve MOKD's efficiency.	self-supervised learning (ssl), knowledge distillation, contrastive learning, representation learning, computer vision
2304.06440 Report	Zoom-VQA: Patches, Frames and Clips Integration for Video Quality Assessment	Kai Zhao, Kun Yuan, Ming Sun, Xing Wen	Video quality assessment (VQA) aims to simulate the human perception of video quality, which is influenced by factors ranging from low-level color and texture details to high-level semantic content. To effectively model these complicated quality-related factors, in this paper, we decompose video into three levels (\ie, patch level, frame level, and clip level), and propose a novel Zoom-VQA architecture to perceive spatio-temporal features at different levels. It integrates three components: patch attention module, frame pyramid alignment, and clip ensemble strategy, respectively for capturing region-of-interest in the spatial dimension, multi-level information at different feature levels, and distortions distributed over the temporal dimension. Owing to the comprehensive design, Zoom-VQA obtains state-of-the-art results on four VQA benchmarks and achieves 2nd place in the NTIRE 2023 VQA challenge. Notably, Zoom-VQA has outperformed the previous best results on two subsets of LSVQ, achieving 0.8860 (+1.0%) and 0.7985 (+1.9%) of SRCC on the respective subsets. Adequate ablation studies further verify the effectiveness of each component. Codes and models are released in https://github.com/k-zha14/Zoom-VQA.	This paper proposes Zoom-VQA, a novel video quality assessment (VQA) framework that integrates information from patches, frames, and clips to better model human perception of video quality.	Accurately assessing video quality is crucial for optimizing user experience on streaming platforms, especially with the increasing use of AI-based video enhancement techniques that introduce new types of artifacts.	Zoom-VQA consists of two branches: an image-based branch (IQA) for global information and a clip-based branch (VQA) for local texture information. The IQA branch utilizes a patch attention module and frame pyramid alignment to capture spatial details at multiple feature levels. The VQA branch leverages a clip ensemble strategy and patch head expansion to model temporal dynamics and low-level texture information effectively.	Zoom-VQA achieves state-of-the-art results on four VQA benchmarks (VDPVE, LSVQ, KoNViD-1k, and LIVE-VQC). It outperforms previous best methods on LSVQ subsets, achieving 0.8860 SRCC on LSVQ_test (+1.0%) and 0.7985 SRCC on LSVQ_1080p (+1.9%). Zoom-VQA secured 2nd place in the NTIRE 2023 VQA Challenge, demonstrating its strong generalization ability.	The current implementation primarily focuses on No-Reference VQA; incorporating reference information could further enhance performance. Exploring the impact of different fragment sizes and sampling strategies in the VQA branch is an area for future investigation.	video quality assessment, deep learning, vision transformer, multi-level feature fusion, spatio-temporal analysis
2304.06419 Report	Tracking by 3D Model Estimation of Unknown Objects in Videos	Denys Rozumnyi, Jiri Matas, Marc Pollefeys, Vittorio Ferrari, Martin R. Oswald	Most model-free visual object tracking methods formulate the tracking task as object location estimation given by a 2D segmentation or a bounding box in each video frame. We argue that this representation is limited and instead propose to guide and improve 2D tracking with an explicit object representation, namely the textured 3D shape and 6DoF pose in each video frame. Our representation tackles a complex long-term dense correspondence problem between all 3D points on the object for all video frames, including frames where some points are invisible. To achieve that, the estimation is driven by re-rendering the input video frames as well as possible through differentiable rendering, which has not been used for tracking before. The proposed optimization minimizes a novel loss function to estimate the best 3D shape, texture, and 6DoF pose. We improve the state-of-the-art in 2D segmentation tracking on three different datasets with mostly rigid objects.	This paper introduces a novel model-free object tracking method that goes beyond 2D segmentation by jointly estimating the 3D shape, texture, and 6DoF pose of unknown, rigid objects in videos.	This approach provides a richer object representation compared to standard 2D trackers, enabling applications like augmented reality and object manipulation.	The method leverages differentiable rendering to optimize the object parameters for accurately reconstructing the input video frames, guided by initial 2D segmentations from a standard tracker. A keyframe selection strategy ensures efficient optimization over long sequences.	The method outperforms state-of-the-art 2D trackers in segmentation accuracy on datasets featuring rigid objects. It demonstrates robustness to challenging scenarios, including object rotations and illumination changes, especially when using robust features like S2DNet. Despite not requiring a pre-defined 3D model, the method achieves competitive 6DoF pose estimation results on the TUD-L benchmark.	The current implementation relies on the assumption of object rigidity, limiting its applicability to certain scenarios. The method's runtime can be further optimized for real-time performance.	object tracking, 3d reconstruction, differentiable rendering, 6dof pose estimation, deep surface texture
2304.06408 Report	Intriguing properties of synthetic images: from generative adversarial networks to diffusion models	Riccardo Corvi, Davide Cozzolino, Giovanni Poggi, Koki Nagano, Luisa Verdoliva	Detecting fake images is becoming a major goal of computer vision. This need is becoming more and more pressing with the continuous improvement of synthesis methods based on Generative Adversarial Networks (GAN), and even more with the appearance of powerful methods based on Diffusion Models (DM). Towards this end, it is important to gain insight into which image features better discriminate fake images from real ones. In this paper we report on our systematic study of a large number of image generators of different families, aimed at discovering the most forensically relevant characteristics of real and generated images. Our experiments provide a number of interesting observations and shed light on some intriguing properties of synthetic images: (1) not only the GAN models but also the DM and VQ-GAN (Vector Quantized Generative Adversarial Networks) models give rise to visible artifacts in the Fourier domain and exhibit anomalous regular patterns in the autocorrelation; (2) when the dataset used to train the model lacks sufficient variety, its biases can be transferred to the generated images; (3) synthetic and real images exhibit significant differences in the mid-high frequency signal content, observable in their radial and angular spectral power distributions.	This paper presents a systematic investigation into the traces left by various generative models, including GANs, VQ-GANs, and diffusion models, in synthetic images by examining their second-order statistics in both spatial and frequency domains.	With the increasing sophistication of synthetic image generators, it becomes crucial to understand the characteristic features that distinguish them from real images for developing robust forensic detectors.	The study analyzes a large dataset of synthetic images generated using various models, alongside real images. The analysis focuses on autocorrelation functions for spatial domain analysis and power spectra, including radial and angular spectra, for frequency domain analysis.	All examined image generators, even the most sophisticated ones, introduce specific artifacts detectable in the spatial or frequency domain. The training dataset used for a generative model can significantly bias the generated images, transferring artifacts present in the training data to the synthetic images. Generative models often struggle to accurately reproduce the spectral distribution of real images at mid-high frequencies, leading to discrepancies in radial and angular spectra.	The study primarily focuses on analyzing images 'in the lab,' without considering the impact of post-processing operations commonly applied to images in real-world scenarios. The analysis relies on a limited number of datasets for training and evaluation, potentially limiting the generalizability of findings to other datasets.	synthetic image detection, generative adversarial networks (gans), diffusion models, image forensics, frequency analysis
2304.06345 Report	ASR: Attention-alike Structural Re-parameterization	Shanshan Zhong, Zhongzhan Huang, Wushao Wen, Jinghui Qin, Liang Lin	The structural re-parameterization (SRP) technique is a novel deep learning technique that achieves interconversion between different network architectures through equivalent parameter transformations. This technique enables the mitigation of the extra costs for performance improvement during training, such as parameter size and inference time, through these transformations during inference, and therefore SRP has great potential for industrial and practical applications. The existing SRP methods have successfully considered many commonly used architectures, such as normalizations, pooling methods, and multi-branch convolution. However, the widely used attention modules which drastically slow inference speed cannot be directly implemented by SRP due to these modules usually act on the backbone network in a multiplicative manner and the modules' output is input-dependent during inference, which limits the application scenarios of SRP. In this paper, we conduct extensive experiments from a statistical perspective and discover an interesting phenomenon Stripe Observation, which reveals that channel attention values quickly approach some constant vectors during training. This observation inspires us to propose a simple-yet-effective attention-alike structural re-parameterization (ASR) that allows us to achieve SRP for a given network while enjoying the effectiveness of the attention mechanism. Extensive experiments conducted on several standard benchmarks demonstrate the effectiveness of ASR in generally improving the performance of existing backbone networks, attention modules, and SRP methods without any elaborated model crafting. We also analyze the limitations and provide experimental and theoretical evidence for the strong robustness of the proposed ASR.	This paper introduces Attention-alike Structural Re-parameterization (ASR), a novel method enabling the integration of channel attention mechanisms into Structural Re-parameterization (SRP) techniques for deep learning models.	Existing SRP methods struggle to incorporate attention modules due to their multiplicative and input-dependent nature, limiting the application of SRP despite its potential for improving model performance. ASR addresses this challenge, allowing the benefits of attention without extra parameters or computational cost during inference.	Inspired by the observed "Stripe Observation" where channel attention values converge to constant vectors during training, ASR uses a learnable vector as input for the attention module, enabling its merging into the backbone during inference without impacting performance.	ASR consistently improves performance across various backbone models (ResNet, VGG, ShuffleNetV2, MobileNet, ViT) and datasets (ImageNet, STL10, CIFAR10/100), with accuracy improvements up to 2.77%. ASR demonstrates strong compatibility with existing attention modules, further enhancing the performance of models already employing attention. ASR proves compatible with other SRP methods, such as RepVGG and ACNet, showcasing its versatility and potential for integration with existing model optimization techniques.	ASR's current formulation primarily focuses on channel attention and doesn't directly transfer to spatial or transformer-based attention mechanisms. While validated in classification tasks, ASR's effectiveness in more complex downstream tasks requires further investigation, potentially through designing specialized attention modules within the ASR paradigm.	structural re-parameterization, attention mechanism, deep learning, model compression, computer vision
2304.06247 Report	ShapeClipper: Scalable 3D Shape Learning from Single-View Images via Geometric and CLIP-based Consistency	Zixuan Huang, Varun Jampani, Anh Thai, Yuanzhen Li, Stefan Stojanov, James M. Rehg	We present ShapeClipper, a novel method that reconstructs 3D object shapes from real-world single-view RGB images. Instead of relying on laborious 3D, multi-view or camera pose annotation, ShapeClipper learns shape reconstruction from a set of single-view segmented images. The key idea is to facilitate shape learning via CLIP-based shape consistency, where we encourage objects with similar CLIP encodings to share similar shapes. We also leverage off-the-shelf normals as an additional geometric constraint so the model can learn better bottom-up reasoning of detailed surface geometry. These two novel consistency constraints, when used to regularize our model, improve its ability to learn both global shape structure and local geometric details. We evaluate our method over three challenging real-world datasets, Pix3D, Pascal3D+, and OpenImages, where we achieve superior performance over state-of-the-art methods.	ShapeClipper is a novel method for reconstructing 3D object shapes from single-view RGB images without relying on 3D, multi-view, or viewpoint supervision.	Existing methods for 3D shape reconstruction rely on laborious annotations that are not scalable to real-world scenarios.	ShapeClipper leverages CLIP-based shape consistency to encourage objects with similar CLIP encodings to share similar shapes. It also uses off-the-shelf surface normals as additional geometric constraints to improve bottom-up reasoning of surface details.	ShapeClipper achieves superior performance over state-of-the-art methods on Pix3D, Pascal3D+, and OpenImages datasets. CLIP-based shape consistency effectively improves top-down reasoning of global shape structure. Geometric constraints using off-the-shelf normals enhance the reconstruction of local geometric details.	ShapeClipper struggles with heavily occluded or deformable object categories. Future work could explore explicitly handling shape misalignment in the semantic constraint.	3d reconstruction, single-view reconstruction, clip, shape consistency, geometric constraints
2304.06212 Report	[CLS] Token is All You Need for Zero-Shot Semantic Segmentation	Letian Wu, Wenyao Zhang, Tengping Jiang, Wankou Yang, Xin Jin, Wenjun Zeng	In this paper, we propose an embarrassingly simple yet highly effective zero-shot semantic segmentation (ZS3) method, based on the pre-trained vision-language model CLIP. First, our study provides a couple of key discoveries: (i) the global tokens (a.k.a [CLS] tokens in Transformer) of the text branch in CLIP provide a powerful representation of semantic information and (ii) these text-side [CLS] tokens can be regarded as category priors to guide CLIP visual encoder pay more attention on the corresponding region of interest. Based on that, we build upon the CLIP model as a backbone which we extend with a One-Way [CLS] token navigation from text to the visual branch that enables zero-shot dense prediction, dubbed \textbf{ClsCLIP}. Specifically, we use the [CLS] token output from the text branch, as an auxiliary semantic prompt, to replace the [CLS] token in shallow layers of the ViT-based visual encoder. This one-way navigation embeds such global category prior earlier and thus promotes semantic segmentation. Furthermore, to better segment tiny objects in ZS3, we further enhance ClsCLIP with a local zoom-in strategy, which employs a region proposal pre-processing and we get ClsCLIP+. Extensive experiments demonstrate that our proposed ZS3 method achieves a SOTA performance, and it is even comparable with those few-shot semantic segmentation methods.	This paper presents ClsCLIP, a simple yet effective zero-shot semantic segmentation method that leverages the pre-trained vision-language model CLIP.	Zero-shot semantic segmentation (ZS3) is important because it allows for the segmentation of novel, unseen categories without requiring any annotations, which is crucial for real-world applications where obtaining annotations for every possible category is infeasible.	ClsCLIP extends CLIP for ZS3 by replacing the [CLS] token in the shallow layers of the ViT-based visual encoder with the text-side [CLS] token. This one-way navigation embeds global category priors, guiding the visual encoder to focus on relevant regions for segmentation. Furthermore, ClsCLIP+ enhances this by incorporating a region proposal pre-processing step using YOLO to provide object location priors, addressing the issue of missing tiny objects.	ClsCLIP significantly outperforms other state-of-the-art ZS3 methods on PASCAL-5^i and COCO-20^i datasets. The one-way [CLS] token navigation effectively guides the visual encoder to focus on regions of interest, resulting in improved segmentation performance. ClsCLIP+, enhanced with region proposals, further improves performance, especially for segmenting tiny objects, and even surpasses some few-shot semantic segmentation methods.	The performance of ClsCLIP+ is contingent on the accuracy of the region proposal generator. Future work could explore alternative region proposal methods or develop end-to-end trainable approaches that jointly optimize region proposal and segmentation.	zero-shot semantic segmentation, vision-language models, clip, prompt learning, tiny object segmentation
2304.06211 Report	Boosting Video Object Segmentation via Space-time Correspondence Learning	Yurong Zhang, Liulei Li, Wenguan Wang, Rong Xie, Li Song, Wenjun Zhang	Current top-leading solutions for video object segmentation (VOS) typically follow a matching-based regime: for each query frame, the segmentation mask is inferred according to its correspondence to previously processed and the first annotated frames. They simply exploit the supervisory signals from the groundtruth masks for learning mask prediction only, without posing any constraint on the space-time correspondence matching, which, however, is the fundamental building block of such regime. To alleviate this crucial yet commonly ignored issue, we devise a correspondence-aware training framework, which boosts matching-based VOS solutions by explicitly encouraging robust correspondence matching during network learning. Through comprehensively exploring the intrinsic coherence in videos on pixel and object levels, our algorithm reinforces the standard, fully supervised training of mask segmentation with label-free, contrastive correspondence learning. Without neither requiring extra annotation cost during training, nor causing speed delay during deployment, nor incurring architectural modification, our algorithm provides solid performance gains on four widely used benchmarks, i.e., DAVIS2016&2017, and YouTube-VOS2018&2019, on the top of famous matching-based VOS solutions.	This paper presents a novel training framework for Video Object Segmentation (VOS) that boosts the performance of matching-based methods by explicitly encouraging robust space-time correspondence learning during training.	Existing matching-based VOS methods rely heavily on accurate correspondence matching between frames but lack explicit supervision for this crucial component, potentially leading to sub-optimal results.	The proposed method leverages the intrinsic coherence of videos on both pixel and object levels. It introduces self-supervised contrastive learning objectives, enforcing pixel-level consistency and object-level coherence without requiring additional annotations.	The framework significantly improves the performance of state-of-the-art matching-based VOS methods (STCN and XMem) on DAVIS and YouTube-VOS datasets. Ablation studies demonstrate the effectiveness of both pixel-level and object-level correspondence learning components. The method introduces minimal computational overhead during training and doesn't affect inference time.	The current method mainly explores local consistency within consecutive frames, leaving long-term consistency as future work. The algorithm relies on an external memory module for storing past frames, which can be potentially improved for better efficiency and scalability.	video object segmentation, correspondence learning, self-supervised learning, contrastive learning, computer vision
2304.06140 Report	An Edit Friendly DDPM Noise Space: Inversion and Manipulations	Inbar Huberman-Spiegelglas, Vladimir Kulikov, Tomer Michaeli	Denoising diffusion probabilistic models (DDPMs) employ a sequence of white Gaussian noise samples to generate an image. In analogy with GANs, those noise maps could be considered as the latent code associated with the generated image. However, this native noise space does not possess a convenient structure, and is thus challenging to work with in editing tasks. Here, we propose an alternative latent noise space for DDPM that enables a wide range of editing operations via simple means, and present an inversion method for extracting these edit-friendly noise maps for any given image (real or synthetically generated). As opposed to the native DDPM noise space, the edit-friendly noise maps do not have a standard normal distribution and are not statistically independent across timesteps. However, they allow perfect reconstruction of any desired image, and simple transformations on them translate into meaningful manipulations of the output image (e.g. shifting, color edits). Moreover, in text-conditional models, fixing those noise maps while changing the text prompt, modifies semantics while retaining structure. We illustrate how this property enables text-based editing of real images via the diverse DDPM sampling scheme (in contrast to the popular non-diverse DDIM inversion). We also show how it can be used within existing diffusion-based editing methods to improve their quality and diversity. Webpage: https://inbarhub.github.io/DDPM_inversion	This paper introduces a novel method for inverting denoising diffusion probabilistic models (DDPMs) by extracting edit-friendly noise maps that allow for diverse image editing.	The native noise space in DDPMs is not conducive to intuitive edits. This new method enables meaningful image manipulations by working with an alternative, edit-friendly latent noise space.	The authors propose an optimization-based inversion approach to extract edit-friendly noise maps for any given image. This is achieved by minimizing the difference between the generated and target images across multiple timesteps in the DDPM sampling process.	The extracted noise maps allow perfect reconstruction of the input image. Simple transformations on the noise maps translate to semantically meaningful edits in the output image (e.g., shifting, color adjustments). By fixing the noise maps and changing the text prompt in text-conditional models, edits can be applied while preserving image structure.	The method relies on an optimization process during inversion, which can be computationally expensive. Future work could explore more efficient inversion techniques and expand the range of editable image properties.	ddpm, diffusion models, image editing, latent space manipulation, text-guided image manipulation
2304.06107 Report	PATMAT: Person Aware Tuning of Mask-Aware Transformer for Face Inpainting	Saman Motamed, Jianjin Xu, Chen Henry Wu, Fernando De la Torre	Generative models such as StyleGAN2 and Stable Diffusion have achieved state-of-the-art performance in computer vision tasks such as image synthesis, inpainting, and de-noising. However, current generative models for face inpainting often fail to preserve fine facial details and the identity of the person, despite creating aesthetically convincing image structures and textures. In this work, we propose Person Aware Tuning (PAT) of Mask-Aware Transformer (MAT) for face inpainting, which addresses this issue. Our proposed method, PATMAT, effectively preserves identity by incorporating reference images of a subject and fine-tuning a MAT architecture trained on faces. By using ~40 reference images, PATMAT creates anchor points in MAT's style module, and tunes the model using the fixed anchors to adapt the model to a new face identity. Moreover, PATMAT's use of multiple images per anchor during training allows the model to use fewer reference images than competing methods. We demonstrate that PATMAT outperforms state-of-the-art models in terms of image quality, the preservation of person-specific details, and the identity of the subject. Our results suggest that PATMAT can be a promising approach for improving the quality of personalized face inpainting.	PATMAT, a personalized face inpainting method that fine-tunes a pre-trained Mask-Aware Transformer (MAT) using a few reference images to preserve identity.	Existing face inpainting methods struggle to preserve fine facial details and identity, which is crucial for applications like security, entertainment, and photo restoration.	PATMAT utilizes a pre-trained MAT and conditions its style manipulation module by defining anchors in the noise-style space, inspired by Pivot Tuning. It uses multiple images per anchor and introduces regularization to prevent overfitting.	PATMAT outperforms state-of-the-art models in image quality and identity preservation with limited reference images. A user study confirmed PATMAT-C (multiple images per anchor) preserves identity better than PATMAT-S (single image per anchor). Human judges struggled to distinguish PATMAT-C's inpainted images from real images, demonstrating its high perceptual quality.	PATMAT's performance depends on the diversity and coverage of the reference images. It may struggle with poses, accessories, and lighting conditions not present in the training data. The method relies on manual data separation for grouping images with distinct features like glasses and lighting. Automating this step could improve the approach.	face inpainting, identity preservation, mask-aware transformer, person aware tuning, style manipulation
2304.06061 Report	CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes	Maria Parelli, Alexandros Delitzas, Nikolas Hars, Georgios Vlassis, Sotirios Anagnostidis, Gregor Bachmann, Thomas Hofmann	Training models to apply linguistic knowledge and visual concepts from 2D images to 3D world understanding is a promising direction that researchers have only recently started to explore. In this work, we design a novel 3D pre-training Vision-Language method that helps a model learn semantically meaningful and transferable 3D scene point cloud representations. We inject the representational power of the popular CLIP model into our 3D encoder by aligning the encoded 3D scene features with the corresponding 2D image and text embeddings produced by CLIP. To assess our model's 3D world reasoning capability, we evaluate it on the downstream task of 3D Visual Question Answering. Experimental quantitative and qualitative results show that our pre-training method outperforms state-of-the-art works in this task and leads to an interpretable representation of 3D scene features.	This paper introduces a novel vision-language pre-training method for 3D Question Answering that aligns 3D scene features with corresponding 2D image and text embeddings from CLIP.	This work addresses the gap in pre-training methods for 3D Question Answering that leverage both visual and linguistic modalities, aiming to improve 3D scene understanding.	The authors design a 3D scene encoder based on VoteNet and a transformer and pre-train it by minimizing the cosine distance between the scene embedding and corresponding CLIP text and image embeddings.	The pre-trained scene encoder significantly improves the performance on the ScanQA 3D-VQA benchmark compared to training from scratch. The proposed method outperforms the state-of-the-art on ScanQA, even without using multi-view image features. Visualization of the learned scene features shows that semantically similar scenes cluster together, indicating a meaningful representation space.	The pre-training currently uses only a single top-down view of the scene. Further exploration of more complex question types and reasoning tasks in 3D scenes is needed.	3d vision-language pre-training, 3d visual question answering, clip, scene understanding, multi-modal learning
2304.06025 Report	DreamPose: Fashion Image-to-Video Synthesis via Stable Diffusion	Johanna Karras, Aleksander Holynski, Ting-Chun Wang, Ira Kemelmacher-Shlizerman	We present DreamPose, a diffusion-based method for generating animated fashion videos from still images. Given an image and a sequence of human body poses, our method synthesizes a video containing both human and fabric motion. To achieve this, we transform a pretrained text-to-image model (Stable Diffusion) into a pose-and-image guided video synthesis model, using a novel fine-tuning strategy, a set of architectural changes to support the added conditioning signals, and techniques to encourage temporal consistency. We fine-tune on a collection of fashion videos from the UBC Fashion dataset. We evaluate our method on a variety of clothing styles and poses, and demonstrate that our method produces state-of-the-art results on fashion video animation.Video results are available on our project page.	DreamPose is a novel diffusion-based method for animating still fashion images, generating photorealistic videos of people wearing diverse clothing styles in motion by leveraging a pretrained Stable Diffusion model.	Fashion videos are more informative than static images but scarce. DreamPose addresses this by enabling the creation of realistic fashion videos from readily available still images, enhancing online shopping experiences and fashion content creation.	DreamPose modifies Stable Diffusion by incorporating a split CLIP-VAE image encoder for detailed appearance conditioning and concatenating multi-pose representations for temporal consistency. A two-stage finetuning scheme, first on a fashion video dataset and then on specific subject images, enhances realism and identity preservation.	DreamPose generates high-quality fashion videos with realistic fabric motion and diverse appearances. Quantitative metrics demonstrate DreamPose outperforms state-of-the-art methods in image quality, temporal consistency, and identity preservation. User studies confirm DreamPose generates more realistic and faithful animations compared to existing techniques.	DreamPose may exhibit limitations in handling complex patterns and occasional artifacts in challenging poses. Future work includes improving computational efficiency, enhancing complex pattern fidelity, and exploring alternative conditioning signals like segmentation masks.	image animation, diffusion models, fashion videos, stable diffusion, pose conditioning
2304.06022 Report	SAM Struggles in Concealed Scenes -- Empirical Study on "Segment Anything"	Ge-Peng Ji, Deng-Ping Fan, Peng Xu, Ming-Ming Cheng, Bowen Zhou, Luc Van Gool	Segmenting anything is a ground-breaking step toward artificial general intelligence, and the Segment Anything Model (SAM) greatly fosters the foundation models for computer vision. We could not be more excited to probe the performance traits of SAM. In particular, exploring situations in which SAM does not perform well is interesting. In this report, we choose three concealed scenes, i.e., camouflaged animals, industrial defects, and medical lesions, to evaluate SAM under unprompted settings. Our main observation is that SAM looks unskilled in concealed scenes.	This paper presents an empirical study on the Segment Anything Model (SAM), exploring its limitations in handling concealed scenes.	Understanding the limitations of SAM, a groundbreaking model for image segmentation, is crucial to further improve its performance and guide future research in computer vision.	The authors quantitatively evaluate SAM on camouflaged object segmentation benchmarks and qualitatively analyze its performance on concealed scenes like camouflaged animals, industrial defects, and medical lesions.	SAM, while demonstrating improvement with larger model sizes, still lags behind state-of-the-art models in camouflaged object segmentation. SAM struggles to segment objects concealed within similar backgrounds or those lacking clear boundaries in concealed scenes. The lack of domain-specific knowledge, such as medical imaging, limits SAM's ability to accurately segment concealed lesions.	The study primarily focuses on visual analysis and could benefit from more in-depth investigation into the model's internal representations. Exploring techniques like incorporating prior knowledge or domain adaptation to improve SAM's performance in concealed scenes is a potential future direction.	segment anything model, sam, concealed scene understanding, camouflaged object segmentation, medical image segmentation
2304.06020 Report	VidStyleODE: Disentangled Video Editing via StyleGAN and NeuralODEs	Moayed Haji Ali, Andrew Bond, Tolga Birdal, Duygu Ceylan, Levent Karacan, Erkut Erdem, Aykut Erdem	We propose $\textbf{VidStyleODE}$, a spatiotemporally continuous disentangled $\textbf{Vid}$eo representation based upon $\textbf{Style}$GAN and Neural-$\textbf{ODE}$s. Effective traversal of the latent space learned by Generative Adversarial Networks (GANs) has been the basis for recent breakthroughs in image editing. However, the applicability of such advancements to the video domain has been hindered by the difficulty of representing and controlling videos in the latent space of GANs. In particular, videos are composed of content (i.e., appearance) and complex motion components that require a special mechanism to disentangle and control. To achieve this, VidStyleODE encodes the video content in a pre-trained StyleGAN $\mathcal{W}_+$ space and benefits from a latent ODE component to summarize the spatiotemporal dynamics of the input video. Our novel continuous video generation process then combines the two to generate high-quality and temporally consistent videos with varying frame rates. We show that our proposed method enables a variety of applications on real videos: text-guided appearance manipulation, motion manipulation, image animation, and video interpolation and extrapolation. Project website: https://cyberiada.github.io/VidStyleODE	VidStyleODE, a novel method for disentangled video editing, leverages StyleGAN and Neural ODEs to achieve spatiotemporally consistent manipulation of video content and motion.	Existing video editing techniques struggle to balance high-quality generation, content-motion disentanglement, and accurate manipulation. VidStyleODE addresses these challenges by learning a continuous video representation.	VidStyleODE encodes video content into StyleGAN's latent space and captures motion dynamics with a latent ODE. It then predicts latent directions conditioned on external style cues (e.g., text) to manipulate the video while preserving temporal consistency.	Achieves state-of-the-art results on text-guided video editing, outperforming baselines in terms of visual quality, temporal consistency, and manipulation accuracy. Enables various applications, including image animation, video interpolation/extrapolation, and local motion transfer, by disentangling content and motion representations. Introduces a novel CLIP-based temporal consistency loss that improves video quality and training stability compared to adversarial training.	Generation quality is limited by the pre-trained StyleGAN generator, and test-time fine-tuning could improve results. Future work could explore using second-order ODEs for enhanced dynamics representation and enable more sophisticated text-guided manipulation of local dynamics.	video editing, stylegan, neural odes, text-guided manipulation, video generation
2304.05868 Report	Mesh2Tex: Generating Mesh Textures from Image Queries	Alexey Bokhovkin, Shubham Tulsiani, Angela Dai	Remarkable advances have been achieved recently in learning neural representations that characterize object geometry, while generating textured objects suitable for downstream applications and 3D rendering remains at an early stage. In particular, reconstructing textured geometry from images of real objects is a significant challenge -- reconstructed geometry is often inexact, making realistic texturing a significant challenge. We present Mesh2Tex, which learns a realistic object texture manifold from uncorrelated collections of 3D object geometry and photorealistic RGB images, by leveraging a hybrid mesh-neural-field texture representation. Our texture representation enables compact encoding of high-resolution textures as a neural field in the barycentric coordinate system of the mesh faces. The learned texture manifold enables effective navigation to generate an object texture for a given 3D object geometry that matches to an input RGB image, which maintains robustness even under challenging real-world scenarios where the mesh geometry approximates an inexact match to the underlying geometry in the RGB image. Mesh2Tex can effectively generate realistic object textures for an object mesh to match real images observations towards digitization of real environments, significantly improving over previous state of the art.	Mesh2Tex learns a texture manifold conditioned on mesh geometry, enabling high-resolution texture generation and image-based texture transfer.	Generating textured 3D objects is important for various applications, but existing methods struggle with realistic texturing, especially from real images.	The method uses a hybrid mesh-neural-field texture representation, trained adversarially with differentiable rendering to match real 2D images.	Outperforms state-of-the-art in unconditional texture generation (FID, KID). Successfully transfers textures from single images, even with inexact geometry matches. Demonstrates robustness to unknown image pose through NOC-guided patch-based optimization.	Limitations: Lacks explicit semantic modeling, no probabilistic sampling for occlusions. Future work: Incorporate semantic understanding, explore probabilistic texture generation.	texture generation, texture transfer, neural fields, differentiable rendering, 3d shape analysis
2304.05866 Report	NoisyTwins: Class-Consistent and Diverse Image Generation through StyleGANs	Harsh Rangwani, Lavish Bansal, Kartik Sharma, Tejan Karmali, Varun Jampani, R. Venkatesh Babu	StyleGANs are at the forefront of controllable image generation as they produce a latent space that is semantically disentangled, making it suitable for image editing and manipulation. However, the performance of StyleGANs severely degrades when trained via class-conditioning on large-scale long-tailed datasets. We find that one reason for degradation is the collapse of latents for each class in the $\mathcal{W}$ latent space. With NoisyTwins, we first introduce an effective and inexpensive augmentation strategy for class embeddings, which then decorrelates the latents based on self-supervision in the $\mathcal{W}$ space. This decorrelation mitigates collapse, ensuring that our method preserves intra-class diversity with class-consistency in image generation. We show the effectiveness of our approach on large-scale real-world long-tailed datasets of ImageNet-LT and iNaturalist 2019, where our method outperforms other methods by $\sim 19\%$ on FID, establishing a new state-of-the-art.	This paper proposes NoisyTwins, a novel method to improve class-consistency and diversity in StyleGAN-generated images, particularly for long-tailed datasets.	StyleGANs struggle with class-conditional generation on long-tailed datasets, often resulting in mode collapse (limited diversity) or class confusion.	NoisyTwins introduces noise augmentation to class embeddings and uses a Barlow Twins-inspired loss to enforce invariance to these augmentations in the StyleGAN's W latent space.	NoisyTwins achieves state-of-the-art FID scores on ImageNet-LT and iNaturalist 2019, outperforming previous methods by ~19%. The method effectively mitigates both mode collapse and class confusion, generating diverse and class-consistent images even for tail classes. NoisyTwins demonstrates strong performance in few-shot image generation scenarios, improving FID scores by 22.2% on average.	The reliance on CLIP for evaluation might introduce biases from the CLIP model itself. Exploring NoisyTwins for conditioning on more complex attributes beyond class labels is a potential area for improvement.	generative adversarial networks, stylegan, long-tailed learning, image generation, class-conditional generation
2304.05818 Report	Gradient-Free Textual Inversion	Zhengcong Fei, Mingyuan Fan, Junshi Huang	Recent works on personalized text-to-image generation usually learn to bind a special token with specific subjects or styles of a few given images by tuning its embedding through gradient descent. It is natural to question whether we can optimize the textual inversions by only accessing the process of model inference. As only requiring the forward computation to determine the textual inversion retains the benefits of less GPU memory, simple deployment, and secure access for scalable models. In this paper, we introduce a \emph{gradient-free} framework to optimize the continuous textual inversion in an iterative evolutionary strategy. Specifically, we first initialize an appropriate token embedding for textual inversion with the consideration of visual and text vocabulary information. Then, we decompose the optimization of evolutionary strategy into dimension reduction of searching space and non-convex gradient-free optimization in subspace, which significantly accelerates the optimization process with negligible performance loss. Experiments in several applications demonstrate that the performance of text-to-image model equipped with our proposed gradient-free method is comparable to that of gradient-based counterparts with variant GPU/CPU platforms, flexible employment, as well as computational efficiency.	This paper presents the first gradient-free framework for personalized text-to-image generation, using an iterative evolutionary strategy to optimize textual inversion without requiring model gradients, making it suitable for limited-resource settings and large models.	Existing gradient-based methods for personalizing text-to-image models are computationally expensive and impractical for large models or restricted access scenarios. This work addresses these limitations by enabling personalization using only model inference, making it more accessible and efficient.	The proposed framework initializes the textual inversion embedding using cross-attention between given images and text vocabulary. Then, it employs a gradient-free optimization (CMA-ES) in a lower-dimensional subspace, determined by PCA or prior normalization, for efficient exploration and exploitation.	Gradient-free textual inversion achieves comparable image generation quality to gradient-based methods, both qualitatively and according to human evaluation. The proposed general condition initialization strategy significantly accelerates optimization convergence. Single pseudo-word inversion outperforms multi-word counterparts in terms of editability and maintains comparable reconstruction quality.	Balancing exploration and exploitation in the evolutionary strategy needs further investigation for potential improvement. The potential for bias in generated images, similar to other generative models, requires further investigation and mitigation strategies.	text-to-image generation, textual inversion, gradient-free optimization, evolutionary strategy, personalization
2304.05772 Report	An Image Quality Assessment Dataset for Portraits	Nicolas Chahine, Ana-Stefania Calarasanu, Davide Garcia-Civiero, Theo Cayla, Sira Ferradans, Jean Ponce	Year after year, the demand for ever-better smartphone photos continues to grow, in particular in the domain of portrait photography. Manufacturers thus use perceptual quality criteria throughout the development of smartphone cameras. This costly procedure can be partially replaced by automated learning-based methods for image quality assessment (IQA). Due to its subjective nature, it is necessary to estimate and guarantee the consistency of the IQA process, a characteristic lacking in the mean opinion scores (MOS) widely used for crowdsourcing IQA. In addition, existing blind IQA (BIQA) datasets pay little attention to the difficulty of cross-content assessment, which may degrade the quality of annotations. This paper introduces PIQ23, a portrait-specific IQA dataset of 5116 images of 50 predefined scenarios acquired by 100 smartphones, covering a high variety of brands, models, and use cases. The dataset includes individuals of various genders and ethnicities who have given explicit and informed consent for their photographs to be used in public research. It is annotated by pairwise comparisons (PWC) collected from over 30 image quality experts for three image attributes: face detail preservation, face target exposure, and overall image quality. An in-depth statistical analysis of these annotations allows us to evaluate their consistency over PIQ23. Finally, we show through an extensive comparison with existing baselines that semantic information (image context) can be used to improve IQA predictions. The dataset along with the proposed statistical analysis and BIQA algorithms are available: https://github.com/DXOMARK-Research/PIQ2023	Introduces PIQ23, the first smartphone portrait quality assessment dataset, featuring 5116 images across 50 scenes, annotated by experts via pairwise comparisons for face detail, exposure, and overall quality.	Addresses the growing need for automated portrait quality evaluation in smartphone camera development, moving beyond generic IQA datasets to focus on portrait-specific attributes.	Constructed PIQ23 with 100 smartphones capturing diverse portrait scenarios. Expert annotations were gathered using pairwise comparisons and a controlled lab environment. A novel statistical analysis method assessed annotation consistency and clustered images based on quality.	PIQ23 is the first IQA dataset with legally obtained explicit consent from all individuals depicted, addressing ethical concerns. The statistical analysis method quantifies uncertainty in pairwise comparison data and identifies significant quality differences between images. The proposed SEM-HyperIQA model, integrating semantic information and multitasking, outperforms existing BIQA methods on PIQ23, demonstrating the importance of content awareness.	Color quality annotation, attempted but excluded due to high subjectivity and difficulty in pairwise comparisons, needs further investigation. Future work could explore additional portrait-specific attributes beyond the initial three, expanding the dataset and model capabilities.	image quality assessment, portrait photography, smartphone cameras, pairwise comparison, semantic segmentation
2304.05750 Report	Segment Anything Is Not Always Perfect: An Investigation of SAM on Different Real-world Applications	Wei Ji, Jingjing Li, Qi Bi, Tingwei Liu, Wenbo Li, Li Cheng	Recently, Meta AI Research approaches a general, promptable Segment Anything Model (SAM) pre-trained on an unprecedentedly large segmentation dataset (SA-1B). Without a doubt, the emergence of SAM will yield significant benefits for a wide array of practical image segmentation applications. In this study, we conduct a series of intriguing investigations into the performance of SAM across various applications, particularly in the fields of natural images, agriculture, manufacturing, remote sensing, and healthcare. We analyze and discuss the benefits and limitations of SAM, while also presenting an outlook on its future development in segmentation tasks. By doing so, we aim to give a comprehensive understanding of SAM's practical applications. This work is expected to provide insights that facilitate future research activities toward generic segmentation. Source code is publicly available.	This paper investigates the performance of the Segment Anything Model (SAM) on a variety of real-world applications beyond natural images.	While SAM has shown impressive results on general segmentation tasks, it's crucial to understand its capabilities and limitations in diverse, specialized domains like agriculture and healthcare.	The authors evaluate SAM on various segmentation subtasks within natural images, agriculture, manufacturing, remote sensing, and healthcare. They analyze both qualitatively (visual results) and quantitatively (comparing to state-of-the-art models on benchmarks) SAM's performance.	SAM excels in common scenes with distinct objects, demonstrating strong generalization from its training. SAM requires strong prior knowledge in complex scenes, often needing specific prompts to perform well. SAM struggles with low-contrast, small, and irregular objects, highlighting limitations in handling such cases.	The study primarily focuses on visual and limited quantitative analysis, lacking in-depth performance evaluation. Future work includes exploring application-oriented SAMs, new prompt modes, and extending to video and semi-supervised learning.	segment anything model (sam), image segmentation, real-world applications, computer vision, foundation models
2304.05659 Report	RIFormer: Keep Your Vision Backbone Effective While Removing Token Mixer	Jiahao Wang, Songyang Zhang, Yong Liu, Taiqiang Wu, Yujiu Yang, Xihui Liu, Kai Chen, Ping Luo, Dahua Lin	This paper studies how to keep a vision backbone effective while removing token mixers in its basic building blocks. Token mixers, as self-attention for vision transformers (ViTs), are intended to perform information communication between different spatial tokens but suffer from considerable computational cost and latency. However, directly removing them will lead to an incomplete model structure prior, and thus brings a significant accuracy drop. To this end, we first develop an RepIdentityFormer base on the re-parameterizing idea, to study the token mixer free model architecture. And we then explore the improved learning paradigm to break the limitation of simple token mixer free backbone, and summarize the empirical practice into 5 guidelines. Equipped with the proposed optimization strategy, we are able to build an extremely simple vision backbone with encouraging performance, while enjoying the high efficiency during inference. Extensive experiments and ablative analysis also demonstrate that the inductive bias of network architecture, can be incorporated into simple network structure with appropriate optimization strategy. We hope this work can serve as a starting point for the exploration of optimization-driven efficient network design. Project page: https://techmonsterwang.github.io/RIFormer/.	This paper presents RIFormer, a vision backbone that maintains efficacy while removing computationally expensive token mixers.	Token mixers, like self-attention in ViTs, are computationally costly and limit backbone efficiency on resource-constrained devices. RIFormer explores removing these while preserving effectiveness.	The study uses structural re-parameterization to train a model with an affine transformation replacing the token mixer, later merging it into the LayerNorm during inference. It further explores knowledge distillation, using a teacher model with a token mixer to guide the token mixer-free student.	RIFormer achieves competitive performance on ImageNet-1K while surpassing models with token mixers in inference speed. The paper provides five guidelines for effectively training such token-mixer-free models using knowledge distillation. Analysis suggests that the inductive bias introduced by token mixers can be implicitly learned by simpler structures using the proposed training method.	The paper primarily focuses on image classification, leaving its application to other vision tasks unexplored. Future work could involve investigating the impact of this method on tasks like object detection and image deblurring.	vision backbones, token mixers, knowledge distillation, efficient networks, structural re-parameterization
2304.05568 Report	Improving Diffusion Models for Scene Text Editing with Dual Encoders	Jiabao Ji, Guanhua Zhang, Zhaowen Wang, Bairu Hou, Zhifei Zhang, Brian Price, Shiyu Chang	Scene text editing is a challenging task that involves modifying or inserting specified texts in an image while maintaining its natural and realistic appearance. Most previous approaches to this task rely on style-transfer models that crop out text regions and feed them into image transfer models, such as GANs. However, these methods are limited in their ability to change text style and are unable to insert texts into images. Recent advances in diffusion models have shown promise in overcoming these limitations with text-conditional image editing. However, our empirical analysis reveals that state-of-the-art diffusion models struggle with rendering correct text and controlling text style. To address these problems, we propose DIFFSTE to improve pre-trained diffusion models with a dual encoder design, which includes a character encoder for better text legibility and an instruction encoder for better style control. An instruction tuning framework is introduced to train our model to learn the mapping from the text instruction to the corresponding image with either the specified style or the style of the surrounding texts in the background. Such a training method further brings our method the zero-shot generalization ability to the following three scenarios: generating text with unseen font variation, e.g., italic and bold, mixing different fonts to construct a new font, and using more relaxed forms of natural language as the instructions to guide the generation task. We evaluate our approach on five datasets and demonstrate its superior performance in terms of text correctness, image naturalness, and style controllability. Our code is publicly available. https://github.com/UCSB-NLP-Chang/DiffSTE	Proposed DiffSTE, a method leveraging dual encoders (character and instruction) to enhance pre-trained diffusion models for scene text editing.	Existing methods for scene text editing either struggle with style control (GAN-based style transfer) or text accuracy (diffusion models).	Introduce a dual encoder design to improve Stable Diffusion: character encoder for accurate spelling, instruction encoder for style control. Train using instruction tuning on synthetic and real-world datasets with varying style instructions.	Achieves superior performance in text correctness, image naturalness, and style control on five datasets. Shows strong zero-shot generalization to unseen font variations (e.g., bold, italic) and combinations. Can be guided by more natural language instructions.	Current model is limited to single-word editing. Evaluation focuses on single-word editing; more complex scene text editing scenarios can be explored.	scene text editing, diffusion models, instruction tuning, dual encoder, zero-shot learning
2304.05552 Report	DynamicDet: A Unified Dynamic Architecture for Object Detection	Zhihao Lin, Yongtao Wang, Jinhe Zhang, Xiaojie Chu	Dynamic neural network is an emerging research topic in deep learning. With adaptive inference, dynamic models can achieve remarkable accuracy and computational efficiency. However, it is challenging to design a powerful dynamic detector, because of no suitable dynamic architecture and exiting criterion for object detection. To tackle these difficulties, we propose a dynamic framework for object detection, named DynamicDet. Firstly, we carefully design a dynamic architecture based on the nature of the object detection task. Then, we propose an adaptive router to analyze the multi-scale information and to decide the inference route automatically. We also present a novel optimization strategy with an exiting criterion based on the detection losses for our dynamic detectors. Last, we present a variable-speed inference strategy, which helps to realize a wide range of accuracy-speed trade-offs with only one dynamic detector. Extensive experiments conducted on the COCO benchmark demonstrate that the proposed DynamicDet achieves new state-of-the-art accuracy-speed trade-offs. For instance, with comparable accuracy, the inference speed of our dynamic detector Dy-YOLOv7-W6 surpasses YOLOv7-E6 by 12%, YOLOv7-D6 by 17%, and YOLOv7-E6E by 39%. The code is available at https://github.com/VDIGPKU/DynamicDet.	This paper proposes DynamicDet, a dynamic neural network framework for object detection that allows for adaptable inference routes based on image difficulty to achieve a wide range of accuracy-speed trade-offs using a single model.	Existing object detectors often require training multiple models for different accuracy-speed trade-offs, leading to high computational costs. Dynamic inference addresses this by adapting computation based on input data, offering improved efficiency.	DynamicDet uses two cascaded detectors with an adaptive router. The router analyzes multi-scale features to estimate an image's difficulty score and dynamically chooses the appropriate detector (faster or more accurate). A novel, hyperparameter-free optimization strategy with an adaptive offset is used to train the router, ensuring accurate difficulty assessment and balanced detector usage.	DynamicDet achieves state-of-the-art accuracy-speed trade-offs, outperforming other real-time object detectors. The method generalizes well across one-stage and two-stage detectors and is compatible with both CNN- and transformer-based backbones. The proposed adaptive router is lightweight and effectively learns to distinguish between "easy" and "hard" images for dynamic routing.	The variable-speed inference strategy relies on a sufficiently large validation set for robust threshold determination. Future work could explore more sophisticated difficulty assessment mechanisms beyond the proposed adaptive router for further performance improvement.	object detection, dynamic neural network, accuracy-speed trade-off, adaptive inference, computer vision
2304.05523 Report	MoMo: A shared encoder Model for text, image and multi-Modal representations	Rakesh Chada, Zhaoheng Zheng, Pradeep Natarajan	We propose a self-supervised shared encoder model that achieves strong results on several visual, language and multimodal benchmarks while being data, memory and run-time efficient. We make three key contributions. First, in contrast to most existing works, we use a single transformer with all the encoder layers processing both the text and the image modalities. Second, we propose a stage-wise training strategy where the model is first trained on images, then jointly with unimodal text and image datasets and finally jointly with text and text-image datasets. Third, to preserve information across both the modalities, we propose a training pipeline that learns simultaneously from gradient updates of different modalities at each training update step. The results on downstream text-only, image-only and multimodal tasks show that our model is competitive with several strong models while using fewer parameters and lesser pre-training data. For example, MoMo performs competitively with FLAVA on multimodal (+3.1), image-only (+1.1) and text-only (-0.1) tasks despite having 2/5th the number of parameters and using 1/3rd the image-text training pairs. Finally, we ablate various design choices and further show that increasing model size produces significant performance gains indicating potential for substantial improvements with larger models using our approach.	This paper introduces MoMo, a self-supervised shared encoder model for text, image, and multimodal representation learning that is efficient in terms of data, memory, and runtime.	MoMo addresses the limitations of existing multimodal models that often rely on huge training corpora or models with numerous parameters by using a single transformer encoder for all modalities and a stage-wise training strategy.	MoMo employs a three-stage training pipeline: first trained on images (Masked Image Modeling), then jointly on unimodal text and images (Masked Language Modeling), and finally on unimodal text and multimodal image-text data (Cross-Modal Masking, contrastive, and matching losses).	MoMo achieves competitive performance on multimodal, image-only, and text-only tasks despite using significantly fewer parameters and less pre-training data compared to models like FLAVA and CLIP. A multi-stage training approach where the model learns simultaneously from different modalities at each training step is crucial for effective multimodal representation learning. Scaling up the model size leads to considerable performance gains, highlighting the potential for further improvements with larger models using this approach.	The model's performance on certain tasks, like VQA, could benefit from additional pre-training data. Future work could explore incorporating more modalities and larger models.	multimodal learning, vision-language pre-training, shared encoder, transformer, self-supervised learning
2304.05395 Report	SE-ORNet: Self-Ensembling Orientation-aware Network for Unsupervised Point Cloud Shape Correspondence	Jiacheng Deng, Chuxin Wang, Jiahao Lu, Jianfeng He, Tianzhu Zhang, Jiyang Yu, Zhe Zhang	Unsupervised point cloud shape correspondence aims to obtain dense point-to-point correspondences between point clouds without manually annotated pairs. However, humans and some animals have bilateral symmetry and various orientations, which lead to severe mispredictions of symmetrical parts. Besides, point cloud noise disrupts consistent representations for point cloud and thus degrades the shape correspondence accuracy. To address the above issues, we propose a Self-Ensembling ORientation-aware Network termed SE-ORNet. The key of our approach is to exploit an orientation estimation module with a domain adaptive discriminator to align the orientations of point cloud pairs, which significantly alleviates the mispredictions of symmetrical parts. Additionally, we design a selfensembling framework for unsupervised point cloud shape correspondence. In this framework, the disturbances of point cloud noise are overcome by perturbing the inputs of the student and teacher networks with different data augmentations and constraining the consistency of predictions. Extensive experiments on both human and animal datasets show that our SE-ORNet can surpass state-of-the-art unsupervised point cloud shape correspondence methods.	This paper introduces SE-ORNet, a novel self-ensembling orientation-aware network designed for unsupervised point cloud shape correspondence.	Existing methods struggle with the mismatching of symmetrical parts in point clouds with different orientations and are sensitive to noise. This paper aims to address these issues.	SE-ORNet utilizes an orientation estimation module with domain adaptation to align point cloud pairs, mitigating mismatches. Additionally, a self-ensembling framework with consistency losses ensures robust feature representations despite noise and orientation variations.	SE-ORNet surpasses state-of-the-art methods on human and animal benchmarks, including SHREC, SURREAL, TOSCA, and SMAL. The orientation estimation module effectively aligns point cloud orientations, significantly improving correspondence accuracy for symmetrical parts. The self-ensembling framework enhances robustness to noise, leading to more consistent and reliable feature representations.	The performance of orientation estimation relies on the accuracy of the pre-defined angle bins. The computational cost of the self-ensembling framework is relatively high.	point cloud shape correspondence, unsupervised learning, self-ensembling, orientation estimation, domain adaptation
2304.05390 Report	HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models	Eslam Mohamed Bakr, Pengzhan Sun, Xiaoqian Shen, Faizan Farooq Khan, Li Erran Li, Mohamed Elhoseiny	In recent years, Text-to-Image (T2I) models have been extensively studied, especially with the emergence of diffusion models that achieve state-of-the-art results on T2I synthesis tasks. However, existing benchmarks heavily rely on subjective human evaluation, limiting their ability to holistically assess the model's capabilities. Furthermore, there is a significant gap between efforts in developing new T2I architectures and those in evaluation. To address this, we introduce HRS-Bench, a concrete evaluation benchmark for T2I models that is Holistic, Reliable, and Scalable. Unlike existing bench-marks that focus on limited aspects, HRS-Bench measures 13 skills that can be categorized into five major categories: accuracy, robustness, generalization, fairness, and bias. In addition, HRS-Bench covers 50 scenarios, including fashion, animals, transportation, food, and clothes. We evaluate nine recent large-scale T2I models using metrics that cover a wide range of skills. A human evaluation aligned with 95% of our evaluations on average was conducted to probe the effectiveness of HRS-Bench. Our experiments demonstrate that existing models often struggle to generate images with the desired count of objects, visual text, or grounded emotions. We hope that our benchmark help ease future text-to-image generation research. The code and data are available at https://eslambakr.github.io/hrsbench.github.io	The paper introduces HRS-Bench, a holistic, reliable, and scalable benchmark for evaluating text-to-image models beyond just image quality.	Existing T2I benchmarks are limited in scope, often focusing on few aspects like fidelity or bias, hindering comprehensive model assessment.	HRS-Bench uses a large dataset of prompts across 50 scenarios and measures 13 skills grouped into five categories: accuracy, robustness, generalization, fairness, and bias. It utilizes a combination of automatic metrics (e.g., UniDet for counting, CLIPScore for similarity) and human evaluation.	Existing models struggle with object counting accuracy, especially with increased prompt complexity. Generating images with visual text or grounded emotions remains a significant challenge for current models. While models show robustness against language perturbations like typos, they struggle with complex compositions, especially spatial, size, and color arrangements.	Accessing the full training data for some models is challenging, limiting the evaluation of aspects like creativity. Expanding the benchmark to include more intricate visual reasoning and common-sense understanding skills.	text-to-image synthesis, benchmarking, evaluation metrics, multi-modal learning, generative ai
2304.05265 Report	Controllable Textual Inversion for Personalized Text-to-Image Generation	Jianan Yang, Haobo Wang, Yanming Zhang, Ruixuan Xiao, Sai Wu, Gang Chen, Junbo Zhao	The recent large-scale generative modeling has attained unprecedented performance especially in producing high-fidelity images driven by text prompts. Text inversion (TI), alongside the text-to-image model backbones, is proposed as an effective technique in personalizing the generation when the prompts contain user-defined, unseen or long-tail concept tokens. Despite that, we find and show that the deployment of TI remains full of "dark-magics" -- to name a few, the harsh requirement of additional datasets, arduous human efforts in the loop and lack of robustness. In this work, we propose a much-enhanced version of TI, dubbed Controllable Textual Inversion (COTI), in resolving all the aforementioned problems and in turn delivering a robust, data-efficient and easy-to-use framework. The core to COTI is a theoretically-guided loss objective instantiated with a comprehensive and novel weighted scoring mechanism, encapsulated by an active-learning paradigm. The extensive results show that COTI significantly outperforms the prior TI-related approaches with a 26.05 decrease in the FID score and a 23.00% boost in the R-precision.	This paper proposes \emph{\FULLNAME{}} (\NAME{}), an enhanced text inversion (TI) framework for personalized text-to-image generation that addresses limitations of existing TI methods, such as the need for large datasets and manual data selection.	Existing text-to-image generation models struggle to produce high-quality images for prompts containing unseen or long-tail concepts. TI offers a solution but often relies on manual data selection and large datasets, limiting its practicality.	\NAME{} utilizes an active learning paradigm with a novel weighted scoring system to automatically select high-quality training data from a web-crawled dataset. The scoring system combines aesthetic and concept-matching scores, dynamically balancing their importance based on the evolving text embedding during training.	\NAME{} significantly outperforms baseline TI approaches, achieving a 26.05 decrease in FID score and a 23.00% boost in R-precision. The method demonstrates successful learning of concept attributes, progressively refining image quality across active learning cycles. Ablation studies confirm the effectiveness of both the dual scoring system and the dynamic training schedule.	The current implementation primarily focuses on single-concept personalization and may require further exploration for concepts with multiple visual representations. Future work could investigate extending \NAME{} to other text-guided generative tasks.	text-to-image generation, textual inversion, active learning, personalized image synthesis, aesthetic image assessment
2304.05139 Report	NeAT: Neural Artistic Tracing for Beautiful Style Transfer	Dan Ruta, Andrew Gilbert, John Collomosse, Eli Shechtman, Nicholas Kolkin	Style transfer is the task of reproducing the semantic contents of a source image in the artistic style of a second target image. In this paper, we present NeAT, a new state-of-the art feed-forward style transfer method. We re-formulate feed-forward style transfer as image editing, rather than image generation, resulting in a model which improves over the state-of-the-art in both preserving the source content and matching the target style. An important component of our model's success is identifying and fixing "style halos", a commonly occurring artefact across many style transfer techniques. In addition to training and testing on standard datasets, we introduce the BBST-4M dataset, a new, large scale, high resolution dataset of 4M images. As a component of curating this data, we present a novel model able to classify if an image is stylistic. We use BBST-4M to improve and measure the generalization of NeAT across a huge variety of styles. Not only does NeAT offer state-of-the-art quality and generalization, it is designed and trained for fast inference at high resolution.	This supplementary material provides additional details about NeAT (Neural Artistic Tracing), a novel style transfer method, and introduces BBST-4M, a large-scale dataset for style transfer.	This work addresses the limitations of existing style transfer datasets and methods by introducing a new dataset with diverse styles and a model that generalizes well to unseen styles.	The authors create BBST-4M using images from Flickr and Behance.net. They train NeAT using a combination of adversarial loss, style loss, content loss, identity loss, contrastive loss, and a novel patch-based discriminator.	NeAT trained on BBST-4M demonstrates strong generalization capabilities and produces high-quality stylizations. BBST-4M, with its diverse range of styles, facilitates the development of more robust style transfer models. NeAT shows promise for video stylization, even with a simple frame-by-frame approach.	The video stylization approach lacks explicit temporal consistency mechanisms. Quantitative comparisons with other state-of-the-art style transfer methods are limited in the supplementary material.	style transfer, deep learning, computer vision, dataset, neural networks
2304.05097 Report	One-Shot High-Fidelity Talking-Head Synthesis with Deformable Neural Radiance Field	Weichuang Li, Longhao Zhang, Dong Wang, Bin Zhao, Zhigang Wang, Mulin Chen, Bang Zhang, Zhongjian Wang, Liefeng Bo, Xuelong Li	Talking head generation aims to generate faces that maintain the identity information of the source image and imitate the motion of the driving image. Most pioneering methods rely primarily on 2D representations and thus will inevitably suffer from face distortion when large head rotations are encountered. Recent works instead employ explicit 3D structural representations or implicit neural rendering to improve performance under large pose changes. Nevertheless, the fidelity of identity and expression is not so desirable, especially for novel-view synthesis. In this paper, we propose HiDe-NeRF, which achieves high-fidelity and free-view talking-head synthesis. Drawing on the recently proposed Deformable Neural Radiance Fields, HiDe-NeRF represents the 3D dynamic scene into a canonical appearance field and an implicit deformation field, where the former comprises the canonical source face and the latter models the driving pose and expression. In particular, we improve fidelity from two aspects: (i) to enhance identity expressiveness, we design a generalized appearance module that leverages multi-scale volume features to preserve face shape and details; (ii) to improve expression preciseness, we propose a lightweight deformation module that explicitly decouples the pose and expression to enable precise expression modeling. Extensive experiments demonstrate that our proposed approach can generate better results than previous works. Project page: https://www.waytron.net/hidenerf/	This paper proposes HiDe-NeRF, a novel one-shot and subject-agnostic Deformable Neural Radiance Field for high-fidelity and free-view talking-head synthesis.	Existing talking head generation methods struggle to generate high-fidelity results, particularly in preserving source identity and mimicking driving expressions, especially under large head rotations.	HiDe-NeRF represents a 3D dynamic scene as a canonical appearance field (multi-scale tri-plane representation of the source face) and an implicit deformation field. The deformation field is learned using a novel Lightweight Expression-aware Deformation (LED) module that decouples pose and expression for precise modeling. A Multi-scale Generalized Appearance (MGA) module ensures identity expressiveness. Finally, the model renders the synthesized image and refines the texture details.	HiDe-NeRF outperforms state-of-the-art methods in both self-reenactment and cross-identity reenactment tasks on multiple benchmark datasets, demonstrating superior performance in preserving source identity and mimicking driving expressions. The proposed method exhibits excellent free-view synthesis capability, accurately redirecting the face while maintaining identity and expression consistency across different viewpoints. Ablation studies confirm the effectiveness of the proposed MGA and LED modules in enhancing identity preservation and expression preciseness.	HiDe-NeRF struggles with handling occlusions in the source image. The method's performance degrades under extreme pose changes due to pose bias in training datasets. Future work includes addressing occlusions, mitigating pose bias, and exploring other modality-driven talking head synthesis.	talking-head synthesis, deformable neural radiance fields, one-shot learning, identity preservation, expression modeling
2304.05051 Report	FashionSAP: Symbols and Attributes Prompt for Fine-grained Fashion Vision-Language Pre-training	Yunpeng Han, Lisai Zhang, Qingcai Chen, Zhijian Chen, Zhonghua Li, Jianxin Yang, Zhao Cao	Fashion vision-language pre-training models have shown efficacy for a wide range of downstream tasks. However, general vision-language pre-training models pay less attention to fine-grained domain features, while these features are important in distinguishing the specific domain tasks from general tasks. We propose a method for fine-grained fashion vision-language pre-training based on fashion Symbols and Attributes Prompt (FashionSAP) to model fine-grained multi-modalities fashion attributes and characteristics. Firstly, we propose the fashion symbols, a novel abstract fashion concept layer, to represent different fashion items and to generalize various kinds of fine-grained fashion features, making modelling fine-grained attributes more effective. Secondly, the attributes prompt method is proposed to make the model learn specific attributes of fashion items explicitly. We design proper prompt templates according to the format of fashion data. Comprehensive experiments are conducted on two public fashion benchmarks, i.e., FashionGen and FashionIQ, and FashionSAP gets SOTA performances for four popular fashion tasks. The ablation study also shows the proposed abstract fashion symbols, and the attribute prompt method enables the model to acquire fine-grained semantics in the fashion domain effectively. The obvious performance gains from FashionSAP provide a new baseline for future fashion task research.	This paper proposes FashionSAP, a novel fine-grained fashion vision-language pre-training model that leverages fashion symbols and attribute prompts to learn attribute-level fashion knowledge.	General vision-language pre-training models often overlook the fine-grained attributes crucial for understanding fashion items, limiting their effectiveness in fashion-related tasks.	FashionSAP utilizes: (1) Nine abstract fashion symbols representing broad categories based on body parts and functionalities, aiding in general feature capture. (2) An attribute prompt method with specifically designed templates to explicitly learn fine-grained fashion characteristics from attribute annotations.	FashionSAP achieves state-of-the-art performance on four popular fashion tasks: text-to-image retrieval, image-to-text retrieval, category recognition, and subcategory recognition, using the FashionGen and FashionIQ datasets. Ablation studies confirm the significant contribution of fashion symbols and attribute prompts in improving performance across all tasks. Visualization using Grad-CAM highlights FashionSAP’s ability to focus on precise regions of interest, demonstrating effective fine-grained alignment between text and image modalities.	The current work explores a limited set of fashion symbols based solely on category attributes; future research could investigate more diverse symbol representations. Further exploration is needed to investigate the full potential of the attribute prompt framework in learning richer and more nuanced fashion representations.	vision-language pre-training, fine-grained visual recognition, fashion analysis, attribute prompt learning, multi-modal representation learning
2304.04968 Report	Re-imagine the Negative Prompt Algorithm: Transform 2D Diffusion into 3D, alleviate Janus problem and Beyond	Mohammadreza Armandpour, Ali Sadeghian, Huangjie Zheng, Amir Sadeghian, Mingyuan Zhou	Although text-to-image diffusion models have made significant strides in generating images from text, they are sometimes more inclined to generate images like the data on which the model was trained rather than the provided text. This limitation has hindered their usage in both 2D and 3D applications. To address this problem, we explored the use of negative prompts but found that the current implementation fails to produce desired results, particularly when there is an overlap between the main and negative prompts. To overcome this issue, we propose Perp-Neg, a new algorithm that leverages the geometrical properties of the score space to address the shortcomings of the current negative prompts algorithm. Perp-Neg does not require any training or fine-tuning of the model. Moreover, we experimentally demonstrate that Perp-Neg provides greater flexibility in generating images by enabling users to edit out unwanted concepts from the initially generated images in 2D cases. Furthermore, to extend the application of Perp-Neg to 3D, we conducted a thorough exploration of how Perp-Neg can be used in 2D to condition the diffusion model to generate desired views, rather than being biased toward the canonical views. Finally, we applied our 2D intuition to integrate Perp-Neg with the state-of-the-art text-to-3D (DreamFusion) method, effectively addressing its Janus (multi-head) problem. Our project page is available at https://Perp-Neg.github.io/	This paper proposes Perp-Neg, a novel sampling algorithm for text-to-image diffusion models to address limitations of current negative prompts when there's overlap with the main prompt.	Current text-to-image models struggle to accurately represent complex prompts, often generating images resembling training data instead of the input text. This is particularly problematic with negative prompts, hindering their use in both 2D and 3D applications.	Perp-Neg leverages geometrical properties of the score space, ensuring negative prompt guidance remains perpendicular to the main prompt's direction, preventing unintended removal of desired concepts. This is achieved without requiring any model training or fine-tuning.	Significantly higher success rate (73.1% vs 42% for side view, 40.4% vs 14.6% for back view) in generating images aligned with specific viewpoint prompts compared to baseline methods. Perp-Neg provides better control over negative attribute elimination, allowing for more nuanced image generation. Integration of Perp-Neg with DreamFusion alleviates the Janus problem in text-to-3D generation by improving view faithfulness of the underlying 2D diffusion model.	The paper primarily focuses on single object generation and view control, further exploration is needed for more complex scenes and prompt compositions. Fine-tuning the negative prompt weight functions for optimal performance can be time-consuming.	text-to-image generation, diffusion models, negative prompts, view synthesis, 3d generation
2304.04962 Report	Mask-Based Modeling for Neural Radiance Fields	Ganlin Yang, Guoqiang Wei, Zhizheng Zhang, Yan Lu, Dong Liu	Most Neural Radiance Fields (NeRFs) exhibit limited generalization capabilities, which restrict their applicability in representing multiple scenes using a single model. To address this problem, existing generalizable NeRF methods simply condition the model on image features. These methods still struggle to learn precise global representations over diverse scenes since they lack an effective mechanism for interacting among different points and views. In this work, we unveil that 3D implicit representation learning can be significantly improved by mask-based modeling. Specifically, we propose masked ray and view modeling for generalizable NeRF (MRVM-NeRF), which is a self-supervised pretraining target to predict complete scene representations from partially masked features along each ray. With this pretraining target, MRVM-NeRF enables better use of correlations across different points and views as the geometry priors, which thereby strengthens the capability of capturing intricate details within the scenes and boosts the generalization capability across different scenes. Extensive experiments demonstrate the effectiveness of our proposed MRVM-NeRF on both synthetic and real-world datasets, qualitatively and quantitatively. Besides, we also conduct experiments to show the compatibility of our proposed method with various backbones and its superiority under few-shot cases.	This paper proposes MRVM (Masked Ray and View Modeling), a self-supervised pretraining strategy for generalizable Neural Radiance Fields (NeRF) to improve their ability to represent multiple scenes with a single model.	Most NeRFs lack generalization capability due to limited conditioning on image features and struggle to learn global representations across diverse scenes.	MRVM introduces a pretraining objective that predicts complete scene representations from partially masked features along rays and across views. This encourages interactions across different points and views, enhancing the learning of global 3D scene priors.	MRVM-NeRF significantly improves performance on synthetic datasets (ShapeNet) in both category-agnostic and category-specific settings. It also demonstrates effectiveness on challenging real-world datasets (NeRF Synthetic, DTU, LLFF) using both MLP-based and Transformer-based architectures. The learned priors from MRVM are beneficial for both cross-scene generalization and per-scene finetuning.	The paper explores a limited set of masking strategies and ratios. Future work could investigate the impact of different masking patterns and optimize them for specific scene complexities.	neural radiance fields, generalizable nerf, self-supervised learning, masked modeling, 3d scene representation
2304.04909 Report	SATR: Zero-Shot Semantic Segmentation of 3D Shapes	Ahmed Abdelreheem, Ivan Skorokhodov, Maks Ovsjanikov, Peter Wonka	We explore the task of zero-shot semantic segmentation of 3D shapes by using large-scale off-the-shelf 2D image recognition models. Surprisingly, we find that modern zero-shot 2D object detectors are better suited for this task than contemporary text/image similarity predictors or even zero-shot 2D segmentation networks. Our key finding is that it is possible to extract accurate 3D segmentation maps from multi-view bounding box predictions by using the topological properties of the underlying surface. For this, we develop the Segmentation Assignment with Topological Reweighting (SATR) algorithm and evaluate it on ShapeNetPart and our proposed FAUST benchmarks. SATR achieves state-of-the-art performance and outperforms a baseline algorithm by 1.3% and 4% average mIoU on the FAUST coarse and fine-grained benchmarks, respectively, and by 5.2% average mIoU on the ShapeNetPart benchmark. Our source code and data will be publicly released. Project webpage: https://samir55.github.io/SATR/.	This paper proposes SATR, a novel method for zero-shot 3D shape segmentation using off-the-shelf 2D zero-shot object detectors and leveraging the topological properties of 3D surfaces.	Extending the success of vision-language models in 2D zero-shot recognition to 3D is hindered by limited 3D data. This work explores using readily available 2D models for efficient and accurate 3D shape segmentation.	SATR leverages 2D object detector (GLIP) predictions from multiple views of a 3D shape. It then refines these predictions by introducing Gaussian geodesic reweighting and visibility smoothing techniques, which utilize the topological information of the mesh.	SATR achieves state-of-the-art performance on ShapeNetPart and the proposed FAUST benchmarks. It significantly outperforms baseline methods, especially in fine-grained segmentation tasks. Ablation studies demonstrate the effectiveness of the proposed Gaussian geodesic reweighting and visibility smoothing techniques.	The random view sampling algorithm does not guarantee complete triangle coverage. Evaluation with other large language models is limited by their public availability.	zero-shot learning, 3d shape segmentation, vision-language models, object detection, topology
2304.04820 Report	Binary Latent Diffusion	Ze Wang, Jiang Wang, Zicheng Liu, Qiang Qiu	In this paper, we show that a binary latent space can be explored for compact yet expressive image representations. We model the bi-directional mappings between an image and the corresponding latent binary representation by training an auto-encoder with a Bernoulli encoding distribution. On the one hand, the binary latent space provides a compact discrete image representation of which the distribution can be modeled more efficiently than pixels or continuous latent representations. On the other hand, we now represent each image patch as a binary vector instead of an index of a learned cookbook as in discrete image representations with vector quantization. In this way, we obtain binary latent representations that allow for better image quality and high-resolution image representations without any multi-stage hierarchy in the latent space. In this binary latent space, images can now be generated effectively using a binary latent diffusion model tailored specifically for modeling the prior over the binary image representations. We present both conditional and unconditional image generation experiments with multiple datasets, and show that the proposed method performs comparably to state-of-the-art methods while dramatically improving the sampling efficiency to as few as 16 steps without using any test-time acceleration. The proposed framework can also be seamlessly scaled to $1024 \times 1024$ high-resolution image generation without resorting to latent hierarchy or multi-stage refinements.	This paper introduces a method for representing and generating images in a compact binary latent space using a novel binary latent diffusion model.	Representing images in a binary latent space offers a compact and expressive alternative to continuous or vector-quantized representations, enabling efficient high-resolution image generation without complex hierarchical latent structures.	The method involves training an auto-encoder with a Bernoulli latent distribution to learn bidirectional mappings between images and binary codes. A binary latent diffusion model, tailored for Bernoulli distributions, is then trained to efficiently model the prior over these binary representations, allowing for novel sample generation.	The binary latent diffusion model achieves comparable image generation quality and diversity to state-of-the-art methods with significantly fewer denoising steps and faster sampling speed. The method allows for high-resolution (1024x1024) image generation in a single shot without resorting to hierarchical latent structures. Binary latent representations offer a good balance between compactness and expressiveness, achieving better reconstruction quality with fewer bits compared to vector quantization.	The current implementation utilizes a plain transformer architecture for the sampler, which may limit its ability to model images with complex global dependencies. Further exploration of different noise schedulers and their impact on sample quality and efficiency is needed.	image generation, diffusion models, binary latent space, bernoulli distribution, representation learning
2304.04742 Report	Detection Transformer with Stable Matching	Shilong Liu, Tianhe Ren, Jiayu Chen, Zhaoyang Zeng, Hao Zhang, Feng Li, Hongyang Li, Jun Huang, Hang Su, Jun Zhu, Lei Zhang	This paper is concerned with the matching stability problem across different decoder layers in DEtection TRansformers (DETR). We point out that the unstable matching in DETR is caused by a multi-optimization path problem, which is highlighted by the one-to-one matching design in DETR. To address this problem, we show that the most important design is to use and only use positional metrics (like IOU) to supervise classification scores of positive examples. Under the principle, we propose two simple yet effective modifications by integrating positional metrics to DETR's classification loss and matching cost, named position-supervised loss and position-modulated cost. We verify our methods on several DETR variants. Our methods show consistent improvements over baselines. By integrating our methods with DINO, we achieve 50.4 and 51.5 AP on the COCO detection benchmark using ResNet-50 backbones under 12 epochs and 24 epochs training settings, achieving a new record under the same setting. We achieve 63.8 AP on COCO detection test-dev with a Swin-Large backbone. Our code will be made available at https://github.com/IDEA-Research/Stable-DINO.	This paper identifies and addresses the unstable matching problem in DEtection TRansformers (DETR), proposing a solution based on using positional metrics like Intersection over Union (IoU) to supervise classification scores.	Unstable matching across decoder layers, caused by a multi-optimization path problem, hinders the training stability and efficiency of DETR-like models.	The paper proposes two modifications: (1) position-supervised loss, using IoU to directly supervise classification scores of positive examples, and (2) position-modulated cost, incorporating IoU into the matching cost to down-weight inaccurate predictions. Additionally, a dense memory fusion technique is introduced to merge encoder and backbone features, enhancing feature utilization.	Significantly improved training stability, evidenced by reduced inconsistencies in matching across decoder layers. Faster convergence, especially during early training stages, attributed to both the stable matching strategy and the memory fusion technique. State-of-the-art performance on the COCO object detection benchmark, achieving 50.4 AP and 51.5 AP with ResNet-50 backbones under 1x and 2x training schedules, respectively.	The method is only validated on image-based object detection and segmentation tasks, leaving its applicability to other domains like 3D object detection unexplored. The study primarily focuses on classification aspects of loss and matching, leaving the optimization of localization components for future work.	object detection, detection transformers, detr, stable matching, position supervision
2304.04709 Report	Can SAM Segment Anything? When SAM Meets Camouflaged Object Detection	Lv Tang, Haoke Xiao, Bo Li	SAM is a segmentation model recently released by Meta AI Research and has been gaining attention quickly due to its impressive performance in generic object segmentation. However, its ability to generalize to specific scenes such as camouflaged scenes is still unknown. Camouflaged object detection (COD) involves identifying objects that are seamlessly integrated into their surroundings and has numerous practical applications in fields such as medicine, art, and agriculture. In this study, we try to ask if SAM can address the COD task and evaluate the performance of SAM on the COD benchmark by employing maximum segmentation evaluation and camouflage location evaluation. We also compare SAM's performance with 22 state-of-the-art COD methods. Our results indicate that while SAM shows promise in generic object segmentation, its performance on the COD task is limited. This presents an opportunity for further research to explore how to build a stronger SAM that may address the COD task. The results of this paper are provided in \url{https://github.com/luckybird1994/SAMCOD}.	This paper evaluates the performance of the Segment Anything Model (SAM) on the task of Camouflaged Object Detection (COD).	COD is an important task with applications in various fields, and understanding the generalization capabilities of foundation models like SAM in specific domains is crucial.	The authors evaluate SAM on three COD benchmark datasets (CAMO, COD10K, NC4K) using two evaluation schemes: maximum segmentation evaluation (selecting the best prediction among multiple outputs) and camouflage location evaluation (analyzing the proportion of predictions exceeding a given F-measure threshold). The results are compared with 22 state-of-the-art COD methods.	SAM's performance on COD is limited compared to state-of-the-art COD methods. SAM's maximum segmentation performance is significantly lower than the best-performing COD methods. SAM's ability to accurately locate camouflaged objects also requires further improvement.	The evaluation does not explore fine-tuning SAM on COD datasets. Future work could investigate modifications to SAM's architecture to better address COD challenges.	camouflaged object detection, segment anything model (sam), foundation models, computer vision, image segmentation
2304.04704 Report	Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition	Shuhuai Ren, Aston Zhang, Yi Zhu, Shuai Zhang, Shuai Zheng, Mu Li, Alex Smola, Xu Sun	This work proposes POMP, a prompt pre-training method for vision-language models. Being memory and computation efficient, POMP enables the learned prompt to condense semantic information for a rich set of visual concepts with over twenty-thousand classes. Once pre-trained, the prompt with a strong transferable ability can be directly plugged into a variety of visual recognition tasks including image classification, semantic segmentation, and object detection, to boost recognition performances in a zero-shot manner. Empirical evaluation shows that POMP achieves state-of-the-art performances on 21 datasets, e.g., 67.0% average accuracy on 10 classification datasets (+3.1% compared to CoOp) and 84.4 hIoU on open-vocabulary Pascal VOC segmentation (+6.9 compared to ZSSeg). Our code is available at https://github.com/amazon-science/prompt-pretraining.	This paper proposes POMP, a memory and computation-efficient prompt pre-training method for vision-language models using the ImageNet-21K dataset.	Existing prompt tuning methods are computationally expensive for large-scale datasets, limiting their ability to learn a universal, task-agnostic prompt for visual recognition.	POMP introduces 'local contrast' to reduce memory overhead by sampling a subset of classes during training and 'local correction' to mitigate bias introduced by sampling.	POMP achieves state-of-the-art accuracy on ImageNet-21K (25.3%) with CLIP ViT-B/16 backbone. It outperforms previous methods in cross-dataset image classification, achieving 67.0% average accuracy on 10 datasets. POMP excels in open-vocabulary semantic segmentation and object detection, surpassing previous state-of-the-art methods.	The theoretical risk of using a subsampled class set for estimating the expected contrastive loss needs investigation. Utilizing the semantic hierarchy within ImageNet-21K could further enhance performance.	prompt learning, vision-language models, zero-shot learning, image recognition, semantic segmentation, object detection
2304.04694 Report	Video-kMaX: A Simple Unified Approach for Online and Near-Online Video Panoptic Segmentation	Inkyu Shin, Dahun Kim, Qihang Yu, Jun Xie, Hong-Seok Kim, Bradley Green, In So Kweon, Kuk-Jin Yoon, Liang-Chieh Chen	Video Panoptic Segmentation (VPS) aims to achieve comprehensive pixel-level scene understanding by segmenting all pixels and associating objects in a video. Current solutions can be categorized into online and near-online approaches. Evolving over the time, each category has its own specialized designs, making it nontrivial to adapt models between different categories. To alleviate the discrepancy, in this work, we propose a unified approach for online and near-online VPS. The meta architecture of the proposed Video-kMaX consists of two components: within clip segmenter (for clip-level segmentation) and cross-clip associater (for association beyond clips). We propose clip-kMaX (clip k-means mask transformer) and HiLA-MB (Hierarchical Location-Aware Memory Buffer) to instantiate the segmenter and associater, respectively. Our general formulation includes the online scenario as a special case by adopting clip length of one. Without bells and whistles, Video-kMaX sets a new state-of-the-art on KITTI-STEP and VIPSeg for video panoptic segmentation, and VSPW for video semantic segmentation. Code will be made publicly available.	This paper presents Video-kMaX, a simple and unified approach for online and near-online video panoptic segmentation.	Existing methods for video panoptic segmentation often require specific design choices depending on whether they process the video frame-by-frame (online) or clip-by-clip (near-online). This paper aims to alleviate this discrepancy with a unified approach.	The proposed Video-kMaX consists of two components: clip-kMaX (clip k-means mask transformer) for clip-level segmentation and Location-Aware Memory Bank (LAMB) for cross-clip association. Clip-kMaX extends the image-level k-means mask transformer to the clip level by concatenating clip-level pixel features. LAMB leverages appearance and location features for long-term association across clips using a hierarchical matching scheme.	Video-kMaX sets a new state-of-the-art on KITTI-STEP and VIPSeg for video panoptic segmentation, and VSPW for video semantic segmentation. The proposed clip-kMaX effectively handles long sequences of video frames with k-means cross-attention. LAMB significantly improves long-term association quality compared to methods relying solely on appearance features.	The model struggles to track objects with large random movements and heavy occlusion. Future work could explore incorporating more sophisticated motion models for robust object association.	video panoptic segmentation, online segmentation, near-online segmentation, k-means mask transformer, memory module
2304.04515 Report	SOOD: Towards Semi-Supervised Oriented Object Detection	Wei Hua, Dingkang Liang, Jingyu Li, Xiaolong Liu, Zhikang Zou, Xiaoqing Ye, Xiang Bai	Semi-Supervised Object Detection (SSOD), aiming to explore unlabeled data for boosting object detectors, has become an active task in recent years. However, existing SSOD approaches mainly focus on horizontal objects, leaving multi-oriented objects that are common in aerial images unexplored. This paper proposes a novel Semi-supervised Oriented Object Detection model, termed SOOD, built upon the mainstream pseudo-labeling framework. Towards oriented objects in aerial scenes, we design two loss functions to provide better supervision. Focusing on the orientations of objects, the first loss regularizes the consistency between each pseudo-label-prediction pair (includes a prediction and its corresponding pseudo label) with adaptive weights based on their orientation gap. Focusing on the layout of an image, the second loss regularizes the similarity and explicitly builds the many-to-many relation between the sets of pseudo-labels and predictions. Such a global consistency constraint can further boost semi-supervised learning. Our experiments show that when trained with the two proposed losses, SOOD surpasses the state-of-the-art SSOD methods under various settings on the DOTA-v1.5 benchmark. The code will be available at https://github.com/HamPerdredes/SOOD.	This paper proposes SOOD, the first semi-supervised oriented object detection method, which introduces two novel losses (RAW and GC) to adapt the dense pseudo-labeling framework for oriented object detection.	Oriented object detection in aerial images is crucial but suffers from high annotation costs. Semi-supervised methods can leverage unlabeled data to improve object detectors and reduce annotation effort.	SOOD builds upon a dense pseudo-labeling framework with a teacher-student model. It introduces two novel losses: 1) Rotation-aware Adaptive Weighting (RAW) loss considers orientation differences to weigh pseudo-label-prediction pairs. 2) Global Consistency (GC) loss uses optimal transport to enforce layout similarity between teacher and student predictions.	SOOD outperforms state-of-the-art SSOD methods on DOTA-v1.5 under various partially labeled data settings (10%, 20%, 30%). SOOD also surpasses existing methods on the fully labeled DOTA-v1.5 benchmark, demonstrating its ability to learn from unlabeled data. Ablation studies confirm the effectiveness of both RAW and GC losses.	SOOD's utilization of aerial object characteristics beyond orientation and layout is limited. The RAW and GC losses, currently separate, could be integrated for better synergy.	semi-supervised learning, oriented object detection, aerial images, pseudo-labeling, optimal transport
2304.04514 Report	DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment	Lewei Yao, Jianhua Han, Xiaodan Liang, Dan Xu, Wei Zhang, Zhenguo Li, Hang Xu	This paper presents DetCLIPv2, an efficient and scalable training framework that incorporates large-scale image-text pairs to achieve open-vocabulary object detection (OVD). Unlike previous OVD frameworks that typically rely on a pre-trained vision-language model (e.g., CLIP) or exploit image-text pairs via a pseudo labeling process, DetCLIPv2 directly learns the fine-grained word-region alignment from massive image-text pairs in an end-to-end manner. To accomplish this, we employ a maximum word-region similarity between region proposals and textual words to guide the contrastive objective. To enable the model to gain localization capability while learning broad concepts, DetCLIPv2 is trained with a hybrid supervision from detection, grounding and image-text pair data under a unified data formulation. By jointly training with an alternating scheme and adopting low-resolution input for image-text pairs, DetCLIPv2 exploits image-text pair data efficiently and effectively: DetCLIPv2 utilizes 13X more image-text pairs than DetCLIP with a similar training time and improves performance. With 13M image-text pairs for pre-training, DetCLIPv2 demonstrates superior open-vocabulary detection performance, e.g., DetCLIPv2 with Swin-T backbone achieves 40.4% zero-shot AP on the LVIS benchmark, which outperforms previous works GLIP/GLIPv2/DetCLIP by 14.4/11.4/4.5% AP, respectively, and even beats its fully-supervised counterpart by a large margin.	DetCLIPv2, an open-vocabulary object detection framework that learns word-region alignment directly from image-text pairs via end-to-end training.	Addresses limitations of prior OVD methods that rely on pre-trained VL models or pseudo labeling by directly learning from large-scale image-text pairs.	Joint training with detection, grounding, and image-text data using a unified formulation. Employs maximum word-region similarity for contrastive learning, aligning visual regions with textual concepts.	Achieves 40.4% zero-shot AP on LVIS with Swin-T backbone, outperforming previous state-of-the-art methods. Demonstrates efficient training, utilizing 13x more image-text pairs than DetCLIP with similar training time. Exhibits strong generalization, achieving state-of-the-art fine-tuning performance on LVIS and ODinW13.	Localization capability heavily relies on bounding box annotations from detection data. Noisy and incomplete descriptions in web-crawled image-text pairs impact learning efficiency.	open-vocabulary object detection, word-region alignment, contrastive learning, image-text pairs, weakly-supervised learning
2304.04452 Report	Neural Residual Radiance Fields for Streamably Free-Viewpoint Videos	Liao Wang, Qiang Hu, Qihan He, Ziyu Wang, Jingyi Yu, Tinne Tuytelaars, Lan Xu, Minye Wu	The success of the Neural Radiance Fields (NeRFs) for modeling and free-view rendering static objects has inspired numerous attempts on dynamic scenes. Current techniques that utilize neural rendering for facilitating free-view videos (FVVs) are restricted to either offline rendering or are capable of processing only brief sequences with minimal motion. In this paper, we present a novel technique, Residual Radiance Field or ReRF, as a highly compact neural representation to achieve real-time FVV rendering on long-duration dynamic scenes. ReRF explicitly models the residual information between adjacent timestamps in the spatial-temporal feature space, with a global coordinate-based tiny MLP as the feature decoder. Specifically, ReRF employs a compact motion grid along with a residual feature grid to exploit inter-frame feature similarities. We show such a strategy can handle large motions without sacrificing quality. We further present a sequential training scheme to maintain the smoothness and the sparsity of the motion/residual grids. Based on ReRF, we design a special FVV codec that achieves three orders of magnitudes compression rate and provides a companion ReRF player to support online streaming of long-duration FVVs of dynamic scenes. Extensive experiments demonstrate the effectiveness of ReRF for compactly representing dynamic radiance fields, enabling an unprecedented free-viewpoint viewing experience in speed and quality.	Presents Residual Radiance Field (ReRF), a novel neural representation for streamable free-viewpoint viewing of long-duration dynamic scenes.	Existing methods for free-viewpoint videos (FVVs) are either offline or limited to short sequences with minimal motion. ReRF aims to enable real-time FVV rendering on long, dynamic scenes with high compression.	ReRF uses a global tiny MLP as a feature decoder and models feature space with explicit grids. It employs a compact motion grid for inter-frame position offsets and a sparse residual grid for error compensation and new regions. A two-stage sequential training scheme with motion pooling and sparsity regularizers is used.	Achieves high-quality free-viewpoint rendering comparable to per-frame reconstructions but with significantly less storage. Outperforms other dynamic scene reconstruction methods in terms of visual quality, especially in long sequences with large motions. Enables real-time decoding and rendering (20fps) with a companion ReRF player, supporting traditional video controls like pause, play, seek, etc.	Per-frame training time needs to be improved further. Reliance on multi-view capture systems for dynamic sequences.	neural rendering, free-viewpoint video, dynamic scene reconstruction, neural compression, streaming
2304.04415 Report	Meta Compositional Referring Expression Segmentation	Li Xu, Mark He Huang, Xindi Shang, Zehuan Yuan, Ying Sun, Jun Liu	Referring expression segmentation aims to segment an object described by a language expression from an image. Despite the recent progress on this task, existing models tackling this task may not be able to fully capture semantics and visual representations of individual concepts, which limits their generalization capability, especially when handling novel compositions of learned concepts. In this work, through the lens of meta learning, we propose a Meta Compositional Referring Expression Segmentation (MCRES) framework to enhance model compositional generalization performance. Specifically, to handle various levels of novel compositions, our framework first uses training data to construct a virtual training set and multiple virtual testing sets, where data samples in each virtual testing set contain a level of novel compositions w.r.t. the virtual training set. Then, following a novel meta optimization scheme to optimize the model to obtain good testing performance on the virtual testing sets after training on the virtual training set, our framework can effectively drive the model to better capture semantics and visual representations of individual concepts, and thus obtain robust generalization performance even when handling novel compositions. Extensive experiments on three benchmark datasets demonstrate the effectiveness of our framework.	The paper proposes Meta Compositional Referring Expression Segmentation (MCRES), a meta-learning framework to improve generalization performance of RES models when handling novel compositions of learned concepts (e.g., "dark coffee").	Existing RES models struggle to generalize to testing samples containing novel compositions of learned concepts, limiting their practical application.	MCRES constructs a virtual training set and multiple virtual testing sets representing different levels of novel compositions. A meta-optimization scheme then optimizes the model on these sets, encouraging it to learn semantics and visual representations of individual concepts for better generalization.	MCRES achieves state-of-the-art performance on RefCOCO, RefCOCO+, and RefCOCOg benchmarks. The framework consistently improves performance across various RES model architectures (transformer and CNN based). Ablation studies validate the effectiveness of handling different levels of novel compositions, the meta-optimization scheme, and the virtual sets construction strategy.	The framework introduces additional training time overhead due to the meta-optimization process. Future work could explore automatically identifying the most effective level of novel composition for each sample.	referring expression segmentation, meta learning, compositional generalization, computer vision, natural language processing
2304.04395 Report	Instance Neural Radiance Field	Yichen Liu, Benran Hu, Junkai Huang, Yu-Wing Tai, Chi-Keung Tang	This paper presents one of the first learning-based NeRF 3D instance segmentation pipelines, dubbed as Instance Neural Radiance Field, or Instance NeRF. Taking a NeRF pretrained from multi-view RGB images as input, Instance NeRF can learn 3D instance segmentation of a given scene, represented as an instance field component of the NeRF model. To this end, we adopt a 3D proposal-based mask prediction network on the sampled volumetric features from NeRF, which generates discrete 3D instance masks. The coarse 3D mask prediction is then projected to image space to match 2D segmentation masks from different views generated by existing panoptic segmentation models, which are used to supervise the training of the instance field. Notably, beyond generating consistent 2D segmentation maps from novel views, Instance NeRF can query instance information at any 3D point, which greatly enhances NeRF object segmentation and manipulation. Our method is also one of the first to achieve such results in pure inference. Experimented on synthetic and real-world NeRF datasets with complex indoor scenes, Instance NeRF surpasses previous NeRF segmentation works and competitive 2D segmentation methods in segmentation performance on unseen views. Watch the demo video at https://youtu.be/wW9Bme73coI. Code and data are available at https://github.com/lyclyc52/Instance_NeRF.	Presents Instance-NeRF (iNeRF), one of the first learning-based NeRF pipelines for 3D instance segmentation, which learns 3D instance segmentation from a pre-trained NeRF without ground truth segmentation.	Addresses the limitations of 3D instance segmentation relying on depth sensors or custom equipment by leveraging the ability of NeRF to associate 2D images with 3D.	Employs a 3D proposal-based mask prediction network on NeRF volumetric features, projects coarse 3D masks to image space, and uses 2D segmentation from existing models to match instances across views and supervise the training of a 3D instance field component within the NeRF model.	Achieves state-of-the-art 3D instance segmentation in NeRF without requiring ground-truth segmentation during inference. Introduces a Neural Instance Field capable of generating multi-view consistent 2D segmentation and continuous 3D segmentation using NeRF representation. Outperforms competitive 2D segmentation methods and prior NeRF segmentation approaches on synthetic indoor scenes.	Relies on existing 2D panoptic segmentation models, which may impact performance if the models are inaccurate. Future work includes extending the method to handle dynamic scenes and more complex real-world scenarios.	nerf, 3d instance segmentation, neural instance field, multi-view consistency, unsupervised segmentation
2304.04344 Report	Towards Real-time Text-driven Image Manipulation with Unconditional Diffusion Models	Nikita Starodubcev, Dmitry Baranchuk, Valentin Khrulkov, Artem Babenko	Recent advances in diffusion models enable many powerful instruments for image editing. One of these instruments is text-driven image manipulations: editing semantic attributes of an image according to the provided text description. % Popular text-conditional diffusion models offer various high-quality image manipulation methods for a broad range of text prompts. Existing diffusion-based methods already achieve high-quality image manipulations for a broad range of text prompts. However, in practice, these methods require high computation costs even with a high-end GPU. This greatly limits potential real-world applications of diffusion-based image editing, especially when running on user devices. In this paper, we address efficiency of the recent text-driven editing methods based on unconditional diffusion models and develop a novel algorithm that learns image manipulations 4.5-10 times faster and applies them 8 times faster. We carefully evaluate the visual quality and expressiveness of our approach on multiple datasets using human annotators. Our experiments demonstrate that our algorithm achieves the quality of much more expensive methods. Finally, we show that our approach can adapt the pretrained model to the user-specified image and text description on the fly just for 4 seconds. In this setting, we notice that more compact unconditional diffusion models can be considered as a rational alternative to the popular text-conditional counterparts.	This paper introduces a novel algorithm for text-driven image manipulation using unconditional diffusion models that significantly improves efficiency without sacrificing visual quality.	Existing diffusion-based methods for text-driven image editing, while effective, are computationally expensive, limiting their practical applications, especially on user devices.	The paper leverages two main ingredients: 1) replacing the sequential DDIM encoding with a closed-form, stochastic encoding at a single time step, and 2) updating the model parameters at a single decoding step per training iteration.	The proposed algorithm learns image manipulations 4.5-10x faster and applies them 8x faster than previous diffusion-based methods. Despite using approximate encoding and decoding, the approach achieves comparable visual and editing quality to DiffusionCLIP, significantly outperforming GAN-based alternatives. The paper demonstrates that unconditional diffusion models can learn text-guided manipulations from a single image, enabling fast, on-the-fly editing.	The proposed method, while more efficient, still requires careful hyperparameter tuning for optimal results. The reliance on CLIP for semantic guidance can limit the expressiveness and success of certain text-driven manipulations.	image manipulation, diffusion models, text-guided editing, unconditional diffusion models, clip
2304.04269 Report	HumanSD: A Native Skeleton-Guided Diffusion Model for Human Image Generation	Xuan Ju, Ailing Zeng, Chenchen Zhao, Jianan Wang, Lei Zhang, Qiang Xu	Controllable human image generation (HIG) has numerous real-life applications. State-of-the-art solutions, such as ControlNet and T2I-Adapter, introduce an additional learnable branch on top of the frozen pre-trained stable diffusion (SD) model, which can enforce various conditions, including skeleton guidance of HIG. While such a plug-and-play approach is appealing, the inevitable and uncertain conflicts between the original images produced from the frozen SD branch and the given condition incur significant challenges for the learnable branch, which essentially conducts image feature editing for condition enforcement. In this work, we propose a native skeleton-guided diffusion model for controllable HIG called HumanSD. Instead of performing image editing with dual-branch diffusion, we fine-tune the original SD model using a novel heatmap-guided denoising loss. This strategy effectively and efficiently strengthens the given skeleton condition during model training while mitigating the catastrophic forgetting effects. HumanSD is fine-tuned on the assembly of three large-scale human-centric datasets with text-image-pose information, two of which are established in this work. As shown in Figure 1, HumanSD outperforms ControlNet in terms of accurate pose control and image quality, particularly when the given skeleton guidance is sophisticated.	This paper introduces HumanSD, a novel skeleton-guided diffusion model for controllable human image generation that directly fine-tunes the Stable Diffusion model with skeleton conditions, enhancing pose control and image quality.	Controllable human image generation is crucial for various applications, but current diffusion-based methods struggle with accurate pose control, especially in complex scenarios. HumanSD addresses these limitations by enabling native skeleton guidance during image generation.	The authors propose a novel Heatmap-guided Denoising Loss to mitigate catastrophic forgetting during fine-tuning. They also establish two large-scale human-centric datasets, GHI and LAION-Human, to train HumanSD.	HumanSD significantly outperforms state-of-the-art methods like ControlNet in terms of pose accuracy, particularly with challenging poses. The model demonstrates high fidelity in replicating desired human poses while preserving image quality and text-image consistency. The proposed Heatmap-guided Denoising Loss proves effective in improving both pose control and background preservation compared to vanilla fine-tuning.	HumanSD still faces challenges with extremely crowded scenes and complex/rare actions. The evaluation system for text and pose-guided image generation needs further development to be more comprehensive and robust.	human image generation, diffusion models, pose control, stable diffusion, heatmap-guided denoising loss
2304.04231 Report	CrowdCLIP: Unsupervised Crowd Counting via Vision-Language Model	Dingkang Liang, Jiahao Xie, Zhikang Zou, Xiaoqing Ye, Wei Xu, Xiang Bai	Supervised crowd counting relies heavily on costly manual labeling, which is difficult and expensive, especially in dense scenes. To alleviate the problem, we propose a novel unsupervised framework for crowd counting, named CrowdCLIP. The core idea is built on two observations: 1) the recent contrastive pre-trained vision-language model (CLIP) has presented impressive performance on various downstream tasks; 2) there is a natural mapping between crowd patches and count text. To the best of our knowledge, CrowdCLIP is the first to investigate the vision language knowledge to solve the counting problem. Specifically, in the training stage, we exploit the multi-modal ranking loss by constructing ranking text prompts to match the size-sorted crowd patches to guide the image encoder learning. In the testing stage, to deal with the diversity of image patches, we propose a simple yet effective progressive filtering strategy to first select the highly potential crowd patches and then map them into the language space with various counting intervals. Extensive experiments on five challenging datasets demonstrate that the proposed CrowdCLIP achieves superior performance compared to previous unsupervised state-of-the-art counting methods. Notably, CrowdCLIP even surpasses some popular fully-supervised methods under the cross-dataset setting. The source code will be available at https://github.com/dk-liang/CrowdCLIP.	This paper proposes CrowdCLIP, a novel unsupervised crowd counting framework that leverages a vision-language model (CLIP) to estimate the number of people in images without any labeled data.	Existing supervised crowd counting methods rely heavily on expensive and time-consuming manual labeling. This paper explores the potential of vision-language models for unsupervised crowd counting, aiming to alleviate the dependence on labeled data.	CrowdCLIP fine-tunes the image encoder of CLIP using a ranking-based contrastive loss with size-sorted image patches and corresponding count text prompts. During inference, it employs a progressive filtering strategy to select highly potential crowd patches and map them into appropriate count intervals.	CrowdCLIP significantly outperforms the current state-of-the-art unsupervised method (CSS-CCNN) by a large margin on five challenging datasets. CrowdCLIP even surpasses some popular fully-supervised methods in cross-dataset evaluation scenarios. Ablation studies validate the effectiveness of the ranking-based contrastive fine-tuning, the proposed progressive filtering strategy, and the design choices of text prompts.	CrowdCLIP currently only provides count-level estimations and lacks the ability to generate point-level localization information. Future work can focus on exploring unsupervised localization techniques for crowd counting to provide more comprehensive crowd analysis.	crowd counting, unsupervised learning, vision-language model, clip, contrastive learning
2304.03950 Report	GANHead: Towards Generative Animatable Neural Head Avatars	Sijing Wu, Yichao Yan, Yunhao Li, Yuhao Cheng, Wenhan Zhu, Ke Gao, Xiaobo Li, Guangtao Zhai	To bring digital avatars into people's lives, it is highly demanded to efficiently generate complete, realistic, and animatable head avatars. This task is challenging, and it is difficult for existing methods to satisfy all the requirements at once. To achieve these goals, we propose GANHead (Generative Animatable Neural Head Avatar), a novel generative head model that takes advantages of both the fine-grained control over the explicit expression parameters and the realistic rendering results of implicit representations. Specifically, GANHead represents coarse geometry, fine-gained details and texture via three networks in canonical space to obtain the ability to generate complete and realistic head avatars. To achieve flexible animation, we define the deformation filed by standard linear blend skinning (LBS), with the learned continuous pose and expression bases and LBS weights. This allows the avatars to be directly animated by FLAME parameters and generalize well to unseen poses and expressions. Compared to state-of-the-art (SOTA) methods, GANHead achieves superior performance on head avatar generation and raw scan fitting.	GANHead, a novel generative model for creating realistic and animatable 3D head avatars, is presented.	Generating complete, realistic, and animatable 3D head avatars efficiently is crucial for various applications like VR/AR and the metaverse, but remains a challenge for existing methods.	GANHead leverages implicit representations with three neural networks to model coarse geometry, fine details, and texture in canonical space. It employs a deformation module based on FLAME parameters for animation and generalizability.	GANHead generates high-quality head avatars with detailed geometry and realistic textures. The generated avatars are controllable by FLAME parameters, enabling animation and generalization to unseen poses and expressions. GANHead outperforms SOTA methods in head avatar generation and raw scan fitting, exhibiting superior reconstruction quality in both shape and texture.	The current model still struggles to generate realistic hair with complex topology. Training requires significant GPU memory.	generative model, 3d head avatar, implicit representation, animatable avatar, flame parameters
2304.03869 Report	Harnessing the Spatial-Temporal Attention of Diffusion Models for High-Fidelity Text-to-Image Synthesis	Qiucheng Wu, Yujian Liu, Handong Zhao, Trung Bui, Zhe Lin, Yang Zhang, Shiyu Chang	Diffusion-based models have achieved state-of-the-art performance on text-to-image synthesis tasks. However, one critical limitation of these models is the low fidelity of generated images with respect to the text description, such as missing objects, mismatched attributes, and mislocated objects. One key reason for such inconsistencies is the inaccurate cross-attention to text in both the spatial dimension, which controls at what pixel region an object should appear, and the temporal dimension, which controls how different levels of details are added through the denoising steps. In this paper, we propose a new text-to-image algorithm that adds explicit control over spatial-temporal cross-attention in diffusion models. We first utilize a layout predictor to predict the pixel regions for objects mentioned in the text. We then impose spatial attention control by combining the attention over the entire text description and that over the local description of the particular object in the corresponding pixel region of that object. The temporal attention control is further added by allowing the combination weights to change at each denoising step, and the combination weights are optimized to ensure high fidelity between the image and the text. Experiments show that our method generates images with higher fidelity compared to diffusion-model-based baselines without fine-tuning the diffusion model. Our code is publicly available at https://github.com/UCSB-NLP-Chang/Diffusion-SpaceTime-Attn.	This paper proposes a new text-to-image synthesis algorithm that addresses the low fidelity issue of diffusion models by explicitly controlling spatial-temporal cross-attention.	Existing diffusion models often generate images inconsistent with text descriptions (e.g., missing objects, mismatched attributes), particularly for complex scenes.	The algorithm uses a layout predictor to determine object positions and optimizes a novel spatial-temporal attention mechanism in the diffusion model. This guides the model to attend to both global and local text descriptions, focusing on overall composition initially and refining object details in later stages.	The method significantly outperforms baseline diffusion models in generating images faithful to complex text descriptions. Both automatic and human evaluations demonstrate the effectiveness of the proposed spatial-temporal attention control. The method generalizes well to novel object combinations, suggesting its potential for creative applications.	The current optimization scheme is time-consuming, taking around 10 minutes per image. The layout predictor's performance might be improved, especially for object positions at the image edge.	text-to-image synthesis, diffusion models, cross-attention, image fidelity, layout prediction
2304.03768 Report	SparseFormer: Sparse Visual Recognition via Limited Latent Tokens	Ziteng Gao, Zhan Tong, Limin Wang, Mike Zheng Shou	Human visual recognition is a sparse process, where only a few salient visual cues are attended to rather than traversing every detail uniformly. However, most current vision networks follow a dense paradigm, processing every single visual unit (e.g,, pixel or patch) in a uniform manner. In this paper, we challenge this dense paradigm and present a new method, coined SparseFormer, to imitate human's sparse visual recognition in an end-to-end manner. SparseFormer learns to represent images using a highly limited number of tokens (down to 49) in the latent space with sparse feature sampling procedure instead of processing dense units in the original pixel space. Therefore, SparseFormer circumvents most of dense operations on the image space and has much lower computational costs. Experiments on the ImageNet classification benchmark dataset show that SparseFormer achieves performance on par with canonical or well-established models while offering better accuracy-throughput tradeoff. Moreover, the design of our network can be easily extended to the video classification with promising performance at lower computational costs. We hope that our work can provide an alternative way for visual modeling and inspire further research on sparse neural architectures. The code will be publicly available at https://github.com/showlab/sparseformer	SparseFormer is a novel vision architecture that sparsely represents images using a limited number of latent tokens and transformers in the latent space, mimicking human sparse visual recognition.	This approach addresses the limitations of dense processing in conventional vision networks, offering a computationally efficient alternative.	SparseFormer employs sparse feature sampling and adaptive feature decoding to build latent tokens and iteratively refines their region of interest (RoI) using a focusing transformer. A cortex transformer then processes these tokens for recognition.	SparseFormer achieves comparable performance to dense counterparts on ImageNet classification with a better accuracy-throughput trade-off. It effectively focuses on foregrounds in an end-to-end manner using only classification signals. The architecture extends well to video classification, demonstrating efficiency on Kinetics-400.	The performance of SparseFormer heavily relies on the number of latent tokens. Further exploration of token initialization and scaling strategies is needed.	sparse visual recognition, vision transformer, latent tokens, image classification, video classification
2304.03752 Report	V3Det: Vast Vocabulary Visual Detection Dataset	Jiaqi Wang, Pan Zhang, Tao Chu, Yuhang Cao, Yujie Zhou, Tong Wu, Bin Wang, Conghui He, Dahua Lin	Recent advances in detecting arbitrary objects in the real world are trained and evaluated on object detection datasets with a relatively restricted vocabulary. To facilitate the development of more general visual object detection, we propose V3Det, a vast vocabulary visual detection dataset with precisely annotated bounding boxes on massive images. V3Det has several appealing properties: 1) Vast Vocabulary: It contains bounding boxes of objects from 13,204 categories on real-world images, which is 10 times larger than the existing large vocabulary object detection dataset, e.g., LVIS. 2) Hierarchical Category Organization: The vast vocabulary of V3Det is organized by a hierarchical category tree which annotates the inclusion relationship among categories, encouraging the exploration of category relationships in vast and open vocabulary object detection. 3) Rich Annotations: V3Det comprises precisely annotated objects in 243k images and professional descriptions of each category written by human experts and a powerful chatbot. By offering a vast exploration space, V3Det enables extensive benchmarks on both vast and open vocabulary object detection, leading to new observations, practices, and insights for future research. It has the potential to serve as a cornerstone dataset for developing more general visual perception systems. V3Det is available at https://v3det.openxlab.org.cn/.	V3Det, a vast vocabulary visual detection dataset with 13,204 categories hierarchically organized with a category tree, is introduced.	Existing object detection datasets have a restricted vocabulary, limiting the development of more general visual object detection systems capable of detecting arbitrary objects.	V3Det leverages the Bamboo classification dataset and web data for image and category acquisition, employs a coarse-to-fine annotation pipeline with multiple verification stages, and provides rich category descriptions.	V3Det contains bounding boxes of objects from 13,204 categories, 10 times larger than existing datasets like LVIS. Benchmarks on V3Det reveal insights and best practices for vast and open vocabulary object detection. Pretraining on V3Det significantly improves the class generalizability of detectors, highlighting its value for open-vocabulary algorithms.	Limited resources restricted the evaluation of all potential object detectors. Further exploration of techniques for efficiently training and evaluating models on such a vast vocabulary dataset is needed.	object detection, vast vocabulary, open vocabulary, dataset, benchmark
2304.03659 Report	Probing Conceptual Understanding of Large Visual-Language Models	Madeline Schiappa, Raiyaan Abdullah, Shehreen Azad, Jared Claypoole, Michael Cogswell, Ajay Divakaran, Yogesh Rawat	In recent years large visual-language (V+L) models have achieved great success in various downstream tasks. However, it is not well studied whether these models have a conceptual grasp of the visual content. In this work we focus on conceptual understanding of these large V+L models. To facilitate this study, we propose novel benchmarking datasets for probing three different aspects of content understanding, 1) \textit{relations}, 2) \textit{composition}, and 3) \textit{context}. Our probes are grounded in cognitive science and help determine if a V+L model can, for example, determine if snow garnished with a man is implausible, or if it can identify beach furniture by knowing it is located on a beach. We experimented with many recent state-of-the-art V+L models and observe that these models mostly \textit{fail to demonstrate} a conceptual understanding. This study reveals several interesting insights such as that \textit{cross-attention} helps learning conceptual understanding, and that CNNs are better with \textit{texture and patterns}, while Transformers are better at \textit{color and shape}. We further utilize some of these insights and investigate a \textit{simple finetuning technique} that rewards the three conceptual understanding measures with promising initial results. The proposed benchmarks will drive the community to delve deeper into conceptual understanding and foster advancements in the capabilities of large V+L models. The code and dataset is available at: \url{https://tinyurl.com/vlm-robustness}	This paper proposes three novel benchmark datasets (Probe-R, Probe-C, Probe-B) to evaluate the conceptual understanding of large visual-language (V+L) models in terms of relations, composition, and context.	It is crucial for real-world applications that V+L models develop an understanding of visual content beyond memorization, similar to the 'conceptual maps' used by humans.	The study evaluates various state-of-the-art V+L models using these datasets. Probe-R uses image-text matching with swapped predicates or objects. Probe-C assesses compositional understanding through image-prompt matching with swapped compositions or objects. Probe-B analyzes contextual understanding by observing performance changes after background removal or replacement.	Existing V+L models largely fail to demonstrate robust conceptual understanding, particularly struggling with relational and contextual reasoning. Cross-attention mechanisms in V+L models are found to be beneficial for learning conceptual understanding. CNN-based models show strength in texture and pattern recognition, while Transformer-based models excel in color and shape understanding.	The study primarily focuses on visual perception and could be extended to incorporate subjective inference. Future work can explore the impact of larger and more diverse training datasets on V+L models' conceptual understanding.	visual-language models, conceptual understanding, benchmarking datasets, compositionality, contextual reasoning
2304.03542 Report	Better "CMOS" Produces Clearer Images: Learning Space-Variant Blur Estimation for Blind Image Super-Resolution	Xuhai Chen, Jiangning Zhang, Chao Xu, Yabiao Wang, Chengjie Wang, Yong Liu	Most of the existing blind image Super-Resolution (SR) methods assume that the blur kernels are space-invariant. However, the blur involved in real applications are usually space-variant due to object motion, out-of-focus, etc., resulting in severe performance drop of the advanced SR methods. To address this problem, we firstly introduce two new datasets with out-of-focus blur, i.e., NYUv2-BSR and Cityscapes-BSR, to support further researches of blind SR with space-variant blur. Based on the datasets, we design a novel Cross-MOdal fuSion network (CMOS) that estimate both blur and semantics simultaneously, which leads to improved SR results. It involves a feature Grouping Interactive Attention (GIA) module to make the two modalities interact more effectively and avoid inconsistency. GIA can also be used for the interaction of other features because of the universality of its structure. Qualitative and quantitative experiments compared with state-of-the-art methods on above datasets and real-world images demonstrate the superiority of our method, e.g., obtaining PSNR/SSIM by +1.91/+0.0048 on NYUv2-BSR than MANet.	This paper introduces two new datasets with out-of-focus blur for blind image super-resolution and proposes CMOS, a novel cross-modal fusion network, for estimating space-variant blur.	Real-world blur is often space-variant, which significantly degrades the performance of existing SR methods that assume space-invariant blur. This work aims to address this limitation and improve blind SR in more realistic scenarios.	The authors propose CMOS, a multi-scale network that leverages semantic information to improve space-variant blur estimation. It uses a novel Grouping Interactive Attention (GIA) module for effective interaction between blur and semantic features. They also introduce two new datasets with synthetic out-of-focus blur for training and evaluation.	CMOS, combined with a non-blind SR model, achieves state-of-the-art performance on the proposed datasets, outperforming existing blind SR methods by a significant margin. The effectiveness of using semantic information and the proposed GIA module is demonstrated through ablation studies. CMOS also shows superior visual results on real-world images with space-variant blur.	The current work focuses on out-of-focus blur as a representative case of space-variant blur. Future work can explore the integration of CMOS with more advanced non-blind SR techniques and extend the approach to other types of spatially variant blur.	blind image super-resolution, space-variant blur, out-of-focus blur, cross-modal fusion, grouping interactive attention
2304.03526 Report	Lift3D: Synthesize 3D Training Data by Lifting 2D GAN to 3D Generative Radiance Field	Leheng Li, Qing Lian, Luozhou Wang, Ningning Ma, Ying-Cong Chen	This work explores the use of 3D generative models to synthesize training data for 3D vision tasks. The key requirements of the generative models are that the generated data should be photorealistic to match the real-world scenarios, and the corresponding 3D attributes should be aligned with given sampling labels. However, we find that the recent NeRF-based 3D GANs hardly meet the above requirements due to their designed generation pipeline and the lack of explicit 3D supervision. In this work, we propose Lift3D, an inverted 2D-to-3D generation framework to achieve the data generation objectives. Lift3D has several merits compared to prior methods: (1) Unlike previous 3D GANs that the output resolution is fixed after training, Lift3D can generalize to any camera intrinsic with higher resolution and photorealistic output. (2) By lifting well-disentangled 2D GAN to 3D object NeRF, Lift3D provides explicit 3D information of generated objects, thus offering accurate 3D annotations for downstream tasks. We evaluate the effectiveness of our framework by augmenting autonomous driving datasets. Experimental results demonstrate that our data generation framework can effectively improve the performance of 3D object detectors. Project page: https://len-li.github.io/lift3d-web.	Lift3D, a novel 2D-to-3D generation framework, synthesizes 3D training data by lifting pretrained 2D GAN to 3D generative radiance field.	Current 3D GANs struggle to generate high-resolution, multi-view consistent images with accurate 3D annotations, limiting their use for data augmentation in 3D vision tasks.	Lift3D disentangles a 2D GAN to generate multi-view images with pseudo pose labels, then lifts them to a 3D object NeRF using a shared conditional NeRF and optimized latent codes.	Outperforms state-of-the-art data augmentation methods for 3D object detection on KITTI and nuScenes datasets. Demonstrates superior visual quality and multi-view consistency compared to previous 3D GANs. Enables unsupervised training of 3D object detectors with promising results.	Current method lacks explicit relation reasoning between generated objects and the environment. Illumination gaps exist between synthetic objects and real-world backgrounds.	data augmentation, 3d object detection, generative adversarial networks, neural radiance fields, autonomous driving
2304.03486 Report	Can we learn better with hard samples?	Subin Sahayam, John Zakkam, Umarani Jayaraman	In deep learning, mini-batch training is commonly used to optimize network parameters. However, the traditional mini-batch method may not learn the under-represented samples and complex patterns in the data, leading to a longer time for generalization. To address this problem, a variant of the traditional algorithm has been proposed, which trains the network focusing on mini-batches with high loss. The study evaluates the effectiveness of the proposed training using various deep neural networks trained on three benchmark datasets (CIFAR-10, CIFAR-100, and STL-10). The deep neural networks used in the study are ResNet-18, ResNet-50, Efficient Net B4, EfficientNetV2-S, and MobilenetV3-S. The experimental results showed that the proposed method can significantly improve the test accuracy and speed up the convergence compared to the traditional mini-batch training method. Furthermore, we introduce a hyper-parameter delta ({\delta}) that decides how many mini-batches are considered for training. Experiments on various values of {\delta} found that the performance of the proposed method for smaller {\delta} values generally results in similar test accuracy and faster generalization. We show that the proposed method generalizes in 26.47% less number of epochs than the traditional mini-batch method in EfficientNet-B4 on STL-10. The proposed method also improves the test top-1 accuracy by 7.26% in ResNet-18 on CIFAR-100.	This paper proposes a novel mini-batch training method that prioritizes learning from hard samples to accelerate the convergence of deep neural networks.	The ability to efficiently learn from hard samples is crucial for improving the performance and generalization of deep learning models, especially in terms of faster convergence.	The method introduces a hyper-parameter (δ) that selects a fraction of mini-batches with the highest loss values for training in each iteration.	The proposed method significantly reduces convergence time while maintaining or even improving test accuracy compared to traditional mini-batch training on benchmark datasets like CIFAR-10, CIFAR-100, and STL-10. Smaller δ values, focusing on the hardest samples, often lead to the most significant acceleration in convergence. The effectiveness of the method varies depending on the network architecture and dataset size, with larger networks and smaller datasets showing greater benefits.	While the method accelerates convergence, it doesn't guarantee improved accuracy in every case. The assumption of sample independence limits the method's applicability to datasets with inherent dependencies, such as time series or 3D images.	deep learning, mini-batch training, hard sample mining, convergence acceleration, image classification
2304.03411 Report	InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning	Jing Shi, Wei Xiong, Zhe Lin, Hyun Joon Jung	Recent advances in personalized image generation allow a pre-trained text-to-image model to learn a new concept from a set of images. However, existing personalization approaches usually require heavy test-time finetuning for each concept, which is time-consuming and difficult to scale. We propose InstantBooth, a novel approach built upon pre-trained text-to-image models that enables instant text-guided image personalization without any test-time finetuning. We achieve this with several major components. First, we learn the general concept of the input images by converting them to a textual token with a learnable image encoder. Second, to keep the fine details of the identity, we learn rich visual feature representation by introducing a few adapter layers to the pre-trained model. We train our components only on text-image pairs without using paired images of the same concept. Compared to test-time finetuning-based methods like DreamBooth and Textual-Inversion, our model can generate competitive results on unseen concepts concerning language-image alignment, image fidelity, and identity preservation while being 100 times faster.	This paper introduces InstantBooth, a novel approach for personalized text-to-image generation that eliminates the need for time-consuming test-time finetuning.	Existing personalization methods often require heavy test-time finetuning for each new concept, making them inefficient and difficult to scale. This paper tackles this limitation, enabling instant personalization.	InstantBooth leverages a pre-trained text-to-image diffusion model and incorporates: (1) an image encoder to convert input images into a textual concept embedding, (2) adapter layers to inject rich visual features for identity preservation, and (3) techniques like balanced sampling and concept token renormalization for balancing identity and language alignment.	InstantBooth achieves comparable results to test-time finetuning-based methods like DreamBooth and Textual-Inversion. It demonstrates superior performance in language-image alignment and identity preservation. The method is significantly faster, being approximately 100 times faster than DreamBooth.	The current model requires separate training for each category. The adapter design only allows for a single concept to provide identity details.	text-to-image generation, personalized image synthesis, test-time finetuning, diffusion models, adapter layers
2304.03373 Report	Training-Free Layout Control with Cross-Attention Guidance	Minghao Chen, Iro Laina, Andrea Vedaldi	Recent diffusion-based generators can produce high-quality images from textual prompts. However, they often disregard textual instructions that specify the spatial layout of the composition. We propose a simple approach that achieves robust layout control without the need for training or fine-tuning of the image generator. Our technique manipulates the cross-attention layers that the model uses to interface textual and visual information and steers the generation in the desired direction given, e.g., a user-specified layout. To determine how to best guide attention, we study the role of attention maps and explore two alternative strategies, forward and backward guidance. We thoroughly evaluate our approach on three benchmarks and provide several qualitative examples and a comparative analysis of the two strategies that demonstrate the superiority of backward guidance compared to forward guidance, as well as prior work. We further demonstrate the versatility of layout guidance by extending it to applications such as editing the layout and context of real images.	This paper proposes a training-free layout control method for diffusion-based image generators using cross-attention guidance, enabling user-specified layout control without retraining.	Existing text-to-image generators struggle to accurately interpret and represent spatial relationships specified in text prompts, limiting their controllability.	The method introduces two strategies: forward guidance (directly biasing attention maps) and backward guidance (using backpropagation to optimize latent codes for desired layout).	Backward guidance significantly outperforms forward guidance and prior arts in achieving layout control while maintaining image quality. Analysis reveals the importance of all tokens, including special tokens like start and padding tokens, in shaping the layout. The method effectively extends to real-image layout editing, enabling manipulation of object position and context within generated scenes.	The impact of initial noise on layout needs further investigation. Exploring alternative optimization strategies for improved speed and performance.	layout control, text-to-image generation, diffusion models, cross-attention, image editing
2304.03307 Report	Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting	Syed Talal Wasim, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan, Mubarak Shah	Adopting contrastive image-text pretrained models like CLIP towards video classification has gained attention due to its cost-effectiveness and competitive performance. However, recent works in this area face a trade-off. Finetuning the pretrained model to achieve strong supervised performance results in low zero-shot generalization. Similarly, freezing the backbone to retain zero-shot capability causes significant drop in supervised accuracy. Because of this, recent works in literature typically train separate models for supervised and zero-shot action recognition. In this work, we propose a multimodal prompt learning scheme that works to balance the supervised and zero-shot performance under a single unified training. Our prompting approach on the vision side caters for three aspects: 1) Global video-level prompts to model the data distribution; 2) Local frame-level prompts to provide per-frame discriminative conditioning; and 3) a summary prompt to extract a condensed video representation. Additionally, we define a prompting scheme on the text side to augment the textual context. Through this prompting scheme, we can achieve state-of-the-art zero-shot performance on Kinetics-600, HMDB51 and UCF101 while remaining competitive in the supervised setting. By keeping the pretrained backbone frozen, we optimize a much lower number of parameters and retain the existing general representation which helps achieve the strong zero-shot performance. Our codes/models are released at https://github.com/TalalWasim/Vita-CLIP.	This paper proposes Vita-CLIP, a multimodal prompt learning approach for adapting CLIP to video recognition, balancing supervised learning with zero-shot generalization using a single unified training scheme.	Existing methods for adapting CLIP to video recognition often sacrifice zero-shot generalization for supervised performance, or vice versa. This work aims to address this trade-off with a unified model.	Vita-CLIP introduces a multimodal prompting scheme involving: 1) video-level prompts for global distribution learning, 2) frame-level prompts for per-frame discriminative conditioning, 3) a summary prompt for condensed video representation, and 4) text prompts for augmented textual context.	Vita-CLIP achieves state-of-the-art zero-shot performance on Kinetics-600, HMDB51, and UCF101, outperforming previous methods by significant margins. It maintains competitive supervised performance on Kinetics-400 and Something-Something-V2 compared to methods fine-tuning the entire CLIP backbone. The method effectively captures per-frame variations and overall video distribution, as shown through ablations and visualizations.	The performance on fine-grained datasets like Something-Something-V2, while improved over previous vision-language models, is still lower than cross-entropy based methods, suggesting future work in this area. Exploring more efficient prompting techniques and extending the method to other video understanding tasks like retrieval could be interesting research directions.	video recognition, zero-shot learning, prompt learning, clip, vision-language models
2304.03285 Report	$\text{DC}^2$: Dual-Camera Defocus Control by Learning to Refocus	Hadi Alzayer, Abdullah Abuolaim, Leung Chun Chan, Yang Yang, Ying Chen Lou, Jia-Bin Huang, Abhishek Kar	Smartphone cameras today are increasingly approaching the versatility and quality of professional cameras through a combination of hardware and software advancements. However, fixed aperture remains a key limitation, preventing users from controlling the depth of field (DoF) of captured images. At the same time, many smartphones now have multiple cameras with different fixed apertures -- specifically, an ultra-wide camera with wider field of view and deeper DoF and a higher resolution primary camera with shallower DoF. In this work, we propose $\text{DC}^2$, a system for defocus control for synthetically varying camera aperture, focus distance and arbitrary defocus effects by fusing information from such a dual-camera system. Our key insight is to leverage real-world smartphone camera dataset by using image refocus as a proxy task for learning to control defocus. Quantitative and qualitative evaluations on real-world data demonstrate our system's efficacy where we outperform state-of-the-art on defocus deblurring, bokeh rendering, and image refocus. Finally, we demonstrate creative post-capture defocus control enabled by our method, including tilt-shift and content-based defocus effects.	The paper introduces $DC^2$, a learning-based system for depth-of-field control in dual-camera smartphones, enabling defocus deblurring, depth-based blur rendering, and image refocusing.	Current smartphone cameras lack post-capture depth-of-field control due to fixed apertures, and existing methods for defocus manipulation often focus on isolated aspects like deblurring or bokeh rendering. $DC^2$ offers a unified approach to address these limitations using readily available dual-camera systems.	The system uses a novel training strategy by leveraging image refocusing as a proxy task. It is trained on a dataset of real-world dual-camera images with varying focus distances, learning to fuse information from wide and ultra-wide cameras to control defocus.	$DC^2$ outperforms state-of-the-art methods on defocus deblurring, even without being explicitly trained on all-in-focus images. It achieves competitive performance in synthesizing shallow depth-of-field effects compared to dedicated bokeh rendering methods. The system excels at image refocusing, surpassing baselines that rely on sequential deblurring and blurring steps.	The method relies on the ultra-wide camera having a deeper depth-of-field than the wide camera, limiting its effectiveness for systems with similar camera configurations. It depends on the accuracy of pre-existing optical flow and stereo depth algorithms, which can be unreliable in the presence of defocus blur, presenting an area for improvement in future work.	depth-of-field control, dual-camera systems, defocus deblurring, image refocusing, bokeh rendering
2304.03284 Report	SegGPT: Segmenting Everything In Context	Xinlong Wang, Xiaosong Zhang, Yue Cao, Wen Wang, Chunhua Shen, Tiejun Huang	We present SegGPT, a generalist model for segmenting everything in context. We unify various segmentation tasks into a generalist in-context learning framework that accommodates different kinds of segmentation data by transforming them into the same format of images. The training of SegGPT is formulated as an in-context coloring problem with random color mapping for each data sample. The objective is to accomplish diverse tasks according to the context, rather than relying on specific colors. After training, SegGPT can perform arbitrary segmentation tasks in images or videos via in-context inference, such as object instance, stuff, part, contour, and text. SegGPT is evaluated on a broad range of tasks, including few-shot semantic segmentation, video object segmentation, semantic segmentation, and panoptic segmentation. Our results show strong capabilities in segmenting in-domain and out-of-domain targets, either qualitatively or quantitatively.	\Ours is a generalist model for segmenting everything in context, unifying various segmentation tasks into an in-context learning framework.	Existing specialist segmentation models are limited to specific tasks, requiring new models and expensive annotation for different settings. This work aims to train a single model capable of solving diverse segmentation tasks.	The model views segmentation as a general format for visual perception, accommodating different data types by transforming them into images. It uses an in-context coloring training scheme with random color mapping to foster generalizability.	\Ours achieves comparable or better performance than state-of-the-art specialist models on few-shot semantic segmentation benchmarks, including out-of-domain tasks. Despite not being specifically trained for video object segmentation, \Ours achieves competitive results on benchmarks like YouTube-VOS 2018 and DAVIS 2017. The model shows strong qualitative results on arbitrary object/part segmentation, text segmentation, and close-set instance/semantic segmentation with learnable prompt tuning.	While the random color scheme enhances generalization, it makes training more challenging, potentially leading to inferior performance on in-domain tasks with large datasets. Future work includes scaling up the model size and exploring self-supervised learning for improved performance and addressing data limitations.	segmentation, generalist model, in-context learning, computer vision, vision transformer
2304.03283 Report	Diffusion Models as Masked Autoencoders	Chen Wei, Karttikeya Mangalam, Po-Yao Huang, Yanghao Li, Haoqi Fan, Hu Xu, Huiyu Wang, Cihang Xie, Alan Yuille, Christoph Feichtenhofer	There has been a longstanding belief that generation can facilitate a true understanding of visual data. In line with this, we revisit generatively pre-training visual representations in light of recent interest in denoising diffusion models. While directly pre-training with diffusion models does not produce strong representations, we condition diffusion models on masked input and formulate diffusion models as masked autoencoders (DiffMAE). Our approach is capable of (i) serving as a strong initialization for downstream recognition tasks, (ii) conducting high-quality image inpainting, and (iii) being effortlessly extended to video where it produces state-of-the-art classification accuracy. We further perform a comprehensive study on the pros and cons of design choices and build connections between diffusion models and masked autoencoders.	This paper introduces Diffusion Masked Autoencoders (DiffMAE), a novel self-supervised learning framework that unifies diffusion models with masked autoencoders.	This work addresses the challenge of effectively utilizing generative pre-training for visual recognition tasks, inspired by the success of generative language models.	DiffMAE incorporates masking into diffusion models, training the model to predict the pixel distribution of masked image regions conditioned on visible regions. The model leverages a ViT-based architecture and explores various decoder designs and training strategies.	DiffMAE achieves strong performance on ImageNet classification, comparable to leading self-supervised learning methods, while also enabling high-quality image inpainting. DiffMAE demonstrates superior inpainting capabilities compared to specialized inpainting algorithms, quantitatively and qualitatively. The framework effortlessly extends to video, achieving state-of-the-art results on Kinetics-400 video classification and demonstrating promising video inpainting capabilities.	There is a trade-off between optimal settings for recognition and inpainting tasks, requiring further exploration for a unified approach. Future work includes investigating the potential of incorporating techniques like tokenization and layer scale to further enhance performance.	diffusion models, masked autoencoders, generative pre-training, self-supervised learning, image and video inpainting
2304.03266 Report	Neural Fields meet Explicit Geometric Representation for Inverse Rendering of Urban Scenes	Zian Wang, Tianchang Shen, Jun Gao, Shengyu Huang, Jacob Munkberg, Jon Hasselgren, Zan Gojcic, Wenzheng Chen, Sanja Fidler	Reconstruction and intrinsic decomposition of scenes from captured imagery would enable many applications such as relighting and virtual object insertion. Recent NeRF based methods achieve impressive fidelity of 3D reconstruction, but bake the lighting and shadows into the radiance field, while mesh-based methods that facilitate intrinsic decomposition through differentiable rendering have not yet scaled to the complexity and scale of outdoor scenes. We present a novel inverse rendering framework for large urban scenes capable of jointly reconstructing the scene geometry, spatially-varying materials, and HDR lighting from a set of posed RGB images with optional depth. Specifically, we use a neural field to account for the primary rays, and use an explicit mesh (reconstructed from the underlying neural field) for modeling secondary rays that produce higher-order lighting effects such as cast shadows. By faithfully disentangling complex geometry and materials from lighting effects, our method enables photorealistic relighting with specular and shadow effects on several outdoor datasets. Moreover, it supports physics-based scene manipulations such as virtual object insertion with ray-traced shadow casting.	Presents FEGR, a novel hybrid-rendering pipeline for inverse rendering of large urban scenes, combining neural fields and explicit mesh representations for efficient and high-quality intrinsic decomposition.	Enables realistic relighting and virtual object insertion in large-scale environments by disentangling geometry, materials, and HDR lighting, which is crucial for applications like AR/VR and digital twins.	Uses a neural field for primary ray rendering and volumetrically renders a G-buffer, then extracts a mesh from the signed distance field for efficient physics-based rendering of secondary rays, enabling high-quality shadow and specular effects.	Significantly outperforms state-of-the-art in novel-view synthesis under varying lighting on the NeRF-OSR dataset. Demonstrates high-quality intrinsic decomposition on challenging single-illumination driving scenes, surpassing baseline methods in albedo, geometry, and environment map accuracy. Enables photorealistic virtual object insertion with accurate cast shadows, confirmed by a user study where participants significantly preferred FEGR over baseline methods.	Relies on manually designed priors for regularization due to the ill-posed nature of inverse rendering, potentially limiting generalizability. Currently limited to static scenes, requiring future extensions with dynamic NeRF techniques to handle dynamic environments.	inverse rendering, neural rendering, neural fields, explicit mesh representation, urban scenes
2304.03246 Report	Inst-Inpaint: Instructing to Remove Objects with Diffusion Models	Ahmet Burak Yildirim, Vedat Baday, Erkut Erdem, Aykut Erdem, Aysegul Dundar	Image inpainting task refers to erasing unwanted pixels from images and filling them in a semantically consistent and realistic way. Traditionally, the pixels that are wished to be erased are defined with binary masks. From the application point of view, a user needs to generate the masks for the objects they would like to remove which can be time-consuming and prone to errors. In this work, we are interested in an image inpainting algorithm that estimates which object to be removed based on natural language input and removes it, simultaneously. For this purpose, first, we construct a dataset named GQA-Inpaint for this task. Second, we present a novel inpainting framework, Inst-Inpaint, that can remove objects from images based on the instructions given as text prompts. We set various GAN and diffusion-based baselines and run experiments on synthetic and real image datasets. We compare methods with different evaluation metrics that measure the quality and accuracy of the models and show significant quantitative and qualitative improvements.	This paper presents Inst-Inpaint, a novel end-to-end image inpainting framework that removes objects from images based solely on textual instructions, eliminating the need for binary masks.	The proposed instructional image inpainting task offers a more natural and user-friendly way to control image inpainting compared to traditional mask-based methods.	The authors create a new real image dataset, GQA-Inpaint, derived from the GQA dataset, and propose Inst-Inpaint, a conditional Latent Diffusion Model trained on this dataset. Inst-Inpaint leverages text prompts and source image encoding to perform object removal.	Inst-Inpaint achieves superior FID scores and CLIP-based inpainting scores compared to baseline methods like Instruct X-Decoder, InstPix2Pix, and CLIPSeg on the GQA-Inpaint dataset. The model effectively removes objects in complex scenarios and demonstrates accurate attention to objects targeted for removal. Analysis of attention maps reveals that Inst-Inpaint implicitly identifies objects for removal with higher accuracy than methods explicitly predicting masks based on prompts, such as CLIPSeg and X-Decoder.	The reliance on autoencoders in the LDM architecture can lead to poor reconstruction of complex patterns in the output, even when object removal is successful. Future work can explore more powerful autoencoders or alternative optimization strategies to address reconstruction quality limitations.	instruction-based inpainting, diffusion models, image editing, text-to-image synthesis, gqa dataset
2304.03199 Report	Face Animation with an Attribute-Guided Diffusion Model	Bohan Zeng, Xuhui Liu, Sicheng Gao, Boyu Liu, Hong Li, Jianzhuang Liu, Baochang Zhang	Face animation has achieved much progress in computer vision. However, prevailing GAN-based methods suffer from unnatural distortions and artifacts due to sophisticated motion deformation. In this paper, we propose a Face Animation framework with an attribute-guided Diffusion Model (FADM), which is the first work to exploit the superior modeling capacity of diffusion models for photo-realistic talking-head generation. To mitigate the uncontrollable synthesis effect of the diffusion model, we design an Attribute-Guided Conditioning Network (AGCN) to adaptively combine the coarse animation features and 3D face reconstruction results, which can incorporate appearance and motion conditions into the diffusion process. These specific designs help FADM rectify unnatural artifacts and distortions, and also enrich high-fidelity facial details through iterative diffusion refinements with accurate animation attributes. FADM can flexibly and effectively improve existing animation videos. Extensive experiments on widely used talking-head benchmarks validate the effectiveness of FADM over prior arts.	This paper introduces FADM, a novel face animation framework utilizing an attribute-guided diffusion model to enhance the quality of animation results, rectifying distortions and artifacts common in GAN-based methods.	Existing GAN-based face animation methods often produce unnatural distortions and artifacts. This work leverages the superior modeling capacity of diffusion models to generate more photo-realistic talking-head videos.	FADM consists of a coarse generative module, 3D face reconstruction, an attribute-guided conditioning network (AGCN), and a diffusion rendering module. AGCN combines coarse animation features with 3D face reconstruction to guide the diffusion process, ensuring accurate animation attributes and high-fidelity facial details.	FADM achieves state-of-the-art performance on widely used talking-head benchmarks like VoxCeleb and CelebA. It effectively rectifies distortions and enriches facial details while preserving accurate appearance and motion. The framework can also be applied to improve the quality of existing animation videos.	The performance of FADM on datasets with low-resolution images and blurred textures, like VoxCeleb2, can be further improved. Exploring more effective attribute-guided strategies to further enhance the controllability and fidelity of face animation is a promising direction.	face animation, diffusion models, generative models, deep learning, computer vision
2304.03119 Report	Zero-shot Generative Model Adaptation via Image-specific Prompt Learning	Jiayi Guo, Chaofei Wang, You Wu, Eric Zhang, Kai Wang, Xingqian Xu, Shiji Song, Humphrey Shi, Gao Huang	Recently, CLIP-guided image synthesis has shown appealing performance on adapting a pre-trained source-domain generator to an unseen target domain. It does not require any target-domain samples but only the textual domain labels. The training is highly efficient, e.g., a few minutes. However, existing methods still have some limitations in the quality of generated images and may suffer from the mode collapse issue. A key reason is that a fixed adaptation direction is applied for all cross-domain image pairs, which leads to identical supervision signals. To address this issue, we propose an Image-specific Prompt Learning (IPL) method, which learns specific prompt vectors for each source-domain image. This produces a more precise adaptation direction for every cross-domain image pair, endowing the target-domain generator with greatly enhanced flexibility. Qualitative and quantitative evaluations on various domains demonstrate that IPL effectively improves the quality and diversity of synthesized images and alleviates the mode collapse. Moreover, IPL is independent of the structure of the generative model, such as generative adversarial networks or diffusion models. Code is available at https://github.com/Picsart-AI-Research/IPL-Zero-Shot-Generative-Model-Adaptation.	This paper proposes Image-specific Prompt Learning (IPL), a novel approach for zero-shot generative model adaptation that addresses the limitations of existing methods relying on fixed adaptation directions.	Existing CLIP-guided zero-shot image synthesis methods, while efficient, suffer from limited image quality and mode collapse due to fixed adaptation directions applied to all cross-domain image pairs. IPL aims to overcome these limitations by introducing image-specific prompt learning.	IPL is a two-stage method. Stage 1 trains a latent mapper to generate image-specific prompt vectors for each source image using a contrastive learning scheme and a domain regularization loss. Stage 2 incorporates the trained mapper to generate adaptive, image-specific adaptation directions for training the target-domain generator.	IPL effectively improves the quality and diversity of synthesized images compared to existing methods like NADA. IPL alleviates the mode collapse issue observed in previous approaches. IPL is model-agnostic and can be applied to both GAN-based and diffusion-based generative models.	The visualization and interpretability of the learned prompt vectors remain challenging. The performance of IPL in scenarios with large domain shifts requires further investigation.	generative model adaptation, zero-shot learning, clip, prompt learning, image synthesis
2304.02978 Report	Simplifying Low-Light Image Enhancement Networks with Relative Loss Functions	Yu Zhang, Xiaoguang Di, Junde Wu, Rao Fu, Yong Li, Yue Wang, Yanwu Xu, Guohui Yang, Chunhui Wang	Image enhancement is a common technique used to mitigate issues such as severe noise, low brightness, low contrast, and color deviation in low-light images. However, providing an optimal high-light image as a reference for low-light image enhancement tasks is impossible, which makes the learning process more difficult than other image processing tasks. As a result, although several low-light image enhancement methods have been proposed, most of them are either too complex or insufficient in addressing all the issues in low-light images. In this paper, to make the learning easier in low-light image enhancement, we introduce FLW-Net (Fast and LightWeight Network) and two relative loss functions. Specifically, we first recognize the challenges of the need for a large receptive field to obtain global contrast and the lack of an absolute reference, which limits the simplification of network structures in this task. Then, we propose an efficient global feature information extraction component and two loss functions based on relative information to overcome these challenges. Finally, we conducted comparative experiments to demonstrate the effectiveness of the proposed method, and the results confirm that the proposed method can significantly reduce the complexity of supervised low-light image enhancement networks while improving processing effect. The code is available at \url{https://github.com/hitzhangyu/FLW-Net}.	This paper presents FLW-Net, a fast and lightweight network for low-light image enhancement, along with two novel relative loss functions to simplify the learning process.	Low-light image enhancement suffers from a lack of optimal reference images, making existing methods either too complex or insufficient. FLW-Net addresses this by simplifying network structure and using relative loss functions that don't require exact output-reference matching.	FLW-Net utilizes a Global Feature Extraction (GFE) component to efficiently extract global information from image histograms. It employs two relative loss functions: L_brightness for brightness order similarity and L_structure for similar gradient orders between enhanced and reference images.	FLW-Net achieves comparable or better performance than state-of-the-art methods while maintaining faster processing speed. Relative loss functions, particularly L_brightness and L_structure, significantly improve PSNR and SSIM, demonstrating effectiveness in noise removal and structural preservation. Combining the proposed loss functions with other networks, like RetinexNet and KIND, enhances their performance with fewer parameters or operations.	Enhancement results depend on the desired brightness parameter (μ_test). Training requires paired data, limiting applicability to unpaired datasets.	low-light image enhancement, lightweight network, relative loss functions, global feature extraction, image restoration
2304.02827 Report	DITTO-NeRF: Diffusion-based Iterative Text To Omni-directional 3D Model	Hoigi Seo, Hayeon Kim, Gwanghyun Kim, Se Young Chun	The increasing demand for high-quality 3D content creation has motivated the development of automated methods for creating 3D object models from a single image and/or from a text prompt. However, the reconstructed 3D objects using state-of-the-art image-to-3D methods still exhibit low correspondence to the given image and low multi-view consistency. Recent state-of-the-art text-to-3D methods are also limited, yielding 3D samples with low diversity per prompt with long synthesis time. To address these challenges, we propose DITTO-NeRF, a novel pipeline to generate a high-quality 3D NeRF model from a text prompt or a single image. Our DITTO-NeRF consists of constructing high-quality partial 3D object for limited in-boundary (IB) angles using the given or text-generated 2D image from the frontal view and then iteratively reconstructing the remaining 3D NeRF using inpainting latent diffusion model. We propose progressive 3D object reconstruction schemes in terms of scales (low to high resolution), angles (IB angles initially to outer-boundary (OB) later), and masks (object to background boundary) in our DITTO-NeRF so that high-quality information on IB can be propagated into OB. Our DITTO-NeRF outperforms state-of-the-art methods in terms of fidelity and diversity qualitatively and quantitatively with much faster training times than prior arts on image/text-to-3D such as DreamFusion, and NeuralLift-360.	DITTO-NeRF, a novel pipeline, generates high-quality 3D NeRF models from text prompts or single images by iteratively reconstructing partial 3D objects using inpainting latent diffusion models.	Existing image-to-3D methods struggle with low correspondence and multi-view consistency, while text-to-3D methods suffer from low sample diversity and long synthesis times. DITTO-NeRF aims to address these challenges.	DITTO-NeRF constructs a high-quality partial 3D object for limited in-boundary angles using a text-generated or user-provided 2D image. It then iteratively reconstructs the remaining 3D NeRF using an inpainting latent diffusion model, employing progressive schemes for scales, angles, and masks.	Outperforms state-of-the-art image-to-3D methods in fidelity and multi-view consistency. Exceeds existing text-to-3D methods in output fidelity and diversity. Achieves these improvements with significantly faster training times compared to prior arts like DreamFusion and NeuralLift-360.	Limited depth estimation accuracy for images with minimal shadows or generated from out-of-distribution data. 3D object quality is dependent on the quality of images generated by the diffusion model.	nerf, text-to-3d, image-to-3d, diffusion models, 3d object generation
2304.02797 Report	DeLiRa: Self-Supervised Depth, Light, and Radiance Fields	Vitor Guizilini, Igor Vasiljevic, Jiading Fang, Rares Ambrus, Sergey Zakharov, Vincent Sitzmann, Adrien Gaidon	Differentiable volumetric rendering is a powerful paradigm for 3D reconstruction and novel view synthesis. However, standard volume rendering approaches struggle with degenerate geometries in the case of limited viewpoint diversity, a common scenario in robotics applications. In this work, we propose to use the multi-view photometric objective from the self-supervised depth estimation literature as a geometric regularizer for volumetric rendering, significantly improving novel view synthesis without requiring additional information. Building upon this insight, we explore the explicit modeling of scene geometry using a generalist Transformer, jointly learning a radiance field as well as depth and light fields with a set of shared latent codes. We demonstrate that sharing geometric information across tasks is mutually beneficial, leading to improvements over single-task learning without an increase in network complexity. Our DeLiRa architecture achieves state-of-the-art results on the ScanNet benchmark, enabling high quality volumetric rendering as well as real-time novel view and depth synthesis in the limited viewpoint diversity setting.	Introduces multi-view photometric loss as regularization for volumetric rendering to improve 3D geometry learning, especially in limited viewpoint scenarios, and proposes DeLiRa, a novel architecture that jointly learns depth, light, and radiance fields from a shared latent space.	Addresses the challenge of degenerate geometries in volumetric rendering due to limited viewpoint diversity, a common issue in applications like robotics.	Combines volumetric rendering with a self-supervised multi-view photometric loss, using depth inferred from rendering to enforce multi-view consistency. DeLiRa utilizes a shared latent representation and cross-attention decoders for efficient and effective multi-task learning.	Achieves state-of-the-art depth and view synthesis on ScanNet, outperforming methods reliant on ground truth or pre-trained networks. Demonstrates that multi-view photometric loss effectively regularizes volumetric rendering, enabling accurate geometry recovery in limited viewpoint settings. Shows joint learning of depth, light, and radiance fields in a shared latent space improves performance across tasks compared to single-task networks.	Remains scene-specific and requires retraining for new scenes. Requires image overlap for multi-view photometric self-supervision, limiting applicability in very sparse view scenarios.	volumetric rendering, depth estimation, neural radiance fields, self-supervised learning, multi-view photometric loss
2304.02744 Report	StyleGAN Salon: Multi-View Latent Optimization for Pose-Invariant Hairstyle Transfer	Sasikarn Khwanmuang, Pakkapon Phongthawee, Patsorn Sangkloy, Supasorn Suwajanakorn	Our paper seeks to transfer the hairstyle of a reference image to an input photo for virtual hair try-on. We target a variety of challenges scenarios, such as transforming a long hairstyle with bangs to a pixie cut, which requires removing the existing hair and inferring how the forehead would look, or transferring partially visible hair from a hat-wearing person in a different pose. Past solutions leverage StyleGAN for hallucinating any missing parts and producing a seamless face-hair composite through so-called GAN inversion or projection. However, there remains a challenge in controlling the hallucinations to accurately transfer hairstyle and preserve the face shape and identity of the input. To overcome this, we propose a multi-view optimization framework that uses "two different views" of reference composites to semantically guide occluded or ambiguous regions. Our optimization shares information between two poses, which allows us to produce high fidelity and realistic results from incomplete references. Our framework produces high-quality results and outperforms prior work in a user study that consists of significantly more challenging hair transfer scenarios than previously studied. Project page: https://stylegan-salon.github.io/.	This paper presents StyleGAN Salon, a novel pose-invariant hairstyle transfer pipeline that leverages multi-view latent optimization to transfer hairstyles between images with significant pose differences.	Hairstyle transfer in the wild is challenging, particularly when dealing with large pose discrepancies between input face and reference hair images. Existing methods often struggle with preserving hair texture, facial features, and background details.	The method involves constructing two guide images from different viewpoints using EG3D for geometric consistency and employs a multi-stage optimization strategy. First, it hallucinates missing details by optimizing in the \W latent space of StyleGAN2. Subsequently, it refines the output by optimizing in the extended \WP space to recover fine-grained details of both face and hair. Lastly, it utilizes Pivotal Tuning Inversion (PTI) to further enhance the fidelity of the final output.	Outperforms state-of-the-art methods like StyleYourHair, Barbershop, and LOHO in user studies, demonstrating superior hairstyle transfer quality, especially in challenging scenarios like pose misalignment, bangs removal, and hat removal. Exhibits better preservation of input facial shape compared to other methods, as indicated by lower RMSE scores on facial landmarks. Successfully handles a variety of challenging scenarios, including transitions from long to short hairstyles, bangs/hat removal, and background inpainting.	Struggles with eccentric hairstyles and faces. Relies on multiple pretrained networks, which can introduce biases and limitations.	hairstyle transfer, generative adversarial networks, stylegan, multi-view optimization, pose invariance
2304.02642 Report	Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models	Xuhui Jia, Yang Zhao, Kelvin C. K. Chan, Yandong Li, Han Zhang, Boqing Gong, Tingbo Hou, Huisheng Wang, Yu-Chuan Su	This paper proposes a method for generating images of customized objects specified by users. The method is based on a general framework that bypasses the lengthy optimization required by previous approaches, which often employ a per-object optimization paradigm. Our framework adopts an encoder to capture high-level identifiable semantics of objects, producing an object-specific embedding with only a single feed-forward pass. The acquired object embedding is then passed to a text-to-image synthesis model for subsequent generation. To effectively blend a object-aware embedding space into a well developed text-to-image model under the same generation context, we investigate different network designs and training strategies, and propose a simple yet effective regularized joint training scheme with an object identity preservation loss. Additionally, we propose a caption generation scheme that become a critical piece in fostering object specific embedding faithfully reflected into the generation process, while keeping control and editing abilities. Once trained, the network is able to produce diverse content and styles, conditioned on both texts and objects. We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity, without the need of test-time optimization. Systematic studies are also conducted to analyze our models, providing insights for future work.	This paper presents a novel method for personalized image synthesis that bypasses the need for per-object optimization, enabling efficient generation of customized images from a single reference image and a text prompt.	Existing personalized image synthesis methods heavily rely on time-consuming per-object optimization, hindering their scalability and practicality. This work addresses this limitation by introducing an efficient and generalizable framework.	The proposed framework leverages a pre-trained text-to-image diffusion model augmented with an object encoder. It employs a regularized joint training scheme with a cross-reference regularization to preserve object identity while maintaining editing capability. Additionally, a caption generation scheme enhances personalization by providing diverse text captions.	The method generates high-quality, diverse images with strong object fidelity, outperforming existing state-of-the-art approaches. It exhibits superior efficiency by eliminating the need for per-object optimization, achieving comparable performance in a single forward pass. The framework demonstrates generalizability, effectively synthesizing images of various objects, even those unseen during training.	The method may struggle to generate accurate details when the reference image lacks complete information. Potential biases in training data could lead to biased image generation, requiring further investigation and mitigation strategies.	image synthesis, text-to-image generation, personalized image synthesis, diffusion models, object encoding
2304.02637 Report	GenPhys: From Physical Processes to Generative Models	Ziming Liu, Di Luo, Yilun Xu, Tommi Jaakkola, Max Tegmark	Since diffusion models (DM) and the more recent Poisson flow generative models (PFGM) are inspired by physical processes, it is reasonable to ask: Can physical processes offer additional new generative models? We show that the answer is yes. We introduce a general family, Generative Models from Physical Processes (GenPhys), where we translate partial differential equations (PDEs) describing physical processes to generative models. We show that generative models can be constructed from s-generative PDEs (s for smooth). GenPhys subsume the two existing generative models (DM and PFGM) and even give rise to new families of generative models, e.g., "Yukawa Generative Models" inspired from weak interactions. On the other hand, some physical processes by default do not belong to the GenPhys family, e.g., the wave equation and the Schr\"{o}dinger equation, but could be made into the GenPhys family with some modifications. Our goal with GenPhys is to explore and expand the design space of generative models.	This paper introduces GenPhys, a framework that converts partial differential equations (PDEs) describing physical processes into generative models.	GenPhys expands the design space of generative models by leveraging the dynamics of diverse physical structures.	The framework leverages the connection between PDEs and density flow. PDEs describing physical processes are rewritten as density flow equations, which then serve as the foundation for generative models. The s-generative property (smoothness, well-behaved density) is introduced as a requirement for a PDE to be suitable for GenPhys.	GenPhys subsumes existing generative models like Diffusion Models (DM) and Poisson Flow Generative Models (PFGM). The framework can leverage new physical processes to create new generative models (e.g., "Yukawa Generative Models" inspired by weak interactions). Dispersion relations can serve as a rigorous criterion to determine whether a PDE is suitable for conversion into a generative model.	The paper primarily focuses on linear PDEs, leaving the exploration of non-linear PDEs for future work. Further investigation is needed to analyze and test the practical performance of the newly proposed GenPhys models.	generative models, physics-inspired ai, partial differential equations, density flow, dispersion relation
2304.02633 Report	HNeRV: A Hybrid Neural Representation for Videos	Hao Chen, Matt Gwilliam, Ser-Nam Lim, Abhinav Shrivastava	Implicit neural representations store videos as neural networks and have performed well for various vision tasks such as video compression and denoising. With frame index or positional index as input, implicit representations (NeRV, E-NeRV, \etc) reconstruct video from fixed and content-agnostic embeddings. Such embedding largely limits the regression capacity and internal generalization for video interpolation. In this paper, we propose a Hybrid Neural Representation for Videos (HNeRV), where a learnable encoder generates content-adaptive embeddings, which act as the decoder input. Besides the input embedding, we introduce HNeRV blocks, which ensure model parameters are evenly distributed across the entire network, such that higher layers (layers near the output) can have more capacity to store high-resolution content and video details. With content-adaptive embeddings and re-designed architecture, HNeRV outperforms implicit methods in video regression tasks for both reconstruction quality ($+4.7$ PSNR) and convergence speed ($16\times$ faster), and shows better internal generalization. As a simple and efficient video representation, HNeRV also shows decoding advantages for speed, flexibility, and deployment, compared to traditional codecs~(H.264, H.265) and learning-based compression methods. Finally, we explore the effectiveness of HNeRV on downstream tasks such as video compression and video inpainting. We provide project page at https://haochen-rye.github.io/HNeRV, and Code at https://github.com/haochen-rye/HNeRV	This paper introduces HNeRV, a hybrid neural representation for videos that combines a learnable encoder for content-adaptive embeddings with a redesigned decoder architecture for even parameter distribution.	Existing implicit neural representations for videos suffer from limited generalizability and regression capacity due to content-agnostic embeddings and uneven parameter distribution in decoders. HNeRV addresses these limitations, aiming for improved quality, speed, and generalization in video representation.	HNeRV consists of a learnable encoder (ConvNeXt blocks) to generate compact frame embeddings and a decoder (HNeRV blocks) that takes embeddings as input. HNeRV blocks are designed to balance parameters across layers, enhancing the representation of high-resolution content.	HNeRV significantly outperforms implicit methods (NeRV, E-NeRV) in video reconstruction quality (+4.7 PSNR) and convergence speed (16x faster). The even parameter distribution strategy in HNeRV's decoder, where later layers have more parameters, proves crucial for reconstructing high-resolution videos. HNeRV demonstrates strong performance in downstream tasks like video compression (competing with H.264, H.265) and video inpainting (comparable to SOTA).	HNeRV requires training a new model for each video, limiting its applicability to scenarios where pre-training on a large dataset is feasible. Determining optimal embedding size, model size, and network architecture for HNeRV remains an open challenge.	neural representation, video compression, video regression, video inpainting, internal generalization
2304.02626 Report	Dynamic Point Fields	Sergey Prokudin, Qianli Ma, Maxime Raafat, Julien Valentin, Siyu Tang	Recent years have witnessed significant progress in the field of neural surface reconstruction. While the extensive focus was put on volumetric and implicit approaches, a number of works have shown that explicit graphics primitives such as point clouds can significantly reduce computational complexity, without sacrificing the reconstructed surface quality. However, less emphasis has been put on modeling dynamic surfaces with point primitives. In this work, we present a dynamic point field model that combines the representational benefits of explicit point-based graphics with implicit deformation networks to allow efficient modeling of non-rigid 3D surfaces. Using explicit surface primitives also allows us to easily incorporate well-established constraints such as-isometric-as-possible regularisation. While learning this deformation model is prone to local optima when trained in a fully unsupervised manner, we propose to additionally leverage semantic information such as keypoint dynamics to guide the deformation learning. We demonstrate our model with an example application of creating an expressive animatable human avatar from a collection of 3D scans. Here, previous methods mostly rely on variants of the linear blend skinning paradigm, which fundamentally limits the expressivity of such models when dealing with complex cloth appearances such as long skirts. We show the advantages of our dynamic point field framework in terms of its representational power, learning efficiency, and robustness to out-of-distribution novel poses.	This paper introduces Dynamic Point Fields (DPF), a novel model combining point-based graphics and deformation networks to efficiently model non-rigid 3D surfaces.	DPFs offer a more efficient and compact alternative to implicit models for representing and animating complex dynamic surfaces, with benefits like faster training, better reconstruction quality, and lower memory requirements.	DPF learns a deformation field represented by a neural network to warp points from a canonical point cloud to target shapes. It leverages constraints like as-isometric-as-possible regularization and keypoint guidance to learn plausible deformations.	DPF outperforms state-of-the-art implicit methods in static surface reconstruction, achieving better quality with smaller model size. DPF demonstrates superior performance in learning deformation fields compared to SDF-based methods and non-rigid registration techniques. DPF enables high-quality animation of clothed humans, particularly excelling with challenging garments like skirts, outperforming LBS-based methods.	The deformation optimization can struggle with large deformations and topological changes when guidance is limited. The current per-frame optimization is computationally expensive, limiting real-time applicability. Future work could explore meta-learning for faster inference.	dynamic surface reconstruction, deformation learning, point cloud processing, neural surface representation, 3d human animation
2304.02602 Report	Generative Novel View Synthesis with 3D-Aware Diffusion Models	Eric R. Chan, Koki Nagano, Matthew A. Chan, Alexander W. Bergman, Jeong Joon Park, Axel Levy, Miika Aittala, Shalini De Mello, Tero Karras, Gordon Wetzstein	We present a diffusion-based model for 3D-aware generative novel view synthesis from as few as a single input image. Our model samples from the distribution of possible renderings consistent with the input and, even in the presence of ambiguity, is capable of rendering diverse and plausible novel views. To achieve this, our method makes use of existing 2D diffusion backbones but, crucially, incorporates geometry priors in the form of a 3D feature volume. This latent feature field captures the distribution over possible scene representations and improves our method's ability to generate view-consistent novel renderings. In addition to generating novel views, our method has the ability to autoregressively synthesize 3D-consistent sequences. We demonstrate state-of-the-art results on synthetic renderings and room-scale scenes; we also show compelling results for challenging, real-world objects.	This paper introduces a diffusion-based model for novel view synthesis that leverages 3D geometry priors, enabling realistic view synthesis from a single input image.	Existing few-shot view synthesis methods struggle with long-range extrapolation and handling complex, real-world scenes due to their reliance on regression objectives.	The method combines 2D diffusion models with a 3D feature volume capturing the scene representation. Input image(s) are encoded into the 3D feature volume, rendered from the target viewpoint, and fed to a U-Net denoiser along with a noisy image, iteratively generating the novel view.	Outperforms state-of-the-art methods on ShapeNet and Matterport3D datasets in terms of image quality and view consistency. Generates plausible and realistic novel views for challenging, real-world objects in the CO3D dataset, a first for single-shot NVS on this dataset. Demonstrates high geometric consistency, as evidenced by dense COLMAP reconstructions from generated sequences.	Limited output resolution (128x128) and slow inference speed due to the diffusion process. Potential for minor inconsistencies and drift in challenging real-world datasets.	novel view synthesis, diffusion models, 3d geometry priors, generative models, single image view synthesis
2304.02364 Report	What's in a Name? Beyond Class Indices for Image Recognition	Kai Han, Yandong Li, Sagar Vaze, Jie Li, Xuhui Jia	Existing machine learning models demonstrate excellent performance in image object recognition after training on a large-scale dataset under full supervision. However, these models only learn to map an image to a predefined class index, without revealing the actual semantic meaning of the object in the image. In contrast, vision-language models like CLIP are able to assign semantic class names to unseen objects in a `zero-shot' manner, although they still rely on a predefined set of candidate names at test time. In this paper, we reconsider the recognition problem and task a vision-language model to assign class names to images given only a large and essentially unconstrained vocabulary of categories as prior information. We use non-parametric methods to establish relationships between images which allow the model to automatically narrow down the set of possible candidate names. Specifically, we propose iteratively clustering the data and voting on class names within them, showing that this enables a roughly 50\% improvement over the baseline on ImageNet. Furthermore, we tackle this problem both in unsupervised and partially supervised settings, as well as with a coarse-grained and fine-grained search space as the unconstrained dictionary.	This paper proposes a new task termed Semantic Category Discovery (SCD), where the goal is to automatically assign semantic class names to images given a large, unconstrained vocabulary of categories, going beyond traditional class index prediction.	Existing image recognition models rely on predefined class indices, limiting their ability to handle unseen objects and adapt to new categories. This work aims to bridge the gap to human-like perception, where we can assign semantic names to objects directly.	The proposed method leverages non-parametric clustering (e.g., k-means) on image features and a pre-trained vision-language model (e.g., CLIP). It iteratively refines cluster assignments and votes on class names within clusters to narrow down the vocabulary to the most relevant concepts.	The method significantly outperforms baseline zero-shot transfer approaches on datasets like ImageNet, Stanford Dogs, and CUB, roughly doubling sACC on ImageNet-100. Surprisingly, the proposed method can also improve clustering accuracy compared to strong baselines like DINO, indicating its effectiveness in grouping semantically similar images. The paper shows that choosing appropriate initial clustering algorithms (constrained semi-supervised k-means) and leveraging strong features (DINO) are crucial for good performance.	Despite improvements, the absolute accuracy of the method remains relatively low, highlighting the challenge of unconstrained semantic naming and the need for further research. The reliance on pre-trained models like CLIP introduces potential biases from the internet-scale data they are trained on, necessitating further investigation into transparency and controllability for real-world deployment.	image recognition, semantic category discovery, vision-language models, zero-shot learning, clustering
2304.02330 Report	SMPConv: Self-moving Point Representations for Continuous Convolution	Sanghyeon Kim, Eunbyung Park	Continuous convolution has recently gained prominence due to its ability to handle irregularly sampled data and model long-term dependency. Also, the promising experimental results of using large convolutional kernels have catalyzed the development of continuous convolution since they can construct large kernels very efficiently. Leveraging neural networks, more specifically multilayer perceptrons (MLPs), is by far the most prevalent approach to implementing continuous convolution. However, there are a few drawbacks, such as high computational costs, complex hyperparameter tuning, and limited descriptive power of filters. This paper suggests an alternative approach to building a continuous convolution without neural networks, resulting in more computationally efficient and improved performance. We present self-moving point representations where weight parameters freely move, and interpolation schemes are used to implement continuous functions. When applied to construct convolutional kernels, the experimental results have shown improved performance with drop-in replacement in the existing frameworks. Due to its lightweight structure, we are first to demonstrate the effectiveness of continuous convolution in a large-scale setting, e.g., ImageNet, presenting the improvements over the prior arts. Our code is available on https://github.com/sangnekim/SMPConv	This paper proposes SMPConv, a novel method for continuous convolution that utilizes self-moving point representations and interpolation schemes, eliminating the need for computationally expensive neural networks.	Current continuous convolution methods rely heavily on neural networks, leading to high computational costs, complex hyperparameter tuning, and limitations in the descriptive power of filters. SMPConv aims to address these issues by offering a more efficient and effective alternative.	SMPConv represents convolutional kernels as continuous functions using self-moving points. These points, associated with weight parameters and radii, are learned during training and interpolated to generate kernel values at arbitrary locations. This approach allows for constructing large, adaptive receptive fields with minimal computational overhead.	SMPConv achieves state-of-the-art results on sequential image classification tasks, such as sMNIST and pMNIST, demonstrating its capability in handling long-term dependencies. On CIFAR10 image classification, SMPConv outperforms its MLP-based counterparts with fewer parameters and significantly faster training times, indicating its effectiveness for 2D image data. For the first time, a continuous convolution method, SMPConv, is successfully applied to ImageNet-scale image classification, achieving competitive results with fewer parameters compared to existing models.	Limited computational budget restricted the number of experiments for large-scale image classification. Further exploration of regularization techniques and prior knowledge integration could potentially improve performance for tasks requiring long-term dependency modeling.	continuous convolution, self-moving point representation, large kernel convolution, efficient deep learning, image classification
2304.02051 Report	Multimodal Garment Designer: Human-Centric Latent Diffusion Models for Fashion Image Editing	Alberto Baldrati, Davide Morelli, Giuseppe Cartella, Marcella Cornia, Marco Bertini, Rita Cucchiara	Fashion illustration is used by designers to communicate their vision and to bring the design idea from conceptualization to realization, showing how clothes interact with the human body. In this context, computer vision can thus be used to improve the fashion design process. Differently from previous works that mainly focused on the virtual try-on of garments, we propose the task of multimodal-conditioned fashion image editing, guiding the generation of human-centric fashion images by following multimodal prompts, such as text, human body poses, and garment sketches. We tackle this problem by proposing a new architecture based on latent diffusion models, an approach that has not been used before in the fashion domain. Given the lack of existing datasets suitable for the task, we also extend two existing fashion datasets, namely Dress Code and VITON-HD, with multimodal annotations collected in a semi-automatic manner. Experimental results on these new datasets demonstrate the effectiveness of our proposal, both in terms of realism and coherence with the given multimodal inputs. Source code and collected multimodal annotations are publicly available at: https://github.com/aimagelab/multimodal-garment-designer.	Introduces Multimodal Garment Designer (MGD), a novel human-centric latent diffusion model for fashion image editing, enabling the generation of novel fashion images conditioned on text, human poses, and garment sketches.	Addresses the limitations of existing fashion image editing methods by introducing a multimodal approach that leverages text, human poses, and garment sketches to guide the generation process, enhancing control and personalization in fashion design.	Presents a novel MGD architecture based on latent diffusion models, incorporating a denoising network conditioned on text embeddings, pose maps, and garment sketches. Extends existing fashion datasets (Dress Code and VITON-HD) with multimodal annotations, including textual descriptions and garment sketches, collected semi-automatically.	MGD consistently outperforms competitor models in terms of image realism and coherence with input modalities on the newly collected multimodal datasets. The model effectively combines and utilizes text, pose, and sketch information in a disentangled manner, making each modality optional during generation. User studies confirm that MGD generates more realistic and coherent images compared to baseline methods.	MGD occasionally struggles to generate accurate hand details, particularly when hands occupy a small portion of the input image. The performance of the model is reliant on the quality of the input sketch, and inaccuracies in the sketch can lead to artifacts in the generated image. Future work will focus on addressing these limitations, potentially through improved hand modeling techniques or sketch refinement methods.	fashion image editing, latent diffusion models, multimodal conditioning, garment sketch guidance, human-centric generation
2304.02012 Report	EGC: Image Generation and Classification via a Diffusion Energy-Based Model	Qiushan Guo, Chuofan Ma, Yi Jiang, Zehuan Yuan, Yizhou Yu, Ping Luo	Learning image classification and image generation using the same set of network parameters is a challenging problem. Recent advanced approaches perform well in one task often exhibit poor performance in the other. This work introduces an energy-based classifier and generator, namely EGC, which can achieve superior performance in both tasks using a single neural network. Unlike a conventional classifier that outputs a label given an image (i.e., a conditional distribution $p(y\|\mathbf{x})$), the forward pass in EGC is a classifier that outputs a joint distribution $p(\mathbf{x},y)$, enabling an image generator in its backward pass by marginalizing out the label $y$. This is done by estimating the energy and classification probability given a noisy image in the forward pass, while denoising it using the score function estimated in the backward pass. EGC achieves competitive generation results compared with state-of-the-art approaches on ImageNet-1k, CelebA-HQ and LSUN Church, while achieving superior classification accuracy and robustness against adversarial attacks on CIFAR-10. This work represents the first successful attempt to simultaneously excel in both tasks using a single set of network parameters. We believe that EGC bridges the gap between discriminative and generative learning.	This paper proposes EGC, a novel energy-based model that unifies image classification and generation using a single neural network.	Bridging the gap between discriminative and generative learning is a challenging problem, and existing models often excel in one task while performing poorly in the other. EGC aims to achieve superior performance in both tasks simultaneously.	EGC leverages the diffusion process to improve the accuracy of estimated scores for stable training and image sampling. The forward pass acts as a classifier, predicting the joint distribution of noisy image and label. The backward pass acts as a generator, denoising data using unconditional and conditional scores.	EGC achieves competitive generation results compared to state-of-the-art approaches on ImageNet, CelebA-HQ, and LSUN Church datasets. EGC achieves superior classification accuracy and robustness against adversarial attacks on CIFAR-10, surpassing existing explicit EBMs. EGC demonstrates promising applications in image inpainting, semantic interpolation, and high-resolution image generation.	The paper acknowledges that incorporating stronger data augmentation techniques could further improve the results. Exploring network architectures specifically designed for optimizing the gradient of the energy function could further enhance performance.	energy-based model, diffusion model, image generation, image classification, generative learning
2304.01999 Report	Revisiting the Evaluation of Image Synthesis with GANs	Mengping Yang, Ceyuan Yang, Yichi Zhang, Qingyan Bai, Yujun Shen, Bo Dai	A good metric, which promises a reliable comparison between solutions, is essential for any well-defined task. Unlike most vision tasks that have per-sample ground-truth, image synthesis tasks target generating unseen data and hence are usually evaluated through a distributional distance between one set of real samples and another set of generated samples. This study presents an empirical investigation into the evaluation of synthesis performance, with generative adversarial networks (GANs) as a representative of generative models. In particular, we make in-depth analyses of various factors, including how to represent a data point in the representation space, how to calculate a fair distance using selected samples, and how many instances to use from each set. Extensive experiments conducted on multiple datasets and settings reveal several important findings. Firstly, a group of models that include both CNN-based and ViT-based architectures serve as reliable and robust feature extractors for measurement evaluation. Secondly, Centered Kernel Alignment (CKA) provides a better comparison across various extractors and hierarchical layers in one model. Finally, CKA is more sample-efficient and enjoys better agreement with human judgment in characterizing the similarity between two internal data correlations. These findings contribute to the development of a new measurement system, which enables a consistent and reliable re-evaluation of current state-of-the-art generative models.	This paper presents an empirical study investigating evaluation paradigms for generative adversarial networks (GANs) in image synthesis, focusing on the feature extractor and distributional distance.	Accurately evaluating the performance of GANs is crucial for assessing progress in image synthesis. Existing metrics like FID have limitations, necessitating a systematic investigation for reliable comparisons.	The study explores various feature extractors (CNNs, ViTs, MLPs) and distributional distances (FID, CKA) using techniques like heatmap visualization, histogram matching attacks, and human evaluations.	A combination of CNN-based and ViT-based architectures provides reliable and robust feature extraction for evaluating GANs. Centered Kernel Alignment (CKA) offers better comparison across different extractors and hierarchical layers than FID. CKA demonstrates greater sample efficiency and stronger agreement with human judgment in assessing GAN-generated image quality.	The study primarily focuses on image-level evaluation, without addressing potential biases from low-level image processing. Future work could explore the impact of image resolution and dataset size on the evaluation results.	image synthesis, generative adversarial networks, evaluation metrics, feature extractors, distributional distance
2304.01900 Report	PODIA-3D: Domain Adaptation of 3D Generative Model Across Large Domain Gap Using Pose-Preserved Text-to-Image Diffusion	Gwanghyun Kim, Ji Ha Jang, Se Young Chun	Recently, significant advancements have been made in 3D generative models, however training these models across diverse domains is challenging and requires an huge amount of training data and knowledge of pose distribution. Text-guided domain adaptation methods have allowed the generator to be adapted to the target domains using text prompts, thereby obviating the need for assembling numerous data. Recently, DATID-3D presents impressive quality of samples in text-guided domain, preserving diversity in text by leveraging text-to-image diffusion. However, adapting 3D generators to domains with significant domain gaps from the source domain still remains challenging due to issues in current text-to-image diffusion models as following: 1) shape-pose trade-off in diffusion-based translation, 2) pose bias, and 3) instance bias in the target domain, resulting in inferior 3D shapes, low text-image correspondence, and low intra-domain diversity in the generated samples. To address these issues, we propose a novel pipeline called PODIA-3D, which uses pose-preserved text-to-image diffusion-based domain adaptation for 3D generative models. We construct a pose-preserved text-to-image diffusion model that allows the use of extremely high-level noise for significant domain changes. We also propose specialized-to-general sampling strategies to improve the details of the generated samples. Moreover, to overcome the instance bias, we introduce a text-guided debiasing method that improves intra-domain diversity. Consequently, our method successfully adapts 3D generators across significant domain gaps. Our qualitative results and user study demonstrates that our approach outperforms existing 3D text-guided domain adaptation methods in terms of text-image correspondence, realism, diversity of rendered images, and sense of depth of 3D shapes in the generated samples	Presents PODIA-3D, a pose-preserved text-to-image diffusion-based domain adaptation method for 3D generative models, enabling adaptation across large domain gaps (e.g., from human faces to animals) with strong text-image correspondence and high-quality 3D shapes.	Training 3D generative models on diverse domains is challenging due to the need for vast amounts of training data and pose information. Existing domain adaptation methods struggle to handle significant domain shifts and often lead to low text-image correspondence and poor 3D shapes.	1. Construct pose-preserved text-to-image diffusion models (PPD) by fine-tuning depth-guided diffusion models on data preserving source poses but with target shapes. 2. Propose a specialized-to-general sampling strategy to generate target images, leveraging PPD for structure and pose and general diffusion models for details. 3. Fine-tune 3D generators adversarially on the generated pose-aware target dataset. 4. Introduce text-guided debiasing to improve intra-domain diversity.	Achieves superior text-image correspondence and 3D shape quality compared to existing 3D text-guided domain adaptation methods like StyleGANFusion and DATID-3D. Demonstrates successful adaptation of EG3D to various animal and character domains with large domain gaps from the source FFHQ dataset. Shows the effectiveness of the proposed PPD and specialized-to-general sampling in generating high-quality, pose-consistent target images.	Domain adaptation to non-living objects (e.g., chairs) with less directional information can result in low text-image correspondence. Reliance on text-to-image diffusion models means inheriting their limitations, such as potential biases or difficulty in generating certain image features.	3d generative models, domain adaptation, text-to-image synthesis, diffusion models, pose preservation
2304.01716 Report	Decoupling Dynamic Monocular Videos for Dynamic View Synthesis	Meng You, Junhui Hou	The challenge of dynamic view synthesis from dynamic monocular videos, i.e., synthesizing novel views for free viewpoints given a monocular video of a dynamic scene captured by a moving camera, mainly lies in accurately modeling the dynamic objects of a scene using limited 2D frames, each with a varying timestamp and viewpoint. Existing methods usually require pre-processed 2D optical flow and depth maps by off-the-shelf methods to supervise the network, making them suffer from the inaccuracy of the pre-processed supervision and the ambiguity when lifting the 2D information to 3D. In this paper, we tackle this challenge in an unsupervised fashion. Specifically, we decouple the motion of the dynamic objects into object motion and camera motion, respectively regularized by proposed unsupervised surface consistency and patch-based multi-view constraints. The former enforces the 3D geometric surfaces of moving objects to be consistent over time, while the latter regularizes their appearances to be consistent across different viewpoints. Such a fine-grained motion formulation can alleviate the learning difficulty for the network, thus enabling it to produce not only novel views with higher quality but also more accurate scene flows and depth than existing methods requiring extra supervision.	This paper proposes an unsupervised learning approach for dynamic view synthesis from monocular videos, which eliminates the need for pre-processed depth and optical flow as supervision by introducing two novel unsupervised regularization terms: surface consistency and patch-based multi-view consistency.	Existing methods heavily rely on pre-processed 2D optical flow and depth maps, leading to limitations such as performance degradation due to inaccurate pre-processed data, ambiguity in lifting 2D information to 3D, and computational expenses.	The method decouples object and camera motion, using a static NeRF for the background and an NSFF for dynamic objects and scene flow. The surface consistency constraint enforces temporal consistency of moving object surfaces, while the patch-based multi-view constraint ensures consistency between rendered novel views and the input view.	The unsupervised method outperforms state-of-the-art supervised methods on the NVIDIA Dynamic Scene Dataset, achieving superior PSNR and comparable SSIM and LPIPS. It effectively handles dynamic scenes, particularly those with significant motion, showing substantial improvements over existing methods. The method produces accurate depth and flow maps for novel views in an unsupervised manner, comparable to those generated by supervised methods.	The method struggles with non-rigid deformations due to the surface consistency constraint's limitations in handling such cases. The reliance on separate modeling for static and dynamic parts and the use of mask supervision represent areas for future simplification.	dynamic view synthesis, neural radiance fields (nerf), unsupervised learning, scene flow estimation, monocular video
2304.01515 Report	Text-Conditioned Sampling Framework for Text-to-Image Generation with Masked Generative Models	Jaewoong Lee, Sangwon Jang, Jaehyeong Jo, Jaehong Yoon, Yunji Kim, Jin-Hwa Kim, Jung-Woo Ha, Sung Ju Hwang	Token-based masked generative models are gaining popularity for their fast inference time with parallel decoding. While recent token-based approaches achieve competitive performance to diffusion-based models, their generation performance is still suboptimal as they sample multiple tokens simultaneously without considering the dependence among them. We empirically investigate this problem and propose a learnable sampling model, Text-Conditioned Token Selection (TCTS), to select optimal tokens via localized supervision with text information. TCTS improves not only the image quality but also the semantic alignment of the generated images with the given texts. To further improve the image quality, we introduce a cohesive sampling strategy, Frequency Adaptive Sampling (FAS), to each group of tokens divided according to the self-attention maps. We validate the efficacy of TCTS combined with FAS with various generative tasks, demonstrating that it significantly outperforms the baselines in image-text alignment and image quality. Our text-conditioned sampling framework further reduces the original inference time by more than 50% without modifying the original generative model.	This paper introduces a text-conditioned sampling framework for text-to-image generation with masked generative models, aiming to improve text alignment and image quality.	Current token-based diffusion models for text-to-image generation, while fast, struggle with inconsistency in generated images due to simultaneous token sampling, leading to a trade-off between speed and quality. This is particularly problematic for text alignment.	The authors propose Text-Conditioned Token Selection (TCTS), a learnable model trained to identify and resample misaligned tokens based on text conditions. They further introduce Frequency Adaptive Sampling (FAS) to address over-simplification in low-frequency image areas by selectively applying persistent sampling based on self-attention maps.	Revocable sampling strategies like TCTS improve text alignment compared to fixed methods, mitigating error accumulation. TCTS, especially when combined with FAS, outperforms baselines in text alignment metrics (MID-L) and maintains competitive image quality (FID) on MS-COCO and CUB datasets. The framework facilitates fast local image refinement and mask-free object editing leveraging cross-attention maps.	The computational overhead of TCTS, while marginal, could be further optimized. The paper primarily focuses on single-object datasets, and further exploration is needed for more complex multi-object scenes.	text-to-image generation, token-based diffusion models, revocable sampling, text alignment, image refinement
2304.01489 Report	Improved Visual Fine-tuning with Natural Language Supervision	Junyang Wang, Yuanhong Xu, Juhua Hu, Ming Yan, Jitao Sang, Qi Qian	Fine-tuning a visual pre-trained model can leverage the semantic information from large-scale pre-training data and mitigate the over-fitting problem on downstream vision tasks with limited training examples. While the problem of catastrophic forgetting in pre-trained backbone has been extensively studied for fine-tuning, its potential bias from the corresponding pre-training task and data, attracts less attention. In this work, we investigate this problem by demonstrating that the obtained classifier after fine-tuning will be close to that induced by the pre-trained model. To reduce the bias in the classifier effectively, we introduce a reference distribution obtained from a fixed text classifier, which can help regularize the learned vision classifier. The proposed method, Text Supervised fine-tuning (TeS), is evaluated with diverse pre-trained vision models including ResNet and ViT, and text encoders including BERT and CLIP, on 11 downstream tasks. The consistent improvement with a clear margin over distinct scenarios confirms the effectiveness of our proposal. Code is available at \url{https://github.com/idstcv/TeS}.	This paper proposes TeS, a method using text supervision from a fixed text encoder to improve fine-tuning of pre-trained vision models for image classification, reducing bias without catastrophic forgetting.	Fine-tuning pre-trained vision models can suffer from bias due to the pre-training task and data, while tackling this issue often leads to catastrophic forgetting. Text supervision offers a readily available source of information to address this challenge.	TeS introduces a reference distribution from a fixed text classifier (using class names). It minimizes KL-divergence between class-level distributions from vision and text encoders and further introduces instance-level regularization by approximating text representations for each image.	TeS consistently outperforms conventional fine-tuning and label smoothing methods across various pre-trained vision models (ResNet, ViT), text encoders (BERT, CLIP), and datasets. Text encoders pre-trained with visual data (CLIP) show superior performance in supervising visual fine-tuning compared to pure language models (BERT). TeS shows significant improvements in few-shot learning scenarios and for datasets with long-tailed distributions.	Current method requires the exact class names, limiting its applicability in scenarios with restricted access to such information. Exploring the combination of TeS with state-of-the-art methods specifically designed for class imbalance learning.	fine-tuning, text supervision, vision-language pre-training, image classification, bias reduction
2304.01436 Report	Learning Personalized High Quality Volumetric Head Avatars from Monocular RGB Videos	Ziqian Bai, Feitong Tan, Zeng Huang, Kripasindhu Sarkar, Danhang Tang, Di Qiu, Abhimitra Meka, Ruofei Du, Mingsong Dou, Sergio Orts-Escolano, Rohit Pandey, Ping Tan, Thabo Beeler, Sean Fanello, Yinda Zhang	We propose a method to learn a high-quality implicit 3D head avatar from a monocular RGB video captured in the wild. The learnt avatar is driven by a parametric face model to achieve user-controlled facial expressions and head poses. Our hybrid pipeline combines the geometry prior and dynamic tracking of a 3DMM with a neural radiance field to achieve fine-grained control and photorealism. To reduce over-smoothing and improve out-of-model expressions synthesis, we propose to predict local features anchored on the 3DMM geometry. These learnt features are driven by 3DMM deformation and interpolated in 3D space to yield the volumetric radiance at a designated query point. We further show that using a Convolutional Neural Network in the UV space is critical in incorporating spatial context and producing representative local features. Extensive experiments show that we are able to reconstruct high-quality avatars, with more accurate expression-dependent details, good generalization to out-of-training expressions, and quantitatively superior renderings compared to other state-of-the-art approaches.	This paper presents a method to create high-quality, controllable 3D head avatars from monocular RGB videos, leveraging a 3DMM-anchored neural radiance field.	Creating realistic and controllable avatars from accessible data like monocular videos is crucial for AR/VR, gaming, and visual effects.	The method combines a 3DMM for facial tracking with a neural radiance field for detailed rendering. It uses a CNN in UV space to predict expression-dependent features attached to 3DMM vertices, enhancing detail and generalization to unseen expressions.	Reconstructs high-quality avatars from short monocular videos. Captures fine-grained details and accurate articulations, outperforming previous methods in terms of visual quality. Demonstrates good generalization to out-of-training expressions and novel view synthesis.	Training is subject-specific and time-consuming, similar to other NeRF-based methods. Limited ability to model components completely missing in the 3DMM, such as the tongue.	head avatar, neural radiance field, 3d morphable model, monocular reconstruction, expression transfer
2304.01200 Report	Video Instance Segmentation in an Open-World	Omkar Thawakar, Sanath Narayan, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Jorma Laaksonen, Mubarak Shah, Fahad Shahbaz Khan	Existing video instance segmentation (VIS) approaches generally follow a closed-world assumption, where only seen category instances are identified and spatio-temporally segmented at inference. Open-world formulation relaxes the close-world static-learning assumption as follows: (a) first, it distinguishes a set of known categories as well as labels an unknown object as `unknown' and then (b) it incrementally learns the class of an unknown as and when the corresponding semantic labels become available. We propose the first open-world VIS approach, named OW-VISFormer, that introduces a novel feature enrichment mechanism and a spatio-temporal objectness (STO) module. The feature enrichment mechanism based on a light-weight auxiliary network aims at accurate pixel-level (unknown) object delineation from the background as well as distinguishing category-specific known semantic classes. The STO module strives to generate instance-level pseudo-labels by enhancing the foreground activations through a contrastive loss. Moreover, we also introduce an extensive experimental protocol to measure the characteristics of OW-VIS. Our OW-VISFormer performs favorably against a solid baseline in OW-VIS setting. Further, we evaluate our contributions in the standard fully-supervised VIS setting by integrating them into the recent SeqFormer, achieving an absolute gain of 1.6\% AP on Youtube-VIS 2019 val. set. Lastly, we show the generalizability of our contributions for the open-world detection (OWOD) setting, outperforming the best existing OWOD method in the literature. Code, models along with OW-VIS splits are available at \url{https://github.com/OmkarThawakar/OWVISFormer}.	This paper introduces OW-VISFormer, the first approach for open-world video instance segmentation (OW-VIS). OW-VISFormer employs a novel feature enrichment mechanism and a spatio-temporal objectness module to identify and segment both known and unknown object instances in videos, allowing for incremental learning of new object categories.	Existing VIS approaches operate under a closed-world assumption, limiting their ability to handle novel object categories. OW-VIS addresses this by enabling the model to identify unknown objects and incrementally learn their categories as annotations become available, crucial for real-world applications where new objects are constantly encountered.	OW-VISFormer leverages a light-weight auxiliary network (ScratchNet) to generate shallow features that complement standard pre-trained features, improving pixel-level object delineation. A spatio-temporal objectness (STO) module with a contrastive loss enhances foreground activations, facilitating the identification of candidate unknown objects and improving mask prediction for both known and unknown instances.	OW-VISFormer consistently outperforms the baseline on various OW-VIS splits, demonstrating its effectiveness in segmenting both known and unknown objects. Integrating the proposed feature enrichment and STO module into SeqFormer, a fully-supervised VIS method, yields a 1.6% absolute gain in AP on the YouTube-VIS 2019 val. set. The proposed approach generalizes well to open-world object detection (OWOD), surpassing the state-of-the-art OW-DETR method on the MS COCO OWOD split.	The current OW-VISFormer framework focuses on single-stage instance segmentation. Exploring its integration with two-stage VIS methods could be beneficial. The impact of different memory replay strategies for incremental learning in OW-VIS warrants further investigation.	video instance segmentation, open-world learning, incremental learning, object detection, computer vision
2304.01198 Report	Open-Vocabulary Semantic Segmentation with Decoupled One-Pass Network	Cong Han, Yujie Zhong, Dengjie Li, Kai Han, Lin Ma	Recently, the open-vocabulary semantic segmentation problem has attracted increasing attention and the best performing methods are based on two-stream networks: one stream for proposal mask generation and the other for segment classification using a pretrained visual-language model. However, existing two-stream methods require passing a great number of (up to a hundred) image crops into the visual-language model, which is highly inefficient. To address the problem, we propose a network that only needs a single pass through the visual-language model for each input image. Specifically, we first propose a novel network adaptation approach, termed patch severance, to restrict the harmful interference between the patch embeddings in the pre-trained visual encoder. We then propose classification anchor learning to encourage the network to spatially focus on more discriminative features for classification. Extensive experiments demonstrate that the proposed method achieves outstanding performance, surpassing state-of-the-art methods while being 4 to 7 times faster at inference. Code: https://github.com/CongHan0808/DeOP.git	This paper proposes Decoupled One-Pass Network (DeOP) for open-vocabulary semantic segmentation, which maintains the zero-shot ability of VLMs while being computationally efficient.	Existing two-stream methods for open-vocabulary semantic segmentation are computationally expensive, as they require passing many image crops through the visual-language model.	DeOP uses a decoupled, two-stream architecture with a class-agnostic mask proposal network and a mask classification network based on a frozen CLIP visual encoder. It introduces two novel components: Generalized Patch Severance (GPS) to reduce interference between patch embeddings and Classification Anchor Learning (CAL) to identify discriminative features for classification.	DeOP achieves state-of-the-art performance on COCO-Stuff and Pascal VOC in both intra- and cross-dataset evaluations. DeOP is 4 to 7 times faster than multi-pass methods at inference. GPS and CAL significantly contribute to the improvement of segmentation performance.	Applying GPS to shallower layers of the visual encoder can hurt performance. Exploring more sophisticated CAL modules could further enhance performance.	semantic segmentation, open-vocabulary learning, zero-shot learning, vision-language models, clip
2304.01197 Report	Bringing Telepresence to Every Desk	Shengze Wang, Ziheng Wang, Ryan Schmelzle, Liujie Zheng, YoungJoong Kwon, Soumyadip Sengupta, Henry Fuchs	In this paper, we work to bring telepresence to every desktop. Unlike commercial systems, personal 3D video conferencing systems must render high-quality videos while remaining financially and computationally viable for the average consumer. To this end, we introduce a capturing and rendering system that only requires 4 consumer-grade RGBD cameras and synthesizes high-quality free-viewpoint videos of users as well as their environments. Experimental results show that our system renders high-quality free-viewpoint videos without using object templates or heavy pre-processing. While not real-time, our system is fast and does not require per-video optimizations. Moreover, our system is robust to complex hand gestures and clothing, and it can generalize to new users. This work provides a strong basis for further optimization, and it will help bring telepresence to every desk in the near future. The code and dataset will be made available on our website https://mcmvmc.github.io/PersonalTelepresence/.	This paper presents a novel view synthesis system for personal 3D telepresence using only 4 consumer-grade RGBD cameras, offering a cost-effective way to synthesize high-quality free-viewpoint videos.	This system addresses the limitations of current commercial 3D telepresence solutions, which are often expensive and require dedicated physical spaces, making them inaccessible to the average user.	The system utilizes a novel volumetric representation called Multi-layer Point Cloud (MPC) to address depth biases in RGBD cameras, improving reconstruction accuracy, especially for slanted surfaces. It also incorporates a temporal renderer for temporal smoothing and Spatial Skip Connections for high-resolution rendering under limited GPU memory.	The system produces high-quality free-viewpoint videos of users and their environments, outperforming baseline methods in terms of accuracy and stability. It accurately reconstructs challenging details like hand gestures and fast body movements without relying on object templates or heavy pre-processing. The system demonstrates generalizability to new users and environments, highlighting its potential for wider adoption.	The system is not yet real-time, with cost volume construction being a bottleneck. The current system does not support immersive display technologies like autostereo displays.	novel view synthesis, 3d telepresence, rgbd cameras, volumetric rendering, personal telepresence
2304.01186 Report	Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos	Yue Ma, Yingqing He, Xiaodong Cun, Xintao Wang, Siran Chen, Ying Shan, Xiu Li, Qifeng Chen	Generating text-editable and pose-controllable character videos have an imperious demand in creating various digital human. Nevertheless, this task has been restricted by the absence of a comprehensive dataset featuring paired video-pose captions and the generative prior models for videos. In this work, we design a novel two-stage training scheme that can utilize easily obtained datasets (i.e.,image pose pair and pose-free video) and the pre-trained text-to-image (T2I) model to obtain the pose-controllable character videos. Specifically, in the first stage, only the keypoint-image pairs are used only for a controllable text-to-image generation. We learn a zero-initialized convolutional encoder to encode the pose information. In the second stage, we finetune the motion of the above network via a pose-free video dataset by adding the learnable temporal self-attention and reformed cross-frame self-attention blocks. Powered by our new designs, our method successfully generates continuously pose-controllable character videos while keeps the editing and concept composition ability of the pre-trained T2I model. The code and models will be made publicly available.	This paper presents a novel two-stage training scheme for generating text-editable and pose-controllable character videos, leveraging pre-trained text-to-image models and easily obtainable datasets.	Generating such videos is crucial for various digital human applications but limited by the lack of paired video-pose captions and effective video generative prior models.	The method uses a pose encoder to incorporate pose information into a pre-trained text-to-image model (Stage 1), followed by fine-tuning on a pose-free video dataset to ensure temporal consistency (Stage 2).	The approach successfully generates high-quality character videos controllable by both text prompts and pose sequences. It inherits robust concept generation and composition capabilities from pre-trained T2I models. The method outperforms existing techniques in terms of generation quality, text-video alignment, pose-video alignment, and temporal coherence.	The model's performance on complex scenes with multiple interacting characters needs further investigation. Future work may explore incorporating additional control signals, such as depth or style, for more versatile video generation.	text-to-video generation, pose control, character animation, diffusion models, deep learning
2304.01184 Report	WeakTr: Exploring Plain Vision Transformer for Weakly-supervised Semantic Segmentation	Lianghui Zhu, Yingyue Li, Jiemin Fang, Yan Liu, Hao Xin, Wenyu Liu, Xinggang Wang	This paper explores the properties of the plain Vision Transformer (ViT) for Weakly-supervised Semantic Segmentation (WSSS). The class activation map (CAM) is of critical importance for understanding a classification network and launching WSSS. We observe that different attention heads of ViT focus on different image areas. Thus a novel weight-based method is proposed to end-to-end estimate the importance of attention heads, while the self-attention maps are adaptively fused for high-quality CAM results that tend to have more complete objects. Besides, we propose a ViT-based gradient clipping decoder for online retraining with the CAM results to complete the WSSS task. We name this plain Transformer-based Weakly-supervised learning framework WeakTr. It achieves the state-of-the-art WSSS performance on standard benchmarks, i.e., 78.4% mIoU on the val set of PASCAL VOC 2012 and 50.3% mIoU on the val set of COCO 2014. Code is available at https://github.com/hustvl/WeakTr.	This paper proposes WeakTr, a novel weakly supervised semantic segmentation framework using a plain Vision Transformer (ViT), introducing a weight-based method for fusing attention heads to generate high-quality class activation maps.	Weakly supervised semantic segmentation typically relies on class activation maps (CAMs) generated from classification networks, but traditional methods for CAM generation have limitations. This paper addresses those limitations by leveraging the multi-head attention mechanism of ViTs.	The authors introduce a weight-based method to estimate the importance of different attention heads in ViT for adaptive fusion, leading to improved CAM quality. They also propose a ViT-based gradient clipping decoder for online retraining using the generated CAMs to achieve semantic segmentation.	WeakTr achieves state-of-the-art WSSS performance, reaching 78.4% mIoU on the PASCAL VOC 2012 val set and 50.3% mIoU on the COCO 2014 val set. The proposed weight-based attention head fusion method generates higher-quality CAMs compared to traditional mean-sum approaches. The ViT-based gradient clipping decoder effectively leverages the generated CAMs for improved semantic segmentation.	The computational cost of WeakTr is not addressed, which is crucial for real-world applications. The paper focuses on single-label WSSS; exploring its effectiveness in multi-label settings would be beneficial. Future work could explore the impact of different ViT architectures and pretraining strategies on WeakTr's performance.	weakly supervised semantic segmentation, vision transformer, class activation map, attention mechanism, computer vision
2304.01172 Report	Generative Multiplane Neural Radiance for 3D-Aware Image Generation	Amandeep Kumar, Ankan Kumar Bhunia, Sanath Narayan, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan	We present a method to efficiently generate 3D-aware high-resolution images that are view-consistent across multiple target views. The proposed multiplane neural radiance model, named GMNR, consists of a novel {\alpha}-guided view-dependent representation ({\alpha}-VdR) module for learning view-dependent information. The {\alpha}-VdR module, faciliated by an {\alpha}-guided pixel sampling technique, computes the view-dependent representation efficiently by learning viewing direction and position coefficients. Moreover, we propose a view-consistency loss to enforce photometric similarity across multiple views. The GMNR model can generate 3D-aware high-resolution images that are viewconsistent across multiple camera poses, while maintaining the computational efficiency in terms of both training and inference time. Experiments on three datasets demonstrate the effectiveness of the proposed modules, leading to favorable results in terms of both generation quality and inference time, compared to existing approaches. Our GMNR model generates 3D-aware images of 1024 X 1024 pixels with 17.6 FPS on a single V100. Code : https://github.com/VIROBO-15/GMNR	This paper proposes Generative Multiplane Neural Radiance (GMNR), an efficient approach for synthesizing 3D-aware and view-consistent high-resolution images across different camera poses.	Generating 3D-aware images that maintain consistency across views is challenging due to the lack of 3D geometry supervision and the need for high-resolution outputs at extrapolated views.	GMNR leverages multiplane images and introduces an α-guided view-dependent representation (α-VdR) module. This module learns view-dependent information by efficiently sampling pixels using an α-guided technique and computing view-dependent pixel representations. It also incorporates a view-consistency loss to enforce photometric similarity across multiple rendered views.	GMNR outperforms the baseline GMPI method in terms of FID, KID, identity consistency, depth accuracy, and pose accuracy on FFHQ and AFHQv2-Cats datasets. The α-VdR module significantly improves image quality at extrapolated views by learning view-dependent information. GMNR achieves comparable or better performance than state-of-the-art methods like EG3D and StyleNeRF while maintaining high inference speed (17.6 FPS for 1024x1024 images on a single V100).	The maximum sampling rate within the α-VdR module is limited by training batch size. Future work can explore extending GMNR to handle more complex scenes and object categories.	3d-aware image generation, view-consistency, multiplane images, generative adversarial networks, view-dependent representation
2304.01114 Report	Associating Spatially-Consistent Grouping with Text-supervised Semantic Segmentation	Yabo Zhang, Zihao Wang, Jun Hao Liew, Jingjia Huang, Manyu Zhu, Jiashi Feng, Wangmeng Zuo	In this work, we investigate performing semantic segmentation solely through the training on image-sentence pairs. Due to the lack of dense annotations, existing text-supervised methods can only learn to group an image into semantic regions via pixel-insensitive feedback. As a result, their grouped results are coarse and often contain small spurious regions, limiting the upper-bound performance of segmentation. On the other hand, we observe that grouped results from self-supervised models are more semantically consistent and break the bottleneck of existing methods. Motivated by this, we introduce associate self-supervised spatially-consistent grouping with text-supervised semantic segmentation. Considering the part-like grouped results, we further adapt a text-supervised model from image-level to region-level recognition with two core designs. First, we encourage fine-grained alignment with a one-way noun-to-region contrastive loss, which reduces the mismatched noun-region pairs. Second, we adopt a contextually aware masking strategy to enable simultaneous recognition of all grouped regions. Coupled with spatially-consistent grouping and region-adapted recognition, our method achieves 59.2% mIoU and 32.4% mIoU on Pascal VOC and Pascal Context benchmarks, significantly surpassing the state-of-the-art methods.	This paper proposes a novel text-supervised semantic segmentation method that leverages spatially-consistent grouping from self-supervised vision models to improve segmentation performance.	Existing text-supervised methods struggle to produce spatially consistent segmentation results due to relying solely on pixel-insensitive image-sentence matching losses. This limits their upper-bound performance as incorrectly grouped pixels are difficult to separate during recognition.	The proposed method utilizes self-supervised features for consistent region grouping and adapts a text-supervised model (CLIP) for region-level recognition with two key designs: 1) a context-aware masking strategy for efficient and effective encoding of grouped regions, and 2) a one-way noun-region contrastive loss to encourage fine-grained alignment while minimizing mismatched noun-region pairs.	The method achieves state-of-the-art performance on Pascal VOC, Pascal Context, and COCO benchmarks for text-supervised semantic segmentation. Qualitative results demonstrate higher quality segmentation masks with fewer spurious regions and more accurate boundaries compared to previous methods. Ablation studies confirm the effectiveness of the proposed masking strategy, fine-tuning approach, and one-way noun-region alignment loss.	The method relies on an external self-supervised model for grouping, which adds complexity. The one-way alignment, while effective, might overlook potential matches where regions are described indirectly in the paired sentence.	semantic segmentation, text supervision, self-supervised learning, vision-language models, clip
2304.00964 Report	Robust Text-driven Image Editing Method that Adaptively Explores Directions in Latent Spaces of StyleGAN and CLIP	Tsuyoshi Baba, Kosuke Nishida, Kyosuke Nishida	Automatic image editing has great demands because of its numerous applications, and the use of natural language instructions is essential to achieving flexible and intuitive editing as the user imagines. A pioneering work in text-driven image editing, StyleCLIP, finds an edit direction in the CLIP space and then edits the image by mapping the direction to the StyleGAN space. At the same time, it is difficult to tune appropriate inputs other than the original image and text instructions for image editing. In this study, we propose a method to construct the edit direction adaptively in the StyleGAN and CLIP spaces with SVM. Our model represents the edit direction as a normal vector in the CLIP space obtained by training a SVM to classify positive and negative images. The images are retrieved from a large-scale image corpus, originally used for pre-training StyleGAN, according to the CLIP similarity between the images and the text instruction. We confirmed that our model performed as well as the StyleCLIP baseline, whereas it allows simple inputs without increasing the computational time.	This paper introduces StyleCLIP-FEU, a text-driven image editing method that eliminates the need for a neutral text description required by StyleCLIP.	This addresses the limitations of StyleCLIP, which requires users to manually provide a neutral text description of the original image, making it less user-friendly and intuitive.	StyleCLIP-FEU leverages a Support Vector Machine (SVM) to adaptively construct the editing direction in the latent spaces of StyleGAN and CLIP. This involves retrieving positive and negative images from a large corpus based on CLIP similarity to the instruction, and training an SVM to find a hyperplane separating them.	StyleCLIP-FEU achieves comparable editing performance to StyleCLIP without requiring neutral text input. The method is robust to changes in hyperparameters controlling editing strength, unlike StyleCLIP. Subjective evaluation shows that StyleCLIP-FEU outperforms StyleCLIP in both accuracy of edits and naturalness of generated images.	Adaptive selection of the sparsity hyperparameter for finer control over editing needs further exploration. The method currently focuses on facial images and could be extended to other domains.	image editing, text-guided synthesis, stylegan, clip, support vector machine
2304.00962 Report	RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding	Jihan Yang, Runyu Ding, Weipeng Deng, Zhe Wang, Xiaojuan Qi	We propose a lightweight and scalable Regional Point-Language Contrastive learning framework, namely \textbf{RegionPLC}, for open-world 3D scene understanding, aiming to identify and recognize open-set objects and categories. Specifically, based on our empirical studies, we introduce a 3D-aware SFusion strategy that fuses 3D vision-language pairs derived from multiple 2D foundation models, yielding high-quality, dense region-level language descriptions without human 3D annotations. Subsequently, we devise a region-aware point-discriminative contrastive learning objective to enable robust and effective 3D learning from dense regional language supervision. We carry out extensive experiments on ScanNet, ScanNet200, and nuScenes datasets, and our model outperforms prior 3D open-world scene understanding approaches by an average of 17.2\% and 9.1\% for semantic and instance segmentation, respectively, while maintaining greater scalability and lower resource demands. Furthermore, our method has the flexibility to be effortlessly integrated with language models to enable open-ended grounded 3D reasoning without extra task-specific training. Code is available at https://github.com/CVMI-Lab/PLA.	This paper introduces RegionPLC, a novel regional point-language contrastive learning framework for open-world 3D scene understanding, enabling recognition and localization of unseen object categories.	Open-world 3D scene understanding is crucial for real-world applications but challenging due to the scarcity of dense 3D semantic annotations. Existing methods suffer from limitations such as constrained vocabulary space and high resource requirements.	RegionPLC leverages diverse 2D foundation models to generate dense region-level 3D-language pairs using a novel supplementary-oriented fusion strategy (SFusion). It then employs a region-aware point-discriminative contrastive learning objective to train a 3D backbone for open-world understanding.	RegionPLC significantly outperforms prior open-world methods, achieving an average of 17.2% gains in unseen category mIoU for semantic segmentation and 9.1% in mAP50 for instance segmentation. It demonstrates promising zero-shot segmentation performance, achieving 40.5% higher foreground mIoU compared to previous state-of-the-art with only language supervision. RegionPLC is also lightweight, requiring only 17% of OpenScene's training cost and 5% of its storage, while being easily integrated with language models for open-ended grounded 3D reasoning.	Current integration of 2D image features is straightforward and can be further improved with more advanced strategies. Visual prompts are pre-defined and can benefit from more adaptive techniques.	open-world learning, 3d scene understanding, point cloud segmentation, vision-language learning, contrastive learning
2304.00838 Report	MetaHead: An Engine to Create Realistic Digital Head	Dingyun Zhang, Chenglai Zhong, Yudong Guo, Yang Hong, Juyong Zhang	Collecting and labeling training data is one important step for learning-based methods because the process is time-consuming and biased. For face analysis tasks, although some generative models can be used to generate face data, they can only achieve a subset of generation diversity, reconstruction accuracy, 3D consistency, high-fidelity visual quality, and easy editability. One recent related work is the graphics-based generative method, but it can only render low realism head with high computation cost. In this paper, we propose MetaHead, a unified and full-featured controllable digital head engine, which consists of a controllable head radiance field(MetaHead-F) to super-realistically generate or reconstruct view-consistent 3D controllable digital heads and a generic top-down image generation framework LabelHead to generate digital heads consistent with the given customizable feature labels. Experiments validate that our controllable digital head engine achieves the state-of-the-art generation visual quality and reconstruction accuracy. Moreover, the generated labeled data can assist real training data and significantly surpass the labeled data generated by graphics-based methods in terms of training effect.	MetaHead, a unified and full-featured controllable digital head engine, enabling realistic head reconstruction, control, generation, and label-consistent synthesis.	Addresses limitations of existing methods in generation diversity, reconstruction accuracy, 3D consistency, fidelity, and editability, especially in challenging scenarios, and aims to streamline data collection and annotation for face analysis tasks.	Combines a controllable head radiance field (MetaHead-F) with a generic top-down image generation framework (LabelHead). MetaHead-F leverages a novel decoder-GAN combination strategy with hierarchical attention for high-quality, disentangled control, while LabelHead uses embedded features for label-consistent generation and bidirectional label estimation.	Achieves state-of-the-art generation quality and reconstruction accuracy, outperforming existing methods on standard metrics. Enables precise and decoupled 3D control over identity, expression, texture, illumination, pose, and other customizable features like gaze and hair color. Demonstrates the ability to generate labeled data significantly better than graphics-based methods, improving downstream tasks like landmark estimation and gaze estimation, even with limited real data.	Potential misuse of highly realistic generated heads necessitates safeguarding measures. Current implementation focuses on head generation; expanding to full-body generation presents further challenges.	digital human, generative model, 3d head, radiance field, data augmentation
2304.00793 Report	FinnWoodlands Dataset	Juan Lagos, Urho Lempiö, Esa Rahtu	While the availability of large and diverse datasets has contributed to significant breakthroughs in autonomous driving and indoor applications, forestry applications are still lagging behind and new forest datasets would most certainly contribute to achieving significant progress in the development of data-driven methods for forest-like scenarios. This paper introduces a forest dataset called \textit{FinnWoodlands}, which consists of RGB stereo images, point clouds, and sparse depth maps, as well as ground truth manual annotations for semantic, instance, and panoptic segmentation. \textit{FinnWoodlands} comprises a total of 4226 objects manually annotated, out of which 2562 objects (60.6\%) correspond to tree trunks classified into three different instance categories, namely "Spruce Tree", "Birch Tree", and "Pine Tree". Besides tree trunks, we also annotated "Obstacles" objects as instances as well as the semantic stuff classes "Lake", "Ground", and "Track". Our dataset can be used in forestry applications where a holistic representation of the environment is relevant. We provide an initial benchmark using three models for instance segmentation, panoptic segmentation, and depth completion, and illustrate the challenges that such unstructured scenarios introduce.	Introduces FinnWoodlands, a forest dataset with RGB stereo images, point clouds, sparse depth maps, and ground truth annotations for semantic, instance, and panoptic segmentation, aiming to advance holistic scene understanding in forestry applications.	Forest datasets are limited compared to urban or indoor datasets, hindering the development of data-driven methods for forestry applications like autonomous navigation and resource management.	Collected data from Finnish forests using a backpack-mounted LIDAR and stereo camera setup. Manually annotated 300 frames with semantic, instance, and panoptic segmentation ground truths. Provided initial benchmark results using Mask R-CNN, EfficientPS, and FuseNet for instance, panoptic segmentation, and depth completion respectively.	FinnWoodlands contains 4226 manually annotated objects, with tree trunks constituting 60.6% of the dataset. Mask R-CNN and EfficientPS show promising results for object detection but struggle with accurate segmentation, especially in dense forest areas. FuseNet demonstrates good generalization in depth completion but loses fine details of objects like trees.	Limited diversity in terms of geographical location and seasonal variation. Extend the dataset with more annotated frames and explore other computer vision tasks relevant to forestry.	forestry, dataset, panoptic segmentation, instance segmentation, depth completion
2304.00784 Report	Disentangled Pre-training for Image Matting	Yanda Li, Zilong Huang, Gang Yu, Ling Chen, Yunchao Wei, Jianbo Jiao	Image matting requires high-quality pixel-level human annotations to support the training of a deep model in recent literature. Whereas such annotation is costly and hard to scale, significantly holding back the development of the research. In this work, we make the first attempt towards addressing this problem, by proposing a self-supervised pre-training approach that can leverage infinite numbers of data to boost the matting performance. The pre-training task is designed in a similar manner as image matting, where random trimap and alpha matte are generated to achieve an image disentanglement objective. The pre-trained model is then used as an initialisation of the downstream matting task for fine-tuning. Extensive experimental evaluations show that the proposed approach outperforms both the state-of-the-art matting methods and other alternative self-supervised initialisation approaches by a large margin. We also show the robustness of the proposed approach over different backbone architectures. Our project page is available at https://crystraldo.github.io/dpt_mat/.	This paper proposes Disentangled Pre-training (DPT), a self-supervised pre-training approach for image matting to leverage large-scale unlabeled data.	High-quality pixel-level annotations for image matting are costly and limit the development of data-driven deep learning methods.	DPT simulates the matting process with synthetic data. It generates random trimaps for guidance and alpha mattes as pseudo labels. It then trains an encoder-decoder network to predict alpha mattes from composited images, mimicking the image disentanglement objective of image matting.	DPT outperforms state-of-the-art image matting methods on Composition-1k and Distinct-646 datasets. The method shows consistent performance improvements across different network backbones (CNN and Transformer). The pre-trained model effectively learns contour information, as shown by its ability to extract object contours even without fine-tuning.	The training data is class-agnostic and may limit its applicability to semantic-related tasks. Future work could explore incorporating semantic information into the pre-training process.	image matting, self-supervised learning, pre-training, disentanglement, trimap
2304.00749 Report	Small but Mighty: Enhancing 3D Point Clouds Semantic Segmentation with U-Next Framework	Ziyin Zeng, Qingyong Hu, Zhong Xie, Jian Zhou, Yongyang Xu	We study the problem of semantic segmentation of large-scale 3D point clouds. In recent years, significant research efforts have been directed toward local feature aggregation, improved loss functions and sampling strategies. While the fundamental framework of point cloud semantic segmentation has been largely overlooked, with most existing approaches rely on the U-Net architecture by default. In this paper, we propose U-Next, a small but mighty framework designed for point cloud semantic segmentation. The key to this framework is to learn multi-scale hierarchical representations from semantically similar feature maps. Specifically, we build our U-Next by stacking multiple U-Net $L^1$ codecs in a nested and densely arranged manner to minimize the semantic gap, while simultaneously fusing the feature maps across scales to effectively recover the fine-grained details. We also devised a multi-level deep supervision mechanism to further smooth gradient propagation and facilitate network optimization. Extensive experiments conducted on three large-scale benchmarks including S3DIS, Toronto3D, and SensatUrban demonstrate the superiority and the effectiveness of the proposed U-Next architecture. Our U-Next architecture shows consistent and visible performance improvements across different tasks and baseline models, indicating its great potential to serve as a general framework for future research.	This paper proposes U-Next, a novel architecture for 3D point cloud semantic segmentation, designed to learn multi-scale hierarchical representations from semantically similar feature maps.	Existing point cloud segmentation approaches heavily rely on the U-Net architecture, overlooking its limitations in handling information loss during aggressive downsampling and upsampling inherent to 3D point cloud data.	U-Next leverages multiple stacked U-Net L1 sub-networks, minimizing semantic gaps between feature maps. It incorporates multi-level deep supervision to facilitate smooth gradient propagation and enhance network optimization.	U-Next consistently outperforms U-Net and U-Net++ architectures on benchmarks like S3DIS, Toronto3D, and SensatUrban. The architecture shows improvements across different baseline models (RandLA-Net, PointNet++, BAAF-Net, LACV-Net), highlighting its generalizability. U-Next demonstrates significant performance gains without incurring substantial computational overhead.	The optimal level of U-Next requires consideration based on accuracy and computational cost trade-offs. Future work will explore U-Next's applicability across different data modalities and tasks to further assess its potential.	3d point cloud, semantic segmentation, deep learning, u-net, multi-scale feature fusion
2304.00719 Report	Multi-Modal Representation Learning with Text-Driven Soft Masks	Jaeyoo Park, Bohyung Han	We propose a visual-linguistic representation learning approach within a self-supervised learning framework by introducing a new operation, loss, and data augmentation strategy. First, we generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image, which are most relevant to a certain word in the corresponding caption, instead of completely removing them. Since our framework relies only on image-caption pairs with no fine-grained annotations, we identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder. Second, we encourage the model to focus more on hard but diverse examples by proposing a focal loss for the image-text contrastive learning (ITC) objective, which alleviates the inherent limitations of overfitting and bias issues. Last, we perform multi-modal data augmentations for self-supervised learning via mining various examples by masking texts and rendering distortions on images. We show that the combination of these three innovations is effective for learning a pretrained model, leading to outstanding performance on multiple vision-language downstream tasks.	This paper proposes a novel visual-linguistic representation learning framework that utilizes soft feature masking and diverse regularizations to enhance performance in vision-language tasks.	Existing vision-language models, pretrained only on image-caption pairs, tend to overfit to discriminative image regions and lack understanding of finer details. This paper addresses this limitation.	The proposed method introduces three key components: 1) Text-driven soft feature masking to diversify visual features by suppressing activations at important regions based on word-conditional Grad-CAM. 2) Focal image-text contrastive learning to emphasize hard examples and address overfitting and bias. 3) Multi-modal data augmentation with strong augmentations and binary caption masking to further diversify training samples.	The proposed approach achieves state-of-the-art performance among detector-free methods on various vision-language downstream tasks, including image-text retrieval, visual entailment, and visual question answering. Ablation studies demonstrate the effectiveness of each component, with soft masking, focal ITC loss, and multi-modal data augmentation all contributing to performance gains. Qualitative analysis of word-conditional Grad-CAM visualizations highlights the model's ability to capture more accurate and comprehensive object and attribute representations compared to the baseline.	The model's tendency to learn biases towards objects and scenes when the most salient parts are heavily masked. Further exploration of optimal masking strategies and their impact on specific downstream tasks.	multi-modal representation learning, vision-language pretraining, soft feature masking, focal contrastive learning, data augmentation
2304.00341 Report	JacobiNeRF: NeRF Shaping with Mutual Information Gradients	Xiaomeng Xu, Yanchao Yang, Kaichun Mo, Boxiao Pan, Li Yi, Leonidas Guibas	We propose a method that trains a neural radiance field (NeRF) to encode not only the appearance of the scene but also semantic correlations between scene points, regions, or entities -- aiming to capture their mutual co-variation patterns. In contrast to the traditional first-order photometric reconstruction objective, our method explicitly regularizes the learning dynamics to align the Jacobians of highly-correlated entities, which proves to maximize the mutual information between them under random scene perturbations. By paying attention to this second-order information, we can shape a NeRF to express semantically meaningful synergies when the network weights are changed by a delta along the gradient of a single entity, region, or even a point. To demonstrate the merit of this mutual information modeling, we leverage the coordinated behavior of scene entities that emerges from our shaping to perform label propagation for semantic and instance segmentation. Our experiments show that a JacobiNeRF is more efficient in propagating annotations among 2D pixels and 3D points compared to NeRFs without mutual information shaping, especially in extremely sparse label regimes -- thus reducing annotation burden. The same machinery can further be used for entity selection or scene modifications.	This paper introduces JacobiNeRF, a novel method for shaping Neural Radiance Fields (NeRFs) to encode semantic correlations between scene elements by aligning their Jacobians in the network's tangent space.	Standard NeRFs excel at scene appearance and geometry but often lack awareness of semantic relationships, hindering tasks like entity selection, annotation, and editing. JacobiNeRF addresses this by encoding semantic correlations directly into the NeRF representation.	The method leverages the equivalence between mutual information and Jacobian cosine similarity. By applying contrastive learning on NeRF gradients, it aligns the tangent space with semantic correlations derived from self-supervised features (e.g., DINO).	JacobiNeRF effectively propagates sparse annotations for semantic and instance segmentation, outperforming methods relying solely on first-order information or 2D features. The approach generalizes well to novel views, demonstrating superior performance on distant viewpoints compared to methods overfitting to source views. Beyond segmentation, JacobiNeRF enables tasks like entity selection and consistent scene recoloring by leveraging the encoded semantic correlations.	The current implementation relies on self-supervised visual features, which may limit performance compared to incorporating stronger semantic priors. The dense label propagation strategy could benefit from more sophisticated gradient subsampling techniques to improve efficiency and accuracy.	neural radiance fields, nerf, semantic segmentation, instance segmentation, mutual information
2304.00334 Report	TalkCLIP: Talking Head Generation with Text-Guided Expressive Speaking Styles	Yifeng Ma, Suzhen Wang, Yu Ding, Bowen Ma, Tangjie Lv, Changjie Fan, Zhipeng Hu, Zhidong Deng, Xin Yu	In order to produce facial-expression-specified talking head videos, previous audio-driven one-shot talking head methods need to use a reference video with a matching speaking style (i.e., facial expressions). However, finding videos with a desired style may not be easy, potentially restricting their application. In this work, we propose an expression-controllable one-shot talking head method, dubbed TalkCLIP, where the expression in a speech is specified by the natural language. This would significantly ease the difficulty of searching for a video with a desired speaking style. Here, we first construct a text-video paired talking head dataset, in which each video has alternative prompt-alike descriptions. Specifically, our descriptions involve coarse-level emotion annotations and facial action unit (AU) based fine-grained annotations. Then, we introduce a CLIP-based style encoder that first projects natural language descriptions to the CLIP text embedding space and then aligns the textual embeddings to the representations of speaking styles. As extensive textual knowledge has been encoded by CLIP, our method can even generalize to infer a speaking style whose description has not been seen during training. Extensive experiments demonstrate that our method achieves the advanced capability of generating photo-realistic talking heads with vivid facial expressions guided by text descriptions.	This paper proposes TalkCLIP, a novel one-shot talking head generation framework that produces photo-realistic videos with speaking styles controlled by natural language descriptions.	Existing methods for generating expressive talking heads rely on reference videos with matching speaking styles, which can be difficult and time-consuming to find. TalkCLIP addresses this limitation by enabling direct control of expressions via text, making the process more user-friendly and flexible.	The authors construct TA-MEAD, a text-annotated talking head dataset based on MEAD, with coarse-level emotion and fine-grained AU-based descriptions. TalkCLIP utilizes a CLIP-based text encoder, trained with guidance from a video-to-speaking-style encoder, to map text descriptions to latent speaking style codes. These codes, along with audio features, drive a facial animation decoder and image renderer to generate the final video.	TalkCLIP generates photo-realistic talking heads with vivid facial expressions accurately reflecting the input text descriptions. The method exhibits strong generalization capabilities, effectively handling out-of-domain text descriptions not seen during training. TalkCLIP achieves comparable or superior performance to state-of-the-art methods that rely on reference videos for style control.	TalkCLIP may struggle to generate accurate speaking styles for abstract text descriptions, like idioms. The method does not currently consider the emotional content of the input audio, potentially leading to inconsistencies between audio and generated video.	talking head generation, text-guided synthesis, expressive facial animation, clip, visual-language learning
2304.00287 Report	Vision Transformers with Mixed-Resolution Tokenization	Tomer Ronen, Omer Levy, Avram Golbert	Vision Transformer models process input images by dividing them into a spatially regular grid of equal-size patches. Conversely, Transformers were originally introduced over natural language sequences, where each token represents a subword - a chunk of raw data of arbitrary size. In this work, we apply this approach to Vision Transformers by introducing a novel image tokenization scheme, replacing the standard uniform grid with a mixed-resolution sequence of tokens, where each token represents a patch of arbitrary size. Using the Quadtree algorithm and a novel saliency scorer, we construct a patch mosaic where low-saliency areas of the image are processed in low resolution, routing more of the model's capacity to important image regions. Using the same architecture as vanilla ViTs, our Quadformer models achieve substantial accuracy gains on image classification when controlling for the computational budget. Code and models are publicly available at https://github.com/TomerRonen34/mixed-resolution-vit .	This paper introduces Quadformer, a Vision Transformer that utilizes a novel mixed-resolution tokenization scheme based on Quadtrees and a saliency scorer, allowing it to process important image regions in high resolution.	The standard uniform grid tokenization in ViTs can be inefficient, treating all image regions equally. Quadformer aims to improve efficiency by focusing computational resources on salient image areas.	Quadformer replaces the uniform grid with a Quadtree-based patch mosaic, using a saliency scorer to determine patch sizes. It employs 2D position embeddings and adapts the standard ViT architecture to process mixed-resolution tokens.	Quadformer with feature-based saliency consistently outperforms vanilla ViTs in accuracy by up to 0.88% when controlling for the number of patches or GMACs. Despite not using accelerated inference techniques, Quadformer also shows gains when controlling for inference speed. Quadformers exhibit less sensitivity to out-of-distribution input lengths compared to vanilla ViTs, enabling a better inference-time compute-accuracy trade-off with a single model.	The runtime overhead of the feature-based saliency scorer can be significant for small ViT models. Finding faster high-quality saliency estimators is crucial for efficient mixed-resolution tokenization in small ViTs.	vision transformers, tokenization, quadtrees, saliency, mixed-resolution
2304.00186 Report	Subject-driven Text-to-Image Generation via Apprenticeship Learning	Wenhu Chen, Hexiang Hu, Yandong Li, Nataniel Ruiz, Xuhui Jia, Ming-Wei Chang, William W. Cohen	Recent text-to-image generation models like DreamBooth have made remarkable progress in generating highly customized images of a target subject, by fine-tuning an ``expert model'' for a given subject from a few examples. However, this process is expensive, since a new expert model must be learned for each subject. In this paper, we present SuTI, a Subject-driven Text-to-Image generator that replaces subject-specific fine tuning with in-context learning. Given a few demonstrations of a new subject, SuTI can instantly generate novel renditions of the subject in different scenes, without any subject-specific optimization. SuTI is powered by apprenticeship learning, where a single apprentice model is learned from data generated by a massive number of subject-specific expert models. Specifically, we mine millions of image clusters from the Internet, each centered around a specific visual subject. We adopt these clusters to train a massive number of expert models, each specializing in a different subject. The apprentice model SuTI then learns to imitate the behavior of these fine-tuned experts. SuTI can generate high-quality and customized subject-specific images 20x faster than optimization-based SoTA methods. On the challenging DreamBench and DreamBench-v2, our human evaluation shows that SuTI significantly outperforms existing models like InstructPix2Pix, Textual Inversion, Imagic, Prompt2Prompt, Re-Imagen and DreamBooth, especially on the subject and text alignment aspects.	This paper introduces SuTI, a subject-driven text-to-image generation model that uses in-context learning to generate customized images of a target subject without subject-specific optimization.	Existing subject-driven image generation methods require fine-tuning specific models for each subject, which is slow and expensive.	SuTI employs apprenticeship learning, where it's trained on a massive dataset of image clusters to imitate the behavior of millions of specialized expert models.	SuTI generates high-quality, customized subject-specific images 20x faster than optimization-based methods. On DreamBench and DreamBench-v2, SuTI significantly outperforms existing models in human evaluations. SuTI shows strong capabilities in subject re-contextualization, attribute editing, artistic style transfer, and accessorization.	SuTI's generations are less diverse than DreamBooth and less faithful to low-level visual details. SuTI struggles with highly compositional prompts.	text-to-image generation, subject-driven generation, in-context learning, apprenticeship learning, diffusion models
2304.00049 Report	Ranking Regularization for Critical Rare Classes: Minimizing False Positives at a High True Positive Rate	Mohammadi Kiarash, Zhao He, Mengyao Zhai, Frederick Tung	In many real-world settings, the critical class is rare and a missed detection carries a disproportionately high cost. For example, tumors are rare and a false negative diagnosis could have severe consequences on treatment outcomes; fraudulent banking transactions are rare and an undetected occurrence could result in significant losses or legal penalties. In such contexts, systems are often operated at a high true positive rate, which may require tolerating high false positives. In this paper, we present a novel approach to address the challenge of minimizing false positives for systems that need to operate at a high true positive rate. We propose a ranking-based regularization (RankReg) approach that is easy to implement, and show empirically that it not only effectively reduces false positives, but also complements conventional imbalanced learning losses. With this novel technique in hand, we conduct a series of experiments on three broadly explored datasets (CIFAR-10&100 and Melanoma) and show that our approach lifts the previous state-of-the-art performance by notable margins.	Presents RankReg, a ranking-based regularization method to minimize false positives in imbalanced classification tasks where a high true positive rate is critical.	In critical applications like medical diagnosis and fraud detection, minimizing false positives at high true positive rates is crucial, which conventional methods often fail to address.	RankReg adds a regularization term that penalizes lower rankings of critical positives, encouraging the model to rank them higher than non-critical negatives. This is optimized using a differentiable ranking method based on a combinatorial solver.	RankReg consistently outperforms existing methods, including the state-of-the-art ALM, on CIFAR-10, CIFAR-100, and Melanoma datasets. The method is complementary to various conventional imbalanced learning losses, demonstrating its general applicability. RankReg shows robustness to label noise, making it suitable for real-world scenarios.	The paper primarily focuses on binary classification tasks, and extending it to multi-class scenarios with more than one critical class requires further investigation. The buffer size for storing positive samples impacts performance and might require task-specific tuning.	imbalanced classification, false positive rate minimization, critical rare classes, ranking regularization, differentiable ranking
2303.18193 Report	GVP: Generative Volumetric Primitives	Mallikarjun B R, Xingang Pan, Mohamed Elgharib, Christian Theobalt	Advances in 3D-aware generative models have pushed the boundary of image synthesis with explicit camera control. To achieve high-resolution image synthesis, several attempts have been made to design efficient generators, such as hybrid architectures with both 3D and 2D components. However, such a design compromises multiview consistency, and the design of a pure 3D generator with high resolution is still an open problem. In this work, we present Generative Volumetric Primitives (GVP), the first pure 3D generative model that can sample and render 512-resolution images in real-time. GVP jointly models a number of volumetric primitives and their spatial information, both of which can be efficiently generated via a 2D convolutional network. The mixture of these primitives naturally captures the sparsity and correspondence in the 3D volume. The training of such a generator with a high degree of freedom is made possible through a knowledge distillation technique. Experiments on several datasets demonstrate superior efficiency and 3D consistency of GVP over the state-of-the-art.	Presents Generative Volumetric Primitives (GVP), the first 3D-aware generative model based on pure 3D representation that can render 512-resolution images in real-time.	Addresses the limitations of existing 3D-aware GANs in achieving high-resolution image synthesis with real-time rendering and multiview consistency.	Utilizes a mixture of volumetric primitives (MVP) representation, where each primitive models a local volume's color and density. Employs a 2D convolutional network to efficiently generate primitives and their spatial information. Leverages knowledge distillation from a pretrained 3D-aware GAN (EG3D) for stable training.	Achieves much faster rendering than previous pure 3D GANs. Preserves better multiview consistency compared to hybrid architectures that rely on 2D upsampling. Learned primitives effectively capture 3D volume sparsity and adapt to different samples, hinting at correspondence learning.	Struggles to generate high-quality details for certain features like curly hair due to the discontinuous 3D scene representation. Purely adversarial training proves unstable due to the discontinuous representation.	generative models, 3d-aware gans, volumetric primitives, real-time rendering, multiview consistency
2303.18181 Report	A Closer Look at Parameter-Efficient Tuning in Diffusion Models	Chendong Xiang, Fan Bao, Chongxuan Li, Hang Su, Jun Zhu	Large-scale diffusion models like Stable Diffusion are powerful and find various real-world applications while customizing such models by fine-tuning is both memory and time inefficient. Motivated by the recent progress in natural language processing, we investigate parameter-efficient tuning in large diffusion models by inserting small learnable modules (termed adapters). In particular, we decompose the design space of adapters into orthogonal factors -- the input position, the output position as well as the function form, and perform Analysis of Variance (ANOVA), a classical statistical approach for analyzing the correlation between discrete (design options) and continuous variables (evaluation metrics). Our analysis suggests that the input position of adapters is the critical factor influencing the performance of downstream tasks. Then, we carefully study the choice of the input position, and we find that putting the input position after the cross-attention block can lead to the best performance, validated by additional visualization analyses. Finally, we provide a recipe for parameter-efficient tuning in diffusion models, which is comparable if not superior to the fully fine-tuned baseline (e.g., DreamBooth) with only 0.75 \% extra parameters, across various customized tasks.	This paper presents a systematic study on parameter-efficient tuning of large-scale diffusion models using lightweight trainable modules called adapters.	Fine-tuning entire large diffusion models like Stable Diffusion for customization is computationally and memory intensive. Parameter-efficient tuning offers a more efficient alternative.	The authors decompose the adapter design space into input position, output position, and function form. They utilize Analysis of Variance (ANOVA) to identify the most impactful factor on downstream task performance.	Input position of the adapter is the most critical factor for effective parameter-efficient tuning. Placing the adapter after the cross-attention block in the U-Net architecture yields the best performance. The proposed adapter-based tuning achieves comparable or better results than fully fine-tuning (DreamBooth) with only 0.75% extra parameters.	The study primarily focuses on Stable Diffusion due to its open-source nature. Exploration of adapter placement beyond a single position is left for future work.	diffusion models, parameter-efficient tuning, adapters, stable diffusion, transfer learning
2303.18144 Report	Siamese DETR	Zeren Chen, Gengshi Huang, Wei Li, Jianing Teng, Kun Wang, Jing Shao, Chen Change Loy, Lu Sheng	Recent self-supervised methods are mainly designed for representation learning with the base model, e.g., ResNets or ViTs. They cannot be easily transferred to DETR, with task-specific Transformer modules. In this work, we present Siamese DETR, a Siamese self-supervised pretraining approach for the Transformer architecture in DETR. We consider learning view-invariant and detection-oriented representations simultaneously through two complementary tasks, i.e., localization and discrimination, in a novel multi-view learning framework. Two self-supervised pretext tasks are designed: (i) Multi-View Region Detection aims at learning to localize regions-of-interest between augmented views of the input, and (ii) Multi-View Semantic Discrimination attempts to improve object-level discrimination for each region. The proposed Siamese DETR achieves state-of-the-art transfer performance on COCO and PASCAL VOC detection using different DETR variants in all setups. Code is available at https://github.com/Zx55/SiameseDETR.	This paper introduces Siamese DETR, a novel self-supervised pretraining method for DETR that utilizes a Siamese network architecture.	This approach addresses the challenge of pretraining DETR's task-specific Transformer modules for object detection, which conventional self-supervised methods struggle with.	Siamese DETR employs two pretext tasks: (i) Multi-View Region Detection, which learns to locate corresponding regions in augmented views, and (ii) Multi-View Semantic Discrimination, which enhances object-level discrimination by maximizing global and regional semantic consistency.	Siamese DETR surpasses previous methods like UP-DETR and DETReg in transfer learning performance for object detection on COCO and PASCAL VOC benchmarks. The method exhibits robustness across various DETR variants (Vanilla, Conditional, Deformable). Siamese DETR demonstrates faster convergence and stronger objectness priors compared to its counterparts.	Siamese DETR still depends on a pre-trained CNN backbone (e.g., SwAV) and does not yet encompass a unified pretraining strategy for both CNN and Transformer components. Future work could explore more efficient frameworks for end-to-end DETR pretraining.	self-supervised learning, object detection, detr, transformer, siamese networks
2303.18080 Report	One-shot Unsupervised Domain Adaptation with Personalized Diffusion Models	Yasser Benigmim, Subhankar Roy, Slim Essid, Vicky Kalogeiton, Stéphane Lathuilière	Adapting a segmentation model from a labeled source domain to a target domain, where a single unlabeled datum is available, is one the most challenging problems in domain adaptation and is otherwise known as one-shot unsupervised domain adaptation (OSUDA). Most of the prior works have addressed the problem by relying on style transfer techniques, where the source images are stylized to have the appearance of the target domain. Departing from the common notion of transferring only the target ``texture'' information, we leverage text-to-image diffusion models (e.g., Stable Diffusion) to generate a synthetic target dataset with photo-realistic images that not only faithfully depict the style of the target domain, but are also characterized by novel scenes in diverse contexts. The text interface in our method Data AugmenTation with diffUsion Models (DATUM) endows us with the possibility of guiding the generation of images towards desired semantic concepts while respecting the original spatial context of a single training image, which is not possible in existing OSUDA methods. Extensive experiments on standard benchmarks show that our DATUM surpasses the state-of-the-art OSUDA methods by up to +7.1%. The implementation is available at https://github.com/yasserben/DATUM	DATUM, a data augmentation pipeline powered by diffusion models for one-shot unsupervised domain adaptation in semantic segmentation.	Addresses the challenging problem of adapting a segmentation model to a target domain when only a single unlabeled target image is available, a scenario known as one-shot unsupervised domain adaptation (OSDA).	1. Personalization Stage: Fine-tune a pre-trained text-to-image diffusion model (e.g., Stable Diffusion) on the single target image to capture its style. 2. Data Generation Stage: Generate a synthetic target dataset using the fine-tuned model, guided by prompts containing class names (e.g., 'car', 'bus') to increase diversity. 3. Adaptive Segmentation Stage: Train a segmentation model using a standard UDA method on the labeled source data and the generated synthetic target data.	DATUM significantly outperforms existing OSDA methods on standard benchmarks (GTA to Cityscapes, SYNTHIA to Cityscapes) by up to +7.1% mIoU. Generating synthetic images that resemble the target domain's style and content is more effective than simply stylizing source images with target texture. Class-aware prompts in the data generation stage lead to more diverse and informative synthetic images, boosting performance.	Potential for generating nonsensical objects due to the diffusion model's limitations, requiring caution in deployment. Reliance on pre-trained diffusion models that might encode biases.	unsupervised domain adaptation, one-shot learning, semantic segmentation, diffusion models, data augmentation
2303.17968 Report	VDN-NeRF: Resolving Shape-Radiance Ambiguity via View-Dependence Normalization	Bingfan Zhu, Yanchao Yang, Xulong Wang, Youyi Zheng, Leonidas Guibas	We propose VDN-NeRF, a method to train neural radiance fields (NeRFs) for better geometry under non-Lambertian surface and dynamic lighting conditions that cause significant variation in the radiance of a point when viewed from different angles. Instead of explicitly modeling the underlying factors that result in the view-dependent phenomenon, which could be complex yet not inclusive, we develop a simple and effective technique that normalizes the view-dependence by distilling invariant information already encoded in the learned NeRFs. We then jointly train NeRFs for view synthesis with view-dependence normalization to attain quality geometry. Our experiments show that even though shape-radiance ambiguity is inevitable, the proposed normalization can minimize its effect on geometry, which essentially aligns the optimal capacity needed for explaining view-dependent variations. Our method applies to various baselines and significantly improves geometry without changing the volume rendering pipeline, even if the data is captured under a moving light source. Code is available at: https://github.com/BoifZ/VDN-NeRF.	Proposes VDN-NeRF, a method to improve geometry reconstruction in neural radiance fields (NeRFs) under non-Lambertian surface and dynamic lighting, by normalizing view-dependent radiance variations.	Addresses the challenge of shape-radiance ambiguity in NeRFs, where inaccurate geometry can be compensated by overly complex radiance functions, especially under varying lighting.	Normalizes view-dependence by distilling invariant features from rendered images using a depth prediction network and incorporating these features into a neural feature field alongside the radiance field.	Achieves state-of-the-art geometry reconstruction compared to various baselines, evidenced by higher IoU and lower Chamfer Distance. Demonstrates robustness to varying lighting conditions, maintaining better geometry quality than baselines under dynamic illumination. Shows effectiveness in challenging real-world scenarios with dynamic lighting, such as underwater scenes and intra-oral scans.	Limited exploration of the interplay between the level of feature invariance and the complexity of the scene. Future work could investigate extending the method to explicitly model and leverage temporal information for dynamic scene reconstruction. Future work could investigate the choice of depth prediction network and its impact on performance.	neural radiance fields, view-dependence normalization, shape-radiance ambiguity, dynamic lighting, geometry reconstruction
2303.17905 Report	3D-aware Image Generation using 2D Diffusion Models	Jianfeng Xiang, Jiaolong Yang, Binbin Huang, Xin Tong	In this paper, we introduce a novel 3D-aware image generation method that leverages 2D diffusion models. We formulate the 3D-aware image generation task as multiview 2D image set generation, and further to a sequential unconditional-conditional multiview image generation process. This allows us to utilize 2D diffusion models to boost the generative modeling power of the method. Additionally, we incorporate depth information from monocular depth estimators to construct the training data for the conditional diffusion model using only still images. We train our method on a large-scale dataset, i.e., ImageNet, which is not addressed by previous methods. It produces high-quality images that significantly outperform prior methods. Furthermore, our approach showcases its capability to generate instances with large view angles, even though the training images are diverse and unaligned, gathered from "in-the-wild" real-world environments.	This paper introduces a novel 3D-aware image generation method utilizing 2D diffusion models, framing the task as a sequential unconditional-conditional multiview image generation process.	This approach leverages the power of 2D diffusion models for high-quality image generation, addressing the limitations of previous GAN-based methods in handling large-scale, in-the-wild datasets.	The method uses monocular depth estimators to construct multiview training data from still images. It then employs an unconditional diffusion model to generate the initial view and a conditional model to iteratively generate subsequent views conditioned on previous ones.	The method significantly outperforms state-of-the-art 3D-aware GANs on ImageNet in terms of image quality and diversity. It demonstrates comparable or better performance on smaller single-category datasets while producing more realistic 3D geometry. The method showcases the capability to generate scenes with large view angles, even up to 360 degrees, from unaligned training data.	The reliance on estimated depth maps can introduce inaccuracies in the generated geometry. Generating images with very large view angles can lead to degraded quality due to data bias and domain drift.	3d-aware image generation, diffusion models, multiview image synthesis, depth estimation, generative modeling
2303.17803 Report	Rethinking Local Perception in Lightweight Vision Transformer	Qihang Fan, Huaibo Huang, Jiyang Guan, Ran He	Vision Transformers (ViTs) have been shown to be effective in various vision tasks. However, resizing them to a mobile-friendly size leads to significant performance degradation. Therefore, developing lightweight vision transformers has become a crucial area of research. This paper introduces CloFormer, a lightweight vision transformer that leverages context-aware local enhancement. CloFormer explores the relationship between globally shared weights often used in vanilla convolutional operators and token-specific context-aware weights appearing in attention, then proposes an effective and straightforward module to capture high-frequency local information. In CloFormer, we introduce AttnConv, a convolution operator in attention's style. The proposed AttnConv uses shared weights to aggregate local information and deploys carefully designed context-aware weights to enhance local features. The combination of the AttnConv and vanilla attention which uses pooling to reduce FLOPs in CloFormer enables the model to perceive high-frequency and low-frequency information. Extensive experiments were conducted in image classification, object detection, and semantic segmentation, demonstrating the superiority of CloFormer. The code is available at \url{https://github.com/qhfan/CloFormer}.	This paper introduces CloFormer, a lightweight vision transformer that enhances local perception through a context-aware convolution operator called AttnConv and a two-branch structure capturing both high and low-frequency information.	Designing lightweight vision transformers suitable for mobile devices with minimal performance degradation compared to larger models is crucial.	CloFormer employs a two-branch architecture. The local branch leverages AttnConv, combining shared-weight convolution for local feature aggregation and context-aware weights for enhancement. The global branch utilizes downsampled vanilla attention for capturing low-frequency global information. These branches are then fused for a comprehensive representation.	CloFormer achieves state-of-the-art performance on ImageNet classification, surpassing competitors with similar model sizes and FLOPs. In COCO object detection and instance segmentation tasks, CloFormer consistently outperforms other backbones, demonstrating its effectiveness in dense prediction tasks. Spectral analysis confirms CloFormer's ability to capture both high-frequency and low-frequency information effectively through its two-branch design.	The gating mechanism in AttnConv, while introducing stronger nonlinearity, might require careful tuning for optimal performance. Exploring alternative fusion strategies for the local and global branches could potentially further enhance CloFormer's performance.	vision transformer, lightweight model, local perception, context-aware, attnconv
2303.17604 Report	Token Merging for Fast Stable Diffusion	Daniel Bolya, Judy Hoffman	The landscape of image generation has been forever changed by open vocabulary diffusion models. However, at their core these models use transformers, which makes generation slow. Better implementations to increase the throughput of these transformers have emerged, but they still evaluate the entire model. In this paper, we instead speed up diffusion models by exploiting natural redundancy in generated images by merging redundant tokens. After making some diffusion-specific improvements to Token Merging (ToMe), our ToMe for Stable Diffusion can reduce the number of tokens in an existing Stable Diffusion model by up to 60% while still producing high quality images without any extra training. In the process, we speed up image generation by up to 2x and reduce memory consumption by up to 5.6x. Furthermore, this speed-up stacks with efficient implementations such as xFormers, minimally impacting quality while being up to 5.4x faster for large images. Code is available at https://github.com/dbolya/tomesd.	This paper introduces 'ToMe for Stable Diffusion', a technique applying Token Merging (ToMe) to speed up Stable Diffusion without retraining.	Open-vocabulary diffusion models like Stable Diffusion revolutionized image generation but are computationally expensive, limiting their accessibility.	The authors adapt ToMe for dense prediction tasks by introducing 'unmerging', enabling token reduction during processing and reconstruction afterwards. They improve upon the naive ToMe application by optimizing token partitioning and experimentally evaluate different design choices for when, where, and how to apply ToMe.	ToMe for Stable Diffusion speeds up image generation by up to 2x. It reduces memory consumption by up to 5.6x. It maintains high visual quality comparable to the original Stable Diffusion model, even when merging 60% of tokens.	The current unmerging strategy is simple and could be improved for better information retention. Exploration of proportional attention or key-based similarity for token merging in diffusion models is left for future work.	image generation, stable diffusion, token merging, speed optimization, memory reduction
2303.17603 Report	NeRF-Supervised Deep Stereo	Fabio Tosi, Alessio Tonioni, Daniele De Gregorio, Matteo Poggi	We introduce a novel framework for training deep stereo networks effortlessly and without any ground-truth. By leveraging state-of-the-art neural rendering solutions, we generate stereo training data from image sequences collected with a single handheld camera. On top of them, a NeRF-supervised training procedure is carried out, from which we exploit rendered stereo triplets to compensate for occlusions and depth maps as proxy labels. This results in stereo networks capable of predicting sharp and detailed disparity maps. Experimental results show that models trained under this regime yield a 30-40% improvement over existing self-supervised methods on the challenging Middlebury dataset, filling the gap to supervised models and, most times, outperforming them at zero-shot generalization.	Introduces NeRF-Supervised (NS) learning, a novel framework for training deep stereo networks without ground truth by leveraging neural rendering to generate stereo training data from image sequences captured with a single handheld camera.	Addresses the challenge of obtaining flexible and scalable training data for deep stereo networks, a major limitation despite advancements in self-supervised learning and synthetic data.	Utilizes Neural Radiance Fields (NeRF) to generate stereo triplets and depth maps from sparse image sequences. Employs a NeRF-supervised training protocol that combines a triplet photometric loss (addressing occlusions) and a rendered disparity loss (enhancing details) to train stereo networks.	Models trained with NS achieve a 30-40% improvement over existing self-supervised methods on the Middlebury dataset. NS-trained networks demonstrate state-of-the-art zero-shot generalization, outperforming models trained on synthetic datasets and existing self-supervised methods. The ease of data collection and rendering allows for building extensive and scalable training datasets, potentially leading to even better results with more collected scenes.	Current data collection is limited to small-scale, static scenes. NS-trained networks, like other stereo networks, face challenges in handling complex scenarios such as transparent surfaces or nighttime images. Future work may explore larger-scale data collection and specialized NeRF variants to address these limitations.	deep stereo matching, self-supervised learning, zero-shot generalization, neural rendering, neural radiance fields
2303.17599 Report	Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models	Wen Wang, Yan Jiang, Kangyang Xie, Zide Liu, Hao Chen, Yue Cao, Xinlong Wang, Chunhua Shen	Large-scale text-to-image diffusion models achieve unprecedented success in image generation and editing. However, how to extend such success to video editing is unclear. Recent initial attempts at video editing require significant text-to-video data and computation resources for training, which is often not accessible. In this work, we propose vid2vid-zero, a simple yet effective method for zero-shot video editing. Our vid2vid-zero leverages off-the-shelf image diffusion models, and doesn't require training on any video. At the core of our method is a null-text inversion module for text-to-video alignment, a cross-frame modeling module for temporal consistency, and a spatial regularization module for fidelity to the original video. Without any training, we leverage the dynamic nature of the attention mechanism to enable bi-directional temporal modeling at test time. Experiments and analyses show promising results in editing attributes, subjects, places, etc., in real-world videos. Code is made available at \url{https://github.com/baaivision/vid2vid-zero}.	This paper presents vid2vid-zero, a zero-shot video editing method using off-the-shelf image diffusion models without requiring video training data.	Existing video editing methods need substantial text-video data and computation for training, limiting their accessibility. This method addresses this by leveraging pre-trained image diffusion models for efficient video editing.	vid2vid-zero utilizes: 1) a null-text inversion module for aligning text prompts with video content, 2) a spatial regularization module for maintaining fidelity to the original video, and 3) a cross-frame modeling module for ensuring temporal consistency. Notably, it introduces a spatial-temporal attention module for bi-directional temporal modeling during testing.	vid2vid-zero effectively edits video styles, attributes, backgrounds, and subjects while preserving temporal consistency and faithfulness to the source. The proposed spatial-temporal attention mechanism is crucial for bidirectional temporal modeling, outperforming methods like Sparse-Causal Attention and Temporal-only Attention. The method excels in user preference tests, demonstrating superior quality, text-to-video alignment, and fidelity compared to techniques like Tune-A-Video and Plug-and-Play.	The method may inherit limitations of the base image diffusion model, lacking specific temporal and motion priors not present in image-only training data. Directly editing actions in videos remains a challenge due to the absence of explicit motion modeling capabilities.	diffusion models, video editing, zero-shot learning, vision-language models, temporal consistency
2303.17598 Report	Consistent View Synthesis with Pose-Guided Diffusion Models	Hung-Yu Tseng, Qinbo Li, Changil Kim, Suhib Alsisan, Jia-Bin Huang, Johannes Kopf	Novel view synthesis from a single image has been a cornerstone problem for many Virtual Reality applications that provide immersive experiences. However, most existing techniques can only synthesize novel views within a limited range of camera motion or fail to generate consistent and high-quality novel views under significant camera movement. In this work, we propose a pose-guided diffusion model to generate a consistent long-term video of novel views from a single image. We design an attention layer that uses epipolar lines as constraints to facilitate the association between different viewpoints. Experimental results on synthetic and real-world datasets demonstrate the effectiveness of the proposed diffusion model against state-of-the-art transformer-based and GAN-based approaches.	This paper introduces a pose-guided diffusion model for synthesizing long-term, consistent novel view videos from a single image.	Existing view synthesis methods struggle to generate consistent and high-quality novel views under significant camera movement, limiting immersive VR applications.	The proposed method utilizes a UNet-based diffusion model with a novel epipolar attention layer. This layer leverages epipolar line constraints to associate features between input and output viewpoints, ensuring geometric consistency. Stochastic conditioning and fixed noise injection during inference further enhance temporal coherence.	Outperforms state-of-the-art transformer and GAN-based methods in synthesizing realistic and consistent novel views. Demonstrates superior performance in both short-term and long-term view synthesis scenarios. Effectiveness of the epipolar attention layer is validated through ablation studies.	Limited capability in handling scenes with significantly different scales compared to training data. Inference speed is computationally expensive due to the multi-step denoising process.	view synthesis, diffusion models, epipolar geometry, single-image view synthesis, long-term view synthesis
2303.17591 Report	Forget-Me-Not: Learning to Forget in Text-to-Image Diffusion Models	Eric Zhang, Kai Wang, Xingqian Xu, Zhangyang Wang, Humphrey Shi	The unlearning problem of deep learning models, once primarily an academic concern, has become a prevalent issue in the industry. The significant advances in text-to-image generation techniques have prompted global discussions on privacy, copyright, and safety, as numerous unauthorized personal IDs, content, artistic creations, and potentially harmful materials have been learned by these models and later utilized to generate and distribute uncontrolled content. To address this challenge, we propose \textbf{Forget-Me-Not}, an efficient and low-cost solution designed to safely remove specified IDs, objects, or styles from a well-configured text-to-image model in as little as 30 seconds, without impairing its ability to generate other content. Alongside our method, we introduce the \textbf{Memorization Score (M-Score)} and \textbf{ConceptBench} to measure the models' capacity to generate general concepts, grouped into three primary categories: ID, object, and style. Using M-Score and ConceptBench, we demonstrate that Forget-Me-Not can effectively eliminate targeted concepts while maintaining the model's performance on other concepts. Furthermore, Forget-Me-Not offers two practical extensions: a) removal of potentially harmful or NSFW content, and b) enhancement of model accuracy, inclusion and diversity through \textbf{concept correction and disentanglement}. It can also be adapted as a lightweight model patch for Stable Diffusion, allowing for concept manipulation and convenient distribution. To encourage future research in this critical area and promote the development of safe and inclusive generative models, we will open-source our code and ConceptBench at \href{https://github.com/SHI-Labs/Forget-Me-Not}{https://github.com/SHI-Labs/Forget-Me-Not}.	This paper introduces Forget-Me-Not, an efficient and low-cost method for removing specific concepts (IDs, objects, styles) from trained text-to-image diffusion models without retraining the entire model.	Addresses growing concerns about privacy, copyright, safety, and bias present in large-scale text-to-image models by enabling targeted removal of sensitive or unwanted information.	Forget-Me-Not employs an "attention resteering" technique, minimizing the influence of target concept embeddings on the model's cross-attention layers through targeted fine-tuning.	Successfully removes targeted concepts like identities (e.g., Elon Musk) and styles (e.g., Van Gogh) while preserving the model's ability to generate other content. Enables concept correction and disentanglement, allowing suppressed concepts to emerge and correcting biased representations. Can be used for removing harmful or NSFW content, as demonstrated with the removal of nudity triggered by specific prompts.	Faces challenges in identifying and forgetting abstract concepts. May require manual intervention, such as concept-specific hyperparameter tuning.	concept forgetting, text-to-image synthesis, diffusion models, stable diffusion, privacy, copyright, safety, bias
2303.17569 Report	Iterative Prompt Learning for Unsupervised Backlit Image Enhancement	Zhexin Liang, Chongyi Li, Shangchen Zhou, Ruicheng Feng, Chen Change Loy	We propose a novel unsupervised backlit image enhancement method, abbreviated as CLIP-LIT, by exploring the potential of Contrastive Language-Image Pre-Training (CLIP) for pixel-level image enhancement. We show that the open-world CLIP prior not only aids in distinguishing between backlit and well-lit images, but also in perceiving heterogeneous regions with different luminance, facilitating the optimization of the enhancement network. Unlike high-level and image manipulation tasks, directly applying CLIP to enhancement tasks is non-trivial, owing to the difficulty in finding accurate prompts. To solve this issue, we devise a prompt learning framework that first learns an initial prompt pair by constraining the text-image similarity between the prompt (negative/positive sample) and the corresponding image (backlit image/well-lit image) in the CLIP latent space. Then, we train the enhancement network based on the text-image similarity between the enhanced result and the initial prompt pair. To further improve the accuracy of the initial prompt pair, we iteratively fine-tune the prompt learning framework to reduce the distribution gaps between the backlit images, enhanced results, and well-lit images via rank learning, boosting the enhancement performance. Our method alternates between updating the prompt learning framework and enhancement network until visually pleasing results are achieved. Extensive experiments demonstrate that our method outperforms state-of-the-art methods in terms of visual quality and generalization ability, without requiring any paired data.	This paper introduces CLIP-LIT, a novel unsupervised backlit image enhancement method leveraging Contrastive Language-Image Pre-Training (CLIP) for pixel-level enhancement.	Existing methods struggle to effectively enhance backlit images due to the challenge of preserving well-lit regions while improving underexposed areas. This method explores the open-world CLIP prior to address these limitations.	The methodology involves two stages: 1) Initializing prompts by constraining text-image similarity in CLIP space, and training an initial enhancement network. 2) Refining prompts via rank learning using backlit images, enhanced results, and well-lit images, iteratively improving the enhancement network.	CLIP-LIT outperforms state-of-the-art methods in visual quality and quantitative metrics on both BAID and Backlit300 datasets. Iterative prompt learning effectively guides the network to focus on luminance and color distribution, leading to superior enhancement. The method generalizes well to unseen data, demonstrating robustness to diverse lighting conditions and scene content.	The method may not handle extreme over-/under-exposed regions due to sRGB limitations. Current implementation doesn't address noise; future work could explore noise augmentation.	backlit image enhancement, unsupervised learning, clip, prompt learning, image restoration
2303.17561 Report	SoftCLIP: Softer Cross-modal Alignment Makes CLIP Stronger	Yuting Gao, Jinfeng Liu, Zihan Xu, Tong Wu Enwei Zhang, Wei Liu, Jie Yang, Ke Li, Xing Sun	During the preceding biennium, vision-language pre-training has achieved noteworthy success on several downstream tasks. Nevertheless, acquiring high-quality image-text pairs, where the pairs are entirely exclusive of each other, remains a challenging task, and noise exists in the commonly used datasets. To address this issue, we propose SoftCLIP, a novel approach that relaxes the strict one-to-one constraint and achieves a soft cross-modal alignment by introducing a softened target, which is generated from the fine-grained intra-modal self-similarity. The intra-modal guidance is indicative to enable two pairs have some local similarities and model many-to-many relationships between the two modalities. Besides, since the positive still dominates in the softened target distribution, we disentangle the negatives in the distribution to further boost the relation alignment with the negatives in the cross-modal learning. Extensive experiments demonstrate the effectiveness of SoftCLIP. In particular, on ImageNet zero-shot classification task, using CC3M/CC12M as pre-training dataset, SoftCLIP brings a top-1 accuracy improvement of 6.8%/7.2% over the CLIP baseline.	SoftCLIP, a novel approach that leverages intra-modal self-similarity to achieve soft cross-modal alignment in vision-language pre-training, thereby improving performance.	Acquiring perfectly matched image-text pairs for contrastive learning is challenging due to noise and many-to-many relationships between modalities in existing datasets, leading to suboptimal alignment.	SoftCLIP utilizes: (1) Fine-grained intra-modal self-similarities (between ROIs and tags) as softened targets for cross-modal alignment. (2) Disentanglement of negatives in the softened target distribution to enhance relation alignment.	SoftCLIP significantly outperforms CLIP baselines on ImageNet zero-shot classification, achieving a top-1 accuracy improvement of 6.8%/7.2% with CC3M/CC12M pre-training. It also exhibits significant gains on other zero-shot classification datasets, zero-shot image-text retrieval (Flickr30K, MS-COCO), instance retrieval (Oxford, Paris Buildings), and copy detection (INRIA Copydays). Ablation studies confirm the effectiveness of each proposed component (soft alignment, relation enhancement, symmetric KL-Divergence).	The reliance on a pre-trained object-attribute detector introduces additional complexity. Exploration of alternative softened target sources and aggregation methods could further enhance performance.	vision-language pre-training, contrastive learning, soft targets, cross-modal alignment, intra-modal self-similarity
2303.17559 Report	DDP: Diffusion Model for Dense Visual Prediction	Yuanfeng Ji, Zhe Chen, Enze Xie, Lanqing Hong, Xihui Liu, Zhaoqiang Liu, Tong Lu, Zhenguo Li, Ping Luo	We propose a simple, efficient, yet powerful framework for dense visual predictions based on the conditional diffusion pipeline. Our approach follows a "noise-to-map" generative paradigm for prediction by progressively removing noise from a random Gaussian distribution, guided by the image. The method, called DDP, efficiently extends the denoising diffusion process into the modern perception pipeline. Without task-specific design and architecture customization, DDP is easy to generalize to most dense prediction tasks, e.g., semantic segmentation and depth estimation. In addition, DDP shows attractive properties such as dynamic inference and uncertainty awareness, in contrast to previous single-step discriminative methods. We show top results on three representative tasks with six diverse benchmarks, without tricks, DDP achieves state-of-the-art or competitive performance on each task compared to the specialist counterparts. For example, semantic segmentation (83.9 mIoU on Cityscapes), BEV map segmentation (70.6 mIoU on nuScenes), and depth estimation (0.05 REL on KITTI). We hope that our approach will serve as a solid baseline and facilitate future research	This paper presents DDP, a new framework for dense visual prediction tasks using conditional diffusion models. It iteratively removes noise from a random Gaussian distribution guided by the input image.	Existing diffusion-based perception models are inefficient and complex. DDP offers a simple, effective, and general framework applicable to diverse dense prediction tasks.	DDP separates image encoding from map decoding. During training, Gaussian noise is progressively added to the ground truth. The map decoder, conditioned on the encoded image, learns to reverse this process. Inference involves refining an initial noise map using the decoder and input image.	DDP achieves state-of-the-art or competitive results on semantic segmentation (e.g., 83.9 mIoU on Cityscapes), BEV map segmentation (70.6 mIoU on nuScenes), and depth estimation (0.05 REL on KITTI). The method supports dynamic inference, trading off computation for prediction quality by adjusting the number of sampling steps. DDP naturally provides uncertainty estimations by analyzing the consistency of predictions across sampling steps.	Multi-step inference increases computational cost compared to single-step methods. While effective on tested benchmarks, further research is needed to assess DDP's performance on a wider range of tasks.	diffusion models, dense prediction, semantic segmentation, bev map segmentation, depth estimation
2303.17546 Report	PAIR-Diffusion: A Comprehensive Multimodal Object-Level Image Editor	Vidit Goel, Elia Peruzzo, Yifan Jiang, Dejia Xu, Xingqian Xu, Nicu Sebe, Trevor Darrell, Zhangyang Wang, Humphrey Shi	Generative image editing has recently witnessed extremely fast-paced growth. Some works use high-level conditioning such as text, while others use low-level conditioning. Nevertheless, most of them lack fine-grained control over the properties of the different objects present in the image, i.e. object-level image editing. In this work, we tackle the task by perceiving the images as an amalgamation of various objects and aim to control the properties of each object in a fine-grained manner. Out of these properties, we identify structure and appearance as the most intuitive to understand and useful for editing purposes. We propose PAIR Diffusion, a generic framework that can enable a diffusion model to control the structure and appearance properties of each object in the image. We show that having control over the properties of each object in an image leads to comprehensive editing capabilities. Our framework allows for various object-level editing operations on real images such as reference image-based appearance editing, free-form shape editing, adding objects, and variations. Thanks to our design, we do not require any inversion step. Additionally, we propose multimodal classifier-free guidance which enables editing images using both reference images and text when using our approach with foundational diffusion models. We validate the above claims by extensively evaluating our framework on both unconditional and foundational diffusion models. Please refer to https://vidit98.github.io/publication/conference-paper/pair_diff.html for code and model release.	PAIR Diffusion, a generic framework enabling object-level structure and appearance editing in diffusion models.	Most existing image editing methods lack fine-grained control over individual object properties, limiting comprehensive editing capabilities.	PAIR Diffusion extracts per-object structure (shape, category) using panoptic segmentation and appearance representations using pre-trained image encoders (VGG, DINOv2). It conditions diffusion models on these representations, enabling object-level manipulation.	Enables diverse object-level edits: appearance and shape editing, object addition, and variations. Achieves realistic and faithful appearance editing, outperforming baselines in quantitative evaluations (FID, L1, SSIM). Demonstrates precise structure control, surpassing previous methods (SEAN) in mIoU and SSIM scores.	Current architecture modifications for incorporating structure and appearance are simple and can be further improved. Future work includes extending explicit control to other object aspects (illumination, pose) and improving identity preservation during editing.	image editing, diffusion models, object-level editing, generative models, multimodal inference
2303.17225 Report	FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation	Jie Qin, Jie Wu, Pengxiang Yan, Ming Li, Ren Yuxi, Xuefeng Xiao, Yitong Wang, Rui Wang, Shilei Wen, Xin Pan, Xingang Wang	Recently, open-vocabulary learning has emerged to accomplish segmentation for arbitrary categories of text-based descriptions, which popularizes the segmentation system to more general-purpose application scenarios. However, existing methods devote to designing specialized architectures or parameters for specific segmentation tasks. These customized design paradigms lead to fragmentation between various segmentation tasks, thus hindering the uniformity of segmentation models. Hence in this paper, we propose FreeSeg, a generic framework to accomplish Unified, Universal and Open-Vocabulary Image Segmentation. FreeSeg optimizes an all-in-one network via one-shot training and employs the same architecture and parameters to handle diverse segmentation tasks seamlessly in the inference procedure. Additionally, adaptive prompt learning facilitates the unified model to capture task-aware and category-sensitive concepts, improving model robustness in multi-task and varied scenarios. Extensive experimental results demonstrate that FreeSeg establishes new state-of-the-art results in performance and generalization on three segmentation tasks, which outperforms the best task-specific architectures by a large margin: 5.5% mIoU on semantic segmentation, 17.6% mAP on instance segmentation, 20.1% PQ on panoptic segmentation for the unseen class on COCO.	FreeSeg is a novel framework for Unified, Universal and Open-Vocabulary Image Segmentation, using a single model to handle semantic, instance, and panoptic segmentation of arbitrary categories.	Existing open-vocabulary segmentation methods are task-specific, hindering model uniformity and resource efficiency. FreeSeg addresses this by enabling a single model to handle diverse segmentation tasks and arbitrary categories.	FreeSeg employs a two-stage framework: 1) extracting universal mask proposals via a unified network trained with multi-task labels and 2) zero-shot classification on masks using CLIP with adaptive prompt learning for task and category awareness.	FreeSeg achieves state-of-the-art performance on open-vocabulary semantic, instance, and panoptic segmentation, outperforming previous methods by a large margin on unseen classes. The method shows strong generalization across datasets, demonstrating its ability to handle different data distributions and domains. FreeSeg's multi-task training reduces training costs by two-thirds compared to single-task training while achieving superior performance and generalization.	FreeSeg's performance on instance segmentation, while exceeding previous open-vocabulary methods, is lower than specialized instance segmentation models due to the use of mask supervision. Future work could explore incorporating box-level supervision to further improve instance segmentation performance without compromising the framework's universality.	open-vocabulary segmentation, universal segmentation, multi-task learning, prompt learning, zero-shot learning
2303.17189 Report	LayoutDiffusion: Controllable Diffusion Model for Layout-to-image Generation	Guangcong Zheng, Xianpan Zhou, Xuewei Li, Zhongang Qi, Ying Shan, Xi Li	Recently, diffusion models have achieved great success in image synthesis. However, when it comes to the layout-to-image generation where an image often has a complex scene of multiple objects, how to make strong control over both the global layout map and each detailed object remains a challenging task. In this paper, we propose a diffusion model named LayoutDiffusion that can obtain higher generation quality and greater controllability than the previous works. To overcome the difficult multimodal fusion of image and layout, we propose to construct a structural image patch with region information and transform the patched image into a special layout to fuse with the normal layout in a unified form. Moreover, Layout Fusion Module (LFM) and Object-aware Cross Attention (OaCA) are proposed to model the relationship among multiple objects and designed to be object-aware and position-sensitive, allowing for precisely controlling the spatial related information. Extensive experiments show that our LayoutDiffusion outperforms the previous SOTA methods on FID, CAS by relatively 46.35%, 26.70% on COCO-stuff and 44.29%, 41.82% on VG. Code is available at https://github.com/ZGCTroy/LayoutDiffusion.	This paper proposes LayoutDiffusion, a diffusion model for layout-to-image generation that achieves high quality and controllability by fusing layout and image information in a unified form.	Existing text-to-image generation models struggle with precise control over object placement and scene composition, while GAN-based layout-to-image methods suffer from limitations like unstable training. LayoutDiffusion addresses these challenges with a novel diffusion-based approach.	The method introduces a structural image patch with region information, treating each patch as a special object. It employs a Layout Fusion Module (LFM) to model relationships between layout objects and an Object-aware Cross Attention (OaCA) mechanism to fuse multi-resolution image patches with the layout.	LayoutDiffusion significantly outperforms state-of-the-art GAN-based and diffusion-based methods on FID, DS, and CAS metrics. The model demonstrates accurate control over object placement, size, and category, as evidenced by high YOLOScore. LayoutDiffusion generates high-quality images with diverse appearances for a given layout.	Generating realistic images with no distortion and overlap for complex multi-object layouts remains challenging. The model is trained from scratch on specific datasets, and future work could explore combining it with text-guided diffusion models pre-trained on large text-image datasets.	layout-to-image generation, diffusion models, image synthesis, multimodal fusion, controllable generation
2303.17158 Report	KD-DLGAN: Data Limited Image Generation via Knowledge Distillation	Kaiwen Cui, Yingchen Yu, Fangneng Zhan, Shengcai Liao, Shijian Lu1, Eric Xing	Generative Adversarial Networks (GANs) rely heavily on large-scale training data for training high-quality image generation models. With limited training data, the GAN discriminator often suffers from severe overfitting which directly leads to degraded generation especially in generation diversity. Inspired by the recent advances in knowledge distillation (KD), we propose KD-DLGAN, a knowledge-distillation based generation framework that introduces pre-trained vision-language models for training effective data-limited generation models. KD-DLGAN consists of two innovative designs. The first is aggregated generative KD that mitigates the discriminator overfitting by challenging the discriminator with harder learning tasks and distilling more generalizable knowledge from the pre-trained models. The second is correlated generative KD that improves the generation diversity by distilling and preserving the diverse image-text correlation within the pre-trained models. Extensive experiments over multiple benchmarks show that KD-DLGAN achieves superior image generation with limited training data. In addition, KD-DLGAN complements the state-of-the-art with consistent and substantial performance gains.	KD-DLGAN, a novel image generation framework, leverages knowledge distillation from vision-language models to improve GAN training with limited data.	Effective GAN training typically requires large-scale datasets. This work addresses the challenge of data-limited image generation, particularly the discriminator overfitting issue.	The paper introduces two novel generative KD techniques: (1) Aggregated Generative KD (AGKD) challenges the discriminator with harder learning tasks by aggregating real/fake sample features and distilling knowledge from a pretrained CLIP model. (2) Correlated Generative KD (CGKD) distills and preserves diverse image-text correlations from CLIP to the GAN discriminator, improving generation diversity.	KD-DLGAN consistently outperforms existing state-of-the-art data-limited image generation methods on various benchmarks (CIFAR, ImageNet, 100-shot, AFHQ). Both AGKD and CGKD techniques individually improve performance, and their combination yields the best results. The method generalizes well across different GAN architectures (StyleGAN-v2, BigGAN), generation tasks (object, face), and training data sizes.	The study primarily focuses on CLIP as the teacher model; exploring other vision-language models is left for future work. Future research could explore the applications of KD-DLGAN in other image generation tasks like translation and editing.	generative adversarial networks, knowledge distillation, data-limited image generation, vision-language models, discriminator overfitting
2303.17155 Report	Discriminative Class Tokens for Text-to-Image Diffusion Models	Idan Schwartz, Vésteinn Snæbjarnarson, Hila Chefer, Ryan Cotterell, Serge Belongie, Lior Wolf, Sagie Benaim	Recent advances in text-to-image diffusion models have enabled the generation of diverse and high-quality images. While impressive, the images often fall short of depicting subtle details and are susceptible to errors due to ambiguity in the input text. One way of alleviating these issues is to train diffusion models on class-labeled datasets. This approach has two disadvantages: (i) supervised datasets are generally small compared to large-scale scraped text-image datasets on which text-to-image models are trained, affecting the quality and diversity of the generated images, or (ii) the input is a hard-coded label, as opposed to free-form text, limiting the control over the generated images. In this work, we propose a non-invasive fine-tuning technique that capitalizes on the expressive potential of free-form text while achieving high accuracy through discriminative signals from a pretrained classifier. This is done by iteratively modifying the embedding of an added input token of a text-to-image diffusion model, by steering generated images toward a given target class according to a classifier. Our method is fast compared to prior fine-tuning methods and does not require a collection of in-class images or retraining of a noise-tolerant classifier. We evaluate our method extensively, showing that the generated images are: (i) more accurate and of higher quality than standard diffusion models, (ii) can be used to augment training data in a low-resource setting, and (iii) reveal information about the data used to train the guiding classifier. The code is available at \url{https://github.com/idansc/discriminative_class_tokens}.	The paper proposes a novel fine-tuning technique for text-to-image diffusion models that introduces a discriminative class token representing specific classes from a pre-trained classifier.	This technique addresses the limitations of existing text-to-image models in generating images with subtle details and resolving ambiguity in input text, leading to more accurate and higher-quality image generation.	The method iteratively optimizes the embedding of the added class token by generating images and using feedback from the pre-trained classifier to steer the token towards generating images of the target class. A gradient skipping technique is used for efficient training.	Generated images are more accurate and of higher quality (lower FID scores) compared to standard diffusion models. The technique can be used for data augmentation, improving classifier performance in low-resource settings. The optimized tokens can reveal information about the data used to train the guiding classifier.	The method may still exhibit limitations in resolving highly ambiguous classes. Further research is needed to explore deeper backpropagation for potentially enhanced results.	text-to-image synthesis, diffusion models, classifier guidance, fine-grained details, lexical ambiguity
2303.17123 Report	Masked and Adaptive Transformer for Exemplar Based Image Translation	Chang Jiang, Fei Gao, Biao Ma, Yuhao Lin, Nannan Wang, Gang Xu	We present a novel framework for exemplar based image translation. Recent advanced methods for this task mainly focus on establishing cross-domain semantic correspondence, which sequentially dominates image generation in the manner of local style control. Unfortunately, cross-domain semantic matching is challenging; and matching errors ultimately degrade the quality of generated images. To overcome this challenge, we improve the accuracy of matching on the one hand, and diminish the role of matching in image generation on the other hand. To achieve the former, we propose a masked and adaptive transformer (MAT) for learning accurate cross-domain correspondence, and executing context-aware feature augmentation. To achieve the latter, we use source features of the input and global style codes of the exemplar, as supplementary information, for decoding an image. Besides, we devise a novel contrastive style learning method, for acquire quality-discriminative style representations, which in turn benefit high-quality image generation. Experimental results show that our method, dubbed MATEBIT, performs considerably better than state-of-the-art methods, in diverse image translation tasks. The codes are available at \url{https://github.com/AiArt-HDU/MATEBIT}.	This paper proposes MATEBIT, a novel framework for exemplar-based image translation that improves cross-domain semantic matching and integrates local and global style control for high-fidelity image generation.	Exemplar-based image translation is challenging because cross-domain semantic matching is difficult and errors degrade the quality of generated images. This work aims to address this challenge for higher quality image translation.	The proposed MATEBIT framework utilizes a Masked and Adaptive Transformer (MAT) for accurate cross-domain correspondence learning and feature augmentation. It also introduces a Contrastive Style Learning (CSL) method for discriminative style representation and employs a U-Net architecture with skip connections for preserving semantic information.	MATEBIT consistently outperforms state-of-the-art methods in terms of FID, SWD, and style relevance metrics on various datasets. The proposed MAT effectively refines cross-domain correspondence and augments features, leading to improved image quality compared to baseline models. The CSL method enhances style control and generates high-quality images by learning to discriminate subtle differences in perceptual quality.	Artistic portraits generated from facial photos show some degradation, potentially due to differences in edge maps between photos and paintings. Future work will explore semi-supervised learning or domain transfer techniques to address the limitations in edge map representation.	image translation, exemplar-based, transformer, contrastive learning, style control
2303.17076 Report	DiffCollage: Parallel Generation of Large Content with Diffusion Models	Qinsheng Zhang, Jiaming Song, Xun Huang, Yongxin Chen, Ming-Yu Liu	We present DiffCollage, a compositional diffusion model that can generate large content by leveraging diffusion models trained on generating pieces of the large content. Our approach is based on a factor graph representation where each factor node represents a portion of the content and a variable node represents their overlap. This representation allows us to aggregate intermediate outputs from diffusion models defined on individual nodes to generate content of arbitrary size and shape in parallel without resorting to an autoregressive generation procedure. We apply DiffCollage to various tasks, including infinite image generation, panorama image generation, and long-duration text-guided motion generation. Extensive experimental results with a comparison to strong autoregressive baselines verify the effectiveness of our approach.	This paper proposes DiffCollage, a novel method for generating large content (e.g., images, videos, panoramas) using diffusion models trained on smaller pieces of content.	This is important because collecting large-scale datasets for training diffusion models can be prohibitively expensive for certain content types. DiffCollage leverages the abundance of smaller pieces to synthesize high-quality large content.	DiffCollage uses a factor graph representation of the large content, where each node represents a portion and is associated with a pre-trained diffusion model. The method approximates the joint distribution of the large content using Bethe approximation and leverages this to generate pieces in parallel and merge them seamlessly.	DiffCollage outperforms autoregressive baselines in infinite image generation, achieving better FID+ scores and significantly faster generation speeds. It enables text-to-motion generation of long sequences with complex actions, exceeding the capabilities of models trained on shorter sequences. DiffCollage successfully synthesizes 360-degree panoramas from normal perspective images conditioned on semantic segmentation maps, showcasing its ability to handle complex dependency structures.	DiffCollage relies on conditional independence assumptions between content pieces, potentially limiting its ability to capture long-range dependencies. Parallel computation in DiffCollage comes at the cost of increased memory footprint compared to autoregressive methods.	diffusion models, large content generation, factor graphs, bethe approximation, parallel sampling
2303.16891 Report	Mask-free OVIS: Open-Vocabulary Instance Segmentation without Manual Mask Annotations	Vibashan VS, Ning Yu, Chen Xing, Can Qin, Mingfei Gao, Juan Carlos Niebles, Vishal M. Patel, Ran Xu	Existing instance segmentation models learn task-specific information using manual mask annotations from base (training) categories. These mask annotations require tremendous human effort, limiting the scalability to annotate novel (new) categories. To alleviate this problem, Open-Vocabulary (OV) methods leverage large-scale image-caption pairs and vision-language models to learn novel categories. In summary, an OV method learns task-specific information using strong supervision from base annotations and novel category information using weak supervision from image-captions pairs. This difference between strong and weak supervision leads to overfitting on base categories, resulting in poor generalization towards novel categories. In this work, we overcome this issue by learning both base and novel categories from pseudo-mask annotations generated by the vision-language model in a weakly supervised manner using our proposed Mask-free OVIS pipeline. Our method automatically generates pseudo-mask annotations by leveraging the localization ability of a pre-trained vision-language model for objects present in image-caption pairs. The generated pseudo-mask annotations are then used to supervise an instance segmentation model, freeing the entire pipeline from any labour-expensive instance-level annotations and overfitting. Our extensive experiments show that our method trained with just pseudo-masks significantly improves the mAP scores on the MS-COCO dataset and OpenImages dataset compared to the recent state-of-the-art methods trained with manual masks. Codes and models are provided in https://vibashan.github.io/ovis-web/.	This paper proposes Mask-free OVIS, a novel pipeline for open-vocabulary instance segmentation that does not require any human-annotated instance-level labels.	Existing instance segmentation methods rely on expensive manual annotations, limiting their scalability to novel categories. Open-Vocabulary methods, while promising, often overfit to base categories due to the discrepancy between strong base supervision and weak novel category supervision.	The method generates pseudo-mask annotations for both base and novel categories using a pre-trained vision-language model (VLM). It leverages a weakly-supervised proposal network and iterative masking with GradCAM to localize objects and generate accurate masks. These pseudo-masks are then used to train a Mask-RCNN model for open-vocabulary instance segmentation.	Mask-free OVIS achieves state-of-the-art performance on MS-COCO and OpenImages datasets for open-vocabulary instance segmentation, even without using any manual mask annotations. The weakly-supervised proposal network effectively generalizes to novel categories compared to fully-supervised counterparts. Iterative masking with GradCAM significantly improves the quality of pseudo-mask generation by capturing less discriminative object regions.	The performance of Mask-free OVIS, while surpassing existing methods trained without base annotations, is still lower than those fine-tuned with base annotations, suggesting room for improvement in pseudo-mask quality. The iterative masking strategy, while effective, may introduce redundant activations if performed for too many iterations, requiring careful hyperparameter tuning.	open-vocabulary learning, instance segmentation, weakly-supervised learning, vision-language models, pseudo-labeling
2303.16513 Report	Cascaded Local Implicit Transformer for Arbitrary-Scale Super-Resolution	Hao-Wei Chen, Yu-Syuan Xu, Min-Fong Hong, Yi-Min Tsai, Hsien-Kai Kuo, Chun-Yi Lee	Implicit neural representation has recently shown a promising ability in representing images with arbitrary resolutions. In this paper, we present a Local Implicit Transformer (LIT), which integrates the attention mechanism and frequency encoding technique into a local implicit image function. We design a cross-scale local attention block to effectively aggregate local features. To further improve representative power, we propose a Cascaded LIT (CLIT) that exploits multi-scale features, along with a cumulative training strategy that gradually increases the upsampling scales during training. We have conducted extensive experiments to validate the effectiveness of these components and analyze various training strategies. The qualitative and quantitative results demonstrate that LIT and CLIT achieve favorable results and outperform the prior works in arbitrary super-resolution tasks.	This paper introduces Local Implicit Transformer (LIT) and Cascaded LIT (CLIT) for arbitrary-scale super-resolution, integrating attention mechanisms and frequency encoding into local implicit image functions for improved performance.	Existing super-resolution methods often require separate models for different upsampling scales. LIT and CLIT address this limitation by enabling arbitrary-scale super-resolution with a single model, expanding potential applications.	LIT employs a cross-scale local attention block to aggregate local features and a decoder to generate residual images. CLIT extends this by using a cascaded framework with multi-scale feature embeddings, trained with a cumulative strategy that progressively increases upsampling scales.	CLIT outperforms prior local implicit neural representation methods on standard benchmarks like DIV2K, Set5, Set14, B100, and Urban100. Qualitative results demonstrate CLIT's ability to reconstruct sharp details and continuous structures more effectively than LIIF and LTE. Ablation studies confirm the contribution of each component in LIT and the effectiveness of the cumulative training strategy for both LIT and CLIT.	Increasing the local grid size in LIT improves performance but also increases training time. Future work could explore extending CLIT with more sophisticated encoders or exploring its application in other image restoration tasks.	super-resolution, arbitrary-scale, implicit neural representation, local attention, cascaded framework
2303.16509 Report	HoloDiffusion: Training a 3D Diffusion Model using 2D Images	Animesh Karnewar, Andrea Vedaldi, David Novotny, Niloy Mitra	Diffusion models have emerged as the best approach for generative modeling of 2D images. Part of their success is due to the possibility of training them on millions if not billions of images with a stable learning objective. However, extending these models to 3D remains difficult for two reasons. First, finding a large quantity of 3D training data is much more complex than for 2D images. Second, while it is conceptually trivial to extend the models to operate on 3D rather than 2D grids, the associated cubic growth in memory and compute complexity makes this infeasible. We address the first challenge by introducing a new diffusion setup that can be trained, end-to-end, with only posed 2D images for supervision; and the second challenge by proposing an image formation model that decouples model memory from spatial memory. We evaluate our method on real-world data, using the CO3D dataset which has not been used to train 3D generative models before. We show that our diffusion models are scalable, train robustly, and are competitive in terms of sample quality and fidelity to existing approaches for 3D generative modeling.	This paper introduces HoloDiffusion, the first 3D-aware generative diffusion model trained with posed 2D images, producing 3D-consistent images.	Extending diffusion models to 3D enhances generative capabilities, offering view consistency and direct manipulation potential for applications like object placement and content creation.	The method uses a hybrid explicit-implicit 3D feature grid, decoupling model memory from spatial memory. A novel diffusion process learns the distribution of these grids using only posed 2D images by generating intermediate 3D features and applying a denoising 3D UNet trained with a photometric loss.	HoloDiffusion generates high-quality, 3D-consistent samples, outperforming baselines qualitatively. The model demonstrates robustness and scalability, training effectively on a large dataset of real-world videos. Quantitative metrics like FID and KID confirm the superior performance of HoloDiffusion compared to existing methods.	The method currently relies on camera information during training, requiring future work to explore joint viewpoint estimation. Exploration of conditional generation, editing capabilities for shape and appearance, and multi-class training are promising future directions.	diffusion models, 3d generative models, view synthesis, neural rendering, 3d reconstruction
2303.16493 Report	AnyFlow: Arbitrary Scale Optical Flow with Implicit Neural Representation	Hyunyoung Jung, Zhuo Hui, Lei Luo, Haitao Yang, Feng Liu, Sungjoo Yoo, Rakesh Ranjan, Denis Demandolx	To apply optical flow in practice, it is often necessary to resize the input to smaller dimensions in order to reduce computational costs. However, downsizing inputs makes the estimation more challenging because objects and motion ranges become smaller. Even though recent approaches have demonstrated high-quality flow estimation, they tend to fail to accurately model small objects and precise boundaries when the input resolution is lowered, restricting their applicability to high-resolution inputs. In this paper, we introduce AnyFlow, a robust network that estimates accurate flow from images of various resolutions. By representing optical flow as a continuous coordinate-based representation, AnyFlow generates outputs at arbitrary scales from low-resolution inputs, demonstrating superior performance over prior works in capturing tiny objects with detail preservation on a wide range of scenes. We establish a new state-of-the-art performance of cross-dataset generalization on the KITTI dataset, while achieving comparable accuracy on the online benchmarks to other SOTA methods.	AnyFlow, a novel neural network architecture for optical flow estimation that can handle arbitrary image resolutions, leading to robust performance even with low-resolution inputs.	Existing optical flow methods struggle with low-resolution images, limiting their use on devices where resizing to smaller sizes is often necessary for reduced computational cost. AnyFlow addresses this by producing high-quality flow estimations at arbitrary scales, even from downsampled input.	The proposed AnyFlow builds upon RAFT and introduces: 1) a neural implicit flow upsampler for arbitrary scale output; 2) a multi-scale feature warping module to leverage high-resolution representations; 3) a dynamic lookup scheme with region encoding to adapt to diverse motions and input sizes. The model is trained with multi-scale inputs.	Achieves state-of-the-art cross-dataset generalization performance on KITTI, surpassing previous methods by a significant margin. Demonstrates robustness to downsampling, maintaining high accuracy even with 50% downsampled images, outperforming existing methods that degrade substantially. Generates high-resolution optical flow directly from low-resolution input, avoiding artifacts introduced by interpolation or super-resolution techniques used in other methods.	The dynamic lookup with region encoding, while improving performance on the training set, did not show consistent improvements on the test set, suggesting further exploration is needed. Future work could explore the application of AnyFlow to downstream tasks like video super-resolution, which require accurate optical flow from low-resolution input.	optical flow, arbitrary resolution, implicit neural representation, multi-scale feature warping, dynamic lookup
2303.16482 Report	Point2Pix: Photo-Realistic Point Cloud Rendering via Neural Radiance Fields	Tao Hu, Xiaogang Xu, Shu Liu, Jiaya Jia	Synthesizing photo-realistic images from a point cloud is challenging because of the sparsity of point cloud representation. Recent Neural Radiance Fields and extensions are proposed to synthesize realistic images from 2D input. In this paper, we present Point2Pix as a novel point renderer to link the 3D sparse point clouds with 2D dense image pixels. Taking advantage of the point cloud 3D prior and NeRF rendering pipeline, our method can synthesize high-quality images from colored point clouds, generally for novel indoor scenes. To improve the efficiency of ray sampling, we propose point-guided sampling, which focuses on valid samples. Also, we present Point Encoding to build Multi-scale Radiance Fields that provide discriminative 3D point features. Finally, we propose Fusion Encoding to efficiently synthesize high-quality images. Extensive experiments on the ScanNet and ArkitScenes datasets demonstrate the effectiveness and generalization.	Proposes Point2Pix, a novel point cloud renderer that synthesizes photo-realistic images from colored point clouds, particularly for indoor scenes, by bridging the gap between point clouds and Neural Radiance Fields (NeRF).	Addresses the challenge of synthesizing realistic images from sparse point clouds, leveraging the strengths of point cloud 3D priors and NeRF rendering.	Combines point-guided sampling for efficient ray focusing, Multi-scale Radiance Fields for extracting discriminative 3D point features, and a Fusion Decoder to generate high-quality images from rendered feature maps.	Achieves state-of-the-art results on ScanNet and ARkitScenes datasets, outperforming existing point cloud renderers and demonstrating strong generalization to novel indoor scenes. Significantly reduces rendering time and memory consumption compared to traditional NeRF-based methods due to efficient point-guided sampling and multi-scale feature rendering. Demonstrates applicability in point cloud upsampling and in-painting by leveraging the learned 3D point features and attributes.	Rendering speed is relatively slow compared to caching-based methods. Generalization to arbitrary environments beyond indoor scenes remains a challenge.	point cloud rendering, neural radiance fields, 3d point features, point-guided sampling, fusion decoder
2303.16333 Report	Flow supervision for Deformable NeRF	Chaoyang Wang, Lachlan Ewen MacDonald, Laszlo A. Jeni, Simon Lucey	In this paper we present a new method for deformable NeRF that can directly use optical flow as supervision. We overcome the major challenge with respect to the computationally inefficiency of enforcing the flow constraints to the backward deformation field, used by deformable NeRFs. Specifically, we show that inverting the backward deformation function is actually not needed for computing scene flows between frames. This insight dramatically simplifies the problem, as one is no longer constrained to deformation functions that can be analytically inverted. Instead, thanks to the weak assumptions required by our derivation based on the inverse function theorem, our approach can be extended to a broad class of commonly used backward deformation field. We present results on monocular novel view synthesis with rapid object motion, and demonstrate significant improvements over baselines without flow supervision.	This paper presents a new method to apply optical flow supervision to deformable NeRF, by deriving an analytical solution to compute object velocities directly from the backward warping field.	Current deformable NeRFs struggle with rapid object motion due to the lack of temporal regularization. This method allows for the use of optical flow as supervision to address this issue.	The method leverages the inverse function theorem to compute velocity fields from the backward warping field without needing an explicit inverse function. Scene flows are then computed via time integration of the velocities, enabling the use of optical flow for supervision.	Flow supervision significantly improves the convergence speed and reconstruction quality for rapid object motion. The method outperforms baseline deformable NeRFs on datasets with lower effective multi-view factors. The method enables clean separation of moving foreground objects and static background.	The method suffers from scale ambiguity due to the lack of depth supervision. The choice of the canonical frame can significantly impact performance for highly deformable objects.	deformable nerf, optical flow, novel view synthesis, dynamic scene reconstruction, temporal regularization
2303.16201 Report	ASIC: Aligning Sparse in-the-wild Image Collections	Kamal Gupta, Varun Jampani, Carlos Esteves, Abhinav Shrivastava, Ameesh Makadia, Noah Snavely, Abhishek Kar	We present a method for joint alignment of sparse in-the-wild image collections of an object category. Most prior works assume either ground-truth keypoint annotations or a large dataset of images of a single object category. However, neither of the above assumptions hold true for the long-tail of the objects present in the world. We present a self-supervised technique that directly optimizes on a sparse collection of images of a particular object/object category to obtain consistent dense correspondences across the collection. We use pairwise nearest neighbors obtained from deep features of a pre-trained vision transformer (ViT) model as noisy and sparse keypoint matches and make them dense and accurate matches by optimizing a neural network that jointly maps the image collection into a learned canonical grid. Experiments on CUB and SPair-71k benchmarks demonstrate that our method can produce globally consistent and higher quality correspondences across the image collection when compared to existing self-supervised methods. Code and other material will be made available at \url{https://kampta.github.io/asic}.	This paper presents ASIC, a method for obtaining consistent dense correspondences across a small collection of in-the-wild images of an object or object category, without requiring manual annotations.	Dense image alignment in the wild is crucial for various applications, but most existing methods rely on annotated keypoints, large datasets, or specific object knowledge. ASIC addresses this gap by leveraging pre-trained self-supervised vision models to enable low-shot dense correspondence.	ASIC utilizes sparse pseudo-correspondences from deep features of a pre-trained ViT model as noisy keypoint matches. It then jointly optimizes a neural network to map the image collection into a learned canonical grid, using a contrastive loss, equivariance regularization, and reconstruction terms.	ASIC produces globally consistent and dense canonical space mappings for various object categories, effectively acting as a continuous co-segmentation. The method outperforms or achieves competitive results with existing unsupervised keypoint correspondence approaches on SPair-71k, CUB-200, PF-Willow, and SAMURAI datasets. ASIC exhibits superior consistency in keypoint propagation over image sequences compared to baselines, as demonstrated by the proposed k-cycle PCK metric.	ASIC may struggle with left-right ambiguity in symmetric objects due to the nature of SSL models. The method might not handle large viewpoint changes well, especially when intermediate viewpoints are scarce. Future work will explore applications in few-shot tasks like reconstruction, pose estimation, and tracking.	dense correspondence, image alignment, self-supervised learning, vision transformer, low-shot learning
2303.16187 Report	Visual Chain-of-Thought Diffusion Models	William Harvey, Frank Wood	Recent progress with conditional image diffusion models has been stunning, and this holds true whether we are speaking about models conditioned on a text description, a scene layout, or a sketch. Unconditional image diffusion models are also improving but lag behind, as do diffusion models which are conditioned on lower-dimensional features like class labels. We propose to close the gap between conditional and unconditional models using a two-stage sampling procedure. In the first stage we sample an embedding describing the semantic content of the image. In the second stage we sample the image conditioned on this embedding and then discard the embedding. Doing so lets us leverage the power of conditional diffusion models on the unconditional generation task, which we show improves FID by 25-50% compared to standard unconditional generation.	This paper introduces Visual Chain-of-Thought Diffusion Models (VCDM), a two-stage sampling procedure for improved unconditional and lightly-conditional image generation.	Unconditional and lightly-conditional diffusion models lag behind their heavily-conditioned counterparts in terms of sample quality. This paper aims to bridge this performance gap.	VCDM first samples a semantically-meaningful CLIP embedding and then, conditioned on this embedding, samples the final image using a conditional diffusion model. The CLIP embedding is discarded after image generation.	VCDM consistently outperforms standard unconditional diffusion models (EDM) on AFHQ, FFHQ, and ImageNet datasets. The performance gap between VCDM and using an oracle CLIP embedding is small, indicating the effectiveness of the learned auxiliary model for embedding sampling. VCDM achieves significant FID improvements with minimal computational overhead compared to baselines.	VCDM relies on the availability of pretrained CLIP embedders, which might be limited for certain domains. Exploring alternative self-supervised representations or joint diffusion models over both image and embedding spaces could further enhance VCDM.	diffusion models, image generation, clip embeddings, unconditional generation, conditional generation
2303.15951 Report	F$^{2}$-NeRF: Fast Neural Radiance Field Training with Free Camera Trajectories	Peng Wang, Yuan Liu, Zhaoxi Chen, Lingjie Liu, Ziwei Liu, Taku Komura, Christian Theobalt, Wenping Wang	This paper presents a novel grid-based NeRF called F2-NeRF (Fast-Free-NeRF) for novel view synthesis, which enables arbitrary input camera trajectories and only costs a few minutes for training. Existing fast grid-based NeRF training frameworks, like Instant-NGP, Plenoxels, DVGO, or TensoRF, are mainly designed for bounded scenes and rely on space warping to handle unbounded scenes. Existing two widely-used space-warping methods are only designed for the forward-facing trajectory or the 360-degree object-centric trajectory but cannot process arbitrary trajectories. In this paper, we delve deep into the mechanism of space warping to handle unbounded scenes. Based on our analysis, we further propose a novel space-warping method called perspective warping, which allows us to handle arbitrary trajectories in the grid-based NeRF framework. Extensive experiments demonstrate that F2-NeRF is able to use the same perspective warping to render high-quality images on two standard datasets and a new free trajectory dataset collected by us. Project page: https://totoro97.github.io/projects/f2-nerf.	This paper presents F$^2$-NeRF, a novel grid-based NeRF method that enables fast training with free camera trajectories for novel view synthesis in unbounded scenes.	Existing fast grid-based NeRF methods are limited to bounded scenes or specific camera trajectories (forward-facing or 360° object-centric) due to their reliance on space warping techniques.	F$^2$-NeRF introduces a new perspective warping method that generalizes to arbitrary camera trajectories. It maps 3D points to a warp space based on their projections onto multiple input views using PCA. Additionally, it employs an adaptive space subdivision scheme and multiple hash functions to efficiently handle unbounded scenes.	F$^2$-NeRF outperforms baseline methods on a new Free dataset with challenging free camera trajectories while achieving training times of around 12 minutes on a 2080Ti GPU. The proposed perspective warping is shown to be compatible with both forward-facing and 360° object-centric trajectories, achieving comparable results to specialized warping techniques on LLFF and NeRF-360-V2 datasets. Ablation studies validate the effectiveness of perspective warping and perspective sampling for improved rendering quality on free trajectories.	The current implementation relies on a fixed number of cameras for perspective warping computation, potentially limiting its representation capacity for scenes with drastic changes in viewpoints. The proposed perspective warping utilizes a linear approximation for perspective sampling, which may lead to suboptimal sampling in certain cases with complex geometry.	neural radiance fields, novel view synthesis, space warping, perspective warping, unbounded scenes
2303.15893 Report	VIVE3D: Viewpoint-Independent Video Editing using 3D-Aware GANs	Anna Frühstück, Nikolaos Sarafianos, Yuanlu Xu, Peter Wonka, Tony Tung	We introduce VIVE3D, a novel approach that extends the capabilities of image-based 3D GANs to video editing and is able to represent the input video in an identity-preserving and temporally consistent way. We propose two new building blocks. First, we introduce a novel GAN inversion technique specifically tailored to 3D GANs by jointly embedding multiple frames and optimizing for the camera parameters. Second, besides traditional semantic face edits (e.g. for age and expression), we are the first to demonstrate edits that show novel views of the head enabled by the inherent properties of 3D GANs and our optical flow-guided compositing technique to combine the head with the background video. Our experiments demonstrate that VIVE3D generates high-fidelity face edits at consistent quality from a range of camera viewpoints which are composited with the original video in a temporally and spatially consistent manner.	Presents VIVE3D, a novel method for viewpoint-independent video editing that leverages 3D-aware GANs to enable realistic and consistent edits of facial attributes and camera viewpoints.	Existing video editing techniques struggle with maintaining consistency when altering viewpoints, limiting their application in scenarios requiring perspective changes.	VIVE3D decomposes the input video into identity and offset latent codes, allowing for personalized generator fine-tuning and per-frame editing. It also employs optical flow correction to ensure seamless compositing of the edited face onto the original body, even under significant viewpoint changes.	VIVE3D successfully performs attribute edits (like aging) and viewpoint adjustments while preserving temporal and spatial consistency. The method surpasses existing 2D GAN-based techniques in handling viewpoint changes, showcasing superior quantitative and qualitative results. VIVE3D demonstrates strong generalization ability by compositing faces and motions from different videos with plausible outcomes.	Lighting inconsistencies between source and target videos can impact the realism of composites, suggesting an area for future improvement. The reliance on per-frame optimization, while effective, can be computationally intensive. Exploring encoding strategies for EG3D could improve efficiency.	video editing, 3d gans, viewpoint synthesis, facial attribute editing, deep learning
2303.15892 Report	Head3D: Complete 3D Head Generation via Tri-plane Feature Distillation	Yuhao Cheng, Yichao Yan, Wenhan Zhu, Ye Pan, Bowen Pan, Xiaokang Yang	Head generation with diverse identities is an important task in computer vision and computer graphics, widely used in multimedia applications. However, current full head generation methods require a large number of 3D scans or multi-view images to train the model, resulting in expensive data acquisition cost. To address this issue, we propose Head3D, a method to generate full 3D heads with limited multi-view images. Specifically, our approach first extracts facial priors represented by tri-planes learned in EG3D, a 3D-aware generative model, and then proposes feature distillation to deliver the 3D frontal faces into complete heads without compromising head integrity. To mitigate the domain gap between the face and head models, we present dual-discriminators to guide the frontal and back head generation, respectively. Our model achieves cost-efficient and diverse complete head generation with photo-realistic renderings and high-quality geometry representations. Extensive experiments demonstrate the effectiveness of our proposed Head3D, both qualitatively and quantitatively.	This paper proposes Head3D, a novel method for generating complete 3D heads using limited multi-view images and a pre-trained 3D face generator.	Existing 3D head generation methods are limited to frontal faces or require expensive 3D scans. Head3D addresses these limitations by leveraging a cost-effective approach.	Head3D extracts facial priors from a pre-trained EG3D model via tri-plane feature distillation, transferring identity information while completing head geometry. A dual-discriminator approach addresses the distribution gap between frontal and back head images.	Head3D generates high-fidelity complete heads with photo-realistic renderings and detailed geometry. The proposed tri-plane feature distillation effectively transfers identity information while maintaining head integrity. Dual-discriminators improve generation quality by addressing the distribution gap and quantity imbalance between front and back views.	The quality of generated heads, while high, is slightly lower than the pre-trained face generator due to knowledge distillation and limited data. Future work could explore improving generation quality and extending the method to handle various head poses and expressions.	3d head generation, tri-plane feature distillation, dual-discriminator, knowledge distillation, 3d-aware gan
2303.15780 Report	Instruct 3D-to-3D: Text Instruction Guided 3D-to-3D conversion	Hiromichi Kamata, Yuiko Sakuma, Akio Hayakawa, Masato Ishii, Takuya Narihira	We propose a high-quality 3D-to-3D conversion method, Instruct 3D-to-3D. Our method is designed for a novel task, which is to convert a given 3D scene to another scene according to text instructions. Instruct 3D-to-3D applies pretrained Image-to-Image diffusion models for 3D-to-3D conversion. This enables the likelihood maximization of each viewpoint image and high-quality 3D generation. In addition, our proposed method explicitly inputs the source 3D scene as a condition, which enhances 3D consistency and controllability of how much of the source 3D scene structure is reflected. We also propose dynamic scaling, which allows the intensity of the geometry transformation to be adjusted. We performed quantitative and qualitative evaluations and showed that our proposed method achieves higher quality 3D-to-3D conversions than baseline methods.	Presents Instruct 3D-to-3D, a method for converting a 3D scene into another based on text instructions, leveraging pretrained Image-to-Image diffusion models and dynamic scaling for high quality and 3D consistency.	Editing 3D scenes with text instructions is desirable for easier and more intuitive 3D content creation.	Uses a pretrained Image-to-Image diffusion model (InstructPix2Pix) conditioned on the source 3D scene and the text instruction to optimize a target 3D model. Dynamic scaling is employed to control the strength of geometry transformation by gradually decreasing and increasing the 3D resolution.	Achieves higher quality 3D-to-3D conversions than CLIP-NeRF and DreamFusion in qualitative and quantitative evaluations. Demonstrates better preservation of source 3D scene structure while reflecting text instructions. User study confirms a strong preference for Instruct 3D-to-3D over baseline methods in terms of overall conversion quality.	Limitations in handling instructions requiring spatial reasoning, such as accurately placing objects. Future work to incorporate depth information and improve spatial reasoning capabilities.	3d-to-3d conversion, text-guided 3d editing, diffusion models, dynamic scaling, implicit 3d representation
2303.15768 Report	RobustSwap: A Simple yet Robust Face Swapping Model against Attribute Leakage	Jaeseong Lee, Taewoo Kim, Sunghyun Park, Younggun Lee, Jaegul Choo	Face swapping aims at injecting a source image's identity (i.e., facial features) into a target image, while strictly preserving the target's attributes, which are irrelevant to identity. However, we observed that previous approaches still suffer from source attribute leakage, where the source image's attributes interfere with the target image's. In this paper, we analyze the latent space of StyleGAN and find the adequate combination of the latents geared for face swapping task. Based on the findings, we develop a simple yet robust face swapping model, RobustSwap, which is resistant to the potential source attribute leakage. Moreover, we exploit the coordination of 3DMM's implicit and explicit information as a guidance to incorporate the structure of the source image and the precise pose of the target image. Despite our method solely utilizing an image dataset without identity labels for training, our model has the capability to generate high-fidelity and temporally consistent videos. Through extensive qualitative and quantitative evaluations, we demonstrate that our method shows significant improvements compared with the previous face swapping models in synthesizing both images and videos. Project page is available at https://robustswap.github.io/	This paper introduces RobustSwap, a novel face swapping model that addresses the issue of source attribute leakage in previous approaches.	Existing face swapping methods often exhibit source attribute leakage, where attributes from the source image, such as hair or pose, contaminate the swapped image. This leakage degrades the quality and realism of the results.	The paper analyzes StyleGAN's latent space to find an optimal combination of latent codes for preserving target attributes while injecting source identity. The proposed RobustSwap model utilizes a target attribute encoder, source identity encoder, and a shape-guided identity injection mechanism based on 3DMM.	RobustSwap demonstrates superior performance in preserving target attributes and minimizing source attribute leakage compared to existing methods. The method achieves state-of-the-art quantitative results on CelebA-HQ dataset across various metrics, including identity similarity, attribute preservation, and image quality. RobustSwap can generate high-quality and temporally consistent videos even without training on video datasets, highlighting its robustness and generalization ability.	The model's reliance on a pre-trained StyleGAN might limit its generalizability to unseen domains or facial variations. Further research could explore enhancing identity preservation while maintaining attribute fidelity.	face swapping, source attribute leakage, stylegan, 3d morphable model (3dmm), latent space analysis
2303.15649 Report	StyleDiffusion: Prompt-Embedding Inversion for Text-Based Editing	Senmao Li, Joost van de Weijer, Taihang Hu, Fahad Shahbaz Khan, Qibin Hou, Yaxing Wang, Jian Yang	A significant research effort is focused on exploiting the amazing capacities of pretrained diffusion models for the editing of images. They either finetune the model, or invert the image in the latent space of the pretrained model. However, they suffer from two problems: (1) Unsatisfying results for selected regions, and unexpected changes in nonselected regions. (2) They require careful text prompt editing where the prompt should include all visual objects in the input image. To address this, we propose two improvements: (1) Only optimizing the input of the value linear network in the cross-attention layers, is sufficiently powerful to reconstruct a real image. (2) We propose attention regularization to preserve the object-like attention maps after editing, enabling us to obtain accurate style editing without invoking significant structural changes. We further improve the editing technique which is used for the unconditional branch of classifier-free guidance, as well as the conditional one as used by P2P. Extensive experimental prompt-editing results on a variety of images, demonstrate qualitatively and quantitatively that our method has superior editing capabilities than existing and concurrent works.	This paper introduces StyleDiffusion, a method for accurate text-based editing of real images using pre-trained text-guided diffusion models.	Existing methods struggle with achieving accurate edits in specific regions while preserving the rest of the image and often require complex prompt engineering. StyleDiffusion addresses these limitations.	StyleDiffusion maps a real image to the input embeddings of the value linear layer in the cross-attention layers, keeping the text embedding for the key layer fixed. This preserves structure from the input image while enabling style editing. It also introduces an attention regularization to maintain attention map accuracy during editing and proposes P2Plus, an enhancement to the P2P editing technique for improved handling of large structural changes.	StyleDiffusion achieves more accurate editing of real images compared to baselines like Null-text and SDEdit, as demonstrated qualitatively and quantitatively using metrics like Structure Dist, NS-LPIPS, and Clipscore. The method successfully preserves the structure of non-edited regions in the image while modifying the targeted elements. Attention regularization proves essential for maintaining the fidelity of attention maps during editing, resulting in more precise edits.	StyleDiffusion may not perform optimally when the input image contains objects in uncommon poses or when the semantic gap between source and target prompts is too large. Future work could explore extending StyleDiffusion to handle multiple edits simultaneously and further improve its efficiency for real-time applications.	image editing, diffusion models, text-guided synthesis, attention mechanisms, style transfer
2303.15446 Report	SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications	Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan	Self-attention has become a defacto choice for capturing global context in various vision applications. However, its quadratic computational complexity with respect to image resolution limits its use in real-time applications, especially for deployment on resource-constrained mobile devices. Although hybrid approaches have been proposed to combine the advantages of convolutions and self-attention for a better speed-accuracy trade-off, the expensive matrix multiplication operations in self-attention remain a bottleneck. In this work, we introduce a novel efficient additive attention mechanism that effectively replaces the quadratic matrix multiplication operations with linear element-wise multiplications. Our design shows that the key-value interaction can be replaced with a linear layer without sacrificing any accuracy. Unlike previous state-of-the-art methods, our efficient formulation of self-attention enables its usage at all stages of the network. Using our proposed efficient additive attention, we build a series of models called "SwiftFormer" which achieves state-of-the-art performance in terms of both accuracy and mobile inference speed. Our small variant achieves 78.5% top-1 ImageNet-1K accuracy with only 0.8 ms latency on iPhone 14, which is more accurate and 2x faster compared to MobileViT-v2. Code: https://github.com/Amshaker/SwiftFormer	This paper introduces SwiftFormer, a novel efficient additive attention mechanism for transformer-based real-time mobile vision applications, which replaces quadratic matrix multiplications in self-attention with linear element-wise multiplications.	Self-attention, while effective for capturing global context in vision applications, has a quadratic computational complexity that limits its use in real-time applications on mobile devices. SwiftFormer addresses this by providing an efficient alternative.	The paper proposes an efficient additive attention mechanism that eliminates the need for matrix multiplications by focusing on query-key interactions and using a linear layer for context encoding. This allows for a consistent hybrid design with the attention block used in all stages of the network.	SwiftFormer achieves state-of-the-art performance in terms of accuracy and mobile inference speed, outperforming existing ConvNets, transformer-based, and hybrid models. The small variant of SwiftFormer achieves 78.5% top-1 ImageNet-1K accuracy with only 0.8ms latency on iPhone 14, surpassing MobileViT-v2 in both accuracy and speed. The effectiveness of SwiftFormer is demonstrated across various tasks, including image classification, object detection, instance segmentation, and semantic segmentation.	The paper primarily focuses on 224x224 image resolution for evaluation and comparison. Future work may explore the use of neural architecture search to potentially further optimize SwiftFormer's performance.	computer vision, mobile vision, efficient deep learning, transformers, self-attention
2303.15437 Report	FaceLit: Neural 3D Relightable Faces	Anurag Ranjan, Kwang Moo Yi, Jen-Hao Rick Chang, Oncel Tuzel	We propose a generative framework, FaceLit, capable of generating a 3D face that can be rendered at various user-defined lighting conditions and views, learned purely from 2D images in-the-wild without any manual annotation. Unlike existing works that require careful capture setup or human labor, we rely on off-the-shelf pose and illumination estimators. With these estimates, we incorporate the Phong reflectance model in the neural volume rendering framework. Our model learns to generate shape and material properties of a face such that, when rendered according to the natural statistics of pose and illumination, produces photorealistic face images with multiview 3D and illumination consistency. Our method enables photorealistic generation of faces with explicit illumination and view controls on multiple datasets - FFHQ, MetFaces and CelebA-HQ. We show state-of-the-art photorealism among 3D aware GANs on FFHQ dataset achieving an FID score of 3.5.	Proposes FaceLit, a generative framework that learns a disentangled 3D model of a face from 2D images, enabling rendering under various user-defined lighting conditions and views.	Existing 3D generative models entangle geometry and illumination, limiting controllability. FaceLit addresses this by incorporating physics-based illumination for disentanglement.	Embeds a simplified Phong reflectance model with Spherical Harmonics into the EG3D neural volume rendering pipeline. The model learns to generate shape and material properties for realistic rendering under varying pose and illumination.	Achieves state-of-the-art FID score of 3.5 among 3D aware GANs on FFHQ dataset. Demonstrates photorealistic generation of faces with explicit illumination and view controls on FFHQ, MetFaces, and CelebA-HQ. Shows improved detail in challenging areas like lips and teeth compared to previous methods.	Does not model all physical aspects of the scene, e.g., global illumination or subsurface scattering, which could further improve realism. Accuracy is limited by the performance of external pose and illumination estimation methods (DECA).	generative model, 3d face reconstruction, relighting, neural volume rendering, disentanglement
2303.15435 Report	The Stable Signature: Rooting Watermarks in Latent Diffusion Models	Pierre Fernandez, Guillaume Couairon, Hervé Jégou, Matthijs Douze, Teddy Furon	Generative image modeling enables a wide range of applications but raises ethical concerns about responsible deployment. This paper introduces an active strategy combining image watermarking and Latent Diffusion Models. The goal is for all generated images to conceal an invisible watermark allowing for future detection and/or identification. The method quickly fine-tunes the latent decoder of the image generator, conditioned on a binary signature. A pre-trained watermark extractor recovers the hidden signature from any generated image and a statistical test then determines whether it comes from the generative model. We evaluate the invisibility and robustness of the watermarks on a variety of generation tasks, showing that Stable Signature works even after the images are modified. For instance, it detects the origin of an image generated from a text prompt, then cropped to keep $10\%$ of the content, with $90$+$\%$ accuracy at a false positive rate below 10$^{-6}$.	This paper proposes Stable Signature, a method to embed invisible watermarks directly into images generated by Latent Diffusion Models (LDMs) for detection and identification purposes.	The rise of AI-generated content necessitates reliable methods for detecting and tracing generated images to address concerns about authenticity, copyright, and misuse.	Stable Signature fine-tunes the LDM decoder by back-propagating a combined loss of a perceptual image loss and a hidden message loss from a pre-trained watermark extractor. The extractor, trained using a simplified HiDDeN method, ensures robust watermark recovery even after image modifications.	Stable Signature achieves high detection rates (e.g., 99% for unmodified images) with low false positive rates (e.g., 1 in 1 billion) on various image transformations. The method demonstrates accurate identification of the specific LDM model used for generation even with a large number of deployed models. Stable Signature maintains high image quality, with minimal perceptual differences between watermarked and original LDM outputs.	The watermark's robustness against model purification attacks, where the model is fine-tuned to remove the watermark, requires further investigation. Exploring more powerful traitor tracing codes and accusation algorithms to enhance the identification of colluding users is an area for future work.	image watermarking, latent diffusion models, ai-generated content detection, content authentication, responsible ai
2303.15403 Report	Training-free Content Injection using h-space in Diffusion Models	Jaeseok Jeong, Mingi Kwon, Youngjung Uh	Diffusion models (DMs) synthesize high-quality images in various domains. However, controlling their generative process is still hazy because the intermediate variables in the process are not rigorously studied. Recently, the bottleneck feature of the U-Net, namely $h$-space, is found to convey the semantics of the resulting image. It enables StyleCLIP-like latent editing within DMs. In this paper, we explore further usage of $h$-space beyond attribute editing, and introduce a method to inject the content of one image into another image by combining their features in the generative processes. Briefly, given the original generative process of the other image, 1) we gradually blend the bottleneck feature of the content with proper normalization, and 2) we calibrate the skip connections to match the injected content. Unlike custom-diffusion approaches, our method does not require time-consuming optimization or fine-tuning. Instead, our method manipulates intermediate features within a feed-forward generative process. Furthermore, our method does not require supervision from external networks. The code is available at https://curryjung.github.io/InjectFusion/	This paper introduces InjectFusion, a training-free method for injecting content from one image into another using pretrained diffusion models by manipulating feature maps in the bottleneck of the U-Net.	Controlling the generative process of diffusion models remains a challenge. This method provides a novel way to control content without requiring additional training or external networks, unlike previous approaches.	InjectFusion uses spherical interpolation (Slerp) to blend bottleneck features of content and original images while preserving statistical correlations within the model. It further introduces "latent calibration" to fine-tune the blending and preserve image quality.	Simply replacing bottleneck features leads to content injection but with significant distortion. Slerp with proper normalization effectively injects content while maintaining high image quality. InjectFusion successfully injects content even when using out-of-domain images as stylistic references.	The small spatial dimensions of the bottleneck feature map limit the granularity of local content injection. Injecting content from out-of-domain images with drastically different semantics can lead to poor results.	diffusion models, content injection, image synthesis, generative models, training-free
2303.15389 Report	EVA-CLIP: Improved Training Techniques for CLIP at Scale	Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, Yue Cao	Contrastive language-image pre-training, CLIP for short, has gained increasing attention for its potential in various scenarios. In this paper, we propose EVA-CLIP, a series of models that significantly improve the efficiency and effectiveness of CLIP training. Our approach incorporates new techniques for representation learning, optimization, and augmentation, enabling EVA-CLIP to achieve superior performance compared to previous CLIP models with the same number of parameters but significantly smaller training costs. Notably, our largest 5.0B-parameter EVA-02-CLIP-E/14+ with only 9 billion seen samples achieves 82.0 zero-shot top-1 accuracy on ImageNet-1K val. A smaller EVA-02-CLIP-L/14+ with only 430 million parameters and 6 billion seen samples achieves 80.4 zero-shot top-1 accuracy on ImageNet-1K val. To facilitate open access and open research, we release the complete suite of EVA-CLIP to the community at https://github.com/baaivision/EVA/tree/master/EVA-CLIP.	This paper proposes \evaclip, a family of models that significantly improves the efficiency and effectiveness of CLIP training, achieving superior performance compared to previous CLIP models with the same number of parameters but significantly smaller training costs.	Training CLIP models is challenging due to high computational costs and training instability when scaling up. \evaclip provides a feasible, efficient, and effective solution by significantly reducing training costs, stabilizing training, and improving zero-shot performance.	The \evaclip approach leverages pre-trained EVA representations for initialization, utilizes the LAMB optimizer, implements random dropping of input tokens (FLIP), and employs a speedup trick called flash attention.	The 5.0B-parameter \evaTwoclip-E/14+ achieves 82.0% zero-shot top-1 accuracy on ImageNet-1K val with only 9 billion seen samples. The smaller \evaTwoclip-L/14+ achieves 80.4% zero-shot top-1 accuracy on ImageNet-1K val using only 430 million parameters and 6 billion seen samples. \evaclip models demonstrate superior performance across various zero-shot benchmarks, including image classification, video classification, and image-text retrieval tasks.	The zero-shot retrieval performance of \evaTwoclip-E/14, while competitive, is slightly lower than OpenCLIP-G/14, possibly due to the smaller text encoder capacity and fewer training samples. Future work could explore further scaling up \evaclip models with larger text encoders and more training data, especially for retrieval tasks.	clip, image classification, zero-shot learning, vision-language pre-training, efficient training
2303.15342 Report	Exploring Continual Learning of Diffusion Models	Michał Zając, Kamil Deja, Anna Kuzina, Jakub M. Tomczak, Tomasz Trzciński, Florian Shkurti, Piotr Miłoś	Diffusion models have achieved remarkable success in generating high-quality images thanks to their novel training procedures applied to unprecedented amounts of data. However, training a diffusion model from scratch is computationally expensive. This highlights the need to investigate the possibility of training these models iteratively, reusing computation while the data distribution changes. In this study, we take the first step in this direction and evaluate the continual learning (CL) properties of diffusion models. We begin by benchmarking the most common CL methods applied to Denoising Diffusion Probabilistic Models (DDPMs), where we note the strong performance of the experience replay with the reduced rehearsal coefficient. Furthermore, we provide insights into the dynamics of forgetting, which exhibit diverse behavior across diffusion timesteps. We also uncover certain pitfalls of using the bits-per-dimension metric for evaluating CL.	This paper presents the first study on the continual learning (CL) properties of diffusion models, benchmarking common CL methods on DDPMs and revealing insights into forgetting dynamics and evaluation metrics.	Training diffusion models from scratch is computationally expensive, making iterative training with changing data distributions highly desirable.	The authors evaluated various CL methods, including finetuning, L2 regularization, experience replay, and generative replay, on MNIST and Fashion-MNIST datasets.	DDPMs exhibit significant catastrophic forgetting in CL settings. Experience replay with a reduced rehearsal coefficient effectively prevents catastrophic forgetting in DDPMs. Bits-per-dimension (BPD) is an unreliable metric for evaluating generative CL performance in diffusion models.	The study was limited to MNIST and Fashion-MNIST datasets. Future work could explore novel CL strategies tailored for diffusion models and extend the analysis to more complex scenarios like text-to-image generation.	continual learning, diffusion models, generative models, catastrophic forgetting, experience replay
2303.15234 Report	Prompt Tuning based Adapter for Vision-Language Model Adaption	Jingchen Sun, Jiayu Qin, Zihao Lin, Changyou Chen	Large pre-trained vision-language (VL) models have shown significant promise in adapting to various downstream tasks. However, fine-tuning the entire network is challenging due to the massive number of model parameters. To address this issue, efficient adaptation methods such as prompt tuning have been proposed. We explore the idea of prompt tuning with multi-task pre-trained initialization and find it can significantly improve model performance. Based on our findings, we introduce a new model, termed Prompt-Adapter, that combines pre-trained prompt tunning with an efficient adaptation network. Our approach beat the state-of-the-art methods in few-shot image classification on the public 11 datasets, especially in settings with limited data instances such as 1 shot, 2 shots, 4 shots, and 8 shots images. Our proposed method demonstrates the promise of combining prompt tuning and parameter-efficient networks for efficient vision-language model adaptation. The code is publicly available at: https://github.com/Jingchensun/prompt_adapter.	This paper introduces Prompt-Adapter, a novel model for efficient vision-language model adaptation by combining pre-trained prompt tuning and an efficient adaptation network (cache model).	Adapting large pre-trained vision-language models to downstream tasks is challenging due to their size and complexity. This paper addresses this challenge by improving efficiency while maintaining high performance, especially in few-shot learning scenarios.	The paper proposes Prompt-Adapter, which leverages a pre-trained text prompt from CoOp and a cache model similar to Tip-Adapter. They explore both training-free (Prompt-Adapter) and fine-tuned (Prompt-Adapter-F) variants. Additionally, they investigate the impact of multi-task prompt initialization and different training strategies.	Prompt-Adapter achieves superior few-shot image classification performance on 11 datasets, outperforming baselines like CoOp and Tip-Adapter. The study shows that multi-task prompt initialization significantly improves performance compared to random or manual initialization. Separately training the prompt and the cache model is found to be more effective than joint training.	The method shows slight decreases in accuracy on datasets with high intra-class visual feature variance, such as EuroSAT and OxfordPets. Future work could explore combining other prompt learning methods or expanding the approach to other downstream tasks beyond image classification.	vision-language models, prompt tuning, few-shot learning, image classification, parameter-efficient learning
2303.15067 Report	Intersection over Union with smoothing for bounding box regression	Petra Števuliáková, Petr Hurtik	We focus on the construction of a loss function for the bounding box regression. The Intersection over Union (IoU) metric is improved to converge faster, to make the surface of the loss function smooth and continuous over the whole searched space, and to reach a more precise approximation of the labels. The main principle is adding a smoothing part to the original IoU, where the smoothing part is given by a linear space with values that increases from the ground truth bounding box to the border of the input image, and thus covers the whole spatial search space. We show the motivation and formalism behind this loss function and experimentally prove that it outperforms IoU, DIoU, CIoU, and SIoU by a large margin. We experimentally show that the proposed loss function is robust with respect to the noise in the dimension of ground truth bounding boxes. The reference implementation is available at gitlab.com/irafm-ai/smoothing-iou.	This paper introduces a novel smoothing modification to the standard Intersection over Union (IoU) loss function, aiming to improve bounding box regression in object detection.	The proposed method addresses limitations of existing IoU-based losses by improving convergence speed and robustness against noisy labels, crucial for real-world applications with limited or imperfect data.	The approach involves adding a smoothing part, a linear space with values increasing from the ground truth bounding box to the image border, to the standard IoU loss. This guides gradient descent and mitigates the effects of noisy labels.	The smoothing IoU loss outperforms standard IoU, SIoU, DIoU, and CIoU in regression accuracy on both clean and noisy datasets. It exhibits lower overfitting on clean data and higher underfitting on noisy data, indicating robustness against label noise. The method shows stable regression accuracy even with high noise levels (up to 60%), surpassing the performance of other losses trained on clean data.	The paper employs a custom dataset with limited diversity, potentially impacting the generalizability of the findings. Future work could explore the integration of the smoothing approach with other advanced IoU-based losses for enhanced performance.	bounding box regression, intersection over union, object detection, noisy labels, loss function
2303.15043 Report	Joint Video Multi-Frame Interpolation and Deblurring under Unknown Exposure Time	Wei Shang, Dongwei Ren, Yi Yang, Hongzhi Zhang, Kede Ma, Wangmeng Zuo	Natural videos captured by consumer cameras often suffer from low framerate and motion blur due to the combination of dynamic scene complexity, lens and sensor imperfection, and less than ideal exposure setting. As a result, computational methods that jointly perform video frame interpolation and deblurring begin to emerge with the unrealistic assumption that the exposure time is known and fixed. In this work, we aim ambitiously for a more realistic and challenging task - joint video multi-frame interpolation and deblurring under unknown exposure time. Toward this goal, we first adopt a variant of supervised contrastive learning to construct an exposure-aware representation from input blurred frames. We then train two U-Nets for intra-motion and inter-motion analysis, respectively, adapting to the learned exposure representation via gain tuning. We finally build our video reconstruction network upon the exposure and motion representation by progressive exposure-adaptive convolution and motion refinement. Extensive experiments on both simulated and real-world datasets show that our optimized method achieves notable performance gains over the state-of-the-art on the joint video x8 interpolation and deblurring task. Moreover, on the seemingly implausible x16 interpolation task, our method outperforms existing methods by more than 1.5 dB in terms of PSNR.	This paper proposes VIDUE, a novel method for jointly interpolating and deblurring videos with unknown exposure times.	Existing video frame interpolation and deblurring methods often assume fixed and known exposure time, which is unrealistic for real-world videos captured by consumer cameras. This work addresses the more challenging and realistic setting of unknown exposure time.	VIDUE leverages supervised contrastive learning to construct an exposure-aware representation from input blurred frames. It then uses two U-Nets for intra-motion and inter-motion analysis, adapting them to the learned exposure representation via gain tuning. Finally, it builds a video reconstruction network with exposure-adaptive convolution and motion refinement.	VIDUE achieves state-of-the-art performance on joint video x8 interpolation and deblurring, outperforming existing methods by a significant margin on both synthetic and real-world datasets. The method demonstrates robust performance under different exposure time settings, effectively handling the challenges posed by unknown blur. VIDUE exhibits promising results on the challenging x16 interpolation task, surpassing previous approaches by more than 1.5 dB in PSNR.	The computational complexity of VIDUE could be further reduced for practical applications. Future work can explore optimizing VIDUE using perceptual quality metrics to improve the temporal coherence of the reconstructed videos.	video frame interpolation, video deblurring, unknown exposure time, adaptive computation, supervised contrastive learning
2303.14968 Report	Blind Image Quality Assessment via Vision-Language Correspondence: A Multitask Learning Perspective	Weixia Zhang, Guangtao Zhai, Ying Wei, Xiaokang Yang, Kede Ma	We aim at advancing blind image quality assessment (BIQA), which predicts the human perception of image quality without any reference information. We develop a general and automated multitask learning scheme for BIQA to exploit auxiliary knowledge from other tasks, in a way that the model parameter sharing and the loss weighting are determined automatically. Specifically, we first describe all candidate label combinations (from multiple tasks) using a textual template, and compute the joint probability from the cosine similarities of the visual-textual embeddings. Predictions of each task can be inferred from the joint distribution, and optimized by carefully designed loss functions. Through comprehensive experiments on learning three tasks - BIQA, scene classification, and distortion type identification, we verify that the proposed BIQA method 1) benefits from the scene classification and distortion type identification tasks and outperforms the state-of-the-art on multiple IQA datasets, 2) is more robust in the group maximum differentiation competition, and 3) realigns the quality annotations from different IQA datasets more effectively. The source code is available at https://github.com/zwx8981/LIQE.	This paper proposes LIQE, a novel blind image quality assessment (BIQA) method that leverages multitask learning via vision-language correspondence.	Addressing the challenge of limited human-annotated data in BIQA, this work explores incorporating auxiliary knowledge from other vision tasks like scene classification and distortion type identification to improve quality prediction accuracy.	LIQE utilizes a pre-trained CLIP model to obtain visual and textual embeddings for input images and textual descriptions of scene, distortion, and quality. It jointly optimizes a multitask objective function with dynamically weighted fidelity losses for quality prediction, scene classification, and distortion type identification.	LIQE outperforms state-of-the-art BIQA methods on multiple benchmark datasets, demonstrating the benefits of multitask learning with vision-language correspondence. It exhibits improved generalizability in cross-dataset evaluations and the group maximum differentiation (gMAD) competition, indicating better perceptual scale alignment across datasets. The method effectively leverages distortion type identification as an auxiliary task to aid BIQA, suggesting a cooperative relationship between them.	The performance of LIQE on algorithm-dependent distortions is limited, suggesting a need for task-specific training in such scenarios. Future work includes exploring other auxiliary tasks and more sophisticated loss weighting schemes to further enhance BIQA performance.	blind image quality assessment, multitask learning, vision-language correspondence, clip, perceptual quality
2303.14960 Report	Ambiguity-Resistant Semi-Supervised Learning for Dense Object Detection	Chang Liu, Weiming Zhang, Xiangru Lin, Wei Zhang, Xiao Tan, Junyu Han, Xiaomao Li, Errui Ding, Jingdong Wang	With basic Semi-Supervised Object Detection (SSOD) techniques, one-stage detectors generally obtain limited promotions compared with two-stage clusters. We experimentally find that the root lies in two kinds of ambiguities: (1) Selection ambiguity that selected pseudo labels are less accurate, since classification scores cannot properly represent the localization quality. (2) Assignment ambiguity that samples are matched with improper labels in pseudo-label assignment, as the strategy is misguided by missed objects and inaccurate pseudo boxes. To tackle these problems, we propose a Ambiguity-Resistant Semi-supervised Learning (ARSL) for one-stage detectors. Specifically, to alleviate the selection ambiguity, Joint-Confidence Estimation (JCE) is proposed to jointly quantifies the classification and localization quality of pseudo labels. As for the assignment ambiguity, Task-Separation Assignment (TSA) is introduced to assign labels based on pixel-level predictions rather than unreliable pseudo boxes. It employs a "divide-and-conquer" strategy and separately exploits positives for the classification and localization task, which is more robust to the assignment ambiguity. Comprehensive experiments demonstrate that ARSL effectively mitigates the ambiguities and achieves state-of-the-art SSOD performance on MS COCO and PASCAL VOC. Codes can be found at https://github.com/PaddlePaddle/PaddleDetection.	This paper proposes Ambiguity-Resistant Semi-supervised Learning (ARSL) to address the limited performance of one-stage detectors in semi-supervised object detection.	One-stage detectors, despite their efficiency, lag behind two-stage counterparts in semi-supervised object detection due to ambiguities in pseudo-label selection and assignment.	ARSL tackles two ambiguities: (1) Selection ambiguity is mitigated by Joint-Confidence Estimation (JCE), jointly quantifying classification and localization quality. (2) Assignment ambiguity is addressed by Task-Separation Assignment (TSA), assigning labels based on pixel-level predictions, separately leveraging potential positives for classification and localization.	ARSL significantly boosts one-stage detector performance in semi-supervised settings, outperforming previous methods on COCO-Standard. JCE effectively reduces selection ambiguity, reflected in improved Top-5 IoU and correlation between classification and localization quality. TSA, assigning labels based on dense predictions instead of boxes, mitigates assignment ambiguity by increasing true positives and reducing false positives.	The slight increase in false positives in TSA is attributed to treating all ambiguous candidates as positives for classification. Future work can explore more sophisticated strategies to further refine the selection of potential positives in TSA, potentially by incorporating contextual information or leveraging relationships between objects.	semi-supervised object detection, one-stage detectors, pseudo-labeling, joint-confidence estimation, task-separation assignment
2303.14707 Report	Clean-NeRF: Reformulating NeRF to account for View-Dependent Observations	Xinhang Liu, Yu-Wing Tai, Chi-Keung Tang	While Neural Radiance Fields (NeRFs) had achieved unprecedented novel view synthesis results, they have been struggling in dealing with large-scale cluttered scenes with sparse input views and highly view-dependent appearances. Specifically, existing NeRF-based models tend to produce blurry rendering with the volumetric reconstruction often inaccurate, where a lot of reconstruction errors are observed in the form of foggy "floaters" hovering within the entire volume of an opaque 3D scene. Such inaccuracies impede NeRF's potential for accurate 3D NeRF registration, object detection, segmentation, etc., which possibly accounts for only limited significant research effort so far to directly address these important 3D fundamental computer vision problems to date. This paper analyzes the NeRF's struggles in such settings and proposes Clean-NeRF for accurate 3D reconstruction and novel view rendering in complex scenes. Our key insights consist of enforcing effective appearance and geometry constraints, which are absent in the conventional NeRF reconstruction, by 1) automatically detecting and modeling view-dependent appearances in the training views to prevent them from interfering with density estimation, which is complete with 2) a geometric correction procedure performed on each traced ray during inference. Clean-NeRF can be implemented as a plug-in that can immediately benefit existing NeRF-based methods without additional input. Codes will be released.	This paper proposes Clean-NeRF, an extension to NeRF for accurate 3D reconstruction and novel view rendering in complex scenes with sparse view inputs.	Existing NeRF-based models struggle with blurry rendering and inaccurate volumetric reconstruction in complex scenes, hindering their application in 3D computer vision tasks.	Clean-NeRF enforces appearance and geometry constraints by 1) decomposing and modeling view-dependent and view-independent color components during training, and 2) applying a geometry correction procedure to eliminate density errors during inference.	Effectively removes "floaters" (density errors) in challenging indoor scenes. Recovers intricate object details, especially for glossy surfaces. Outperforms baselines in quantitative metrics such as PSNR, SSIM, and LPIPS.	Assumes fixed lighting conditions and no semi-transparent objects. May misinterpret consistently appearing specular highlights as view-independent colors.	neural radiance fields, nerf, 3d reconstruction, novel view synthesis, appearance decomposition
2303.14662 Report	OTAvatar: One-shot Talking Face Avatar with Controllable Tri-plane Rendering	Zhiyuan Ma, Xiangyu Zhu, Guojun Qi, Zhen Lei, Lei Zhang	Controllability, generalizability and efficiency are the major objectives of constructing face avatars represented by neural implicit field. However, existing methods have not managed to accommodate the three requirements simultaneously. They either focus on static portraits, restricting the representation ability to a specific subject, or suffer from substantial computational cost, limiting their flexibility. In this paper, we propose One-shot Talking face Avatar (OTAvatar), which constructs face avatars by a generalized controllable tri-plane rendering solution so that each personalized avatar can be constructed from only one portrait as the reference. Specifically, OTAvatar first inverts a portrait image to a motion-free identity code. Second, the identity code and a motion code are utilized to modulate an efficient CNN to generate a tri-plane formulated volume, which encodes the subject in the desired motion. Finally, volume rendering is employed to generate an image in any view. The core of our solution is a novel decoupling-by-inverting strategy that disentangles identity and motion in the latent code via optimization-based inversion. Benefiting from the efficient tri-plane representation, we achieve controllable rendering of generalized face avatar at $35$ FPS on A100. Experiments show promising performance of cross-identity reenactment on subjects out of the training set and better 3D consistency.	OTAvatar, a one-shot talking face avatar generation method using controllable tri-plane rendering.	Existing methods for creating talking face avatars struggle to balance controllability, generalizability, and efficiency. They are either limited to specific individuals, computationally expensive, or fail to produce high-quality animations.	OTAvatar leverages a pre-trained 3D face generator and introduces a motion controller module. It utilizes a decoupling-by-inverting strategy to disentangle identity and motion in the latent code during training and inference. This allows for one-shot avatar creation from a single portrait and animation driven by 3DMM coefficients.	Achieves one-shot reconstruction and animation of photo-realistic face avatars. Demonstrates superior performance in cross-identity reenactment and multi-view consistency compared to 2D and 3D baselines. Enables real-time inference speed at 35 FPS on A100 GPU due to efficient tri-plane representation and compact architecture.	Relies on accurate 3DMM coefficient extraction for optimal performance. Further exploration of alternative motion representations beyond 3DMM coefficients.	talking face avatar, one-shot learning, volume rendering, 3d face animation, generative adversarial networks
2303.14651 Report	You Only Segment Once: Towards Real-Time Panoptic Segmentation	Jie Hu, Linyan Huang, Tianhe Ren, Shengchuan Zhang, Rongrong Ji, Liujuan Cao	In this paper, we propose YOSO, a real-time panoptic segmentation framework. YOSO predicts masks via dynamic convolutions between panoptic kernels and image feature maps, in which you only need to segment once for both instance and semantic segmentation tasks. To reduce the computational overhead, we design a feature pyramid aggregator for the feature map extraction, and a separable dynamic decoder for the panoptic kernel generation. The aggregator re-parameterizes interpolation-first modules in a convolution-first way, which significantly speeds up the pipeline without any additional costs. The decoder performs multi-head cross-attention via separable dynamic convolution for better efficiency and accuracy. To the best of our knowledge, YOSO is the first real-time panoptic segmentation framework that delivers competitive performance compared to state-of-the-art models. Specifically, YOSO achieves 46.4 PQ, 45.6 FPS on COCO; 52.5 PQ, 22.6 FPS on Cityscapes; 38.0 PQ, 35.4 FPS on ADE20K; and 34.1 PQ, 7.1 FPS on Mapillary Vistas. Code is available at https://github.com/hujiecpp/YOSO.	YOSO, a real-time panoptic segmentation framework that predicts masks via dynamic convolutions between panoptic kernels and image feature maps, allowing for simultaneous instance and semantic segmentation.	Real-time panoptic segmentation is challenging due to computationally intensive separate branches for semantic and instance segmentation. Existing methods struggle to achieve both speed and accuracy.	YOSO employs a feature pyramid aggregator with convolution-first aggregation (CFA) for efficient feature extraction. It utilizes a separable dynamic decoder with separable dynamic convolution attention (SDCA) for lightweight and accurate panoptic kernel generation.	YOSO achieves competitive speed and accuracy compared to state-of-the-art models on COCO, Cityscapes, ADE20K, and Mapillary Vistas datasets. CFA significantly reduces computational burden without re-training or sacrificing performance. SDCA outperforms traditional multi-head cross-attention in both accuracy and efficiency for panoptic kernel generation.	YOSO's performance on instance segmentation of smaller objects can be further improved. Exploring alternative lightweight backbones for enhanced efficiency.	panoptic segmentation, real-time, dynamic convolution, feature pyramid, separable attention
2303.14541 Report	UnScene3D: Unsupervised 3D Instance Segmentation for Indoor Scenes	David Rozenberszki, Or Litany, Angela Dai	3D instance segmentation is fundamental to geometric understanding of the world around us. Existing methods for instance segmentation of 3D scenes rely on supervision from expensive, manual 3D annotations. We propose UnScene3D, the first fully unsupervised 3D learning approach for class-agnostic 3D instance segmentation of indoor scans. UnScene3D first generates pseudo masks by leveraging self-supervised color and geometry features to find potential object regions. We operate on a basis of geometric oversegmentation, enabling efficient representation and learning on high-resolution 3D data. The coarse proposals are then refined through self-training our model on its predictions. Our approach improves over state-of-the-art unsupervised 3D instance segmentation methods by more than 300% Average Precision score, demonstrating effective instance segmentation even in challenging, cluttered 3D scenes.	This paper introduces [Method Name], the first fully unsupervised 3D learning approach for class-agnostic 3D instance segmentation of indoor scans.	3D instance segmentation is crucial for scene understanding but existing methods rely on expensive manual annotations.	The method generates pseudo masks by leveraging self-supervised color and geometry features, then refines them through self-training on a 3D transformer-based model.	[Method Name] significantly outperforms clustering-based unsupervised methods (over 300% improvement in AP). The method effectively leverages both color and geometric signals for improved pseudo mask generation. Self-training significantly enhances the density and completeness of instance proposals.	The reliance on mesh representation for graph coarsening could be extended. Small objects might be missed during pseudo annotation generation.	3d instance segmentation, unsupervised learning, self-training, geometric primitives, rgb-d scans
2303.14536 Report	SUDS: Scalable Urban Dynamic Scenes	Haithem Turki, Jason Y. Zhang, Francesco Ferroni, Deva Ramanan	We extend neural radiance fields (NeRFs) to dynamic large-scale urban scenes. Prior work tends to reconstruct single video clips of short durations (up to 10 seconds). Two reasons are that such methods (a) tend to scale linearly with the number of moving objects and input videos because a separate model is built for each and (b) tend to require supervision via 3D bounding boxes and panoptic labels, obtained manually or via category-specific models. As a step towards truly open-world reconstructions of dynamic cities, we introduce two key innovations: (a) we factorize the scene into three separate hash table data structures to efficiently encode static, dynamic, and far-field radiance fields, and (b) we make use of unlabeled target signals consisting of RGB images, sparse LiDAR, off-the-shelf self-supervised 2D descriptors, and most importantly, 2D optical flow. Operationalizing such inputs via photometric, geometric, and feature-metric reconstruction losses enables SUDS to decompose dynamic scenes into the static background, individual objects, and their motions. When combined with our multi-branch table representation, such reconstructions can be scaled to tens of thousands of objects across 1.2 million frames from 1700 videos spanning geospatial footprints of hundreds of kilometers, (to our knowledge) the largest dynamic NeRF built to date. We present qualitative initial results on a variety of tasks enabled by our representations, including novel-view synthesis of dynamic urban scenes, unsupervised 3D instance segmentation, and unsupervised 3D cuboid detection. To compare to prior work, we also evaluate on KITTI and Virtual KITTI 2, surpassing state-of-the-art methods that rely on ground truth 3D bounding box annotations while being 10x quicker to train.	The paper introduces SUDS (Scalable Urban Dynamic Scenes), a novel approach extending neural radiance fields (NeRFs) to reconstruct large-scale, dynamic urban environments from multi-view videos, achieving scalability and handling dynamic elements like vehicles and pedestrians.	Existing NeRF methods struggle with city-scale dynamic scenes due to limitations in handling numerous moving objects and reliance on labeled data like 3D bounding boxes. SUDS addresses these challenges, aiming for open-world dynamic city reconstructions.	SUDS leverages a three-branch hash table structure to represent static background, dynamic objects, and far-field environment. It utilizes unlabeled inputs: RGB images, sparse LiDAR, optical flow, and self-supervised 2D descriptors. Spatial partitioning allows independent model training for different city areas, enabling scalability.	SUDS achieves the first large-scale dynamic NeRF reconstruction, covering over 100 square kilometers. It outperforms baseline methods on City-1M dataset and standard benchmarks (KITTI, Virtual KITTI 2) for novel view synthesis, even with limited training views. The learned representation enables downstream tasks like unsupervised 3D instance segmentation and cuboid detection.	Current implementation doesn't extrapolate object motion beyond the captured video boundaries. Reliance on accurate camera pose estimation is crucial, with joint optimization of camera parameters during training still underexplored.	neural radiance fields, dynamic scene reconstruction, large-scale 3d modeling, unsupervised learning, urban environments
2303.14516 Report	OVeNet: Offset Vector Network for Semantic Segmentation	Stamatis Alexandropoulos, Christos Sakaridis, Petros Maragos	Semantic segmentation is a fundamental task in visual scene understanding. We focus on the supervised setting, where ground-truth semantic annotations are available. Based on knowledge about the high regularity of real-world scenes, we propose a method for improving class predictions by learning to selectively exploit information from neighboring pixels. In particular, our method is based on the prior that for each pixel, there is a seed pixel in its close neighborhood sharing the same prediction with the former. Motivated by this prior, we design a novel two-head network, named Offset Vector Network (OVeNet), which generates both standard semantic predictions and a dense 2D offset vector field indicating the offset from each pixel to the respective seed pixel, which is used to compute an alternative, seed-based semantic prediction. The two predictions are adaptively fused at each pixel using a learnt dense confidence map for the predicted offset vector field. We supervise offset vectors indirectly via optimizing the seed-based prediction and via a novel loss on the confidence map. Compared to the baseline state-of-the-art architectures HRNet and HRNet+OCR on which OVeNet is built, the latter achieves significant performance gains on three prominent benchmarks for semantic segmentation, namely Cityscapes, ACDC and ADE20K. Code is available at https://github.com/stamatisalex/OVeNet	OVeNet, a novel two-head network for semantic segmentation, leverages a learnt offset vector field to exploit information from neighboring pixels, thereby enhancing class predictions.	Existing methods often misclassify pixels, especially near boundaries, due to overlooking the regularity of real-world scenes. OVeNet addresses this by learning to selectively use information from neighboring pixels.	OVeNet comprises two heads: one predicts semantic logits, the other predicts an offset vector field and a confidence map. Offsets resample logits to generate a seed-based prediction, fused with the initial prediction using the confidence map.	OVeNet significantly outperforms HRNet and HRNet+OCR baselines on Cityscapes, ACDC, and ADE20K. It demonstrates significant improvement in per-class accuracy, particularly in challenging conditions like fog and night in the ACDC dataset. The confidence map effectively guides the fusion of predictions, improving boundary delineation and overall segmentation quality.	The model's performance is sensitive to the offset vector length, requiring careful hyperparameter tuning. The current implementation is limited by memory constraints, leading to a reduced number of blocks in the offset head.	semantic segmentation, offset vector network, seed pixel, deep learning, computer vision
2303.14471 Report	HQ3DAvatar: High Quality Controllable 3D Head Avatar	Kartik Teotia, Mallikarjun B R, Xingang Pan, Hyeongwoo Kim, Pablo Garrido, Mohamed Elgharib, Christian Theobalt	Multi-view volumetric rendering techniques have recently shown great potential in modeling and synthesizing high-quality head avatars. A common approach to capture full head dynamic performances is to track the underlying geometry using a mesh-based template or 3D cube-based graphics primitives. While these model-based approaches achieve promising results, they often fail to learn complex geometric details such as the mouth interior, hair, and topological changes over time. This paper presents a novel approach to building highly photorealistic digital head avatars. Our method learns a canonical space via an implicit function parameterized by a neural network. It leverages multiresolution hash encoding in the learned feature space, allowing for high-quality, faster training and high-resolution rendering. At test time, our method is driven by a monocular RGB video. Here, an image encoder extracts face-specific features that also condition the learnable canonical space. This encourages deformation-dependent texture variations during training. We also propose a novel optical flow based loss that ensures correspondences in the learned canonical space, thus encouraging artifact-free and temporally consistent renderings. We show results on challenging facial expressions and show free-viewpoint renderings at interactive real-time rates for medium image resolutions. Our method outperforms all existing approaches, both visually and numerically. We will release our multiple-identity dataset to encourage further research. Our Project page is available at: https://vcai.mpi-inf.mpg.de/projects/HQ3DAvatar/	This paper presents HQ3DAvatar, a novel method for creating high-quality, controllable 3D head avatars from multi-view video data that can be driven by a monocular RGB video at test time.	Creating realistic and controllable digital humans is crucial for various applications like VR/AR, VFX, and media production.	The method learns a canonical space via an implicit function parameterized by a neural network, leveraging multiresolution hash encoding for efficiency and high resolution. It utilizes an image encoder to extract face-specific features that condition the canonical space, enabling deformation-dependent texture variations. A novel optical flow based loss ensures temporal coherence and reduces artifacts.	HQ3DAvatar achieves state-of-the-art photorealism, outperforming existing methods in visual quality and accuracy, especially in challenging regions like hair and mouth interior. The method enables dynamic free-view synthesis from arbitrary monocular viewpoints, with promising results for generalization to in-the-wild videos. It allows for high-resolution rendering, showcasing the first 2K results in literature, and enables real-time performance at medium resolutions (480x270).	The method may produce artifacts in regions with strong disocclusions, like the tongue moving out of the mouth. The current solution is person-specific, and future work could explore generalization to unseen identities.	volumetric rendering, implicit representations, neural radiance fields, neural avatars, free-viewpoint rendering
2303.14420 Report	Human Preference Score: Better Aligning Text-to-Image Models with Human Preference	Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, Hongsheng Li	Recent years have witnessed a rapid growth of deep generative models, with text-to-image models gaining significant attention from the public. However, existing models often generate images that do not align well with human preferences, such as awkward combinations of limbs and facial expressions. To address this issue, we collect a dataset of human choices on generated images from the Stable Foundation Discord channel. Our experiments demonstrate that current evaluation metrics for generative models do not correlate well with human choices. Thus, we train a human preference classifier with the collected dataset and derive a Human Preference Score (HPS) based on the classifier. Using HPS, we propose a simple yet effective method to adapt Stable Diffusion to better align with human preferences. Our experiments show that HPS outperforms CLIP in predicting human choices and has good generalization capability toward images generated from other models. By tuning Stable Diffusion with the guidance of HPS, the adapted model is able to generate images that are more preferred by human users. The project page is available here: https://tgxs002.github.io/align_sd_web/ .	This paper introduces Human Preference Score (HPS) for aligning text-to-image models with human preferences by leveraging a large-scale dataset of human choices on generated images.	Existing evaluation metrics like IS, FID, and CLIP score often fail to capture subtle human preferences in generated images, particularly concerning aspects like awkward compositions. This misalignment necessitates a new metric that better reflects human choices.	The authors collect a large dataset of human choices on images generated by Stable Diffusion. They then fine-tune a CLIP model on this dataset to develop a human preference classifier, from which HPS is derived. This HPS is then used to adapt Stable Diffusion by explicitly training it to distinguish between preferred and non-preferred images.	Existing evaluation metrics (IS, FID, CLIP score) show poor correlation with human choices on the collected dataset. The fine-tuned CLIP model, used to derive HPS, demonstrates superior performance in predicting human preferences compared to the original CLIP score. Adapting Stable Diffusion using HPS guidance leads to the generation of images that are significantly more preferred by human users, as evidenced by user studies.	The collected dataset, though large, represents preferences from a limited user group active on the Stable Foundation Discord channel and might not reflect global diversity in preferences. The use of prompts engineered by experienced Stable Diffusion users might introduce bias and deviate from natural language patterns.	text-to-image generation, human preference learning, stable diffusion, evaluation metrics, aesthetic quality
2303.14412 Report	Freestyle Layout-to-Image Synthesis	Han Xue, Zhiwu Huang, Qianru Sun, Li Song, Wenjun Zhang	Typical layout-to-image synthesis (LIS) models generate images for a closed set of semantic classes, e.g., 182 common objects in COCO-Stuff. In this work, we explore the freestyle capability of the model, i.e., how far can it generate unseen semantics (e.g., classes, attributes, and styles) onto a given layout, and call the task Freestyle LIS (FLIS). Thanks to the development of large-scale pre-trained language-image models, a number of discriminative models (e.g., image classification and object detection) trained on limited base classes are empowered with the ability of unseen class prediction. Inspired by this, we opt to leverage large-scale pre-trained text-to-image diffusion models to achieve the generation of unseen semantics. The key challenge of FLIS is how to enable the diffusion model to synthesize images from a specific layout which very likely violates its pre-learned knowledge, e.g., the model never sees "a unicorn sitting on a bench" during its pre-training. To this end, we introduce a new module called Rectified Cross-Attention (RCA) that can be conveniently plugged in the diffusion model to integrate semantic masks. This "plug-in" is applied in each cross-attention layer of the model to rectify the attention maps between image and text tokens. The key idea of RCA is to enforce each text token to act on the pixels in a specified region, allowing us to freely put a wide variety of semantics from pre-trained knowledge (which is general) onto the given layout (which is specific). Extensive experiments show that the proposed diffusion network produces realistic and freestyle layout-to-image generation results with diverse text inputs, which has a high potential to spawn a bunch of interesting applications. Code is available at https://github.com/essunny310/FreestyleNet.	This paper proposes Freestyle Layout-to-Image Synthesis (FLIS) which generates images with unseen semantics (classes, attributes, styles) onto a layout by leveraging pre-trained text-to-image diffusion models.	Existing layout-to-image synthesis (LIS) methods are limited to generating images with semantics from a fixed set of classes. FLIS breaks this limitation and allows for more creative and diverse image generation.	The paper introduces Rectified Cross-Attention (RCA) that integrates layout information into the pre-trained diffusion model. RCA rectifies the attention maps between image and text tokens, forcing each text token to act on pixels within its corresponding mask region.	FreestyleNet can synthesize unseen objects, bind new attributes to objects, and render images in various styles. FreestyleNet outperforms state-of-the-art LIS methods in FID score, indicating high visual quality. RCA effectively enforces spatial alignment between generated images and input layouts.	The model struggles to generate rare or unreasonable semantics that are not well-represented in the pre-trained knowledge. The method requires a user-defined set of category names which can be challenging to obtain for long-tailed datasets.	image generation, layout-to-image synthesis, text-to-image synthesis, diffusion models, cross-attention
2303.14407 Report	LPFF: A Portrait Dataset for Face Generators Across Large Poses	Yiqian Wu, Jing Zhang, Hongbo Fu, Xiaogang Jin	The creation of 2D realistic facial images and 3D face shapes using generative networks has been a hot topic in recent years. Existing face generators exhibit exceptional performance on faces in small to medium poses (with respect to frontal faces) but struggle to produce realistic results for large poses. The distorted rendering results on large poses in 3D-aware generators further show that the generated 3D face shapes are far from the distribution of 3D faces in reality. We find that the above issues are caused by the training dataset's pose imbalance. In this paper, we present LPFF, a large-pose Flickr face dataset comprised of 19,590 high-quality real large-pose portrait images. We utilize our dataset to train a 2D face generator that can process large-pose face images, as well as a 3D-aware generator that can generate realistic human face geometry. To better validate our pose-conditional 3D-aware generators, we develop a new FID measure to evaluate the 3D-level performance. Through this novel FID measure and other experiments, we show that LPFF can help 2D face generators extend their latent space and better manipulate the large-pose data, and help 3D-aware face generators achieve better view consistency and more realistic 3D reconstruction results.	This paper introduces LPFF, a large-pose face dataset containing 19,590 high-quality images, designed to address the pose imbalance in existing datasets used for training face generators.	Existing face generators, both 2D and 3D-aware, struggle to generate realistic results for faces at large poses due to the lack of sufficient large-pose training data.	The authors developed a pipeline to collect, process, and filter large-pose face images from Flickr. They then used this dataset to train a 2D face generator (StyleGAN2-ada) and a 3D-aware generator (EG3D). A new FID measure for 3D-aware generators is also proposed for evaluation.	LPFF enables StyleGAN2-ada to generate and manipulate large-pose faces more realistically. LPFF leads to more realistic face geometry generation in EG3D, with better view consistency and higher quality rendering at large poses. A novel FID measure for evaluating pose-conditional 3D-aware generators is proposed.	The dataset still suffers from semantic attribute imbalance (e.g., smiling faces are more prevalent in frontal views). The proposed processing pipeline cannot handle extreme poses where the face is fully occluded.	face generation, large-pose faces, dataset, generative adversarial networks (gans), 3d-aware generators
2303.14389 Report	MDTv2: Masked Diffusion Transformer is a Strong Image Synthesizer	Shanghua Gao, Pan Zhou, Ming-Ming Cheng, Shuicheng Yan	Despite its success in image synthesis, we observe that diffusion probabilistic models (DPMs) often lack contextual reasoning ability to learn the relations among object parts in an image, leading to a slow learning process. To solve this issue, we propose a Masked Diffusion Transformer (MDT) that introduces a mask latent modeling scheme to explicitly enhance the DPMs' ability to contextual relation learning among object semantic parts in an image. During training, MDT operates in the latent space to mask certain tokens. Then, an asymmetric diffusion transformer is designed to predict masked tokens from unmasked ones while maintaining the diffusion generation process. Our MDT can reconstruct the full information of an image from its incomplete contextual input, thus enabling it to learn the associated relations among image tokens. We further improve MDT with a more efficient macro network structure and training strategy, named MDTv2. Experimental results show that MDTv2 achieves superior image synthesis performance, e.g., a new SOTA FID score of 1.58 on the ImageNet dataset, and has more than 10x faster learning speed than the previous SOTA DiT. The source code is released at https://github.com/sail-sg/MDT.	This paper introduces Masked Diffusion Transformer (MDT), a novel approach for enhancing contextual representation learning in diffusion probabilistic models (DPMs) for image synthesis, and its improved version MDTv2.	DPMs often struggle to learn associated relations between object parts in an image, leading to slow training convergence. MDT addresses this by explicitly enhancing the contextual learning ability of DPMs.	MDT employs a mask latent modeling scheme. It operates in the latent space, masking certain image tokens and using an asymmetric diffusion transformer to predict masked tokens from unmasked ones. MDTv2 further enhances MDT with long shortcuts in the encoder, dense input shortcuts in the decoder, and improved training strategies.	MDT demonstrates superior image synthesis performance compared to previous state-of-the-art methods, achieving a new SOTA FID score of 1.58 on ImageNet for class-conditional image generation. It exhibits significantly faster learning progress during training, achieving about 3x faster convergence speed than DiT. MDTv2 further accelerates training, achieving up to 5x faster convergence than MDT and up to 18x faster convergence than DiT.	The optimal masking ratio needs further investigation for different model sizes and datasets. Exploring the effectiveness of MDT on higher-resolution image generation and other downstream tasks is promising.	image synthesis, diffusion models, masked modeling, transformer, contextual learning
2303.14386 Report	Prompt-Guided Transformers for End-to-End Open-Vocabulary Object Detection	Hwanjun Song, Jihwan Bang	Prompt-OVD is an efficient and effective framework for open-vocabulary object detection that utilizes class embeddings from CLIP as prompts, guiding the Transformer decoder to detect objects in both base and novel classes. Additionally, our novel RoI-based masked attention and RoI pruning techniques help leverage the zero-shot classification ability of the Vision Transformer-based CLIP, resulting in improved detection performance at minimal computational cost. Our experiments on the OV-COCO and OVLVIS datasets demonstrate that Prompt-OVD achieves an impressive 21.2 times faster inference speed than the first end-to-end open-vocabulary detection method (OV-DETR), while also achieving higher APs than four two-stage-based methods operating within similar inference time ranges. Code will be made available soon.	Prompt-OVD, an end-to-end open-vocabulary object detection framework that uses class embeddings from CLIP as prompts to guide the Transformer decoder.	To address the limitations of existing open-vocabulary object detection methods, such as overfitting to base classes and slow inference speeds.	The framework leverages prompt-based decoding, RoI-based masked attention, and RoI pruning to achieve efficient and effective open-vocabulary object detection.	Prompt-OVD achieves a 21.2 times faster inference speed than OV-DETR, a previous end-to-end method. Prompt-OVD achieves higher APs on base and novel classes compared to two-stage OVD methods with similar inference speeds. The framework shows strong performance in both OV-COCO and OV-LVIS datasets.	The performance gap between object detection and instance segmentation could be further improved. Exploring more advanced prediction ensemble strategies could enhance the synergy between CLIP and the detection model.	open-vocabulary object detection, prompt-based decoding, vision transformer, clip, roi-based masked attention
2303.14377 Report	Unsupervised Domain Adaption with Pixel-level Discriminator for Image-aware Layout Generation	Chenchen Xu, Min Zhou, Tiezheng Ge, Yuning Jiang, Weiwei Xu	Layout is essential for graphic design and poster generation. Recently, applying deep learning models to generate layouts has attracted increasing attention. This paper focuses on using the GAN-based model conditioned on image contents to generate advertising poster graphic layouts, which requires an advertising poster layout dataset with paired product images and graphic layouts. However, the paired images and layouts in the existing dataset are collected by inpainting and annotating posters, respectively. There exists a domain gap between inpainted posters (source domain data) and clean product images (target domain data). Therefore, this paper combines unsupervised domain adaption techniques to design a GAN with a novel pixel-level discriminator (PD), called PDA-GAN, to generate graphic layouts according to image contents. The PD is connected to the shallow level feature map and computes the GAN loss for each input-image pixel. Both quantitative and qualitative evaluations demonstrate that PDA-GAN can achieve state-of-the-art performances and generate high-quality image-aware graphic layouts for advertising posters.	This paper presents PDA-GAN, a novel GAN-based model for generating image-aware graphic layouts of advertising posters, leveraging unsupervised domain adaptation to bridge the domain gap between clean product images and inpainted images.	Generating advertising posters often relies on a paired dataset of product images and graphic layouts. Existing datasets suffer from a domain gap due to using inpainted poster images, leading to unrealistic layouts. PDA-GAN addresses this gap.	PDA-GAN incorporates a pixel-level discriminator (PD) connected to shallow feature maps. This PD analyzes pixel-level discrepancies to align the feature spaces of inpainted and clean product images, enabling the generation of layouts consistent with image content details.	PDA-GAN significantly outperforms state-of-the-art methods in generating image-aware layouts, evidenced by both quantitative and qualitative results. Compared to methods using Gaussian blur for domain adaptation, PDA-GAN achieves superior performance, particularly in metrics related to background complexity, subject occlusion, and product occlusion. The pixel-level discriminator proves to be more effective than global or patch-level discriminators, highlighting the importance of fine-grained domain adaptation at the pixel level.	One limitation is the potential bias towards source domain data during training due to additional reconstruction loss. Future work can explore more balanced training strategies. Another limitation is the limited control over layout diversity and user constraints. Future research can focus on incorporating explicit controls for element categories, positions, and overall layout variations.	layout generation, generative adversarial networks (gans), unsupervised domain adaptation, advertising posters, image-aware design
2303.14297 Report	AgileGAN3D: Few-Shot 3D Portrait Stylization by Augmented Transfer Learning	Guoxian Song, Hongyi Xu, Jing Liu, Tiancheng Zhi, Yichun Shi, Jianfeng Zhang, Zihang Jiang, Jiashi Feng, Shen Sang, Linjie Luo	While substantial progresses have been made in automated 2D portrait stylization, admirable 3D portrait stylization from a single user photo remains to be an unresolved challenge. One primary obstacle here is the lack of high quality stylized 3D training data. In this paper, we propose a novel framework \emph{AgileGAN3D} that can produce 3D artistically appealing and personalized portraits with detailed geometry. New stylization can be obtained with just a few (around 20) unpaired 2D exemplars. We achieve this by first leveraging existing 2D stylization capabilities, \emph{style prior creation}, to produce a large amount of augmented 2D style exemplars. These augmented exemplars are generated with accurate camera pose labels, as well as paired real face images, which prove to be critical for the downstream 3D stylization task. Capitalizing on the recent advancement of 3D-aware GAN models, we perform \emph{guided transfer learning} on a pretrained 3D GAN generator to produce multi-view-consistent stylized renderings. In order to achieve 3D GAN inversion that can preserve subject's identity well, we incorporate \emph{multi-view consistency loss} in the training of our encoder. Our pipeline demonstrates strong capability in turning user photos into a diverse range of 3D artistic portraits. Both qualitative results and quantitative evaluations have been conducted to show the superior performance of our method. Code and pretrained models will be released for reproduction purpose.	AgileGAN3D, a novel framework that generates high-quality 3D stylized portraits with detailed geometry from a single user photo and a few 2D style exemplars.	Addresses the challenge of limited high-quality 3D data for 3D portrait stylization, enabling personalized and artistic 3D content creation.	Combines style prior creation, guided transfer learning, and 3D GAN inversion with multi-view consistency loss.	Generates visually appealing and multi-view consistent 3D stylized portraits. Outperforms baseline methods in terms of perceptual quality and identity preservation. Demonstrates robustness across different genders, face shapes, hairstyles, and illumination conditions.	Gaze direction bias and occasional failure to preserve accessories require further improvement. Potential for misuse in generating fake images.	3d stylization, generative adversarial networks, neural radiance fields, few-shot learning, 3d portrait generation
2303.14038 Report	Accelerating Vision-Language Pretraining with Free Language Modeling	Teng Wang, Yixiao Ge, Feng Zheng, Ran Cheng, Ying Shan, Xiaohu Qie, Ping Luo	The state of the arts in vision-language pretraining (VLP) achieves exemplary performance but suffers from high training costs resulting from slow convergence and long training time, especially on large-scale web datasets. An essential obstacle to training efficiency lies in the entangled prediction rate (percentage of tokens for reconstruction) and corruption rate (percentage of corrupted tokens) in masked language modeling (MLM), that is, a proper corruption rate is achieved at the cost of a large portion of output tokens being excluded from prediction loss. To accelerate the convergence of VLP, we propose a new pretraining task, namely, free language modeling (FLM), that enables a 100% prediction rate with arbitrary corruption rates. FLM successfully frees the prediction rate from the tie-up with the corruption rate while allowing the corruption spans to be customized for each token to be predicted. FLM-trained models are encouraged to learn better and faster given the same GPU time by exploiting bidirectional contexts more flexibly. Extensive experiments show FLM could achieve an impressive 2.5x pretraining time reduction in comparison to the MLM-based methods, while keeping competitive performance on both vision-language understanding and generation tasks. Code will be public at https://github.com/TencentARC/FLM.	This paper proposes Free Language Modeling (FLM), a novel pre-training objective for Vision-Language Pretraining (VLP), to accelerate training by decoupling prediction rate from corruption rate and enabling flexible corruption patterns.	Existing VLP methods using Masked Language Modeling (MLM) suffer from slow convergence and long training times, especially on large web datasets, due to the entangled prediction and corruption rates limiting the utilization of output tokens.	FLM employs an encode-corrupt-predict framework. It first encodes input text bidirectionally. Then, it constructs independent corruption-prediction tasks by injecting random span corruptions into encoded features. Finally, a reconstructor predicts each token by reasoning over uncorrupted bidirectional contexts, achieving 100% prediction rate.	FLM achieves a 2.5x speed-up in pre-training time compared to MLM while maintaining comparable performance on various VL understanding tasks. FLM demonstrates superior performance on VL generation tasks, such as image captioning, compared to MLM, AR, and PrefixLM. Ablation studies validate the effectiveness of individual components of FLM, including decomposed bidirectional encoding, deep reconstructor, and flexible corruption rate.	FLM's performance on cross-modal retrieval tasks currently lags behind MLM, requiring further exploration of better corruption strategies to enhance global feature alignment. The optimal corruption rate for FLM might vary across different corruption methods, suggesting future research on effectively combining different corruption types to improve context diversity.	vision-language pretraining, free language modeling, training acceleration, corruption-prediction, bidirectional contextual representation
2303.13873 Report	Fantasia3D: Disentangling Geometry and Appearance for High-quality Text-to-3D Content Creation	Rui Chen, Yongwei Chen, Ningxin Jiao, Kui Jia	Automatic 3D content creation has achieved rapid progress recently due to the availability of pre-trained, large language models and image diffusion models, forming the emerging topic of text-to-3D content creation. Existing text-to-3D methods commonly use implicit scene representations, which couple the geometry and appearance via volume rendering and are suboptimal in terms of recovering finer geometries and achieving photorealistic rendering; consequently, they are less effective for generating high-quality 3D assets. In this work, we propose a new method of Fantasia3D for high-quality text-to-3D content creation. Key to Fantasia3D is the disentangled modeling and learning of geometry and appearance. For geometry learning, we rely on a hybrid scene representation, and propose to encode surface normal extracted from the representation as the input of the image diffusion model. For appearance modeling, we introduce the spatially varying bidirectional reflectance distribution function (BRDF) into the text-to-3D task, and learn the surface material for photorealistic rendering of the generated surface. Our disentangled framework is more compatible with popular graphics engines, supporting relighting, editing, and physical simulation of the generated 3D assets. We conduct thorough experiments that show the advantages of our method over existing ones under different text-to-3D task settings. Project page and source codes: https://fantasia3d.github.io/.	Fantasia3D, a novel text-to-3D generation method that disentangles geometry and appearance modeling, enabling high-quality surface and material generation.	Existing text-to-3D methods struggle to generate high-quality surfaces and photorealistic rendering due to coupled geometry and appearance learning.	Leverages a hybrid scene representation (DMTet) for geometry modeling, using rendered normal maps as input for a pre-trained image diffusion model. Introduces spatially varying BRDF for appearance modeling, enabling photorealistic rendering with learned surface materials.	Disentangled geometry and appearance learning outperforms entangled approaches, producing superior 3D assets. Shape encoding of rendered normal maps proves crucial for high-quality geometry generation. Fantasia3D generates more realistic and higher-quality 3D content compared to state-of-the-art methods like DreamFusion and Magic3D.	Limited ability to generate loose geometries like hair and fur. Primarily focuses on object generation, lacking support for complete scenes with backgrounds.	text-to-3d, 3d content creation, disentangled representation learning, brdf, photorealistic rendering
2303.13843 Report	CompoNeRF: Text-guided Multi-object Compositional NeRF with Editable 3D Scene Layout	Haotian Bai, Yuanhuiyi Lyu, Lutao Jiang, Sijia Li, Haonan Lu, Xiaodong Lin, Lin Wang	Recent advances have shown promise in merging neural radiance fields (NeRFs) with pre-trained diffusion models for text-to-3D object generation. However, one enduring challenge is their inadequate capability to accurately parse and regenerate consistent multi-object environments. Specifically, these models encounter difficulties in accurately representing quantity and style prompted by multi-object texts, often resulting in a collapse of the rendering fidelity that fails to match the semantic intricacies. Moreover, amalgamating these elements into a coherent 3D scene is a substantial challenge, stemming from generic distribution inherent in diffusion models. To tackle the issue of 'guidance collapse' and enhance consistency, we propose a novel framework, dubbed CompoNeRF, by integrating an editable 3D scene layout with object specific and scene-wide guidance mechanisms. It initiates by interpreting a complex text into an editable 3D layout populated with multiple NeRFs, each paired with a corresponding subtext prompt for precise object depiction. Next, a tailored composition module seamlessly blends these NeRFs, promoting consistency, while the dual-level text guidance reduces ambiguity and boosts accuracy. Noticeably, the unique modularity of CompoNeRF permits NeRF decomposition. This enables flexible scene editing and recomposition into new scenes based on the edited layout or text prompts. Utilizing the open source Stable Diffusion model, CompoNeRF not only generates scenes with high fidelity but also paves the way for innovative multi-object composition using editable 3D layouts. Remarkably, our framework achieves up to a 54\% improvement in performance, as measured by the multi-view CLIP score metric. Code is available at https://github.com/hbai98/Componerf.	Introduces CompoNeRF, a novel framework for synthesizing coherent multi-object 3D scenes from text descriptions by integrating an editable 3D scene layout with object-specific and scene-wide guidance mechanisms.	Addresses the 'guidance collapse' problem in existing text-to-3D methods, which struggle to accurately represent and compose multiple objects in a scene as described by text.	Interprets multi-object text prompts into editable 3D layouts with bounding boxes, each associated with a distinct NeRF and subtext. Employs a composition module to blend individual NeRFs while maintaining global consistency guided by dual-level text prompts (global and object-specific).	Achieves superior object identity accuracy and context relevance compared to previous methods like Latent-NeRF and SJC. Demonstrates up to a 54% improvement in performance as measured by the multi-view CLIP score metric. Enables flexible scene editing and recomposition by decomposing and caching individual NeRFs for reuse.	Limited in interpreting uncommon object integrations or scenes due to the reliance on the pre-trained diffusion model's knowledge. Faces occasionally exhibit the 'multi-face' issue, requiring further research into stronger geometric constraints or improved diffusion guidance.	text-to-3d, neural radiance fields (nerfs), scene composition, 3d scene understanding, generative ai
2303.13791 Report	Progressively Optimized Local Radiance Fields for Robust View Synthesis	Andreas Meuleman, Yu-Lun Liu, Chen Gao, Jia-Bin Huang, Changil Kim, Min H. Kim, Johannes Kopf	We present an algorithm for reconstructing the radiance field of a large-scale scene from a single casually captured video. The task poses two core challenges. First, most existing radiance field reconstruction approaches rely on accurate pre-estimated camera poses from Structure-from-Motion algorithms, which frequently fail on in-the-wild videos. Second, using a single, global radiance field with finite representational capacity does not scale to longer trajectories in an unbounded scene. For handling unknown poses, we jointly estimate the camera poses with radiance field in a progressive manner. We show that progressive optimization significantly improves the robustness of the reconstruction. For handling large unbounded scenes, we dynamically allocate new local radiance fields trained with frames within a temporal window. This further improves robustness (e.g., performs well even under moderate pose drifts) and allows us to scale to large scenes. Our extensive evaluation on the Tanks and Temples dataset and our collected outdoor dataset, Static Hikes, show that our approach compares favorably with the state-of-the-art.	This paper introduces an algorithm for reconstructing large-scale scene radiance fields from casual videos using progressive joint optimization of camera poses and local radiance fields.	Reconstructing radiance fields from casual videos is challenging due to inaccurate camera pose estimation and limitations of global radiance fields in large scenes.	The method uses a progressive scheme to estimate camera poses and dynamically allocates local radiance fields to model the scene. It incorporates monocular depth and optical flow for robust optimization.	The method achieves high-quality view synthesis on long video sequences. It outperforms existing methods in terms of robustness and scalability. The proposed progressive optimization scheme significantly improves pose estimation accuracy.	The method assumes continuous video without shot changes. It currently doesn't handle dynamic elements in the scene.	radiance fields, novel view synthesis, camera pose estimation, progressive optimization, local radiance fields
2303.13756 Report	GP-VTON: Towards General Purpose Virtual Try-on via Collaborative Local-Flow Global-Parsing Learning	Zhenyu Xie, Zaiyu Huang, Xin Dong, Fuwei Zhao, Haoye Dong, Xijin Zhang, Feida Zhu, Xiaodan Liang	Image-based Virtual Try-ON aims to transfer an in-shop garment onto a specific person. Existing methods employ a global warping module to model the anisotropic deformation for different garment parts, which fails to preserve the semantic information of different parts when receiving challenging inputs (e.g, intricate human poses, difficult garments). Moreover, most of them directly warp the input garment to align with the boundary of the preserved region, which usually requires texture squeezing to meet the boundary shape constraint and thus leads to texture distortion. The above inferior performance hinders existing methods from real-world applications. To address these problems and take a step towards real-world virtual try-on, we propose a General-Purpose Virtual Try-ON framework, named GP-VTON, by developing an innovative Local-Flow Global-Parsing (LFGP) warping module and a Dynamic Gradient Truncation (DGT) training strategy. Specifically, compared with the previous global warping mechanism, LFGP employs local flows to warp garments parts individually, and assembles the local warped results via the global garment parsing, resulting in reasonable warped parts and a semantic-correct intact garment even with challenging inputs.On the other hand, our DGT training strategy dynamically truncates the gradient in the overlap area and the warped garment is no more required to meet the boundary constraint, which effectively avoids the texture squeezing problem. Furthermore, our GP-VTON can be easily extended to multi-category scenario and jointly trained by using data from different garment categories. Extensive experiments on two high-resolution benchmarks demonstrate our superiority over the existing state-of-the-art methods.	This paper introduces GP-VTON, a unified framework for general-purpose virtual try-on, capable of generating realistic try-on results even in challenging scenarios (e.g., intricate human poses, complex garment inputs) and extendable to multi-category scenarios.	Existing VTON methods face challenges in handling complex poses and garments, often leading to artifacts and texture distortions. They also primarily focus on upper-body try-on, limiting their practical applications.	GP-VTON leverages a novel Local-Flow Global-Parsing (LFGP) warping module to deform garment parts individually and assemble them using global garment parsing. It also employs a Dynamic Gradient Truncation (DGT) training strategy to minimize texture distortion around preserved regions.	GP-VTON outperforms state-of-the-art methods on high-resolution benchmarks VITON-HD and DressCode, demonstrating its ability to generate more realistic and semantically accurate try-on results. The LFGP module effectively handles challenging poses and garments, reducing artifacts like damaged sleeves, blended pant legs, and adhesive regions. The DGT strategy successfully minimizes texture distortion around preserved regions, resulting in more visually appealing try-on results.	The paper acknowledges limitations and social impact of GP-VTON in the supplementary materials (not provided). Future work could explore extending GP-VTON to handle more diverse garment types and complex scenes.	virtual try-on, image synthesis, generative adversarial networks, deep learning, computer vision
2303.13744 Report	Conditional Image-to-Video Generation with Latent Flow Diffusion Models	Haomiao Ni, Changhao Shi, Kai Li, Sharon X. Huang, Martin Renqiang Min	Conditional image-to-video (cI2V) generation aims to synthesize a new plausible video starting from an image (e.g., a person's face) and a condition (e.g., an action class label like smile). The key challenge of the cI2V task lies in the simultaneous generation of realistic spatial appearance and temporal dynamics corresponding to the given image and condition. In this paper, we propose an approach for cI2V using novel latent flow diffusion models (LFDM) that synthesize an optical flow sequence in the latent space based on the given condition to warp the given image. Compared to previous direct-synthesis-based works, our proposed LFDM can better synthesize spatial details and temporal motion by fully utilizing the spatial content of the given image and warping it in the latent space according to the generated temporally-coherent flow. The training of LFDM consists of two separate stages: (1) an unsupervised learning stage to train a latent flow auto-encoder for spatial content generation, including a flow predictor to estimate latent flow between pairs of video frames, and (2) a conditional learning stage to train a 3D-UNet-based diffusion model (DM) for temporal latent flow generation. Unlike previous DMs operating in pixel space or latent feature space that couples spatial and temporal information, the DM in our LFDM only needs to learn a low-dimensional latent flow space for motion generation, thus being more computationally efficient. We conduct comprehensive experiments on multiple datasets, where LFDM consistently outperforms prior arts. Furthermore, we show that LFDM can be easily adapted to new domains by simply finetuning the image decoder. Our code is available at https://github.com/nihaomiao/CVPR23_LFDM.	This paper proposes Latent Flow Diffusion Models (LFDM) for conditional image-to-video generation, which synthesizes temporally-coherent optical flow sequences in the latent space to warp the given image.	Existing methods struggle to simultaneously maintain spatial details and temporal coherence. LFDM addresses this by reusing spatial content from the given image through warping guided by the generated latent flow.	LFDM employs a two-stage training strategy: (1) Unsupervised training of a latent flow auto-encoder for spatial content and flow estimation. (2) Conditional training of a 3D-UNet diffusion model to generate temporal latent flow from class labels.	LFDM outperforms previous state-of-the-art methods in conditional image-to-video generation on multiple datasets. LFDM exhibits smaller training-testing gaps, indicating better generalization to unseen images. LFDM can be easily adapted to new domains by simply finetuning the image decoder.	LFDM currently focuses on single-subject videos and struggles with multiple moving subjects. The sampling process with DDPM is slow and can be improved by exploring fast sampling techniques.	image-to-video generation, diffusion models, optical flow, latent space, conditional generation
2303.13743 Report	TEGLO: High Fidelity Canonical Texture Mapping from Single-View Images	Vishal Vinod, Tanmay Shah, Dmitry Lagun	Recent work in Neural Fields (NFs) learn 3D representations from class-specific single view image collections. However, they are unable to reconstruct the input data preserving high-frequency details. Further, these methods do not disentangle appearance from geometry and hence are not suitable for tasks such as texture transfer and editing. In this work, we propose TEGLO (Textured EG3D-GLO) for learning 3D representations from single view in-the-wild image collections for a given class of objects. We accomplish this by training a conditional Neural Radiance Field (NeRF) without any explicit 3D supervision. We equip our method with editing capabilities by creating a dense correspondence mapping to a 2D canonical space. We demonstrate that such mapping enables texture transfer and texture editing without requiring meshes with shared topology. Our key insight is that by mapping the input image pixels onto the texture space we can achieve near perfect reconstruction (>= 74 dB PSNR at 1024^2 resolution). Our formulation allows for high quality 3D consistent novel view synthesis with high-frequency details at megapixel image resolution.	TEGLO learns textured 3D representations from single-view in-the-wild images of objects, enabling tasks like texture transfer and editing, without relying on 3D supervision or textured mesh datasets.	Existing NeRF-based methods struggle to reconstruct high-frequency details and disentangle appearance from geometry, limiting their use in tasks like texture manipulation. TEGLO addresses these limitations.	TEGLO uses a two-stage approach: (1) trains a conditional NeRF using tri-planes and GLO to learn per-object latent codes; (2) learns dense correspondences between 3D surface points and a 2D canonical space using the rendered output from the first stage.	Achieves near-perfect reconstruction of input images (>= 74 dB PSNR at 1024^2 resolution). Enables high-fidelity single-view 3D reconstruction and novel view synthesis at arbitrary resolutions. Performs texture transfer and editing without requiring mesh-based methods or spatial fine-tuning.	Requires significant computational resources for training and inference. Limited to mapping target image pixels, resulting in missing pixel artifacts for certain views.	neural radiance fields, texture representation, dense correspondences, generative latent optimization, single-view 3d reconstruction
2303.13714 Report	High Fidelity Image Synthesis With Deep VAEs In Latent Space	Troy Luhman, Eric Luhman	We present fast, realistic image generation on high-resolution, multimodal datasets using hierarchical variational autoencoders (VAEs) trained on a deterministic autoencoder's latent space. In this two-stage setup, the autoencoder compresses the image into its semantic features, which are then modeled with a deep VAE. With this method, the VAE avoids modeling the fine-grained details that constitute the majority of the image's code length, allowing it to focus on learning its structural components. We demonstrate the effectiveness of our two-stage approach, achieving a FID of 9.34 on the ImageNet-256 dataset which is comparable to BigGAN. We make our implementation available online.	This paper introduces a two-stage approach for high-fidelity image generation using hierarchical variational autoencoders (VAEs) trained on the latent space of a pretrained deterministic autoencoder (DAE).	This approach addresses the limitations of traditional VAEs in generating realistic images on large datasets by separating the modeling of high-frequency details from semantic structure.	The DAE first compresses images into low-dimensional latent representations, removing imperceptible details. Then, a deep hierarchical VAE is trained on these latents to learn the underlying semantic relationships, leveraging classifier-free guidance for improved image fidelity.	Achieved a FID of 9.34 on ImageNet-256, comparable to BigGAN and demonstrating significant improvement over previous hierarchical VAEs. Showed the importance of latent space compression by comparing different downsampling factors, with 4x and 8x performing best. Demonstrated the interpretability and flexibility of the latent space through image manipulations like interpolation and outpainting.	Inability to compute data likelihood, limiting its use in tasks like density estimation. While improved, unguided sample quality still lags behind state-of-the-art diffusion models and GANs.	image generation, variational autoencoders, latent space, classifier-free guidance, deep generative models
2303.13703 Report	End-to-End Diffusion Latent Optimization Improves Classifier Guidance	Bram Wallace, Akash Gokul, Stefano Ermon, Nikhil Naik	Classifier guidance -- using the gradients of an image classifier to steer the generations of a diffusion model -- has the potential to dramatically expand the creative control over image generation and editing. However, currently classifier guidance requires either training new noise-aware models to obtain accurate gradients or using a one-step denoising approximation of the final generation, which leads to misaligned gradients and sub-optimal control. We highlight this approximation's shortcomings and propose a novel guidance method: Direct Optimization of Diffusion Latents (DOODL), which enables plug-and-play guidance by optimizing diffusion latents w.r.t. the gradients of a pre-trained classifier on the true generated pixels, using an invertible diffusion process to achieve memory-efficient backpropagation. Showcasing the potential of more precise guidance, DOODL outperforms one-step classifier guidance on computational and human evaluation metrics across different forms of guidance: using CLIP guidance to improve generations of complex prompts from DrawBench, using fine-grained visual classifiers to expand the vocabulary of Stable Diffusion, enabling image-conditioned generation with a CLIP visual encoder, and improving image aesthetics using an aesthetic scoring network. Code at https://github.com/salesforce/DOODL.	Presents DOODL, a method enabling precise guidance of pretrained diffusion models by directly optimizing diffusion latents with respect to a model-based loss on the final generation.	Overcomes limitations of existing classifier guidance techniques that rely on noise-aware classifiers or one-step denoising approximations, which lead to sub-optimal control over image generation.	Leverages EDICT, an invertible diffusion algorithm, to compute gradients of model losses with respect to the final generated image and uses these gradients to iteratively optimize diffusion latents for improved control and flexibility.	DOODL outperforms one-step classifier guidance on DrawBench, showing improved generation of images from complex prompts. Expands the vocabulary of Stable Diffusion by leveraging fine-grained visual classifiers, enabling generation of rare or unseen concepts. Enables image-conditioned generation with CLIP, demonstrating personalized entity generation without retraining or finetuning.	Requires more optimization iterations for certain tasks, such as visual personalization. Can sometimes lead to warping or deformation of content during optimization, particularly when targeting aesthetic improvement.	diffusion models, classifier guidance, image generation, invertible neural networks, direct latent optimization
2303.13518 Report	Three ways to improve feature alignment for open vocabulary detection	Relja Arandjelović, Alex Andonian, Arthur Mensch, Olivier J. Hénaff, Jean-Baptiste Alayrac, Andrew Zisserman	The core problem in zero-shot open vocabulary detection is how to align visual and text features, so that the detector performs well on unseen classes. Previous approaches train the feature pyramid and detection head from scratch, which breaks the vision-text feature alignment established during pretraining, and struggles to prevent the language model from forgetting unseen classes. We propose three methods to alleviate these issues. Firstly, a simple scheme is used to augment the text embeddings which prevents overfitting to a small number of classes seen during training, while simultaneously saving memory and computation. Secondly, the feature pyramid network and the detection head are modified to include trainable gated shortcuts, which encourages vision-text feature alignment and guarantees it at the start of detection training. Finally, a self-training approach is used to leverage a larger corpus of image-text pairs thus improving detection performance on classes with no human annotated bounding boxes. Our three methods are evaluated on the zero-shot version of the LVIS benchmark, each of them showing clear and significant benefits. Our final network achieves the new stateof-the-art on the mAP-all metric and demonstrates competitive performance for mAP-rare, as well as superior transfer to COCO and Objects365.	This paper introduces three methods to enhance the alignment of visual and textual features for improved zero-shot open vocabulary object detection.	Zero-shot open vocabulary detection, enabling the detection of objects not seen during training, heavily relies on strong alignment between visual and textual representations. Previous methods struggle to maintain this alignment, particularly when incorporating components trained from scratch.	The paper proposes (1) efficient text augmentation during training using dropout or precomputed embedding variants, (2) an alignment preserving architecture (APA) for the detector's feature pyramid network and head, using shortcuts and gates to retain feature alignment from pretraining, and (3) self-training with pseudo-labeling on a large image-text dataset to further improve alignment.	Text augmentation with a frozen language model outperforms training the language model, improving speed and memory efficiency. APA significantly improves both overall detection (mAP-all) and detection of unseen objects (mAP-rare). Self-training considerably boosts performance, particularly for unseen objects, and surpasses the performance of previous self-training methods.	Zero-shot detection still struggles with certain common objects frequently seen in training images but not annotated with bounding boxes. Future work could explore more efficient use of large image-text datasets through improved pseudo-labeling and self-supervised learning techniques.	open vocabulary object detection, zero-shot learning, feature alignment, self-training, text augmentation
2303.13514 Report	SAOR: Single-View Articulated Object Reconstruction	Mehmet Aygün, Oisin Mac Aodha	We introduce SAOR, a novel approach for estimating the 3D shape, texture, and viewpoint of an articulated object from a single image captured in the wild. Unlike prior approaches that rely on pre-defined category-specific 3D templates or tailored 3D skeletons, SAOR learns to articulate shapes from single-view image collections with a skeleton-free part-based model without requiring any 3D object shape priors. To prevent ill-posed solutions, we propose a cross-instance consistency loss that exploits disentangled object shape deformation and articulation. This is helped by a new silhouette-based sampling mechanism to enhance viewpoint diversity during training. Our method only requires estimated object silhouettes and relative depth maps from off-the-shelf pre-trained networks during training. At inference time, given a single-view image, it efficiently outputs an explicit mesh representation. We obtain improved qualitative and quantitative results on challenging quadruped animals compared to relevant existing work.	SAOR, a self-supervised method for reconstructing the 3D shape, texture, and viewpoint of articulated objects from single images, without relying on 3D templates or skeletons.	Reconstructing the 3D shape of articulated objects in the wild, particularly animals, from single images remains challenging due to limitations of existing methods, such as reliance on 3D templates or difficulty in modeling articulation.	SAOR uses a skeleton-free, part-based model that learns to articulate shapes from single-view images. It utilizes a cross-instance consistency loss and a silhouette-based sampling mechanism to handle the ill-posed nature of 3D reconstruction and enhance viewpoint diversity during training.	Outperforms previous methods that do not use explicit 3D supervision on keypoint transfer tasks for birds and quadrupeds. Demonstrates multi-view consistent 3D shape reconstructions, successfully capturing articulation and viewpoint differences. Exhibits generalization capabilities, reconstructing plausible 3D shapes from non-photorealistic images, such as drawings.	Texture predictions, while promising, could be improved with refinement techniques. Struggles with images containing unusual viewpoints or significant object occlusion.	3d reconstruction, articulated objects, self-supervised learning, skeleton-free modeling, single-view reconstruction
2303.13455 Report	CoBIT: A Contrastive Bi-directional Image-Text Generation Model	Haoxuan You, Mandy Guo, Zhecan Wang, Kai-Wei Chang, Jason Baldridge, Jiahui Yu	The field of vision and language has witnessed a proliferation of pre-trained foundation models. Most existing methods are independently pre-trained with contrastive objective like CLIP, image-to-text generative objective like PaLI, or text-to-image generative objective like Parti. However, the three objectives can be pre-trained on the same data, image-text pairs, and intuitively they complement each other as contrasting provides global alignment capacity and generation grants fine-grained understanding. In this work, we present a Contrastive Bi-directional Image-Text generation model (CoBIT), which attempts to unify the three pre-training objectives in one framework. Specifically, CoBIT employs a novel unicoder-decoder structure, consisting of an image unicoder, a text unicoder and a cross-modal decoder. The image/text unicoders can switch between encoding and decoding in different tasks, enabling flexibility and shared knowledge that benefits both image-to-text and text-to-image generations. CoBIT achieves superior performance in image understanding, image-text understanding (Retrieval, Captioning, VQA, SNLI-VE) and text-based content creation, particularly in zero-shot scenarios. For instance, 82.7% in zero-shot ImageNet classification, 9.37 FID score in zero-shot text-to-image generation and 44.8 CIDEr in zero-shot captioning.	This paper introduces CoBIT, a novel vision-language model that unifies contrastive learning, image-to-text generation, and text-to-image generation within a single framework using a unicoder-decoder structure.	This unification aims to consolidate the strengths of each pre-training objective and enable the model to excel in a wide range of vision and vision-language tasks.	CoBIT utilizes a novel unicoder-decoder structure, enabling the image and text unicoders to switch between encoding and decoding modes depending on the task. The model is pre-trained on large-scale image-text datasets using contrastive loss, image-to-text generation loss, and text-to-image generation loss.	CoBIT achieves state-of-the-art zero-shot performance on ImageNet classification (82.7% accuracy) and MS-COCO text-to-image generation (9.37 FID score). The model shows strong performance in other zero-shot tasks, such as image-text retrieval and image captioning. Ablation studies demonstrate the effectiveness of the proposed unicoder structure and the benefits of unifying the three pre-training objectives.	The paper identifies a slight contradiction between the text-to-image and image-to-text generation objectives during training, suggesting a need for further exploration to better harmonize these tasks. Future work could investigate scaling up the model and exploring more diverse and challenging datasets to further enhance its capabilities.	vision-language model, contrastive learning, image-to-text generation, text-to-image generation, unicoder-decoder
2303.13450 Report	Set-the-Scene: Global-Local Training for Generating Controllable NeRF Scenes	Dana Cohen-Bar, Elad Richardson, Gal Metzer, Raja Giryes, Daniel Cohen-Or	Recent breakthroughs in text-guided image generation have led to remarkable progress in the field of 3D synthesis from text. By optimizing neural radiance fields (NeRF) directly from text, recent methods are able to produce remarkable results. Yet, these methods are limited in their control of each object's placement or appearance, as they represent the scene as a whole. This can be a major issue in scenarios that require refining or manipulating objects in the scene. To remedy this deficit, we propose a novel GlobalLocal training framework for synthesizing a 3D scene using object proxies. A proxy represents the object's placement in the generated scene and optionally defines its coarse geometry. The key to our approach is to represent each object as an independent NeRF. We alternate between optimizing each NeRF on its own and as part of the full scene. Thus, a complete representation of each object can be learned, while also creating a harmonious scene with style and lighting match. We show that using proxies allows a wide variety of editing options, such as adjusting the placement of each independent object, removing objects from a scene, or refining an object. Our results show that Set-the-Scene offers a powerful solution for scene synthesis and manipulation, filling a crucial gap in controllable text-to-3D synthesis.	Introduces Set-the-Scene, a framework for synthesizing controllable 3D scenes from text prompts and 3D object proxies using a Global-Local training approach with composable NeRFs.	Current text-to-3D methods lack control over object placement and appearance, limiting scene customization and editing.	Represents scenes as composable NeRFs, each built around a proxy defining placement and optionally coarse geometry. Employs a Global-Local training strategy, alternating between optimizing individual NeRFs locally and the entire scene globally using score distillation and shape loss.	Generates scenes matching user-defined object placements and styles guided by text prompts. Enables post-training editing like object relocation, duplication, removal, geometry modification, and color scheme adjustments. Outperforms single-NeRF methods in generating complex scenes with consistent object relationships and styles, as demonstrated qualitatively and through a user study.	Generation quality limited by the underlying single-object text-to-3D method (Latent-NeRF) and diffusion model. Occasional generation of objects as textures within the background NeRF instead of separate geometry.	text-to-3d synthesis, neural radiance fields (nerf), composable scene representation, score distillation, 3d scene editing
2303.13396 Report	Zero-guidance Segmentation Using Zero Segment Labels	Pitchaporn Rewatbowornwong, Nattanat Chatthee, Ekapol Chuangsuwanich, Supasorn Suwajanakorn	CLIP has enabled new and exciting joint vision-language applications, one of which is open-vocabulary segmentation, which can locate any segment given an arbitrary text query. In our research, we ask whether it is possible to discover semantic segments without any user guidance in the form of text queries or predefined classes, and label them using natural language automatically? We propose a novel problem zero-guidance segmentation and the first baseline that leverages two pre-trained generalist models, DINO and CLIP, to solve this problem without any fine-tuning or segmentation dataset. The general idea is to first segment an image into small over-segments, encode them into CLIP's visual-language space, translate them into text labels, and merge semantically similar segments together. The key challenge, however, is how to encode a visual segment into a segment-specific embedding that balances global and local context information, both useful for recognition. Our main contribution is a novel attention-masking technique that balances the two contexts by analyzing the attention layers inside CLIP. We also introduce several metrics for the evaluation of this new task. With CLIP's innate knowledge, our method can precisely locate the Mona Lisa painting among a museum crowd. Project page: https://zero-guide-seg.github.io/.	Introduces "zero-guidance segmentation," a novel problem aiming to segment images and label segments in natural language without predefined classes or text guidance, and proposes the first baseline solution.	Significantly advances semantic segmentation by eliminating the need for user input or predefined classes, enabling more flexible and comprehensive image understanding.	Leverages pretrained DINO and CLIP models. First, over-segments an image using DINO features. Then, maps each segment to CLIP's visual-language embedding space using a novel attention-masking technique called "global subtraction" to balance global and local contexts. Finally, translates embeddings to text labels and merges semantically similar segments.	Presents qualitative results demonstrating the method's ability to discover semantic segments and label them with diverse and meaningful text descriptions. Proposes new evaluation metrics to address the challenges of arbitrary label granularity and synonyms, enabling quantitative assessment of segmentation quality and text label accuracy. Shows promising results on Pascal Context and Pascal VOC datasets, particularly in discovering a wider range of objects compared to existing zero-shot open-vocabulary methods.	Label reassignment during evaluation remains challenging due to the potential mismatch between predicted and ground-truth labels, highlighting the need for a better understanding of object parts and relationships. Global context leakage can still occur, particularly for background segments sharing boundaries with salient objects, suggesting avenues for improvement in attention masking and segment encoding.	semantic segmentation, zero-shot learning, vision-language models, clip, dino
2303.13277 Report	SINE: Semantic-driven Image-based NeRF Editing with Prior-guided Editing Field	Chong Bao, Yinda Zhang, Bangbang Yang, Tianxing Fan, Zesong Yang, Hujun Bao, Guofeng Zhang, Zhaopeng Cui	Despite the great success in 2D editing using user-friendly tools, such as Photoshop, semantic strokes, or even text prompts, similar capabilities in 3D areas are still limited, either relying on 3D modeling skills or allowing editing within only a few categories. In this paper, we present a novel semantic-driven NeRF editing approach, which enables users to edit a neural radiance field with a single image, and faithfully delivers edited novel views with high fidelity and multi-view consistency. To achieve this goal, we propose a prior-guided editing field to encode fine-grained geometric and texture editing in 3D space, and develop a series of techniques to aid the editing process, including cyclic constraints with a proxy mesh to facilitate geometric supervision, a color compositing mechanism to stabilize semantic-driven texture editing, and a feature-cluster-based regularization to preserve the irrelevant content unchanged. Extensive experiments and editing examples on both real-world and synthetic data demonstrate that our method achieves photo-realistic 3D editing using only a single edited image, pushing the bound of semantic-driven editing in 3D real-world scenes. Our project webpage: https://zju3dv.github.io/sine/.	Proposes SINE, a semantic-driven image-based editing approach for NeRFs, enabling 3D scene editing using a single image or text prompts.	Addresses limitations in 3D editing tools that require 3D modeling expertise or offer limited editing capabilities, aiming for effortless and realistic 3D scene manipulation.	Learns a prior-guided editing field to encode geometric and texture modifications, utilizing shape priors (DIF, depth prediction) and semantic texture priors (ViT) for multi-view consistency. Introduces cyclic constraints with a proxy mesh, color compositing, and feature-cluster-based regularization to enhance editing quality and control.	Achieves realistic geometric deformations from single-view edits, outperforming EG3D and EditNeRF in visual quality and generalization. Enables semantic-aware texture editing using target images or text prompts, surpassing ARF and CLIP-NeRF in detail and realism. Demonstrates effective editing control, preserving irrelevant scene parts through feature-cluster-based regularization.	Current approach doesn't support edits involving topology changes (e.g., breaking objects). Assumes user edits are semantically meaningful, limiting the use of nonsensical target images.	nerf editing, semantic editing, single-view editing, 3d scene manipulation, prior-guided learning
2303.13273 Report	TAPS3D: Text-Guided 3D Textured Shape Generation from Pseudo Supervision	Jiacheng Wei, Hao Wang, Jiashi Feng, Guosheng Lin, Kim-Hui Yap	In this paper, we investigate an open research task of generating controllable 3D textured shapes from the given textual descriptions. Previous works either require ground truth caption labeling or extensive optimization time. To resolve these issues, we present a novel framework, TAPS3D, to train a text-guided 3D shape generator with pseudo captions. Specifically, based on rendered 2D images, we retrieve relevant words from the CLIP vocabulary and construct pseudo captions using templates. Our constructed captions provide high-level semantic supervision for generated 3D shapes. Further, in order to produce fine-grained textures and increase geometry diversity, we propose to adopt low-level image regularization to enable fake-rendered images to align with the real ones. During the inference phase, our proposed model can generate 3D textured shapes from the given text without any additional optimization. We conduct extensive experiments to analyze each of our proposed components and show the efficacy of our framework in generating high-fidelity 3D textured and text-relevant shapes.	Introduces TAPS3D, a novel framework for generating controllable 3D textured shapes from text descriptions without ground truth captions or extensive optimization.	Addresses limitations of prior text-to-3D methods that require labeled captions or suffer from long optimization times, making text-guided 3D generation practical.	Generates pseudo captions from rendered 2D images using CLIP word retrieval and templates. Trains a text-conditioned 3D generator (pretrained GET3D) with high-level CLIP loss and low-level image regularization.	Generates high-fidelity 3D textured shapes consistent with input text prompts. Significantly faster inference compared to optimization-based methods. Quantitative evaluation shows superior performance in image quality (FID), text-image relevance (CLIP-R-Precision), and geometry quality (FPD).	Limited capacity to generate fine-grained details for different object parts. Reliance on diverse training images to handle complex text input.	3d shape generation, text-to-3d, pseudo supervision, clip, generative adversarial networks
2303.13232 Report	Transforming Radiance Field with Lipschitz Network for Photorealistic 3D Scene Stylization	Zicheng Zhang, Yinglu Liu, Congying Han, Yingwei Pan, Tiande Guo, Ting Yao	Recent advances in 3D scene representation and novel view synthesis have witnessed the rise of Neural Radiance Fields (NeRFs). Nevertheless, it is not trivial to exploit NeRF for the photorealistic 3D scene stylization task, which aims to generate visually consistent and photorealistic stylized scenes from novel views. Simply coupling NeRF with photorealistic style transfer (PST) will result in cross-view inconsistency and degradation of stylized view syntheses. Through a thorough analysis, we demonstrate that this non-trivial task can be simplified in a new light: When transforming the appearance representation of a pre-trained NeRF with Lipschitz mapping, the consistency and photorealism across source views will be seamlessly encoded into the syntheses. That motivates us to build a concise and flexible learning framework namely LipRF, which upgrades arbitrary 2D PST methods with Lipschitz mapping tailored for the 3D scene. Technically, LipRF first pre-trains a radiance field to reconstruct the 3D scene, and then emulates the style on each view by 2D PST as the prior to learn a Lipschitz network to stylize the pre-trained appearance. In view of that Lipschitz condition highly impacts the expressivity of the neural network, we devise an adaptive regularization to balance the reconstruction and stylization. A gradual gradient aggregation strategy is further introduced to optimize LipRF in a cost-efficient manner. We conduct extensive experiments to show the high quality and robust performance of LipRF on both photorealistic 3D stylization and object appearance editing.	This paper presents LipRF, a novel framework that leverages a Lipschitz-constrained MLP to transform pre-trained NeRF appearance representations for photorealistic 3D scene stylization, ensuring consistency and photorealism in stylized novel views.	Photorealistic 3D scene stylization, aiming to generate consistent and realistic stylized novel views, is challenging due to the lack of style loss tailored for NeRF training and the limitations of 2D PST methods causing inconsistencies across views.	LipRF pre-trains a radiance field (Plenoxels) for scene reconstruction and learns a Lipschitz MLP to map the pre-trained appearance to stylized versions. It utilizes 2D PST stylization on individual views as guidance and employs adaptive regularization based on spectral normalization to balance reconstruction and stylization quality. Gradual gradient aggregation ensures cost-efficient optimization.	LipRF successfully preserves photorealism and consistency in stylized scenes, outperforming existing 2D PST and 3D stylization methods. Adaptive regularization effectively balances stylization quality and adherence to the Lipschitz constraint, crucial for preserving image structure. Gradual gradient aggregation enables efficient training of LipRF, reducing memory footprint and computational cost.	LipRF's performance depends on the accuracy of the pre-trained radiance field, potentially limiting its application to scenes where accurate NeRF reconstruction is challenging. Future work could explore joint optimization of the radiance field and the Lipschitz MLP for enhanced stylization.	3d scene stylization, neural radiance fields, lipschitz networks, photorealistic style transfer, novel view synthesis
2303.13126 Report	MagicFusion: Boosting Text-to-Image Generation Performance by Fusing Diffusion Models	Jing Zhao, Heliang Zheng, Chaoyue Wang, Long Lan, Wenjing Yang	The advent of open-source AI communities has produced a cornucopia of powerful text-guided diffusion models that are trained on various datasets. While few explorations have been conducted on ensembling such models to combine their strengths. In this work, we propose a simple yet effective method called Saliency-aware Noise Blending (SNB) that can empower the fused text-guided diffusion models to achieve more controllable generation. Specifically, we experimentally find that the responses of classifier-free guidance are highly related to the saliency of generated images. Thus we propose to trust different models in their areas of expertise by blending the predicted noises of two diffusion models in a saliency-aware manner. SNB is training-free and can be completed within a DDIM sampling process. Additionally, it can automatically align the semantics of two noise spaces without requiring additional annotations such as masks. Extensive experiments show the impressive effectiveness of SNB in various applications. Project page is available at https://magicfusion.github.io/.	This paper introduces Saliency-aware Noise Blending (SNB), a method to fuse pre-trained text-guided diffusion models for more controllable image generation.	Leveraging the strengths of multiple pre-trained diffusion models allows for the creation of images that combine their individual capabilities, enabling fine-grained control, creative scene composition, and cross-domain fusion.	SNB uses classifier-free guidance to generate saliency maps from two diffusion models, creating a mask that guides the blending of their predicted noises during the denoising sampling process.	SNB allows the fusion of a general model with a fine-grained model, enabling the generation of specific objects within complex scenes. The method enables recontextualization by fusing a general model with a DreamBooth model, placing specific objects in new settings. SNB facilitates cross-domain fusion, combining the creative composition of cartoon models with the photorealism of general models.	The method currently relies on manual tuning of hyperparameters for optimal blending. Future work could explore extending SNB to fuse more than two diffusion models simultaneously.	image generation, diffusion models, model fusion, text-to-image synthesis, classifier-free guidance
2303.13076 Report	CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching	Xiaoshi Wu, Feng Zhu, Rui Zhao, Hongsheng Li	Open-vocabulary detection (OVD) is an object detection task aiming at detecting objects from novel categories beyond the base categories on which the detector is trained. Recent OVD methods rely on large-scale visual-language pre-trained models, such as CLIP, for recognizing novel objects. We identify the two core obstacles that need to be tackled when incorporating these models into detector training: (1) the distribution mismatch that happens when applying a VL-model trained on whole images to region recognition tasks; (2) the difficulty of localizing objects of unseen classes. To overcome these obstacles, we propose CORA, a DETR-style framework that adapts CLIP for Open-vocabulary detection by Region prompting and Anchor pre-matching. Region prompting mitigates the whole-to-region distribution gap by prompting the region features of the CLIP-based region classifier. Anchor pre-matching helps learning generalizable object localization by a class-aware matching mechanism. We evaluate CORA on the COCO OVD benchmark, where we achieve 41.7 AP50 on novel classes, which outperforms the previous SOTA by 2.4 AP50 even without resorting to extra training data. When extra training data is available, we train CORA$^+$ on both ground-truth base-category annotations and additional pseudo bounding box labels computed by CORA. CORA$^+$ achieves 43.1 AP50 on the COCO OVD benchmark and 28.1 box APr on the LVIS OVD benchmark.	CORA, a DETR-style framework that adapts CLIP for Open-Vocabulary Detection using Region Prompting and Anchor Pre-Matching.	To address the obstacles of distribution mismatch and novel class localization when using CLIP for open-vocabulary detection.	The paper introduces Region Prompting to adapt CLIP for region-level tasks and Anchor Pre-Matching for efficient and generalizable object localization.	CORA achieves 41.7 AP50 on novel classes of the COCO OVD benchmark, outperforming the previous state-of-the-art by 2.4 AP50 without extra training data. Region Prompting effectively mitigates the distribution gap, boosting classification performance on novel classes from 63.9% to 74.1%. Anchor Pre-Matching enables efficient class-aware object localization, leading to better generalization to novel classes.	The method relies on the performance of the CLIP model, which may limit its ability to detect objects with complex visual appearances or relationships. Future work could explore incorporating other modalities, such as depth or semantic segmentation, to further improve object localization.	open-vocabulary detection, clip, region prompting, anchor pre-matching, detr
2303.13071 Report	PanoHead: Geometry-Aware 3D Full-Head Synthesis in 360$^{\circ}$	Sizhe An, Hongyi Xu, Yichun Shi, Guoxian Song, Umit Ogras, Linjie Luo	Synthesis and reconstruction of 3D human head has gained increasing interests in computer vision and computer graphics recently. Existing state-of-the-art 3D generative adversarial networks (GANs) for 3D human head synthesis are either limited to near-frontal views or hard to preserve 3D consistency in large view angles. We propose PanoHead, the first 3D-aware generative model that enables high-quality view-consistent image synthesis of full heads in $360^\circ$ with diverse appearance and detailed geometry using only in-the-wild unstructured images for training. At its core, we lift up the representation power of recent 3D GANs and bridge the data alignment gap when training from in-the-wild images with widely distributed views. Specifically, we propose a novel two-stage self-adaptive image alignment for robust 3D GAN training. We further introduce a tri-grid neural volume representation that effectively addresses front-face and back-head feature entanglement rooted in the widely-adopted tri-plane formulation. Our method instills prior knowledge of 2D image segmentation in adversarial learning of 3D neural scene structures, enabling compositable head synthesis in diverse backgrounds. Benefiting from these designs, our method significantly outperforms previous 3D GANs, generating high-quality 3D heads with accurate geometry and diverse appearances, even with long wavy and afro hairstyles, renderable from arbitrary poses. Furthermore, we show that our system can reconstruct full 3D heads from single input images for personalized realistic 3D avatars.	PanoHead is the first 3D GAN framework to synthesize view-consistent, high-fidelity full-head images in 360° from only single-view images, enabling 3D portrait creation.	Existing 3D GANs for head synthesis are either limited to near-frontal views or struggle to maintain 3D consistency across large view angles, hindering applications like digital avatars and telepresence.	PanoHead builds upon EG3D and introduces: (1) foreground-aware tri-discriminator for separating foreground and background; (2) tri-grid volume representation to address feature entanglement in tri-plane; (3) two-stage image alignment with a self-adaptation module for robust training on in-the-wild images.	Synthesizes high-fidelity 360° full-head images with detailed geometry, outperforming SOTA methods in qualitative and quantitative evaluations. Generates background-free 3D head geometry, even with diverse hairstyles. Demonstrates compelling single-view 3D head reconstruction and novel-view synthesis.	Minor artifacts persist (e.g., teeth area, flickering textures). Lacks finer high-frequency geometric details (e.g., hair tips).	3d gan, full-head synthesis, 360° view synthesis, single-view reconstruction, neural rendering
2303.13062 Report	SIEDOB: Semantic Image Editing by Disentangling Object and Background	Wuyang Luo, Su Yang, Xinjian Zhang, Weishan Zhang	Semantic image editing provides users with a flexible tool to modify a given image guided by a corresponding segmentation map. In this task, the features of the foreground objects and the backgrounds are quite different. However, all previous methods handle backgrounds and objects as a whole using a monolithic model. Consequently, they remain limited in processing content-rich images and suffer from generating unrealistic objects and texture-inconsistent backgrounds. To address this issue, we propose a novel paradigm, \textbf{S}emantic \textbf{I}mage \textbf{E}diting by \textbf{D}isentangling \textbf{O}bject and \textbf{B}ackground (\textbf{SIEDOB}), the core idea of which is to explicitly leverages several heterogeneous subnetworks for objects and backgrounds. First, SIEDOB disassembles the edited input into background regions and instance-level objects. Then, we feed them into the dedicated generators. Finally, all synthesized parts are embedded in their original locations and utilize a fusion network to obtain a harmonized result. Moreover, to produce high-quality edited images, we propose some innovative designs, including Semantic-Aware Self-Propagation Module, Boundary-Anchored Patch Discriminator, and Style-Diversity Object Generator, and integrate them into SIEDOB. We conduct extensive experiments on Cityscapes and ADE20K-Room datasets and exhibit that our method remarkably outperforms the baselines, especially in synthesizing realistic and diverse objects and texture-consistent backgrounds.	Presents SIEDOB, a novel semantic image editing framework that disentangles object and background generation for improved realism and texture consistency in complex scenes.	Existing methods struggle with generating realistic and coherent edits in images with multiple, diverse objects and backgrounds, particularly in content-rich scenes.	Employs a heterogeneous model that disassembles the edited image into background regions and instance-level objects. Different generators synthesize corresponding content, integrated via a fusion network. Introduces innovations like the Semantic-Aware Self-Propagation Module, Boundary-Anchored Patch Discriminator, and Style-Diversity Object Generator to enhance quality.	Outperforms state-of-the-art methods in visual quality and metrics like FID, LPIPS, and mIoU. Effectively handles scenes with dense, overlapping objects, producing superior results compared to methods that treat the entire image uniformly. Demonstrates improved texture consistency between edited and known regions in both objects and backgrounds.	Struggles with generating objects from rare categories due to limited training data. Object generation quality is challenged by extreme poses or large-scale occlusions. Future work could explore incorporating mechanisms to handle rare categories and challenging object configurations.	semantic image editing, generative adversarial networks, image synthesis, disentanglement learning, texture consistency
2303.13005 Report	From Knowledge Distillation to Self-Knowledge Distillation: A Unified Approach with Normalized Loss and Customized Soft Labels	Zhendong Yang, Ailing Zeng, Zhe Li, Tianke Zhang, Chun Yuan, Yu Li	Knowledge Distillation (KD) uses the teacher's prediction logits as soft labels to guide the student, while self-KD does not need a real teacher to require the soft labels. This work unifies the formulations of the two tasks by decomposing and reorganizing the generic KD loss into a Normalized KD (NKD) loss and customized soft labels for both target class (image's category) and non-target classes named Universal Self-Knowledge Distillation (USKD). We decompose the KD loss and find the non-target loss from it forces the student's non-target logits to match the teacher's, but the sum of the two non-target logits is different, preventing them from being identical. NKD normalizes the non-target logits to equalize their sum. It can be generally used for KD and self-KD to better use the soft labels for distillation loss. USKD generates customized soft labels for both target and non-target classes without a teacher. It smooths the target logit of the student as the soft target label and uses the rank of the intermediate feature to generate the soft non-target labels with Zipf's law. For KD with teachers, our NKD achieves state-of-the-art performance on CIFAR-100 and ImageNet datasets, boosting the ImageNet Top-1 accuracy of ResNet18 from 69.90% to 71.96% with a ResNet-34 teacher. For self-KD without teachers, USKD is the first self-KD method that can be effectively applied to both CNN and ViT models with negligible additional time and memory cost, resulting in new state-of-the-art results, such as 1.17% and 0.55% accuracy gains on ImageNet for MobileNet and DeiT-Tiny, respectively. Our codes are available at https://github.com/yzd-v/cls_KD.	This paper proposes two novel methods: Normalized Knowledge Distillation (NKD), which improves upon traditional KD by normalizing non-target logits for better soft label utilization, and Universal Self-Knowledge Distillation (USKD), which introduces customized soft labels for both target and non-target classes, enabling self-KD for both CNN and ViT models.	The paper aims to address limitations in knowledge distillation (KD) and self-knowledge distillation (self-KD) by improving the utilization of soft labels for distillation loss and proposing a more general and effective method for generating customized soft labels for self-KD.	NKD normalizes the non-target logits in KD loss to equalize their sum with the teacher's, facilitating better knowledge transfer. USKD generates customized soft labels by smoothing the student's target logit and using the rank of intermediate features, determined through weak supervision, to generate soft non-target labels based on Zipf's law.	NKD achieves state-of-the-art performance for KD, outperforming previous methods on CIFAR-100 and ImageNet, demonstrating significant accuracy gains. USKD, with its customized soft labels, effectively performs self-KD on both CNN and ViT models, achieving state-of-the-art results with negligible additional time and resource consumption compared to baseline training. The paper provides analysis and visualizations demonstrating the effectiveness of normalizing non-target logits, customizing soft labels, and the impact of different smoothing methods and rank determination approaches.	The performance of USKD with varying hyperparameters for non-target loss requires further investigation. Exploration of the effectiveness of NKD and USKD on more diverse tasks beyond image classification and object detection.	knowledge distillation, self-knowledge distillation, normalized loss, customized soft labels, cnn and vit models
2303.12950 Report	LightPainter: Interactive Portrait Relighting with Freehand Scribble	Yiqun Mei, He Zhang, Xuaner Zhang, Jianming Zhang, Zhixin Shu, Yilin Wang, Zijun Wei, Shi Yan, HyunJoon Jung, Vishal M. Patel	Recent portrait relighting methods have achieved realistic results of portrait lighting effects given a desired lighting representation such as an environment map. However, these methods are not intuitive for user interaction and lack precise lighting control. We introduce LightPainter, a scribble-based relighting system that allows users to interactively manipulate portrait lighting effect with ease. This is achieved by two conditional neural networks, a delighting module that recovers geometry and albedo optionally conditioned on skin tone, and a scribble-based module for relighting. To train the relighting module, we propose a novel scribble simulation procedure to mimic real user scribbles, which allows our pipeline to be trained without any human annotations. We demonstrate high-quality and flexible portrait lighting editing capability with both quantitative and qualitative experiments. User study comparisons with commercial lighting editing tools also demonstrate consistent user preference for our method.	This paper introduces LightPainter, a novel scribble-based interactive portrait relighting system that enables users to manipulate portrait lighting effects easily.	Existing portrait relighting methods rely on lighting representations like environment maps or exemplar images, which are not intuitive for user interaction and lack precise lighting control.	LightPainter uses two conditional neural networks: a delighting module to recover geometry and albedo optionally conditioned on skin tone, and a scribble-based module for relighting. It uses a novel scribble simulation procedure to mimic real user scribbles for training.	LightPainter allows flexible lighting editing with scribbles and enables skin tone control with SkinFill. User study shows LightPainter is more user-friendly and generates more faithful relighting results than other methods. LightPainter outperforms state-of-the-art methods in terms of photorealism and fidelity on both light stage and in-the-wild images.	LightPainter's performance relies on accurate geometry estimation. The current scribble simulation may not cover all real-world cases.	portrait relighting, interactive image editing, scribble-based interface, deep learning, computer vision
2303.12865 Report	NeRF-GAN Distillation for Efficient 3D-Aware Generation with Convolutions	Mohamad Shahbazi, Evangelos Ntavelis, Alessio Tonioni, Edo Collins, Danda Pani Paudel, Martin Danelljan, Luc Van Gool	Pose-conditioned convolutional generative models struggle with high-quality 3D-consistent image generation from single-view datasets, due to their lack of sufficient 3D priors. Recently, the integration of Neural Radiance Fields (NeRFs) and generative models, such as Generative Adversarial Networks (GANs), has transformed 3D-aware generation from single-view images. NeRF-GANs exploit the strong inductive bias of neural 3D representations and volumetric rendering at the cost of higher computational complexity. This study aims at revisiting pose-conditioned 2D GANs for efficient 3D-aware generation at inference time by distilling 3D knowledge from pretrained NeRF-GANs. We propose a simple and effective method, based on re-using the well-disentangled latent space of a pre-trained NeRF-GAN in a pose-conditioned convolutional network to directly generate 3D-consistent images corresponding to the underlying 3D representations. Experiments on several datasets demonstrate that the proposed method obtains results comparable with volumetric rendering in terms of quality and 3D consistency while benefiting from the computational advantage of convolutional networks. The code will be available at: https://github.com/mshahbazi72/NeRF-GAN-Distillation	This paper introduces a novel method for distilling pretrained NeRF-GANs into pose-conditioned convolutional generators, enabling efficient 3D-aware image generation.	While NeRF-GANs excel in 3D-aware generation from single-view images, their reliance on volumetric rendering makes them computationally expensive. This work addresses this limitation by transferring the 3D knowledge to a faster convolutional generator.	The method leverages the disentangled latent space of a pretrained NeRF-GAN (EG3D) to supervise a convolutional generator. By sharing the latent space and training with a combination of reconstruction and adversarial losses, the convolutional generator learns to produce 3D-consistent images.	The proposed method generates images with comparable quality and 3D consistency to the NeRF-GAN, as evidenced by FID/KID scores and pose/identity preservation metrics. It significantly outperforms traditional pose-conditioned GANs and a recent baseline (SURF) in terms of 3D consistency. The convolutional generator achieves superior efficiency compared to the volumetric rendering approach, allowing for larger batch sizes and faster inference.	The quality and consistency of the generated images are inherently limited by the pretrained NeRF-GAN. Future work could focus on achieving even stronger correspondence between the convolutional and volumetric rendering in terms of semantic details.	generative adversarial networks, neural radiance fields, 3d-aware generation, knowledge distillation, convolutional networks
2303.12790 Report	$CrowdDiff$: Multi-hypothesis Crowd Density Estimation using Diffusion Models	Yasiru Ranasinghe, Nithin Gopalakrishnan Nair, Wele Gedara Chaminda Bandara, Vishal M. Patel	Crowd counting is a fundamental problem in crowd analysis which is typically accomplished by estimating a crowd density map and summing over the density values. However, this approach suffers from background noise accumulation and loss of density due to the use of broad Gaussian kernels to create the ground truth density maps. This issue can be overcome by narrowing the Gaussian kernel. However, existing approaches perform poorly when trained with ground truth density maps with broad kernels. To deal with this limitation, we propose using conditional diffusion models to predict density maps, as diffusion models show high fidelity to training data during generation. With that, we present $CrowdDiff$ that generates the crowd density map as a reverse diffusion process. Furthermore, as the intermediate time steps of the diffusion process are noisy, we incorporate a regression branch for direct crowd estimation only during training to improve the feature learning. In addition, owing to the stochastic nature of the diffusion model, we introduce producing multiple density maps to improve the counting performance contrary to the existing crowd counting pipelines. We conduct extensive experiments on publicly available datasets to validate the effectiveness of our method. $CrowdDiff$ outperforms existing state-of-the-art crowd counting methods on several public crowd analysis benchmarks with significant improvements.	This paper introduces CrowdDiff, a novel crowd counting framework employing denoising diffusion probabilistic models to generate crowd density maps, enhancing accuracy by using narrow density kernels and enabling iterative improvement through multiple realizations.	Existing density-based methods struggle with background noise and density loss, especially in congested scenes, while localization-based methods require crowd density heuristics. CrowdDiff addresses these limitations by combining the strengths of both approaches.	CrowdDiff leverages a denoising diffusion process to generate crowd density maps, utilizing narrow Gaussian kernels for higher fidelity. A counting branch aids feature learning during training, and a novel fusion method combines multiple density map realizations for improved accuracy.	CrowdDiff surpasses state-of-the-art crowd counting methods on public datasets, particularly excelling in dense scenes. Using narrow kernels with diffusion models enables accurate counting in congested regions and reduces background noise accumulation. The proposed crowd map fusion method significantly boosts counting performance by leveraging the stochastic nature of diffusion models.	The iterative inference process of diffusion models leads to higher inference times compared to some existing methods. Future work could explore the use of consistency models to speed up inference without sacrificing accuracy.	crowd counting, diffusion models, density map estimation, crowd analysis, computer vision
2303.12786 Report	FeatureNeRF: Learning Generalizable NeRFs by Distilling Foundation Models	Jianglong Ye, Naiyan Wang, Xiaolong Wang	Recent works on generalizable NeRFs have shown promising results on novel view synthesis from single or few images. However, such models have rarely been applied on other downstream tasks beyond synthesis such as semantic understanding and parsing. In this paper, we propose a novel framework named FeatureNeRF to learn generalizable NeRFs by distilling pre-trained vision foundation models (e.g., DINO, Latent Diffusion). FeatureNeRF leverages 2D pre-trained foundation models to 3D space via neural rendering, and then extract deep features for 3D query points from NeRF MLPs. Consequently, it allows to map 2D images to continuous 3D semantic feature volumes, which can be used for various downstream tasks. We evaluate FeatureNeRF on tasks of 2D/3D semantic keypoint transfer and 2D/3D object part segmentation. Our extensive experiments demonstrate the effectiveness of FeatureNeRF as a generalizable 3D semantic feature extractor. Our project page is available at https://jianglongye.com/featurenerf/ .	Proposes FeatureNeRF, a framework that distills pre-trained 2D vision foundation models into generalizable NeRFs to enable 3D semantic understanding from 2D images.	Existing generalizable NeRFs focus on novel view synthesis and lack semantic understanding capabilities, while 3D foundation models are limited by the availability of large-scale 3D datasets.	FeatureNeRF adds a feature branch to the NeRF MLP and trains it to match the features extracted from a 2D foundation model (e.g., DINO, Latent Diffusion) of the rendered image. It further introduces internal NeRF features and a coordinate loss to improve 3D semantic understanding.	FeatureNeRF outperforms baselines in 2D/3D semantic keypoint transfer and 2D/3D object part segmentation tasks. It enables novel view semantic keypoint transfer and part co-segmentation by rendering feature maps from unseen viewpoints. The learned 3D semantic feature representation can be applied to editing applications, such as 3D part texture swapping.	The performance of FeatureNeRF relies on the quality of the pre-trained 2D foundation models. The current implementation requires known camera poses, limiting its application to in-the-wild images.	neural radiance fields, nerf, foundation models, 3d semantic understanding, feature distillation
2303.12733 Report	On the De-duplication of LAION-2B	Ryan Webster, Julien Rabin, Loic Simon, Frederic Jurie	Generative models, such as DALL-E, Midjourney, and Stable Diffusion, have societal implications that extend beyond the field of computer science. These models require large image databases like LAION-2B, which contain two billion images. At this scale, manual inspection is difficult and automated analysis is challenging. In addition, recent studies show that duplicated images pose copyright problems for models trained on LAION2B, which hinders its usability. This paper proposes an algorithmic chain that runs with modest compute, that compresses CLIP features to enable efficient duplicate detection, even for vast image volumes. Our approach demonstrates that roughly 700 million images, or about 30\%, of LAION-2B's images are likely duplicated. Our method also provides the histograms of duplication on this dataset, which we use to reveal more examples of verbatim copies by Stable Diffusion and further justify the approach. The current version of the de-duplicated set will be distributed online.	This paper presents an algorithmic chain for efficient duplicate detection in large image datasets, focusing on LAION-2B, using compressed CLIP features.	Duplicate images in massive datasets like LAION-2B pose copyright concerns for trained models, especially generative ones like Stable Diffusion. This work aims to address this issue by efficiently identifying and removing duplicates, improving dataset usability.	The authors propose SNIP, a contrastive feature compression technique that preserves text-image alignment in CLIP features. They use SNIP with approximate nearest neighbor search (IVFPQ) to efficiently find duplicates in LAION-2B. An adaptive thresholding strategy based on asymmetric distances is used to identify duplicates.	The SNIP compression method shows better semantic retention for multimodal tasks compared to MSE-based compression while maintaining competitive retrieval performance. The proposed method identifies roughly 700 million duplicate images in LAION-2B, approximately one-third of the dataset, with a precision of 91%. By synthesizing images from the most duplicated subset, the authors were able to identify additional cases of verbatim copying by Stable Diffusion with significantly fewer resources than previous studies.	The current de-duplication method is based on a conservative threshold and may miss some duplicates. Future work could explore the impact of prompt variability and distinctiveness on image duplication.	de-duplication, clip, image retrieval, generative models, copyright
2303.12688 Report	Pix2Video: Video Editing using Image Diffusion	Duygu Ceylan, Chun-Hao Paul Huang, Niloy J. Mitra	Image diffusion models, trained on massive image collections, have emerged as the most versatile image generator model in terms of quality and diversity. They support inverting real images and conditional (e.g., text) generation, making them attractive for high-quality image editing applications. We investigate how to use such pre-trained image models for text-guided video editing. The critical challenge is to achieve the target edits while still preserving the content of the source video. Our method works in two simple steps: first, we use a pre-trained structure-guided (e.g., depth) image diffusion model to perform text-guided edits on an anchor frame; then, in the key step, we progressively propagate the changes to the future frames via self-attention feature injection to adapt the core denoising step of the diffusion model. We then consolidate the changes by adjusting the latent code for the frame before continuing the process. Our approach is training-free and generalizes to a wide range of edits. We demonstrate the effectiveness of the approach by extensive experimentation and compare it against four different prior and parallel efforts (on ArXiv). We demonstrate that realistic text-guided video edits are possible, without any compute-intensive preprocessing or video-specific finetuning.	This paper presents Pix2Video, a training-free method for text-guided video editing that leverages pre-trained image diffusion models, particularly a depth-conditioned Stable Diffusion model.	Existing video editing techniques often require extensive training or per-video fine-tuning. This method aims to bridge this gap by leveraging the power of pre-trained image diffusion models for coherent and efficient video editing.	The method employs a two-step process: (1) It uses a pre-trained depth-guided image diffusion model to perform text-guided editing on a selected anchor frame. (2) It propagates changes to other frames via self-attention feature injection in the diffusion model's denoising step and consolidates these changes by adjusting latent codes to ensure similarity with the preceding frame.	Pix2Video can perform both localized (e.g., changing an object's color) and global (e.g., changing the overall style) edits on videos. The method exhibits superior performance compared to baseline methods, achieving higher faithfulness to the text prompt and maintaining better temporal consistency. User studies confirmed that Pix2Video generates edits that are more faithful to the prompts and are generally preferred over the results of other methods.	The temporal coherency of the generated video can be further improved. Handling longer videos requires addressing challenges with increasing distance from the anchor frame.	video editing, image diffusion models, text-guided editing, stable diffusion, self-attention
2303.12678 Report	Uni-Fusion: Universal Continuous Mapping	Yijun Yuan, Andreas Nuechter	We present Uni-Fusion, a universal continuous mapping framework for surfaces, surface properties (color, infrared, etc.) and more (latent features in CLIP embedding space, etc.). We propose the first universal implicit encoding model that supports encoding of both geometry and different types of properties (RGB, infrared, features, etc.) without requiring any training. Based on this, our framework divides the point cloud into regular grid voxels and generates a latent feature in each voxel to form a Latent Implicit Map (LIM) for geometries and arbitrary properties. Then, by fusing a local LIM frame-wisely into a global LIM, an incremental reconstruction is achieved. Encoded with corresponding types of data, our Latent Implicit Map is capable of generating continuous surfaces, surface property fields, surface feature fields, and all other possible options. To demonstrate the capabilities of our model, we implement three applications: (1) incremental reconstruction for surfaces and color (2) 2D-to-3D transfer of fabricated properties (3) open-vocabulary scene understanding by creating a text CLIP feature field on surfaces. We evaluate Uni-Fusion by comparing it in corresponding applications, from which Uni-Fusion shows high-flexibility in various applications while performing best or being competitive. The project page of Uni-Fusion is available at https://jarrome.github.io/Uni-Fusion/ .	Presents Uni-Fusion, a universal continuous mapping framework for surfaces, surface properties (color, infrared, etc.), and high-dimensional features like CLIP embeddings, without requiring any training.	Addresses the need for a single, universal mapping model in robotics that can handle various types of information, including geometry and surface properties, for tasks like reconstruction and scene understanding.	Decouples Gaussian Process Regression (GPR) using kernel function approximation to encode local point cloud data into latent vectors. These vectors form a Latent Implicit Map (LIM) that is incrementally reconstructed by fusing local LIMs frame-wise into a global LIM.	Uni-Fusion achieves state-of-the-art surface reconstruction accuracy on ScanNet, outperforming previous methods like BNV-Fusion. It demonstrates high-quality color reconstruction on the Replica dataset, achieving results comparable to NeRF-SLAM in visual quality. Uni-Fusion successfully performs open-vocabulary scene understanding by constructing a surface field for CLIP embeddings, enabling it to respond to various semantic queries.	Currently lacks support for remapping, which is necessary for bundle adjustment and loop closing. Future work involves exploring Visual Language Navigation (VLN) applications leveraging Uni-Fusion's ability to construct 3D embedding maps.	continuous mapping, surface reconstruction, scene understanding, open-vocabulary, neural implicit maps
2303.12417 Report	CLIP$^2$: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data	Yihan Zeng, Chenhan Jiang, Jiageng Mao, Jianhua Han, Chaoqiang Ye, Qingqiu Huang, Dit-Yan Yeung, Zhen Yang, Xiaodan Liang, Hang Xu	Contrastive Language-Image Pre-training, benefiting from large-scale unlabeled text-image pairs, has demonstrated great performance in open-world vision understanding tasks. However, due to the limited Text-3D data pairs, adapting the success of 2D Vision-Language Models (VLM) to the 3D space remains an open problem. Existing works that leverage VLM for 3D understanding generally resort to constructing intermediate 2D representations for the 3D data, but at the cost of losing 3D geometry information. To take a step toward open-world 3D vision understanding, we propose Contrastive Language-Image-Point Cloud Pretraining (CLIP$^2$) to directly learn the transferable 3D point cloud representation in realistic scenarios with a novel proxy alignment mechanism. Specifically, we exploit naturally-existed correspondences in 2D and 3D scenarios, and build well-aligned and instance-based text-image-point proxies from those complex scenarios. On top of that, we propose a cross-modal contrastive objective to learn semantic and instance-level aligned point cloud representation. Experimental results on both indoor and outdoor scenarios show that our learned 3D representation has great transfer ability in downstream tasks, including zero-shot and few-shot 3D recognition, which boosts the state-of-the-art methods by large margins. Furthermore, we provide analyses of the capability of different representations in real scenarios and present the optional ensemble scheme.	Presents CLIP², a novel framework for Contrastive Language-Image-Point Cloud Pretraining that learns transferable 3D point cloud representation directly from real-world scenarios with a new proxy alignment mechanism.	Addresses the challenge of adapting the success of 2D Vision-Language Models (VLM) to 3D space due to limited Text-3D data pairs, aiming for open-world 3D vision understanding.	Leverages existing large-scale point cloud datasets and constructs language-image-point triplets (Triplet Proxy Collection) to pretrain a point cloud encoder using a cross-modal contrastive objective (Cross-Modal Pretraining).	Achieves state-of-the-art zero-shot transfer performance on 5 datasets, including indoor/outdoor scenes and single-object benchmarks. Significantly outperforms baseline methods on zero-shot 3D recognition tasks, demonstrating the effectiveness of the learned 3D representation. Showcases strong open-vocabulary recognition and localization abilities in both indoor and outdoor scenarios, recognizing objects beyond the predefined ground truth vocabulary.	Current proxy generation process cannot provide accurate tight bounding boxes for 3D objects as dedicated detectors. Limited by the scale of current proxy data, further performance improvement is expected with larger and more diverse proxy datasets.	3d vision, vision-language model, zero-shot learning, point cloud representation learning, open-world recognition
2303.12368 Report	MAIR: Multi-view Attention Inverse Rendering with 3D Spatially-Varying Lighting Estimation	JunYong Choi, SeokYeong Lee, Haesol Park, Seung-Won Jung, Ig-Jae Kim, Junghyun Cho	We propose a scene-level inverse rendering framework that uses multi-view images to decompose the scene into geometry, a SVBRDF, and 3D spatially-varying lighting. Because multi-view images provide a variety of information about the scene, multi-view images in object-level inverse rendering have been taken for granted. However, owing to the absence of multi-view HDR synthetic dataset, scene-level inverse rendering has mainly been studied using single-view image. We were able to successfully perform scene-level inverse rendering using multi-view images by expanding OpenRooms dataset and designing efficient pipelines to handle multi-view images, and splitting spatially-varying lighting. Our experiments show that the proposed method not only achieves better performance than single-view-based methods, but also achieves robust performance on unseen real-world scene. Also, our sophisticated 3D spatially-varying lighting volume allows for photorealistic object insertion in any 3D location.	This paper presents MAIR, the first multi-view inverse rendering framework for scene-level decomposition into geometry, spatially-varying BRDF, and 3D spatially-varying lighting.	Existing single-view inverse rendering methods struggle with complex real-world scenes due to reliance on contextual information for specular reflectance. MAIR overcomes this limitation by exploiting multi-view images and MVS depth, enabling more accurate and robust scene decomposition.	MAIR uses a three-stage training pipeline: 1) Estimate direct lighting and geometry from multi-view inputs. 2) Estimate material properties using the estimated direct lighting and multi-view aggregation. 3) Infer 3D spatially-varying lighting by combining all estimated components. The authors create OpenRooms FF, a multi-view extension of OpenRooms dataset, to train and evaluate MAIR.	MAIR outperforms single-view methods in material and geometry estimation on OpenRooms FF dataset. Qualitative results on real-world images demonstrate MAIR's robustness in handling complex scenes and separating materials from lighting. MAIR enables realistic object insertion in both synthetic and real-world scenes by accurately reproducing 3D lighting.	Cascaded pipeline structure makes MAIR susceptible to errors in depth estimation. Non-parametric VSG lighting representation limits its application in tasks such as light source editing.	inverse rendering, multi-view stereo, spatially-varying lighting, scene understanding, object insertion
2303.12346 Report	NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation	Shengming Yin, Chenfei Wu, Huan Yang, Jianfeng Wang, Xiaodong Wang, Minheng Ni, Zhengyuan Yang, Linjie Li, Shuguang Liu, Fan Yang, Jianlong Fu, Gong Ming, Lijuan Wang, Zicheng Liu, Houqiang Li, Nan Duan	In this paper, we propose NUWA-XL, a novel Diffusion over Diffusion architecture for eXtremely Long video generation. Most current work generates long videos segment by segment sequentially, which normally leads to the gap between training on short videos and inferring long videos, and the sequential generation is inefficient. Instead, our approach adopts a ``coarse-to-fine'' process, in which the video can be generated in parallel at the same granularity. A global diffusion model is applied to generate the keyframes across the entire time range, and then local diffusion models recursively fill in the content between nearby frames. This simple yet effective strategy allows us to directly train on long videos (3376 frames) to reduce the training-inference gap, and makes it possible to generate all segments in parallel. To evaluate our model, we build FlintstonesHD dataset, a new benchmark for long video generation. Experiments show that our model not only generates high-quality long videos with both global and local coherence, but also decreases the average inference time from 7.55min to 26s (by 94.26\%) at the same hardware setting when generating 1024 frames. The homepage link is \url{https://msra-nuwa.azurewebsites.net/}	NUWA-XL, a "Diffusion over Diffusion" architecture for generating extremely long videos using a "coarse-to-fine" process.	Existing methods, relying on "Autoregressive over X" architectures, struggle with training-inference gap and inefficient sequential generation, leading to incoherent and unrealistic long videos.	1. A global diffusion model generates keyframes spanning the entire video, creating a coarse storyline. 2. Local diffusion models recursively fill in content between adjacent keyframes with increasing detail.	Directly trained on long videos (3376 frames), eliminating the training-inference gap. Generates higher-quality long videos with better global and local coherence compared to "Autoregressive over X" methods. Significantly faster inference (up to 94.26% speedup) due to parallel processing of local diffusions.	Limited evaluation on open-domain long videos due to data availability; currently validated on a cartoon dataset. Requires significant GPU resources for parallel inference to achieve the speedup.	video generation, long video generation, diffusion models, coarse-to-fine, parallel inference
2303.12343 Report	LD-ZNet: A Latent Diffusion Approach for Text-Based Image Segmentation	Koutilya Pnvr, Bharat Singh, Pallabi Ghosh, Behjat Siddiquie, David Jacobs	Large-scale pre-training tasks like image classification, captioning, or self-supervised techniques do not incentivize learning the semantic boundaries of objects. However, recent generative foundation models built using text-based latent diffusion techniques may learn semantic boundaries. This is because they have to synthesize intricate details about all objects in an image based on a text description. Therefore, we present a technique for segmenting real and AI-generated images using latent diffusion models (LDMs) trained on internet-scale datasets. First, we show that the latent space of LDMs (z-space) is a better input representation compared to other feature representations like RGB images or CLIP encodings for text-based image segmentation. By training the segmentation models on the latent z-space, which creates a compressed representation across several domains like different forms of art, cartoons, illustrations, and photographs, we are also able to bridge the domain gap between real and AI-generated images. We show that the internal features of LDMs contain rich semantic information and present a technique in the form of LD-ZNet to further boost the performance of text-based segmentation. Overall, we show up to 6% improvement over standard baselines for text-to-image segmentation on natural images. For AI-generated imagery, we show close to 20% improvement compared to state-of-the-art techniques. The project is available at https://koutilya-pnvr.github.io/LD-ZNet/.	This paper presents LD-ZNet, a novel text-based image segmentation technique leveraging latent diffusion models (LDMs) trained on large-scale datasets.	The work addresses the limitation of large-scale pre-training tasks (e.g., image classification) in learning semantic boundaries of objects, which is crucial for open-world image segmentation, particularly in editing AI-generated content.	The methodology involves analyzing the latent space (z-space) of LDMs as input representation for segmentation and incorporating internal LDM features via cross-attention into a segmentation network (ZNet) to create LD-ZNet.	LD-ZNet shows up to 6% improvement over baselines for text-to-image segmentation on natural images. For AI-generated imagery, LD-ZNet achieves close to 20% improvement over state-of-the-art techniques. Analysis reveals that the internal features of LDMs contain rich semantic information, especially in middle layers and specific timesteps during the denoising process.	The approach's reliance on LDMs increases inference time compared to some baselines. Future work could explore optimizing the trade-off between performance and computational cost.	image segmentation, latent diffusion models, text-to-image synthesis, ai-generated images, semantic segmentation
2303.12326 Report	Make Encoder Great Again in 3D GAN Inversion through Geometry and Occlusion-Aware Encoding	Ziyang Yuan, Yiming Zhu, Yu Li, Hongyu Liu, Chun Yuan	3D GAN inversion aims to achieve high reconstruction fidelity and reasonable 3D geometry simultaneously from a single image input. However, existing 3D GAN inversion methods rely on time-consuming optimization for each individual case. In this work, we introduce a novel encoder-based inversion framework based on EG3D, one of the most widely-used 3D GAN models. We leverage the inherent properties of EG3D's latent space to design a discriminator and a background depth regularization. This enables us to train a geometry-aware encoder capable of converting the input image into corresponding latent code. Additionally, we explore the feature space of EG3D and develop an adaptive refinement stage that improves the representation ability of features in EG3D to enhance the recovery of fine-grained textural details. Finally, we propose an occlusion-aware fusion operation to prevent distortion in unobserved regions. Our method achieves impressive results comparable to optimization-based methods while operating up to 500 times faster. Our framework is well-suited for applications such as semantic editing.	This paper introduces a novel encoder-based 3D GAN inversion method for EG3D that leverages a "canonical latent space" within the model for improved reconstruction fidelity and 3D geometry.	Existing 3D GAN inversion methods are either time-consuming (optimization-based) or lack fidelity (encoder-based). This method aims to achieve both efficiency and high-quality inversion.	The method utilizes a geometry-aware encoder trained with a canonical latent discriminator and background depth regularization. It also uses an adaptive feature alignment module to refine generator features and an occlusion-aware fusion operation for multi-view consistency.	Achieves high-quality inversion comparable to optimization-based methods while being significantly faster (up to 500 times). Exhibits robust performance even when inverting images with extreme poses. Demonstrates effectiveness for 3D-aware semantic editing applications.	The reliance on paired data for training may limit generalization. Future work could explore applying the method to other 3D GAN architectures.	3d gan inversion, eg3d, canonical latent space, adaptive feature alignment, occlusion-aware fusion
2303.12218 Report	Compositional 3D Scene Generation using Locally Conditioned Diffusion	Ryan Po, Gordon Wetzstein	Designing complex 3D scenes has been a tedious, manual process requiring domain expertise. Emerging text-to-3D generative models show great promise for making this task more intuitive, but existing approaches are limited to object-level generation. We introduce \textbf{locally conditioned diffusion} as an approach to compositional scene diffusion, providing control over semantic parts using text prompts and bounding boxes while ensuring seamless transitions between these parts. We demonstrate a score distillation sampling--based text-to-3D synthesis pipeline that enables compositional 3D scene generation at a higher fidelity than relevant baselines.	This paper introduces locally conditioned diffusion, a method for compositional 3D scene generation using diffusion models with control over semantic elements through text prompts and bounding boxes.	Designing 3D scenes is a laborious process, and this method aims to simplify it while offering control over scene composition.	The method leverages pre-trained text-conditioned 2D diffusion models and applies locally conditioned diffusion to a score distillation sampling-based 3D generation pipeline. It utilizes 3D bounding boxes and text prompts to guide the generation process.	The method generates high-quality 3D scenes that adhere to the user-specified layout with seamless transitions between elements. It provides control over the size and position of individual assets within the scene. It outperforms baseline methods in terms of CLIP R-Precision, indicating better alignment with input prompts.	The generation process can be slow, especially for scenes with multiple distinct elements, due to reliance on thousands of denoising iterations. The heavy reliance on high guidance scales for score distillation sampling can lead to limited diversity in generated outputs.	3d scene generation, diffusion models, compositional synthesis, text-to-3d, score distillation sampling
2303.11938 Report	3D-CLFusion: Fast Text-to-3D Rendering with Contrastive Latent Diffusion	Yu-Jhe Li, Tao Xu, Ji Hou, Bichen Wu, Xiaoliang Dai, Albert Pumarola, Peizhao Zhang, Peter Vajda, Kris Kitani	We tackle the task of text-to-3D creation with pre-trained latent-based NeRFs (NeRFs that generate 3D objects given input latent code). Recent works such as DreamFusion and Magic3D have shown great success in generating 3D content using NeRFs and text prompts, but the current approach of optimizing a NeRF for every text prompt is 1) extremely time-consuming and 2) often leads to low-resolution outputs. To address these challenges, we propose a novel method named 3D-CLFusion which leverages the pre-trained latent-based NeRFs and performs fast 3D content creation in less than a minute. In particular, we introduce a latent diffusion prior network for learning the w latent from the input CLIP text/image embeddings. This pipeline allows us to produce the w latent without further optimization during inference and the pre-trained NeRF is able to perform multi-view high-resolution 3D synthesis based on the latent. We note that the novelty of our model lies in that we introduce contrastive learning during training the diffusion prior which enables the generation of the valid view-invariant latent code. We demonstrate through experiments the effectiveness of our proposed view-invariant diffusion process for fast text-to-3D creation, e.g., 100 times faster than DreamFusion. We note that our model is able to serve as the role of a plug-and-play tool for text-to-3D with pre-trained NeRFs.	This paper introduces 3D-CLFusion, a novel method for fast text-to-3D creation that leverages pre-trained latent-based NeRFs and a latent diffusion prior network.	Current text-to-3D methods using NeRFs are time-consuming (taking hours per object) and often produce low-resolution outputs due to optimizing a NeRF from scratch for each text prompt. 3D-CLFusion addresses these limitations by enabling fast generation (<1 minute) and high-resolution rendering.	3D-CLFusion consists of a diffusion prior network trained on CLIP image embeddings and a pre-trained latent-based NeRF. It uses contrastive learning during training to ensure the generated latent codes are view-invariant, allowing for consistent 3D object generation from various viewpoints.	3D-CLFusion generates 3D objects from text prompts significantly faster (around 100x) than methods like DreamFusion and Magic3D. The use of contrastive learning in the diffusion process is crucial for achieving view-invariant latent codes and thus, consistent 3D objects. The approach demonstrates promising results on different pre-trained NeRF generators, including StyleNeRF and EG3D, for various object classes like faces and cars.	The generated 3D objects are limited to the domain of the pre-trained NeRF model used. Future work could explore extending the method to handle a wider variety of objects and scenes by incorporating more diverse pre-trained NeRF models or developing techniques for generalizing across domains.	text-to-3d, neural radiance fields (nerfs), latent diffusion models, contrastive learning, view-invariance
2303.11916 Report	CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion	Geonmo Gu, Sanghyuk Chun, Wonjae Kim, HeeJae Jun, Yoohoon Kang, Sangdoo Yun	This paper proposes a novel diffusion-based model, CompoDiff, for solving zero-shot Composed Image Retrieval (ZS-CIR) with latent diffusion. This paper also introduces a new synthetic dataset, named SynthTriplets18M, with 18.8 million reference images, conditions, and corresponding target image triplets to train CIR models. CompoDiff and SynthTriplets18M tackle the shortages of the previous CIR approaches, such as poor generalizability due to the small dataset scale and the limited types of conditions. CompoDiff not only achieves a new state-of-the-art on four ZS-CIR benchmarks, including FashionIQ, CIRR, CIRCO, and GeneCIS, but also enables a more versatile and controllable CIR by accepting various conditions, such as negative text, and image mask conditions. CompoDiff also shows the controllability of the condition strength between text and image queries and the trade-off between inference speed and performance, which are unavailable with existing CIR methods. The code and dataset are available at https://github.com/navervision/CompoDiff	This paper introduces CompoDiff, a novel diffusion-based model for zero-shot Composed Image Retrieval (ZS-CIR) using latent diffusion. It also presents SynthTriplets18M, a large synthetic dataset for training CIR models.	Existing CIR methods suffer from limited generalizability due to small datasets and restricted condition types. This work aims to address these limitations and enable versatile CIR with diverse conditions.	CompoDiff leverages a latent diffusion model with classifier-free guidance to edit reference images in CLIP latent space. It is trained on a massive synthetic dataset, SynthTriplets18M, generated by automatically creating and filtering image-caption triplets.	CompoDiff achieves state-of-the-art zero-shot performance on FashionIQ, CIRR, CIRCO, and GeneCIS benchmarks. Training existing CIR methods on SynthTriplets18M also leads to significant improvements, surpassing previous zero-shot methods. CompoDiff allows versatile CIR with multiple conditions (negative text, image masks) and enables controlling condition strength and inference speed.	Current CIR benchmarks might not fully represent real-world queries. Quantitative evaluation of retrieval quality on large-scale databases requires further exploration.	composed image retrieval, latent diffusion, zero-shot learning, synthetic dataset, classifier-free guidance
2303.11797 Report	CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation	Seokju Cho, Heeseong Shin, Sunghwan Hong, Anurag Arnab, Paul Hongsuck Seo, Seungryong Kim	Open-vocabulary semantic segmentation presents the challenge of labeling each pixel within an image based on a wide range of text descriptions. In this work, we introduce a novel cost-based approach to adapt vision-language foundation models, notably CLIP, for the intricate task of semantic segmentation. Through aggregating the cosine similarity score, i.e., the cost volume between image and text embeddings, our method potently adapts CLIP for segmenting seen and unseen classes by fine-tuning its encoders, addressing the challenges faced by existing methods in handling unseen classes. Building upon this, we explore methods to effectively aggregate the cost volume considering its multi-modal nature of being established between image and text embeddings. Furthermore, we examine various methods for efficiently fine-tuning CLIP.	This paper presents CAT-Seg, a novel cost-based framework for open-vocabulary semantic segmentation that leverages CLIP by aggregating cosine similarity scores between image and text embeddings.	Existing methods struggle to adapt CLIP for pixel-level prediction due to overfitting issues when fine-tuning. This paper addresses this gap by proposing a cost aggregation approach that effectively adapts CLIP to the segmentation task.	CAT-Seg computes a cost volume from image and text embeddings of CLIP and aggregates it through spatial and class aggregation modules. Additionally, it utilizes embedding guidance and efficiently fine-tunes CLIP encoders for optimal performance.	CAT-Seg achieves state-of-the-art results on standard open-vocabulary benchmarks, outperforming previous methods by a large margin. The framework generalizes well to multi-domain datasets, showing robustness to domain shifts. CAT-Seg demonstrates strong efficiency in both training and inference compared to region-text methods.	The reliability of evaluation datasets for open-vocabulary semantic segmentation is questionable due to ambiguities in ground truth. Further investigation into handcrafted text prompts for improved performance is a potential avenue for future work.	open-vocabulary semantic segmentation, vision-language models, clip, cost aggregation, fine-tuning
2303.11749 Report	Detecting Everything in the Open World: Towards Universal Object Detection	Zhenyu Wang, Yali Li, Xi Chen, Ser-Nam Lim, Antonio Torralba, Hengshuang Zhao, Shengjin Wang	In this paper, we formally address universal object detection, which aims to detect every scene and predict every category. The dependence on human annotations, the limited visual information, and the novel categories in the open world severely restrict the universality of traditional detectors. We propose UniDetector, a universal object detector that has the ability to recognize enormous categories in the open world. The critical points for the universality of UniDetector are: 1) it leverages images of multiple sources and heterogeneous label spaces for training through the alignment of image and text spaces, which guarantees sufficient information for universal representations. 2) it generalizes to the open world easily while keeping the balance between seen and unseen classes, thanks to abundant information from both vision and language modalities. 3) it further promotes the generalization ability to novel categories through our proposed decoupling training manner and probability calibration. These contributions allow UniDetector to detect over 7k categories, the largest measurable category size so far, with only about 500 classes participating in training. Our UniDetector behaves the strong zero-shot generalization ability on large-vocabulary datasets like LVIS, ImageNetBoxes, and VisualGenome - it surpasses the traditional supervised baselines by more than 4\% on average without seeing any corresponding images. On 13 public detection datasets with various scenes, UniDetector also achieves state-of-the-art performance with only a 3\% amount of training data.	Proposes UniDetector, a universal object detection framework capable of detecting a vast number of categories in open-world scenarios, even those not present during training.	Addresses the limitations of traditional object detectors that struggle to generalize to unseen categories and diverse scenes, aiming for human-like generalization capabilities in object detection.	Leverages image-text pre-training to align image and text spaces, enabling detection of novel categories. Employs decoupled training of proposal generation and RoI classification stages, along with probability calibration, to enhance open-world performance.	Achieves state-of-the-art performance on 13 diverse object detection datasets with only 3% of the training data used by comparable methods. Outperforms traditional supervised methods on large-vocabulary datasets by over 4% AP, demonstrating strong zero-shot generalization. Achieves 49.3% AP on COCO with a ResNet50 backbone and 1x schedule, highlighting its effectiveness in both open and closed-world settings.	Current implementation primarily focuses on object-centric datasets, with future work exploring more complex scenes. Relies on accurate language embeddings for novel category detection, which could be limited by the quality of pre-trained language models.	object detection, open-world learning, zero-shot learning, image-text pre-training, universal object detection
2303.11686 Report	Learning a 3D Morphable Face Reflectance Model from Low-cost Data	Yuxuan Han, Zhibo Wang, Feng Xu	Modeling non-Lambertian effects such as facial specularity leads to a more realistic 3D Morphable Face Model. Existing works build parametric models for diffuse and specular albedo using Light Stage data. However, only diffuse and specular albedo cannot determine the full BRDF. In addition, the requirement of Light Stage data is hard to fulfill for the research communities. This paper proposes the first 3D morphable face reflectance model with spatially varying BRDF using only low-cost publicly-available data. We apply linear shiness weighting into parametric modeling to represent spatially varying specular intensity and shiness. Then an inverse rendering algorithm is developed to reconstruct the reflectance parameters from non-Light Stage data, which are used to train an initial morphable reflectance model. To enhance the model's generalization capability and expressive power, we further propose an update-by-reconstruction strategy to finetune it on an in-the-wild dataset. Experimental results show that our method obtains decent rendering results with plausible facial specularities. Our code is released \href{https://yxuhan.github.io/ReflectanceMM/index.html}{\textcolor{magenta}{here}}.	This paper introduces the first 3D morphable face reflectance model that incorporates spatially varying Bidirectional Reflectance Distribution Function (BRDF) learned from readily accessible, low-cost data.	Existing morphable face models struggle to realistically represent facial specularity, a key element for lifelike rendering. This work addresses this limitation by using a novel BRDF model trained on widely available data, removing the reliance on expensive and complex Light Stage setups.	The authors utilize a linear combination of Blinn-Phong BRDFs with predefined exponents to characterize the specular component of reflectance. They develop an inverse rendering method to estimate reflectance parameters from the Multi-PIE dataset, generating an initial model. This model is then refined on the FFHQ dataset through a joint face reconstruction and model update process.	The model successfully captures spatially varying specular intensity and shiness on faces, leading to more realistic renderings compared to models using a global specular exponent. Fine-tuning the model on in-the-wild data significantly enhances its generalization ability, demonstrated through superior performance in photometric face reconstruction on the CelebA-HQ dataset. The model exhibits plausible disentanglement of diffuse and specular shading and shows promise for relighting applications, generating realistic specular reflections under novel lighting conditions.	The model currently employs a Lambertian BRDF for diffuse reflectance, limiting its ability to represent subsurface scattering effects. Integrating a more sophisticated diffuse reflectance model could enhance realism. Representing the complex specular properties of the eye region remains a challenge. Further research is needed to effectively model specular reflections around the eyes.	3d morphable face model, reflectance modeling, brdf, inverse rendering, face relighting
2303.11424 Report	Polynomial Implicit Neural Representations For Large Diverse Datasets	Rajhans Singh, Ankita Shukla, Pavan Turaga	Implicit neural representations (INR) have gained significant popularity for signal and image representation for many end-tasks, such as superresolution, 3D modeling, and more. Most INR architectures rely on sinusoidal positional encoding, which accounts for high-frequency information in data. However, the finite encoding size restricts the model's representational power. Higher representational power is needed to go from representing a single given image to representing large and diverse datasets. Our approach addresses this gap by representing an image with a polynomial function and eliminates the need for positional encodings. Therefore, to achieve a progressively higher degree of polynomial representation, we use element-wise multiplications between features and affine-transformed coordinate locations after every ReLU layer. The proposed method is evaluated qualitatively and quantitatively on large datasets like ImageNet. The proposed Poly-INR model performs comparably to state-of-the-art generative models without any convolution, normalization, or self-attention layers, and with far fewer trainable parameters. With much fewer training parameters and higher representative power, our approach paves the way for broader adoption of INR models for generative modeling tasks in complex domains. The code is available at \url{https://github.com/Rajhans0/Poly_INR}	This paper proposes Poly-INR, a novel Implicit Neural Representation (INR) model that leverages polynomial functions to represent large and diverse image datasets.	Existing INRs, often reliant on sinusoidal positional encoding, face limitations in representational power when scaled to large datasets like ImageNet. Poly-INR addresses this by using polynomials, enabling efficient parameterization and handling of high-frequency information.	Poly-INR consists of a mapping network (converting latent codes to affine parameters) and a synthesis network. The latter progressively increases the degree of the polynomial representation by element-wise multiplication between features and affine-transformed coordinate locations after each ReLU layer.	Poly-INR achieves comparable performance to state-of-the-art CNN-based GANs (e.g., StyleGAN-XL) on ImageNet, with 3-4 times fewer parameters. It outperforms previous INR-based GANs (CIPS, INR-GAN) on FFHQ dataset with a smaller model size. The model demonstrates strong capabilities in image interpolation, extrapolation, style-mixing, high-resolution sampling, and inversion.	Poly-INR's computational cost is higher than multi-scale CNN generators for high-resolution synthesis due to pixel-wise computation. The model sometimes exhibits GAN artifacts (e.g., multiple limbs) potentially due to limitations in the discriminator's shape understanding.	implicit neural representations, generative models, polynomial representation, image synthesis, stylegan
2303.11396 Report	Text2Tex: Text-driven Texture Synthesis via Diffusion Models	Dave Zhenyu Chen, Yawar Siddiqui, Hsin-Ying Lee, Sergey Tulyakov, Matthias Nießner	We present Text2Tex, a novel method for generating high-quality textures for 3D meshes from the given text prompts. Our method incorporates inpainting into a pre-trained depth-aware image diffusion model to progressively synthesize high resolution partial textures from multiple viewpoints. To avoid accumulating inconsistent and stretched artifacts across views, we dynamically segment the rendered view into a generation mask, which represents the generation status of each visible texel. This partitioned view representation guides the depth-aware inpainting model to generate and update partial textures for the corresponding regions. Furthermore, we propose an automatic view sequence generation scheme to determine the next best view for updating the partial texture. Extensive experiments demonstrate that our method significantly outperforms the existing text-driven approaches and GAN-based methods.	Text2Tex: a novel method for generating high-quality 3D textures on meshes from text prompts.	Automating 3D texture design using text guidance is important for efficient 3D content creation, but existing methods struggle to produce high-quality and consistent results.	Text2Tex leverages a pretrained depth-aware text-to-image diffusion model to progressively generate and refine textures. It uses a view partitioning technique for consistent inpainting and an automatic viewpoint selection scheme for refinement.	Significantly outperforms existing text-driven methods in FID and KID on Objaverse dataset. Outperforms category-specific GAN-based methods on ShapeNet car dataset. Preferred by human users in a user study for realism and fidelity to text prompts.	Generated textures can exhibit shading effects inherited from the diffusion backbone. Future work could explore fine-tuning the diffusion model to remove shading artifacts.	3d texture synthesis, text-guided generation, depth-aware diffusion model, view partitioning, automatic viewpoint selection
2303.11324 Report	Open-vocabulary Panoptic Segmentation with Embedding Modulation	Xi Chen, Shuang Li, Ser-Nam Lim, Antonio Torralba, Hengshuang Zhao	Open-vocabulary image segmentation is attracting increasing attention due to its critical applications in the real world. Traditional closed-vocabulary segmentation methods are not able to characterize novel objects, whereas several recent open-vocabulary attempts obtain unsatisfactory results, i.e., notable performance reduction on the closed vocabulary and massive demand for extra data. To this end, we propose OPSNet, an omnipotent and data-efficient framework for Open-vocabulary Panoptic Segmentation. Specifically, the exquisitely designed Embedding Modulation module, together with several meticulous components, enables adequate embedding enhancement and information exchange between the segmentation model and the visual-linguistic well-aligned CLIP encoder, resulting in superior segmentation performance under both open- and closed-vocabulary settings with much fewer need of additional data. Extensive experimental evaluations are conducted across multiple datasets (e.g., COCO, ADE20K, Cityscapes, and PascalContext) under various circumstances, where the proposed OPSNet achieves state-of-the-art results, which demonstrates the effectiveness and generality of the proposed approach. The code and trained models will be made publicly available.	This paper proposes OPSNet, an omnipotent and data-efficient framework for Open-vocabulary Panoptic Segmentation that uses an Embedding Modulation module for enhanced information exchange between the segmentation model and a visual-linguistic CLIP encoder.	Open-vocabulary image segmentation is crucial for real-world applications as it allows for the segmentation and recognition of both known and unknown objects, overcoming limitations of traditional closed-vocabulary methods that fail to characterize novel objects.	OPSNet predicts class-agnostic object masks and utilizes a Spatial Adapter to extract CLIP visual features. It employs Embedding Modulation, combining query and CLIP embeddings, to enhance recognition. Mask Filtering refines mask proposals, while Decoupled Supervision uses image-level labels for training, expanding training concepts.	OPSNet achieves state-of-the-art results in open-vocabulary panoptic segmentation across multiple datasets (COCO, ADE20K, Cityscapes). It demonstrates superior performance compared to previous open-vocabulary semantic segmentation methods while maintaining strong performance on closed-vocabulary datasets. The proposed Embedding Modulation module effectively enhances object recognition by leveraging both query and CLIP embeddings, balancing in-domain accuracy and generalization to novel categories.	The accuracy of open-vocabulary predictions can be further improved, as current results show occasional noise and misclassifications. Future work includes exploring the use of larger and more diverse datasets for training to further enhance the generalization ability of OPSNet to broader object categories.	open-vocabulary segmentation, panoptic segmentation, clip, embedding modulation, cross-dataset generalization
2303.11316 Report	Generative Semantic Segmentation	Jiaqi Chen, Jiachen Lu, Xiatian Zhu, Li Zhang	We present Generative Semantic Segmentation (GSS), a generative learning approach for semantic segmentation. Uniquely, we cast semantic segmentation as an image-conditioned mask generation problem. This is achieved by replacing the conventional per-pixel discriminative learning with a latent prior learning process. Specifically, we model the variational posterior distribution of latent variables given the segmentation mask. To that end, the segmentation mask is expressed with a special type of image (dubbed as maskige). This posterior distribution allows to generate segmentation masks unconditionally. To achieve semantic segmentation on a given image, we further introduce a conditioning network. It is optimized by minimizing the divergence between the posterior distribution of maskige (i.e., segmentation masks) and the latent prior distribution of input training images. Extensive experiments on standard benchmarks show that our GSS can perform competitively to prior art alternatives in the standard semantic segmentation setting, whilst achieving a new state of the art in the more challenging cross-domain setting.	This paper introduces Generative Semantic Segmentation (GSS), a novel approach that formulates semantic segmentation as an image-conditioned mask generation problem, marking a shift from traditional discriminative learning paradigms.	This new perspective allows leveraging the power of off-the-shelf big generative models pretrained on massive datasets, potentially leading to more efficient and domain-agnostic segmentation models.	GSS uses a two-stage optimization process. First, it learns a latent posterior distribution for reconstructing segmentation masks efficiently using the concept of "maskige" and pretrained VQVAE. Second, it learns a latent prior distribution conditioned on input images to generate segmentation masks.	GSS achieves competitive performance compared to state-of-the-art discriminative models on Cityscapes and ADE20K. GSS outperforms existing methods in cross-domain semantic segmentation on the MSeg dataset, demonstrating superior domain generalization capabilities. The proposed "maskige" mechanism proves to be efficient and domain-agnostic, enabling the use of pretrained generative models and transfer learning across datasets.	While showing promising results, GSS's performance still lags behind the top discriminative models, indicating potential for improvement in segmentation accuracy. The current approach is limited by the color space used for "maskige" representation, particularly when dealing with a large number of categories, and exploring higher-dimensional representations could be beneficial.	generative semantic segmentation, maskige, vqvae, cross-domain segmentation, domain generalization
2303.11313 Report	CLIP goes 3D: Leveraging Prompt Tuning for Language Grounded 3D Recognition	Deepti Hegde, Jeya Maria Jose Valanarasu, Vishal M. Patel	Vision-Language models like CLIP have been widely adopted for various tasks due to their impressive zero-shot capabilities. However, CLIP is not suitable for extracting 3D geometric features as it was trained on only images and text by natural language supervision. We work on addressing this limitation and propose a new framework termed CG3D (CLIP Goes 3D) where a 3D encoder is learned to exhibit zero-shot capabilities. CG3D is trained using triplets of pointclouds, corresponding rendered 2D images, and texts using natural language supervision. To align the features in a multimodal embedding space, we utilize contrastive loss on 3D features obtained from the 3D encoder, as well as visual and text features extracted from CLIP. We note that the natural images used to train CLIP and the rendered 2D images in CG3D have a distribution shift. Attempting to train the visual and text encoder to account for this shift results in catastrophic forgetting and a notable decrease in performance. To solve this, we employ prompt tuning and introduce trainable parameters in the input space to shift CLIP towards the 3D pre-training dataset utilized in CG3D. We extensively test our pre-trained CG3D framework and demonstrate its impressive capabilities in zero-shot, open scene understanding, and retrieval tasks. Further, it also serves as strong starting weights for fine-tuning in downstream 3D recognition tasks.	Presents CG3D, a new framework for training a 3D encoder with natural language supervision leveraging CLIP, enabling zero-shot 3D recognition and serving as strong initialization for fine-tuning 3D recognition tasks.	Addresses the lack of 3D networks with zero-shot capabilities similar to CLIP, crucial for 3D understanding tasks and improving existing 3D backbones.	Trains a 3D encoder with contrastive loss aligning 3D, image, and text features using ShapeNet data. Employs prompt tuning to shift CLIP's visual encoder towards rendered 3D objects.	Significantly outperforms PointCLIP in zero-shot 3D recognition on ModelNet and ScanObjectNN. Demonstrates effective scene querying with language and cross-modal 3D retrieval. Provides competitive performance boost when used as pre-training for downstream 3D recognition tasks.	Limited pre-training dataset size and reliance on simulated point cloud objects. Focus on objects rather than scenes, limiting full scene understanding capabilities.	3d vision, zero-shot learning, vision-language models, clip, prompt tuning
2303.11306 Report	Localizing Object-level Shape Variations with Text-to-Image Diffusion Models	Or Patashnik, Daniel Garibi, Idan Azuri, Hadar Averbuch-Elor, Daniel Cohen-Or	Text-to-image models give rise to workflows which often begin with an exploration step, where users sift through a large collection of generated images. The global nature of the text-to-image generation process prevents users from narrowing their exploration to a particular object in the image. In this paper, we present a technique to generate a collection of images that depicts variations in the shape of a specific object, enabling an object-level shape exploration process. Creating plausible variations is challenging as it requires control over the shape of the generated object while respecting its semantics. A particular challenge when generating object variations is accurately localizing the manipulation applied over the object's shape. We introduce a prompt-mixing technique that switches between prompts along the denoising process to attain a variety of shape choices. To localize the image-space operation, we present two techniques that use the self-attention layers in conjunction with the cross-attention layers. Moreover, we show that these localization techniques are general and effective beyond the scope of generating object variations. Extensive results and comparisons demonstrate the effectiveness of our method in generating object variations, and the competence of our localization techniques.	This paper introduces a technique to generate variations in the shape of a specific object within an image using text-to-image diffusion models, enabling object-level shape exploration.	Existing text-to-image generation methods lack object-level control, making it difficult to refine specific objects during exploration. This method addresses this limitation by allowing users to generate and explore variations of a chosen object within an image.	The method uses a prompt-mixing technique that leverages the coarse-to-fine nature of the denoising process. It uses different prompts in different time intervals to control the image layout, object shape, and fine details. It also introduces two localization techniques: one based on injecting self-attention maps from the original image to preserve shapes of other objects, and another using a novel self-segmentation method based on attention maps to preserve the background and selected objects.	The method generates a diverse range of plausible shape variations for a specific object in an image, while preserving the overall image composition. The introduced localization techniques effectively preserve the shapes of other objects in the image and the background, resulting in more coherent and realistic variations. Quantitative and qualitative comparisons with other methods demonstrate the effectiveness of the proposed method in generating diverse and faithful object-level variations while preserving the original image content.	The automatic proxy word selection, while generally effective, may sometimes produce unexpected words that lead to less plausible shape variations. The method currently explores a discrete word space for proxy words, which could be extended to a continuous space for more nuanced control over shape variations.	text-to-image synthesis, diffusion models, object-level control, shape exploration, image editing
2303.11162 Report	Picture that Sketch: Photorealistic Image Generation from Abstract Sketches	Subhadeep Koley, Ayan Kumar Bhunia, Aneeshan Sain, Pinaki Nath Chowdhury, Tao Xiang, Yi-Zhe Song	Given an abstract, deformed, ordinary sketch from untrained amateurs like you and me, this paper turns it into a photorealistic image - just like those shown in Fig. 1(a), all non-cherry-picked. We differ significantly from prior art in that we do not dictate an edgemap-like sketch to start with, but aim to work with abstract free-hand human sketches. In doing so, we essentially democratise the sketch-to-photo pipeline, "picturing" a sketch regardless of how good you sketch. Our contribution at the outset is a decoupled encoder-decoder training paradigm, where the decoder is a StyleGAN trained on photos only. This importantly ensures that generated results are always photorealistic. The rest is then all centred around how best to deal with the abstraction gap between sketch and photo. For that, we propose an autoregressive sketch mapper trained on sketch-photo pairs that maps a sketch to the StyleGAN latent space. We further introduce specific designs to tackle the abstract nature of human sketches, including a fine-grained discriminative loss on the back of a trained sketch-photo retrieval model, and a partial-aware sketch augmentation strategy. Finally, we showcase a few downstream tasks our generation model enables, amongst them is showing how fine-grained sketch-based image retrieval, a well-studied problem in the sketch community, can be reduced to an image (generated) to image retrieval task, surpassing state-of-the-arts. We put forward generated results in the supplementary for everyone to scrutinise.	This paper presents a novel method to generate photorealistic images from abstract, deformed, amateur sketches.	Existing sketch-to-photo methods rely on pixel-aligned edgemaps and fail to produce realistic outputs from abstract human sketches. This work aims to democratize sketch-to-photo generation by enabling photorealistic generation from untrained amateur sketches.	The proposed method employs a decoupled encoder-decoder training paradigm. A StyleGAN, pre-trained on photos, acts as a decoder, ensuring photorealistic outputs. An autoregressive sketch mapper, trained on sketch-photo pairs, encodes sketches into the StyleGAN's latent space. A fine-grained discriminative loss and a partial-aware sketch augmentation strategy further enhance the generation from abstract sketches.	The proposed method significantly outperforms state-of-the-art methods in terms of photorealism and fidelity to the input sketch. The model exhibits strong generalization ability, effectively handling sketches with varying levels of abstraction and noise. The generated model enables downstream applications such as fine-grained sketch-based image retrieval and precise semantic editing.	The quality of generated photos is limited by the diversity and quality of the training data. Future work can explore more sophisticated architectures for the sketch mapper and incorporate additional constraints for enhanced control over the generation process.	sketch-to-photo generation, generative adversarial networks (gans), image-to-image translation, stylegan, fine-grained image retrieval
2303.11120 Report	Positional Diffusion: Ordering Unordered Sets with Diffusion Probabilistic Models	Francesco Giuliari, Gianluca Scarpellini, Stuart James, Yiming Wang, Alessio Del Bue	Positional reasoning is the process of ordering unsorted parts contained in a set into a consistent structure. We present Positional Diffusion, a plug-and-play graph formulation with Diffusion Probabilistic Models to address positional reasoning. We use the forward process to map elements' positions in a set to random positions in a continuous space. Positional Diffusion learns to reverse the noising process and recover the original positions through an Attention-based Graph Neural Network. We conduct extensive experiments with benchmark datasets including two puzzle datasets, three sentence ordering datasets, and one visual storytelling dataset, demonstrating that our method outperforms long-lasting research on puzzle solving with up to +18% compared to the second-best deep learning method, and performs on par against the state-of-the-art methods on sentence ordering and visual storytelling. Our work highlights the suitability of diffusion models for ordering problems and proposes a novel formulation and method for solving various ordering tasks. Project website at https://iit-pavis.github.io/Positional_Diffusion/	\mnamenoit is a novel graph-based Diffusion Probabilistic Model (DPM) for positional reasoning, addressing the challenge of ordering elements in unordered sets.	Positional reasoning is a fundamental human skill crucial for various tasks, and a robust, task-agnostic method for addressing this challenge is highly desirable.	The method uses a graph representation of the set, where each element is a node, and employs an Attention-based Graph Neural Network (GNN) within a DPM framework. During training, the model learns to reverse a noising process applied to node positions, guided by node features. At inference, it iteratively refines initially random positions to recover the correct order.	\mnamenoit achieves state-of-the-art performance on puzzle solving, outperforming previous methods by a significant margin. It demonstrates competitive results on sentence ordering, achieving state-of-the-art performance on a subset of benchmark datasets. The model also exhibits strong performance on visual storytelling, on par with state-of-the-art methods designed specifically for this task.	The performance of \mnamenoit on sentence ordering tasks with loosely structured text, such as ROCStories, is relatively weaker. Future work includes exploring different graph structures beyond fully connected graphs to potentially enhance performance.	positional reasoning, diffusion probabilistic models, graph neural networks, puzzle solving, sentence ordering, visual storytelling
2303.11108 Report	CHATEDIT: Towards Multi-turn Interactive Facial Image Editing via Dialogue	Xing Cui, Zekun Li, Peipei Li, Yibo Hu, Hailin Shi, Zhaofeng He	This paper explores interactive facial image editing via dialogue and introduces the ChatEdit benchmark dataset for evaluating image editing and conversation abilities in this context. ChatEdit is constructed from the CelebA-HQ dataset, incorporating annotated multi-turn dialogues corresponding to user edit requests on the images. The dataset is challenging, as it requires the system to dynamically track user requests, edit images, and generate appropriate responses. Accordingly, we propose three benchmark tasks: (i) user edit request tracking, (ii) image editing, and (iii) response generation. We present a novel baseline framework that integrates a dialogue module for both tracking user requests and generating responses and an image editing module for image editing. Unlike previous approaches, our framework directly tracks user edit requests from the entire dialogue history up to the current turn and modifies the original image rather than adjusting the previous turn's output, thereby reducing error accumulation and preventing attribute forgetfulness. Extensive experiments on the ChatEdit dataset underline our framework's superior performance against prior models, while also highlighting potential room for further research. We will release the code and data publicly to facilitate advancements in complex interactive facial image editing.	This paper introduces ChatEdit, a benchmark dataset for multi-turn interactive facial image editing via dialogue, and proposes a novel framework combining a task-oriented dialogue module and an image editing module.	Existing approaches to interactive image editing suffer from error accumulation, attribute forgetting, and limited response generation capabilities, highlighting the need for a dedicated benchmark and improved methods.	ChatEdit is constructed from CelebA-HQ, enriched with multi-turn dialogues and user belief states. The proposed framework leverages a T5-based dialogue module for request tracking and response generation, and StyleCLIP for image editing guided by extracted requests.	The proposed framework outperforms single-turn editing methods, demonstrating reduced error accumulation and improved image quality. Extracting concise user requests from dialogues significantly improves performance compared to directly using raw dialogue context. Human evaluations confirm that the framework generates higher-quality images and more engaging, human-like responses compared to previous methods.	The dataset currently considers a limited set of editable attributes, potentially hindering generalization to out-of-domain user requests. The current two-stage framework might benefit from end-to-end training to further enhance image editing quality.	interactive image editing, facial image manipulation, dialogue systems, task-oriented dialogue, benchmark dataset
2303.11086 Report	Pluralistic Aging Diffusion Autoencoder	Peipei Li, Rui Wang, Huaibo Huang, Ran He, Zhaofeng He	Face aging is an ill-posed problem because multiple plausible aging patterns may correspond to a given input. Most existing methods often produce one deterministic estimation. This paper proposes a novel CLIP-driven Pluralistic Aging Diffusion Autoencoder (PADA) to enhance the diversity of aging patterns. First, we employ diffusion models to generate diverse low-level aging details via a sequential denoising reverse process. Second, we present Probabilistic Aging Embedding (PAE) to capture diverse high-level aging patterns, which represents age information as probabilistic distributions in the common CLIP latent space. A text-guided KL-divergence loss is designed to guide this learning. Our method can achieve pluralistic face aging conditioned on open-world aging texts and arbitrary unseen face images. Qualitative and quantitative experiments demonstrate that our method can generate more diverse and high-quality plausible aging results.	This paper proposes PADA, a CLIP-driven Pluralistic Aging Diffusion Autoencoder, to generate diverse and plausible face aging results.	Face aging is an ill-posed problem as multiple plausible aging patterns can exist for a single input. Existing methods often produce only one deterministic estimation, limiting their realism.	PADA uses diffusion models for diverse low-level aging details and introduces Probabilistic Aging Embedding (PAE) to capture high-level aging patterns as distributions in the CLIP latent space. Text-guided KL-divergence loss aids in learning PAE.	PADA generates more diverse aging results than state-of-the-art methods, capturing both high-level (e.g., shape, skin color) and low-level (e.g., wrinkles) variations. PADA achieves superior aging accuracy and identity preservation compared to existing methods. The method allows for flexible user interaction, enabling aging based on open-world text descriptions and arbitrary reference images.	The conflict between aging accuracy and identity preservation requires careful balancing. The reliance on pre-trained models might introduce biases.	face aging, diffusion models, clip, probabilistic embedding, generative models
2303.11073 Report	Discovering Interpretable Directions in the Semantic Latent Space of Diffusion Models	René Haas, Inbar Huberman-Spiegelglas, Rotem Mulayoff, Tomer Michaeli	Denoising Diffusion Models (DDMs) have emerged as a strong competitor to Generative Adversarial Networks (GANs). However, despite their widespread use in image synthesis and editing applications, their latent space is still not as well understood. Recently, a semantic latent space for DDMs, coined `$h$-space', was shown to facilitate semantic image editing in a way reminiscent of GANs. The $h$-space is comprised of the bottleneck activations in the DDM's denoiser across all timesteps of the diffusion process. In this paper, we explore the properties of h-space and propose several novel methods for finding meaningful semantic directions within it. We start by studying unsupervised methods for revealing interpretable semantic directions in pretrained DDMs. Specifically, we show that global latent directions emerge as the principal components in the latent space. Additionally, we provide a novel method for discovering image-specific semantic directions by spectral analysis of the Jacobian of the denoiser w.r.t. the latent code. Next, we extend the analysis by finding directions in a supervised fashion in unconditional DDMs. We demonstrate how such directions can be found by relying on either a labeled data set of real images or by annotating generated samples with a domain-specific attribute classifier. We further show how to semantically disentangle the found direction by simple linear projection. Our approaches are applicable without requiring any architectural modifications, text-based guidance, CLIP-based optimization, or model fine-tuning.	This paper proposes several supervised and unsupervised methods for discovering interpretable directions in the semantic latent space of Denoising Diffusion Models (DDMs).	The work is important because it provides new approaches for semantic image editing in DDMs, an area that has been less explored compared to Generative Adversarial Networks (GANs).	The authors leverage the recently proposed 'h-space' for DDMs, which comprises the bottleneck activations of the denoiser across all timesteps. They explore unsupervised methods like principal component analysis (PCA) and spectral analysis of the denoiser's Jacobian. They also propose a supervised method that utilizes labeled data or attribute classifiers to find directions corresponding to specific attributes.	Principal components in the h-space correspond to global semantic directions like pose, gender, and age. Spectral analysis of the denoiser's Jacobian reveals image-specific semantic directions, enabling localized edits like opening/closing of eyes and mouth. Supervised methods using labeled data or attribute classifiers effectively discover directions for specific attributes and can be disentangled using linear projection.	Unsupervised methods show limited interpretability when applied to DDMs trained on less structured datasets. Future work includes exploring alternative unsupervised techniques for diverse datasets and extending the supervised method to more complex attributes.	denoising diffusion models, semantic image editing, latent space manipulation, unsupervised learning, supervised learning
2303.11052 Report	ContraNeRF: Generalizable Neural Radiance Fields for Synthetic-to-real Novel View Synthesis via Contrastive Learning	Hao Yang, Lanqing Hong, Aoxue Li, Tianyang Hu, Zhenguo Li, Gim Hee Lee, Liwei Wang	Although many recent works have investigated generalizable NeRF-based novel view synthesis for unseen scenes, they seldom consider the synthetic-to-real generalization, which is desired in many practical applications. In this work, we first investigate the effects of synthetic data in synthetic-to-real novel view synthesis and surprisingly observe that models trained with synthetic data tend to produce sharper but less accurate volume densities. For pixels where the volume densities are correct, fine-grained details will be obtained. Otherwise, severe artifacts will be produced. To maintain the advantages of using synthetic data while avoiding its negative effects, we propose to introduce geometry-aware contrastive learning to learn multi-view consistent features with geometric constraints. Meanwhile, we adopt cross-view attention to further enhance the geometry perception of features by querying features across input views. Experiments demonstrate that under the synthetic-to-real setting, our method can render images with higher quality and better fine-grained details, outperforming existing generalizable novel view synthesis methods in terms of PSNR, SSIM, and LPIPS. When trained on real data, our method also achieves state-of-the-art results.	This paper proposes ContraNeRF, a novel view synthesis method based on neural radiance fields (NeRF) that generalizes well from synthetic data to real data using contrastive learning with geometry consistency.	Synthetic-to-real novel view synthesis, while desired for its cost-effectiveness, is seldom investigated in existing generalizable NeRF methods. This paper observes that models trained on synthetic data often produce sharper but less accurate volume densities on real data, leading to artifacts. ContraNeRF addresses this issue by incorporating geometry awareness during training.	The proposed method consists of three key components: (1) Geometry Aware Feature Extraction enhances image features by exchanging information between source views through cross-view attention. (2) Geometry Aware Contrastive Learning utilizes geometric constraints to enhance multi-view consistency by comparing similarities of local features between pairs of source views. (3) Rendering utilizes a coarse-to-fine sampling strategy, accumulating colors along the ray weighted by densities after softmax.	ContraNeRF outperforms existing generalizable NeRF methods in synthetic-to-real novel view synthesis, achieving higher PSNR, SSIM, and lower LPIPS on ScanNet dataset. The method also demonstrates state-of-the-art results on DTU and LLFF datasets under the real-to-real setting. Experiments reveal that a small proportion of real data combined with synthetic data can achieve performance comparable to using real data alone.	Current methods, including ContraNeRF, struggle to generate high-quality images for highly blurred scenes, a common occurrence in real-world datasets. Future work can explore incorporating deblurring techniques within the synthetic-to-real generalization framework for improved performance.	novel view synthesis, neural radiance fields, synthetic-to-real generalization, contrastive learning, geometry awareness
2303.10598 Report	StyleRF: Zero-shot 3D Style Transfer of Neural Radiance Fields	Kunhao Liu, Fangneng Zhan, Yiwen Chen, Jiahui Zhang, Yingchen Yu, Abdulmotaleb El Saddik, Shijian Lu, Eric Xing	3D style transfer aims to render stylized novel views of a 3D scene with multi-view consistency. However, most existing work suffers from a three-way dilemma over accurate geometry reconstruction, high-quality stylization, and being generalizable to arbitrary new styles. We propose StyleRF (Style Radiance Fields), an innovative 3D style transfer technique that resolves the three-way dilemma by performing style transformation within the feature space of a radiance field. StyleRF employs an explicit grid of high-level features to represent 3D scenes, with which high-fidelity geometry can be reliably restored via volume rendering. In addition, it transforms the grid features according to the reference style which directly leads to high-quality zero-shot style transfer. StyleRF consists of two innovative designs. The first is sampling-invariant content transformation that makes the transformation invariant to the holistic statistics of the sampled 3D points and accordingly ensures multi-view consistency. The second is deferred style transformation of 2D feature maps which is equivalent to the transformation of 3D points but greatly reduces memory footprint without degrading multi-view consistency. Extensive experiments show that StyleRF achieves superior 3D stylization quality with precise geometry reconstruction and it can generalize to various new styles in a zero-shot manner.	Presents StyleRF, a method for zero-shot 3D style transfer of neural radiance fields.	Enables stylization of 3D scenes using arbitrary artistic styles without requiring paired training data.	Leverages a feature grid to represent the 3D scene and employs a novel style transfer network that operates on the feature grid, enabling view-consistent stylization.	Achieves high-quality stylization of 3D scenes with fidelity to both the input content and reference styles. Exhibits strong multi-view consistency, producing stylized novel views that align seamlessly. Outperforms existing 2D and 3D style transfer methods in terms of visual quality and view consistency.	Limited to relatively simple scenes due to the computational cost of NeRF-based representations. Future work could explore incorporating temporal consistency for stylizing dynamic scenes.	3d style transfer, neural radiance fields, zero-shot learning, computer vision, deep learning
2303.10340 Report	3D Data Augmentation for Driving Scenes on Camera	Wenwen Tong, Jiangwei Xie, Tianyu Li, Hanming Deng, Xiangwei Geng, Ruoyi Zhou, Dingchen Yang, Bo Dai, Lewei Lu, Hongyang Li	Driving scenes are extremely diverse and complicated that it is impossible to collect all cases with human effort alone. While data augmentation is an effective technique to enrich the training data, existing methods for camera data in autonomous driving applications are confined to the 2D image plane, which may not optimally increase data diversity in 3D real-world scenarios. To this end, we propose a 3D data augmentation approach termed Drive-3DAug, aiming at augmenting the driving scenes on camera in the 3D space. We first utilize Neural Radiance Field (NeRF) to reconstruct the 3D models of background and foreground objects. Then, augmented driving scenes can be obtained by placing the 3D objects with adapted location and orientation at the pre-defined valid region of backgrounds. As such, the training database could be effectively scaled up. However, the 3D object modeling is constrained to the image quality and the limited viewpoints. To overcome these problems, we modify the original NeRF by introducing a geometric rectified loss and a symmetric-aware training strategy. We evaluate our method for the camera-only monocular 3D detection task on the Waymo and nuScences datasets. The proposed data augmentation approach contributes to a gain of 1.7% and 1.4% in terms of detection accuracy, on Waymo and nuScences respectively. Furthermore, the constructed 3D models serve as digital driving assets and could be recycled for different detectors or other 3D perception tasks.	This paper proposes Drive-3DAug, a novel 3D data augmentation approach for camera-based 3D perception in autonomous driving, which leverages NeRF to reconstruct and manipulate 3D models of driving scenes.	Existing data augmentation methods for camera data are limited to 2D image manipulations, hindering diversity in generated driving scenes which is crucial for improving 3D perception, especially in handling challenging long-tail scenarios.	The approach uses NeRF to reconstruct 3D models of backgrounds and foreground objects from driving scenes. It introduces a geometric rectified loss to handle imperfect object extraction and a symmetric-aware training strategy to enhance viewpoint diversity. The augmented scenes are created by placing objects in valid regions of the backgrounds while considering physical constraints.	Drive-3DAug significantly improves monocular 3D detection performance, achieving a 1.7% gain on Waymo and 1.4% on nuScenes. It effectively addresses challenges in previous methods by enabling realistic object rotation and translation in 3D space. Reconstructed 3D models serve as reusable digital driving assets, benefiting various perception tasks.	The current method primarily augments data under good illumination conditions and with limited object classes. Future work includes expanding the approach to encompass diverse weather conditions and a wider range of objects.	data augmentation, 3d object detection, autonomous driving, neural radiance fields (nerf), digital driving assets
2303.10137 Report	A Recipe for Watermarking Diffusion Models	Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Ngai-Man Cheung, Min Lin	Diffusion models (DMs) have demonstrated advantageous potential on generative tasks. Widespread interest exists in incorporating DMs into downstream applications, such as producing or editing photorealistic images. However, practical deployment and unprecedented power of DMs raise legal issues, including copyright protection and monitoring of generated content. In this regard, watermarking has been a proven solution for copyright protection and content monitoring, but it is underexplored in the DMs literature. Specifically, DMs generate samples from longer tracks and may have newly designed multimodal structures, necessitating the modification of conventional watermarking pipelines. To this end, we conduct comprehensive analyses and derive a recipe for efficiently watermarking state-of-the-art DMs (e.g., Stable Diffusion), via training from scratch or finetuning. Our recipe is straightforward but involves empirically ablated implementation details, providing a foundation for future research on watermarking DMs. The code is available at https://github.com/yunqing-me/WatermarkDM.	This paper presents a comprehensive empirical study and derives a practical recipe for watermarking diffusion models (DMs) for copyright protection and content monitoring.	The widespread use of DMs in generating realistic images raises legal concerns about copyright and the proliferation of generated content.	- For unconditional/class-conditional DMs, embed predefined watermarks directly into training data using an encoder-decoder architecture. - For text-to-image DMs, finetune pretrained models to generate a specific watermark image in response to a trigger prompt, using a weight-constrained regularization to minimize performance degradation.	Watermarks can be reliably embedded in and recovered from both unconditional/class-conditional and text-to-image DMs. Increasing the complexity of watermarks can lead to a degradation of the generative performance of DMs. Weight-constrained finetuning effectively mitigates performance degradation in text-to-image DMs during watermark embedding.	Embedding complex watermarks can negatively impact the quality of generated images. Future research can explore more sophisticated watermarking techniques to further improve robustness and minimize performance impact.	diffusion models, watermarking, copyright protection, content monitoring, generative models
2303.10126 Report	IRGen: Generative Modeling for Image Retrieval	Yidan Zhang, Ting Zhang, Dong Chen, Yujing Wang, Qi Chen, Xing Xie, Hao Sun, Weiwei Deng, Qi Zhang, Fan Yang, Mao Yang, Qingmin Liao, Baining Guo	While generative modeling has been ubiquitous in natural language processing and computer vision, its application to image retrieval remains unexplored. In this paper, we recast image retrieval as a form of generative modeling by employing a sequence-to-sequence model, contributing to the current unified theme. Our framework, IRGen, is a unified model that enables end-to-end differentiable search, thus achieving superior performance thanks to direct optimization. While developing IRGen we tackle the key technical challenge of converting an image into quite a short sequence of semantic units in order to enable efficient and effective retrieval. Empirical experiments demonstrate that our model yields significant improvement over three commonly used benchmarks, for example, 22.9\% higher than the best baseline method in precision@10 on In-shop dataset with comparable recall@10 score.	This paper recasts image retrieval as a generative modeling problem, proposing IRGen, a sequence-to-sequence model that predicts discrete visual tokens representing a query image's nearest neighbor.	Existing image retrieval pipelines, relying on separate feature extraction and ANN search stages, lack end-to-end optimization. IRGen aims to bridge this gap, offering direct optimization for superior performance.	IRGen utilizes a novel semantic image tokenizer that compresses image representations into short sequences of semantically meaningful tokens. A Transformer-based encoder-decoder architecture then predicts identifiers of nearest neighbors during retrieval.	IRGen outperforms state-of-the-art image retrieval methods on In-shop Clothes, CUB200, and Cars196 datasets, demonstrating superior precision. The model shows promising scalability, achieving excellent results on million-level datasets like ImageNet and Places365. The proposed semantic image tokenizer proves more effective than random identifiers or those derived from hierarchical k-means or RQ-VAE.	Handling billion-scale datasets efficiently requires further research on balancing model capacity and inference speed. Efficiently updating the model with fresh data without retraining remains an open challenge.	image retrieval, generative modeling, sequence-to-sequence, semantic image tokenizer, transformer
2303.10083 Report	$α$Surf: Implicit Surface Reconstruction for Semi-Transparent and Thin Objects with Decoupled Geometry and Opacity	Tianhao Wu, Hanxue Liang, Fangcheng Zhong, Gernot Riegler, Shimon Vainer, Cengiz Oztireli	Implicit surface representations such as the signed distance function (SDF) have emerged as a promising approach for image-based surface reconstruction. However, existing optimization methods assume solid surfaces and are therefore unable to properly reconstruct semi-transparent surfaces and thin structures, which also exhibit low opacity due to the blending effect with the background. While neural radiance field (NeRF) based methods can model semi-transparency and achieve photo-realistic quality in synthesized novel views, their volumetric geometry representation tightly couples geometry and opacity, and therefore cannot be easily converted into surfaces without introducing artifacts. We present $\alpha$Surf, a novel surface representation with decoupled geometry and opacity for the reconstruction of semi-transparent and thin surfaces where the colors mix. Ray-surface intersections on our representation can be found in closed-form via analytical solutions of cubic polynomials, avoiding Monte-Carlo sampling and is fully differentiable by construction. Our qualitative and quantitative evaluations show that our approach can accurately reconstruct surfaces with semi-transparent and thin parts with fewer artifacts, achieving better reconstruction quality than state-of-the-art SDF and NeRF methods. Website: https://alphasurf.netlify.app/	\name{} is a novel grid-based surface representation for reconstructing semi-transparent and thin objects with decoupled geometry and opacity.	Existing SDF-based methods struggle to reconstruct surfaces exhibiting semi-transparency or thin structures due to the assumption of solid surfaces. NeRF-based methods can model semi-transparency but their volumetric geometry representation tightly couples geometry and opacity.	The representation utilizes separate values on a grid to model geometry, opacity, and appearance. It leverages a closed-form solution for finding ray-surface intersections via cubic polynomial root finding and employs differentiable alpha compositing during rendering. The method is initialized from a pre-trained Plenoxels model and incorporates surface-specific regularization for optimization.	\name{} accurately reconstructs surfaces with semi-transparent and thin parts with fewer artifacts than state-of-the-art SDF and NeRF methods. It produces higher quality surfaces, as evidenced by lower Chamfer distance scores compared to baselines. The method effectively removes noisy inner surfaces and density floaters commonly present in NeRF reconstructions.	The reconstructed surfaces tend to be less smooth compared to MLP-based SDF methods due to the lack of spatial smoothness encoded in the MLP. The method currently requires a separate background model for 360° real-world scenes.	surface reconstruction, semi-transparent surfaces, thin structures, implicit surface representation, differentiable rendering
2303.10073 Report	DialogPaint: A Dialog-based Image Editing Model	Jingxuan Wei, Shiyu Wu, Xin Jiang, Yequan Wang	We introduce DialogPaint, a novel framework that bridges conversational interactions with image editing, enabling users to modify images through natural dialogue. By integrating a dialogue model with the Stable Diffusion image transformation technique, DialogPaint offers a more intuitive and interactive approach to image modifications. Our method stands out by effectively interpreting and executing both explicit and ambiguous instructions, handling tasks such as object replacement, style transfer, and color modification. Notably, DialogPaint supports iterative, multi-round editing, allowing users to refine image edits over successive interactions. Comprehensive evaluations highlight the robustness and versatility of our approach, marking a significant advancement in dialogue-driven image editing.	Introduces DialogPaint, a novel framework for interactive image editing through multi-round dialogue.	Addresses limitations of existing image editing methods that struggle with ambiguous instructions and lack intuitive human-computer interaction.	Combines a dialogue model with Stable Diffusion image editing. Leverages self-instruct methodology to generate synthetic dialogue and image pairs for model training.	Outperforms baseline models like InstructPix2Pix in preserving background details and isolating object edits. Successfully handles multi-turn edits, including object addition/removal, color modifications, and scene transformations. Demonstrates robustness and versatility across diverse image editing tasks, evidenced by quantitative metrics (Perplexity, FID, PRD) and positive user feedback (Overall Satisfaction, MOS).	Limited dataset diversity and volume may hinder performance in complex editing scenarios. Future work includes refining the model's ability to balance transformation with preservation for more natural edits.	dialogue-based image editing, natural language processing, image transformation, multi-round interactions, stable diffusion
2303.09833 Report	FreeDoM: Training-Free Energy-Guided Conditional Diffusion Model	Jiwen Yu, Yinhuai Wang, Chen Zhao, Bernard Ghanem, Jian Zhang	Recently, conditional diffusion models have gained popularity in numerous applications due to their exceptional generation ability. However, many existing methods are training-required. They need to train a time-dependent classifier or a condition-dependent score estimator, which increases the cost of constructing conditional diffusion models and is inconvenient to transfer across different conditions. Some current works aim to overcome this limitation by proposing training-free solutions, but most can only be applied to a specific category of tasks and not to more general conditions. In this work, we propose a training-Free conditional Diffusion Model (FreeDoM) used for various conditions. Specifically, we leverage off-the-shelf pre-trained networks, such as a face detection model, to construct time-independent energy functions, which guide the generation process without requiring training. Furthermore, because the construction of the energy function is very flexible and adaptable to various conditions, our proposed FreeDoM has a broader range of applications than existing training-free methods. FreeDoM is advantageous in its simplicity, effectiveness, and low cost. Experiments demonstrate that FreeDoM is effective for various conditions and suitable for diffusion models of diverse data domains, including image and latent code domains.	Proposes FreeDoM, a training-free method for conditional diffusion models using off-the-shelf pre-trained networks to construct time-independent energy functions for guidance.	Addresses the inflexibility and high cost of retraining conditional diffusion models for new conditions.	Approximates time-dependent energy functions using time-independent distance measuring functions based on pre-trained networks. Employs an efficient time-travel strategy for large data domains. Constructs energy functions by projecting conditions and intermediate results into the same feature space for distance measurement.	Generates high-quality images consistent with diverse conditions including text, segmentation maps, sketches, landmarks, face IDs, and style images. Offers controllability by adjusting the learning rate of the energy function. Demonstrates faster inference speed and better alignment with conditioned style images compared to UGD.	Sampling time is higher than training-required methods due to energy function derivative computation and time-travel strategy. Controlling fine-grained structure features in large data domains using the energy function is difficult.	conditional diffusion models, training-free, energy guidance, image generation, time-travel strategy
2303.09826 Report	Learning Data-Driven Vector-Quantized Degradation Model for Animation Video Super-Resolution	Zixi Tuo, Huan Yang, Jianlong Fu, Yujie Dun, Xueming Qian	Existing real-world video super-resolution (VSR) methods focus on designing a general degradation pipeline for open-domain videos while ignoring data intrinsic characteristics which strongly limit their performance when applying to some specific domains (eg., animation videos). In this paper, we thoroughly explore the characteristics of animation videos and leverage the rich priors in real-world animation data for a more practical animation VSR model. In particular, we propose a multi-scale Vector-Quantized Degradation model for animation video Super-Resolution (VQD-SR) to decompose the local details from global structures and transfer the degradation priors in real-world animation videos to a learned vector-quantized codebook for degradation modeling. A rich-content Real Animation Low-quality (RAL) video dataset is collected for extracting the priors. We further propose a data enhancement strategy for high-resolution (HR) training videos based on our observation that existing HR videos are mostly collected from the Web which contains conspicuous compression artifacts. The proposed strategy is valid to lift the upper bound of animation VSR performance, regardless of the specific VSR model. Experimental results demonstrate the superiority of the proposed VQD-SR over state-of-the-art methods, through extensive quantitative and qualitative evaluations of the latest animation video super-resolution benchmark. The code and pre-trained models can be downloaded at https://github.com/researchmm/VQD-SR.	This paper introduces VQD-SR, a novel multi-scale vector-quantized degradation model for animation video super-resolution.	Existing real-world video super-resolution methods often fail to generalize to the animation domain as they disregard the inherent characteristics of such videos, resulting in subpar outcomes.	The authors collected a large-scale Real Animation Low-quality (RAL) video dataset to study real-world animation degradation. They designed a multi-scale VQGAN trained on RAL to learn and transfer degradation priors. A stochastic top-k VQ strategy expands the degradation space for better generalization. Lastly, they proposed an HR-SR data enhancement strategy to improve the quality of HR training videos.	VQD-SR outperforms state-of-the-art methods in quantitative metrics (MANIQA) on the AVC-RealLQ benchmark. Qualitative comparisons demonstrate VQD-SR's ability to restore sharper lines, reduce artifacts, and handle intended scenarios like out-of-focus blur more naturally. Extensive ablation studies validate the effectiveness of the VQ degradation model, HR-SR enhancement strategy, and other design choices.	VQD-SR may still struggle with extreme cases of degradation, such as severe color distortions. The HR-SR enhancement strategy, while effective for animation, may not directly apply to natural videos due to their complexity.	animation video super-resolution, degradation modeling, vector quantization, vqgan, data enhancement
2303.09813 Report	DiffusionSeg: Adapting Diffusion Towards Unsupervised Object Discovery	Chaofan Ma, Yuhuan Yang, Chen Ju, Fei Zhang, Jinxiang Liu, Yu Wang, Ya Zhang, Yanfeng Wang	Learning from a large corpus of data, pre-trained models have achieved impressive progress nowadays. As popular generative pre-training, diffusion models capture both low-level visual knowledge and high-level semantic relations. In this paper, we propose to exploit such knowledgeable diffusion models for mainstream discriminative tasks, i.e., unsupervised object discovery: saliency segmentation and object localization. However, the challenges exist as there is one structural difference between generative and discriminative models, which limits the direct use. Besides, the lack of explicitly labeled data significantly limits performance in unsupervised settings. To tackle these issues, we introduce DiffusionSeg, one novel synthesis-exploitation framework containing two-stage strategies. To alleviate data insufficiency, we synthesize abundant images, and propose a novel training-free AttentionCut to obtain masks in the first synthesis stage. In the second exploitation stage, to bridge the structural gap, we use the inversion technique, to map the given image back to diffusion features. These features can be directly used by downstream architectures. Extensive experiments and ablation studies demonstrate the superiority of adapting diffusion for unsupervised object discovery.	This paper proposes DiffusionSeg, a novel synthesis-exploitation framework leveraging the visual knowledge of pre-trained text-to-image diffusion models for unsupervised object discovery, including saliency segmentation and object localization.	This work is significant as it explores the potential of generative pre-trained models for mainstream discriminative tasks, aiming to bridge the gap between generative and discriminative modeling.	The synthesis stage generates abundant image-mask pairs by leveraging cross- and self-attention within a pre-trained diffusion model, utilizing a novel training-free method called AttentionCut. The exploitation stage employs diffusion inversion to map a given image back to diffusion features, which are then used by a lightweight decoder trained on the synthetic data for object discovery.	DiffusionSeg achieves state-of-the-art performance on six standard object discovery benchmarks, surpassing previous methods in both saliency segmentation and object localization. The analysis of the synthesized dataset demonstrates its ability to reliably simulate real-world data properties, enabling effective training of object discovery models. Ablation studies validate the effectiveness of each component, highlighting the importance of AttentionCut and the CLIP-classifiable prior for knowledge extraction and object discovery.	The current method primarily focuses on single-object discovery, and extending it to multi-object scenarios requires further investigation. The computational cost of diffusion inversion remains a challenge for real-time applications, requiring exploration of efficient inference strategies.	diffusion models, unsupervised object discovery, saliency segmentation, object localization, generative pre-training
2303.09604 Report	DS-Fusion: Artistic Typography via Discriminated and Stylized Diffusion	Maham Tanveer, Yizhi Wang, Ali Mahdavi-Amiri, Hao Zhang	We introduce a novel method to automatically generate an artistic typography by stylizing one or more letter fonts to visually convey the semantics of an input word, while ensuring that the output remains readable. To address an assortment of challenges with our task at hand including conflicting goals (artistic stylization vs. legibility), lack of ground truth, and immense search space, our approach utilizes large language models to bridge texts and visual images for stylization and build an unsupervised generative model with a diffusion model backbone. Specifically, we employ the denoising generator in Latent Diffusion Model (LDM), with the key addition of a CNN-based discriminator to adapt the input style onto the input text. The discriminator uses rasterized images of a given letter/word font as real samples and output of the denoising generator as fake samples. Our model is coined DS-Fusion for discriminated and stylized diffusion. We showcase the quality and versatility of our method through numerous examples, qualitative and quantitative evaluation, as well as ablation studies. User studies comparing to strong baselines including CLIPDraw and DALL-E 2, as well as artist-crafted typographies, demonstrate strong performance of DS-Fusion.	DS-Fusion, a novel method for automatically generating artistic typography by stylizing letter fonts to visually convey the semantics of an input word while maintaining readability.	Artistic typography is challenging due to conflicting goals (artistic stylization vs. legibility), lack of ground truth, and an immense search space.	Employs Latent Diffusion Model with a CNN-based discriminator. Generates style images from the input word using LDM. Fine-tunes the denoising generator on these images using diffusion loss and discriminator loss to adapt the input style onto the input text.	Generates artistic typography by blending styles into glyph shapes. Outperforms baselines like DALL-E 2 and CLIPDraw in user studies for style and legibility. Demonstrates versatility in accommodating different semantics, letters, and styles.	Struggles with multi-letter inputs when style images and letters are dissimilar. Current implementation optimizes for specific style and glyph combinations, limiting generalizability.	artistic typography, diffusion models, generative design, text-to-image synthesis, adversarial learning
2303.09556 Report	Efficient Diffusion Training via Min-SNR Weighting Strategy	Tiankai Hang, Shuyang Gu, Chen Li, Jianmin Bao, Dong Chen, Han Hu, Xin Geng, Baining Guo	Denoising diffusion models have been a mainstream approach for image generation, however, training these models often suffers from slow convergence. In this paper, we discovered that the slow convergence is partly due to conflicting optimization directions between timesteps. To address this issue, we treat the diffusion training as a multi-task learning problem, and introduce a simple yet effective approach referred to as Min-SNR-$\gamma$. This method adapts loss weights of timesteps based on clamped signal-to-noise ratios, which effectively balances the conflicts among timesteps. Our results demonstrate a significant improvement in converging speed, 3.4$\times$ faster than previous weighting strategies. It is also more effective, achieving a new record FID score of 2.06 on the ImageNet $256\times256$ benchmark using smaller architectures than that employed in previous state-of-the-art. The code is available at https://github.com/TiankaiHang/Min-SNR-Diffusion-Training.	This paper introduces Min-SNR-γ, a novel loss weighting strategy for diffusion model training that addresses the issue of slow convergence caused by conflicting optimization directions between timesteps.	Training diffusion models is computationally expensive and slow convergence is a major bottleneck for research. This work tackles this issue, potentially enabling faster experimentation and development of diffusion models.	The authors treat diffusion training as a multi-task learning problem and propose the Min-SNR-γ strategy, which assigns loss weights to each timestep based on a clamped signal-to-noise ratio. This approach aims to balance the optimization conflicts between different noise levels during training.	Min-SNR-γ significantly accelerates convergence speed, achieving a 3.4x speedup compared to previous weighting strategies. The method effectively balances loss across different noise levels, resulting in a more efficient training process. It achieves state-of-the-art FID score of 2.06 on ImageNet 256x256 benchmark.	The paper mainly focuses on image generation and further exploration is needed for other applications of diffusion models. The optimal value for the hyperparameter γ may require task-specific tuning.	diffusion models, image generation, multi-task learning, loss weighting, signal-to-noise ratio
2303.09551 Report	SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving	Yi Wei, Linqing Zhao, Wenzhao Zheng, Zheng Zhu, Jie Zhou, Jiwen Lu	3D scene understanding plays a vital role in vision-based autonomous driving. While most existing methods focus on 3D object detection, they have difficulty describing real-world objects of arbitrary shapes and infinite classes. Towards a more comprehensive perception of a 3D scene, in this paper, we propose a SurroundOcc method to predict the 3D occupancy with multi-camera images. We first extract multi-scale features for each image and adopt spatial 2D-3D attention to lift them to the 3D volume space. Then we apply 3D convolutions to progressively upsample the volume features and impose supervision on multiple levels. To obtain dense occupancy prediction, we design a pipeline to generate dense occupancy ground truth without expansive occupancy annotations. Specifically, we fuse multi-frame LiDAR scans of dynamic objects and static scenes separately. Then we adopt Poisson Reconstruction to fill the holes and voxelize the mesh to get dense occupancy labels. Extensive experiments on nuScenes and SemanticKITTI datasets demonstrate the superiority of our method. Code and dataset are available at https://github.com/weiyithu/SurroundOcc	This paper proposes SurroundOcc, a method to predict dense and accurate 3D occupancy from multi-camera images for autonomous driving.	3D occupancy prediction provides a more comprehensive understanding of the scene compared to 3D object detection, which can struggle with real-world objects of arbitrary shapes and classes.	The method uses a 2D backbone to extract multi-scale features from each image, then employs 2D-3D spatial attention to lift the information to 3D volume features. A 3D convolution network upsamples and fuses these features to predict occupancy. To train the network, the authors created a pipeline to generate dense occupancy ground truth from sparse LiDAR point clouds and existing 3D detection labels.	SurroundOcc achieves state-of-the-art performance on 3D semantic occupancy prediction on the nuScenes dataset. The method also excels in 3D scene reconstruction, outperforming depth estimation methods and other 3D reconstruction approaches. Experiments demonstrate the effectiveness of dense occupancy supervision over sparse LiDAR points for training.	The current method only focuses on single-frame occupancy prediction, limiting its application to scenarios like motion prediction where occupancy flow is crucial. The authors plan to extend the framework to predict occupancy flow from multi-frame surrounding images and explore self-supervised occupancy prediction without LiDAR data.	3d occupancy prediction, autonomous driving, multi-camera perception, dense ground truth generation, 2d-3d spatial attention
2303.09522 Report	P+: Extended Textual Conditioning in Text-to-Image Generation	Andrey Voynov, Qinghao Chu, Daniel Cohen-Or, Kfir Aberman	We introduce an Extended Textual Conditioning space in text-to-image models, referred to as $P+$. This space consists of multiple textual conditions, derived from per-layer prompts, each corresponding to a layer of the denoising U-net of the diffusion model. We show that the extended space provides greater disentangling and control over image synthesis. We further introduce Extended Textual Inversion (XTI), where the images are inverted into $P+$, and represented by per-layer tokens. We show that XTI is more expressive and precise, and converges faster than the original Textual Inversion (TI) space. The extended inversion method does not involve any noticeable trade-off between reconstruction and editability and induces more regular inversions. We conduct a series of extensive experiments to analyze and understand the properties of the new space, and to showcase the effectiveness of our method for personalizing text-to-image models. Furthermore, we utilize the unique properties of this space to achieve previously unattainable results in object-style mixing using text-to-image models. Project page: https://prompt-plus.github.io	This paper introduces $\mathcal{P}+$, an extended textual conditioning space for text-to-image models, which uses multiple textual conditions corresponding to different layers of the denoising U-net, allowing for greater disentanglement and control over image synthesis.	This is important because it allows for more fine-grained control over image generation and enables new possibilities for personalization and style mixing.	The authors analyze the properties of different U-net layers, revealing their varying influence on image attributes. They then leverage this insight to develop Extended Textual Inversion (XTI), which learns per-layer token embeddings for representing specific concepts. Finally, they showcase the capabilities of $\mathcal{P}+$ and XTI through various experiments and a user study.	Different layers of the U-net demonstrate distinct sensitivities to image attributes, with coarse layers predominantly influencing shape and structure, while fine layers primarily affect appearance. XTI outperforms the original Textual Inversion (TI) in terms of both subject fidelity and text similarity, while also demonstrating faster convergence. $\mathcal{P}+$ enables effective object-style mixing by combining token embeddings from different XTI inversions across various layers.	While XTI achieves impressive results, it still falls short of the reconstruction quality achievable by fine-tuning the entire model. The disentanglement of attributes across U-net layers is not perfect, which can limit the level of control in style mixing.	text-to-image synthesis, diffusion models, textual inversion, style mixing, personalization
2303.09472 Report	DiffIR: Efficient Diffusion Model for Image Restoration	Bin Xia, Yulun Zhang, Shiyin Wang, Yitong Wang, Xinglong Wu, Yapeng Tian, Wenming Yang, Luc Van Gool	Diffusion model (DM) has achieved SOTA performance by modeling the image synthesis process into a sequential application of a denoising network. However, different from image synthesis, image restoration (IR) has a strong constraint to generate results in accordance with ground-truth. Thus, for IR, traditional DMs running massive iterations on a large model to estimate whole images or feature maps is inefficient. To address this issue, we propose an efficient DM for IR (DiffIR), which consists of a compact IR prior extraction network (CPEN), dynamic IR transformer (DIRformer), and denoising network. Specifically, DiffIR has two training stages: pretraining and training DM. In pretraining, we input ground-truth images into CPEN$_{S1}$ to capture a compact IR prior representation (IPR) to guide DIRformer. In the second stage, we train the DM to directly estimate the same IRP as pretrained CPEN$_{S1}$ only using LQ images. We observe that since the IPR is only a compact vector, DiffIR can use fewer iterations than traditional DM to obtain accurate estimations and generate more stable and realistic results. Since the iterations are few, our DiffIR can adopt a joint optimization of CPEN$_{S2}$, DIRformer, and denoising network, which can further reduce the estimation error influence. We conduct extensive experiments on several IR tasks and achieve SOTA performance while consuming less computational costs. Code is available at \url{https://github.com/Zj-BinXia/DiffIR}.	This paper introduces DiffIR, an efficient Diffusion Model (DM) designed for Image Restoration (IR) tasks like inpainting, super-resolution, and deblurring.	Traditional DMs, highly effective for image synthesis, are computationally expensive for IR tasks where most input pixels are already present. DiffIR addresses this inefficiency by leveraging the strength of DMs in estimating data distributions to guide the restoration process.	DiffIR consists of a compact IR prior extraction network (CPEN), a dynamic IR transformer (DIRformer), and a denoising network. It operates in two stages: (1) pretraining where CPEN extracts a compact IR prior representation (IPR) from ground-truth images to guide the DIRformer, and (2) training the DM to estimate the IPR solely from LQ images. This allows for joint optimization of DM and DIRformer, enhancing robustness against estimation errors.	DiffIR achieves state-of-the-art performance on benchmark datasets for inpainting, super-resolution, and deblurring tasks. It outperforms other DM-based methods significantly in terms of efficiency, consuming considerably less computational resources while achieving better or comparable results. DiffIR demonstrates faster convergence speed compared to traditional DMs due to its focus on estimating a compact IPR rather than generating entire images.	The current implementation of DiffIR primarily focuses on single-image restoration tasks. Exploration of more complex diffusion processes and network architectures within the DiffIR framework could be beneficial.	image restoration, diffusion model, deep learning, image inpainting, super-resolution
2303.09431 Report	NeRFMeshing: Distilling Neural Radiance Fields into Geometrically-Accurate 3D Meshes	Marie-Julie Rakotosaona, Fabian Manhardt, Diego Martin Arroyo, Michael Niemeyer, Abhijit Kundu, Federico Tombari	With the introduction of Neural Radiance Fields (NeRFs), novel view synthesis has recently made a big leap forward. At the core, NeRF proposes that each 3D point can emit radiance, allowing to conduct view synthesis using differentiable volumetric rendering. While neural radiance fields can accurately represent 3D scenes for computing the image rendering, 3D meshes are still the main scene representation supported by most computer graphics and simulation pipelines, enabling tasks such as real time rendering and physics-based simulations. Obtaining 3D meshes from neural radiance fields still remains an open challenge since NeRFs are optimized for view synthesis, not enforcing an accurate underlying geometry on the radiance field. We thus propose a novel compact and flexible architecture that enables easy 3D surface reconstruction from any NeRF-driven approach. Upon having trained the radiance field, we distill the volumetric 3D representation into a Signed Surface Approximation Network, allowing easy extraction of the 3D mesh and appearance. Our final 3D mesh is physically accurate and can be rendered in real time on an array of devices.	Presents NeRFMeshing, a novel method for extracting geometrically accurate and compact 3D meshes from trained NeRF models, enabling real-time rendering and integration with existing graphics pipelines.	NeRFs excel at view synthesis but lack accurate underlying geometry needed for tasks like real-time rendering, physics simulations, and integration with standard computer graphics pipelines.	Introduces a Signed Surface Approximation Network (SSAN) trained on pre-trained NeRF data to approximate a Truncated Signed Distance Field (TSDF) and appearance features. It leverages NeRF's rendered depth distribution and enforces smoothness and normal consistency. A 3D mesh is extracted using marching cubes and rendered in real-time using an appearance network.	NeRFMeshing achieves superior geometric accuracy compared to baselines like SNeRG and MobileNeRF on the Blender Synthetic dataset. The method demonstrates high-quality mesh reconstruction even on challenging unbounded scenes from the Mip-NeRF 360 dataset. The extracted meshes are suitable for real-time rendering and can be readily used in physics-based simulations and scene editing.	Rendering highly detailed surfaces can lead to large mesh sizes, suggesting a need for adaptive mesh reconstruction. Large and detailed scenes are limited by resolution constraints to manage model size.	neural radiance fields, 3d mesh reconstruction, real-time rendering, novel view synthesis, computer graphics
2303.09412 Report	NeRFtrinsic Four: An End-To-End Trainable NeRF Jointly Optimizing Diverse Intrinsic and Extrinsic Camera Parameters	Hannah Schieber, Fabian Deuser, Bernhard Egger, Norbert Oswald, Daniel Roth	Novel view synthesis using neural radiance fields (NeRF) is the state-of-the-art technique for generating high-quality images from novel viewpoints. Existing methods require a priori knowledge about extrinsic and intrinsic camera parameters. This limits their applicability to synthetic scenes, or real-world scenarios with the necessity of a preprocessing step. Current research on the joint optimization of camera parameters and NeRF focuses on refining noisy extrinsic camera parameters and often relies on the preprocessing of intrinsic camera parameters. Further approaches are limited to cover only one single camera intrinsic. To address these limitations, we propose a novel end-to-end trainable approach called NeRFtrinsic Four. We utilize Gaussian Fourier features to estimate extrinsic camera parameters and dynamically predict varying intrinsic camera parameters through the supervision of the projection error. Our approach outperforms existing joint optimization methods on LLFF and BLEFF. In addition to these existing datasets, we introduce a new dataset called iFF with varying intrinsic camera parameters. NeRFtrinsic Four is a step forward in joint optimization NeRF-based view synthesis and enables more realistic and flexible rendering in real-world scenarios with varying camera parameters.	Presents NeRFtrinsic Four, an end-to-end trainable neural radiance field (NeRF) framework for novel view synthesis that jointly optimizes diverse intrinsic and extrinsic camera parameters, eliminating the need for preprocessing steps like SfM.	Existing NeRF methods require a priori knowledge of camera parameters, limiting their applicability to synthetic scenes or requiring preprocessing. This work addresses limitations in current joint optimization methods, enabling more realistic and flexible rendering for real-world scenarios with varying camera parameters.	Utilizes Gaussian Fourier features to estimate extrinsic camera parameters and dynamically predicts varying intrinsic camera parameters through the supervision of projection error. Introduces a novel dataset, iFF, with varying intrinsic camera parameters.	Outperforms existing joint optimization methods (NeRF-- and SiNeRF) on LLFF and BLEFF benchmarks in terms of image quality and camera parameter estimation. Demonstrates superior performance on the newly introduced iFF dataset, highlighting the advantage of handling diverse intrinsic camera parameters. Shows improved stability and accuracy in camera pose prediction through the use of Gaussian Fourier features and an SSIM loss function.	Limited to forward-facing scenes and not yet suitable for 360° scenes. Pose MLP initialization remains challenging, suggesting potential for future work in regularization methods.	neural radiance fields, novel view synthesis, camera parameter estimation, gaussian fourier features, joint optimization
2303.09319 Report	Unified Multi-Modal Latent Diffusion for Joint Subject and Text Conditional Image Generation	Yiyang Ma, Huan Yang, Wenjing Wang, Jianlong Fu, Jiaying Liu	Language-guided image generation has achieved great success nowadays by using diffusion models. However, texts can be less detailed to describe highly-specific subjects such as a particular dog or a certain car, which makes pure text-to-image generation not accurate enough to satisfy user requirements. In this work, we present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences and generates customized images with the subjects. To be more specific, both input texts and images are encoded into one unified multi-modal latent space, in which the input images are learned to be projected to pseudo word embedding and can be further combined with text to guide image generation. Besides, to eliminate the irrelevant parts of the input images such as background or illumination, we propose a novel sampling technique of diffusion models used by the image generator which fuses the results guided by multi-modal input and pure text input. By leveraging the large-scale pre-trained text-to-image generator and the designed image encoder, our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.	This paper introduces UMM-Diffusion, a novel framework for generating images from both text descriptions and specific subjects provided as images, encoding them into a unified multimodal latent space.	Current text-to-image generation models struggle to accurately depict highly specific subjects. This work allows for greater customization and control over the generated images by incorporating visual subjects.	The method uses a Text-and-Image Unified Encoder (TIUE) that leverages pre-trained CLIP encoders to project both text and images into a shared latent space. A novel fusing sampling technique combines multi-modal and text-only guidance to mitigate overfitting on irrelevant image details.	UMM-Diffusion generates high-quality, customizable images with diverse novel views of the target subjects, aligning with text descriptions while preserving visual features. The model successfully disentangles style information, allowing for stylized image generation guided by either text or input image styles. UMM-Diffusion supports multiple image guidance within a single input, enabling composition of several subjects into one image.	The model may mix features of multiple subjects when provided, leading to fused visual representations. Generating rare or highly-fictitious subjects can result in distorted or inaccurate visual details.	image generation, multi-modal learning, diffusion models, text-to-image synthesis, clip
2303.09295 Report	DIRE for Diffusion-Generated Image Detection	Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, Hezhen Hu, Hong Chen, Houqiang Li	Diffusion models have shown remarkable success in visual synthesis, but have also raised concerns about potential abuse for malicious purposes. In this paper, we seek to build a detector for telling apart real images from diffusion-generated images. We find that existing detectors struggle to detect images generated by diffusion models, even if we include generated images from a specific diffusion model in their training data. To address this issue, we propose a novel image representation called DIffusion Reconstruction Error (DIRE), which measures the error between an input image and its reconstruction counterpart by a pre-trained diffusion model. We observe that diffusion-generated images can be approximately reconstructed by a diffusion model while real images cannot. It provides a hint that DIRE can serve as a bridge to distinguish generated and real images. DIRE provides an effective way to detect images generated by most diffusion models, and it is general for detecting generated images from unseen diffusion models and robust to various perturbations. Furthermore, we establish a comprehensive diffusion-generated benchmark including images generated by eight diffusion models to evaluate the performance of diffusion-generated image detectors. Extensive experiments on our collected benchmark demonstrate that DIRE exhibits superiority over previous generated-image detectors. The code and dataset are available at https://github.com/ZhendongWang6/DIRE.	This paper proposes DIRE (Diffusion Reconstruction Error), a novel image representation for detecting diffusion-generated images, which leverages the distinct reconstruction errors between real and generated images.	The rise of diffusion models in visual synthesis necessitates reliable detectors to prevent misuse for malicious purposes, such as generating deepfakes.	DIRE measures the error between an input image and its reconstruction obtained by inverting and reconstructing the image using a pre-trained diffusion model (DDIM). A binary classifier is then trained on DIRE representations to distinguish real from generated images.	DIRE exhibits superior generalization ability compared to existing generated image detectors, achieving high accuracy on unseen diffusion models. DIRE shows robustness to image perturbations, including Gaussian blur and JPEG compression. Analysis of noise patterns and frequency information in DIRE further supports its effectiveness in distinguishing real and generated images.	The reliance on a pre-trained diffusion model for reconstruction might limit DIRE's applicability if the model is not robust or generalized. Future work could explore the use of DIRE for detecting images generated by other generative models beyond diffusion models.	diffusion model, image generation, image forensics, deepfake detection, generalization
2303.09270 Report	SpectralCLIP: Preventing Artifacts in Text-Guided Style Transfer from a Spectral Perspective	Zipeng Xu, Songlong Xing, Enver Sangineto, Nicu Sebe	Owing to the power of vision-language foundation models, e.g., CLIP, the area of image synthesis has seen recent important advances. Particularly, for style transfer, CLIP enables transferring more general and abstract styles without collecting the style images in advance, as the style can be efficiently described with natural language, and the result is optimized by minimizing the CLIP similarity between the text description and the stylized image. However, directly using CLIP to guide style transfer leads to undesirable artifacts (mainly written words and unrelated visual entities) spread over the image. In this paper, we propose SpectralCLIP, which is based on a spectral representation of the CLIP embedding sequence, where most of the common artifacts occupy specific frequencies. By masking the band including these frequencies, we can condition the generation process to adhere to the target style properties (e.g., color, texture, paint stroke, etc.) while excluding the generation of larger-scale structures corresponding to the artifacts. Experimental results show that SpectralCLIP prevents the generation of artifacts effectively in quantitative and qualitative terms, without impairing the stylisation quality. We also apply SpectralCLIP to text-conditioned image generation and show that it prevents written words in the generated images. Our code is available at https://github.com/zipengxuc/SpectralCLIP.	Proposes SpectralCLIP, a novel method that leverages spectral analysis to prevent artifact generation in CLIP-guided style transfer.	Addresses the problem of undesirable visual and textual artifacts in CLIP-guided style transfer, enhancing the quality and realism of generated images.	Transforms the CLIP embedding sequence into the frequency domain and employs band-stop filters to remove frequencies associated with artifact scales.	Significantly reduces both visual and textual artifacts in stylized images. Maintains high consistency with target styles while preventing artifacts. Outperforms CLIPstyler and forget-to-spell CLIP in terms of artifact reduction and overall quality based on user study.	Lacks a clear explanation for the relationship between artifact scales and target styles. Relies on empirically defined band combinations, requiring manual selection for new styles.	style transfer, clip, artifact removal, spectral analysis, image generation
2303.09252 Report	GridCLIP: One-Stage Object Detection by Grid-Level CLIP Representation Learning	Jiayi Lin, Shaogang Gong	A vision-language foundation model pretrained on very large-scale image-text paired data has the potential to provide generalizable knowledge representation for downstream visual recognition and detection tasks, especially on supplementing the undersampled categories in downstream model training. Recent studies utilizing CLIP for object detection have shown that a two-stage detector design typically outperforms a one-stage detector, while requiring more expensive training resources and longer inference time. In this work, we propose a one-stage detector GridCLIP that narrows its performance gap to those of two-stage detectors, with approximately 43 and 5 times faster than its two-stage counterpart (ViLD) in the training and test process respectively. GridCLIP learns grid-level representations to adapt to the intrinsic principle of one-stage detection learning by expanding the conventional CLIP image-text holistic mapping to a more fine-grained, grid-text alignment. This differs from the region-text mapping in two-stage detectors that apply CLIP directly by treating regions as images. Specifically, GridCLIP performs Grid-level Alignment to adapt the CLIP image-level representations to grid-level representations by aligning to CLIP category representations to learn the annotated (especially frequent) categories. To learn generalizable visual representations of broader categories, especially undersampled ones, we perform Image-level Alignment during training to propagate broad pre-learned categories in the CLIP image encoder from the image-level to the grid-level representations. Experiments show that the learned CLIP-based grid-level representations boost the performance of undersampled (infrequent and novel) categories, reaching comparable detection performance on the LVIS benchmark.	This paper introduces GridCLIP, a one-stage object detector that leverages CLIP's representation space to improve the detection of undersampled categories.	Existing object detection datasets often suffer from long-tail distributions, where some categories have very few training samples, hindering the performance on these undersampled categories. GridCLIP aims to address this issue by transferring knowledge from the CLIP model.	GridCLIP employs two key alignment strategies: (1) Grid-level Alignment maps localized grid-level image features to CLIP's text embeddings for base categories. (2) Image-level Alignment performs knowledge distillation from a fixed CLIP image encoder to guide the learning of both base and novel categories at the image level.	GridCLIP achieves comparable performance to two-stage detectors on LVIS while being significantly faster in training and inference. Both grid-level and image-level alignments are shown to contribute to the improved detection of undersampled categories. Analysis reveals that CLIP's image encoder can effectively capture multiple categories within an image, supporting the effectiveness of image-level alignment.	The gap between base and novel categories in GridCLIP suggests potential for further improvement by refining the alignment strategies for novel categories. Exploring alternative one-stage detectors or incorporating learnable prompts could further enhance GridCLIP's performance.	object detection, vision-language models, clip, undersampled categories, long-tail distribution
2303.09181 Report	Global Knowledge Calibration for Fast Open-Vocabulary Segmentation	Kunyang Han, Yong Liu, Jun Hao Liew, Henghui Ding, Yunchao Wei, Jiajun Liu, Yitong Wang, Yansong Tang, Yujiu Yang, Jiashi Feng, Yao Zhao	Recent advancements in pre-trained vision-language models, such as CLIP, have enabled the segmentation of arbitrary concepts solely from textual inputs, a process commonly referred to as open-vocabulary semantic segmentation (OVS). However, existing OVS techniques confront a fundamental challenge: the trained classifier tends to overfit on the base classes observed during training, resulting in suboptimal generalization performance to unseen classes. To mitigate this issue, recent studies have proposed the use of an additional frozen pre-trained CLIP for classification. Nonetheless, this approach incurs heavy computational overheads as the CLIP vision encoder must be repeatedly forward-passed for each mask, rendering it impractical for real-world applications. To address this challenge, our objective is to develop a fast OVS model that can perform comparably or better without the extra computational burden of the CLIP image encoder during inference. To this end, we propose a core idea of preserving the generalizable representation when fine-tuning on known classes. Specifically, we introduce a text diversification strategy that generates a set of synonyms for each training category, which prevents the learned representation from collapsing onto specific known category names. Additionally, we employ a text-guided knowledge distillation method to preserve the generalizable knowledge of CLIP. Extensive experiments demonstrate that our proposed model achieves robust generalization performance across various datasets. Furthermore, we perform a preliminary exploration of open-vocabulary video segmentation and present a benchmark that can facilitate future open-vocabulary research in the video domain.	This paper proposes Global Knowledge Calibration (GKC), a method for fast open-vocabulary segmentation that preserves generalizability to unseen categories during training and doesn't require an additional frozen CLIP model during inference, leading to faster inference speed.	Existing open-vocabulary segmentation (OVS) models often overfit to seen categories, limiting their generalization to unseen ones. While using a frozen CLIP model during inference helps, it introduces significant computational overhead.	GKC introduces two key components: 1) a text diversification strategy using WordNet synonyms to prevent overfitting to specific category names and 2) a text-guided knowledge distillation approach that utilizes CLIP's multi-modal alignment to guide the training process.	GKC achieves state-of-the-art performance on multiple benchmarks while being significantly faster than previous methods. Text diversification and text-guided distillation are shown to effectively improve generalization ability. The paper introduces a preliminary exploration of open-vocabulary video segmentation and constructs a new benchmark for future research.	The video OVS model suffers from overfitting with increased training iterations. Future work will focus on addressing the overfitting issue in video OVS.	open-vocabulary segmentation, knowledge distillation, text diversification, vision-language models, video segmentation
2303.08914 Report	MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge	Wei Lin, Leonid Karlinsky, Nina Shvetsova, Horst Possegger, Mateusz Kozinski, Rameswar Panda, Rogerio Feris, Hilde Kuehne, Horst Bischof	Large scale Vision-Language (VL) models have shown tremendous success in aligning representations between visual and text modalities. This enables remarkable progress in zero-shot recognition, image generation & editing, and many other exciting tasks. However, VL models tend to over-represent objects while paying much less attention to verbs, and require additional tuning on video data for best zero-shot action recognition performance. While previous work relied on large-scale, fully-annotated data, in this work we propose an unsupervised approach. We adapt a VL model for zero-shot and few-shot action recognition using a collection of unlabeled videos and an unpaired action dictionary. Based on that, we leverage Large Language Models and VL models to build a text bag for each unlabeled video via matching, text expansion and captioning. We use those bags in a Multiple Instance Learning setup to adapt an image-text backbone to video data. Although finetuned on unlabeled video data, our resulting models demonstrate high transferability to numerous unseen zero-shot downstream tasks, improving the base VL model performance by up to 14\%, and even comparing favorably to fully-supervised baselines in both zero-shot and few-shot video recognition transfer. The code will be released later at \url{https://github.com/wlin-at/MAXI}.	This paper presents \oursfull{} (\OurMethod), an unsupervised finetuning approach for zero-shot action recognition using unlabeled videos and language knowledge.	Existing \vl{} models often underperform in zero-shot action recognition due to their object-centric focus. Previous works relied on fully annotated video data for finetuning, which is costly and limits generalizability. This paper proposes an unsupervised approach to overcome these limitations.	The proposed \OurMethod constructs a \textit{text bag} for each unlabeled video by combining information from a predefined action dictionary, GPT-3 text expansion, and BLIP captioning. It then uses Multiple Instance Learning (MIL) to finetune a \vl{} model on these unlabeled video-text bag pairs.	Unsupervised finetuning with \OurMethod significantly improves the zero-shot action recognition performance of the \vl{} model by up to 14% on seven unseen benchmarks. The proposed approach even outperforms several state-of-the-art supervised methods that are trained with full annotation on the same data. \OurMethod demonstrates strong few-shot learning capability, outperforming baselines in most cases, even with extremely limited data.	Performance improvement is not consistent across all datasets due to varying domain shifts to the action dictionary used for training. Further exploration of temporal modeling in the unsupervised finetuning setting is needed.	zero-shot learning, action recognition, vision-language models, unsupervised learning, multiple instance learning
2303.08817 Report	DeepMIM: Deep Supervision for Masked Image Modeling	Sucheng Ren, Fangyun Wei, Samuel Albanie, Zheng Zhang, Han Hu	Deep supervision, which involves extra supervisions to the intermediate features of a neural network, was widely used in image classification in the early deep learning era since it significantly reduces the training difficulty and eases the optimization like avoiding gradient vanish over the vanilla training. Nevertheless, with the emergence of normalization techniques and residual connection, deep supervision in image classification was gradually phased out. In this paper, we revisit deep supervision for masked image modeling (MIM) that pre-trains a Vision Transformer (ViT) via a mask-and-predict scheme. Experimentally, we find that deep supervision drives the shallower layers to learn more meaningful representations, accelerates model convergence, and expands attention diversities. Our approach, called DeepMIM, significantly boosts the representation capability of each layer. In addition, DeepMIM is compatible with many MIM models across a range of reconstruction targets. For instance, using ViT-B, DeepMIM on MAE achieves 84.2 top-1 accuracy on ImageNet, outperforming MAE by +0.6. By combining DeepMIM with a stronger tokenizer CLIP, our model achieves state-of-the-art performance on various downstream tasks, including image classification (85.6 top-1 accuracy on ImageNet-1K, outperforming MAE-CLIP by +0.8), object detection (52.8 APbox on COCO) and semantic segmentation (53.1 mIoU on ADE20K). Code and models are available at https://github.com/OliverRensu/DeepMIM.	This paper revisits deep supervision for Masked Image Modeling (MIM) and proposes DeepMIM, a framework that applies deep supervision to intermediate features in the encoder of MIM models, enhancing representation learning, particularly in shallower layers.	MIM pretraining often results in weaker informative feedback to shallower encoder layers due to the implicit deepening from the decoder. Deep supervision aims to address this by strengthening the representation learning capability of these layers.	DeepMIM appends lightweight decoders to intermediate encoder blocks, introducing deep supervision. It optionally incorporates a hybrid target generator that produces progressively easier reconstruction targets for shallower layers, further enhancing learning.	DeepMIM consistently improves performance across a range of MIM models and reconstruction targets. DeepMIM-MAE achieves 84.2% top-1 accuracy on ImageNet-1K, outperforming MAE by +0.6%. Combined with a CLIP tokenizer, DeepMIM achieves state-of-the-art results on ImageNet-1K (85.6%), COCO object detection (52.8 AP), and ADE20K segmentation (53.1 mIoU).	The hybrid target generator, while beneficial, introduces additional computational overhead. Exploration of alternative hybrid target generation methods with less computational cost.	self-supervised learning, masked image modeling, deep supervision, vision transformer, representation learning
2303.08767 Report	Highly Personalized Text Embedding for Image Manipulation by Stable Diffusion	Inhwa Han, Serin Yang, Taesung Kwon, Jong Chul Ye	Diffusion models have shown superior performance in image generation and manipulation, but the inherent stochasticity presents challenges in preserving and manipulating image content and identity. While previous approaches like DreamBooth and Textual Inversion have proposed model or latent representation personalization to maintain the content, their reliance on multiple reference images and complex training limits their practicality. In this paper, we present a simple yet highly effective approach to personalization using highly personalized (HiPer) text embedding by decomposing the CLIP embedding space for personalization and content manipulation. Our method does not require model fine-tuning or identifiers, yet still enables manipulation of background, texture, and motion with just a single image and target text. Through experiments on diverse target texts, we demonstrate that our approach produces highly personalized and complex semantic image edits across a wide range of tasks. We believe that the novel understanding of the text embedding space presented in this work has the potential to inspire further research across various tasks.	This paper introduces HiPer, a novel approach for personalized text-to-image generation using Stable Diffusion, which enables precise image manipulation while preserving subject identity from a single source image.	Existing diffusion models struggle with maintaining content and identity during image manipulation. While methods like DreamBooth and Textual Inversion offer some personalization, they require multiple reference images and extensive training.	HiPer leverages a novel understanding of CLIP embedding space. It decomposes embeddings, designating a portion for personalization (HiPer embedding) and optimizing it while maintaining source image semantics. This allows manipulation of background, texture, and motion using only a single image and target text.	HiPer successfully manipulates images across various attributes like motion, background, and texture while preserving subject identity. Compared to Imagic, DreamBooth, and Textual Inversion, HiPer demonstrates superior performance in both qualitative and quantitative evaluations, showcasing better content preservation and semantic alignment. The method is computationally efficient, requiring only around 3 minutes for optimization.	HiPer exhibits limitations in manipulating images with prompts requiring counting or specific color matching, and struggles with complex artificial objects. While preserving identity, the overall generated image may appear somewhat unnatural due to limitations of the base Stable Diffusion Model.	image manipulation, text-to-image synthesis, diffusion models, personalization, clip embedding
2303.08714 Report	ResDiff: Combining CNN and Diffusion Model for Image Super-Resolution	Shuyao Shang, Zhengyang Shan, Guangxing Liu, LunQian Wang, XingHua Wang, Zekai Zhang, Jinglin Zhang	Adapting the Diffusion Probabilistic Model (DPM) for direct image super-resolution is wasteful, given that a simple Convolutional Neural Network (CNN) can recover the main low-frequency content. Therefore, we present ResDiff, a novel Diffusion Probabilistic Model based on Residual structure for Single Image Super-Resolution (SISR). ResDiff utilizes a combination of a CNN, which restores primary low-frequency components, and a DPM, which predicts the residual between the ground-truth image and the CNN predicted image. In contrast to the common diffusion-based methods that directly use LR images to guide the noise towards HR space, ResDiff utilizes the CNN's initial prediction to direct the noise towards the residual space between HR space and CNN-predicted space, which not only accelerates the generation process but also acquires superior sample quality. Additionally, a frequency-domain-based loss function for CNN is introduced to facilitate its restoration, and a frequency-domain guided diffusion is designed for DPM on behalf of predicting high-frequency details. The extensive experiments on multiple benchmark datasets demonstrate that ResDiff outperforms previous diffusion based methods in terms of shorter model convergence time, superior generation quality, and more diverse samples.	This paper presents ResDiff, a novel residual Diffusion Probabilistic Model (DPM) for Single Image Super-Resolution (SISR). It leverages a CNN to recover low-frequency image components and a DPM to predict the residual high-frequency details, improving efficiency and quality.	Current DPMs for SISR are inefficient as they attempt to recover the entire image from noise. This work addresses this by separating low and high-frequency recovery, leading to faster convergence and better quality.	ResDiff employs a pre-trained CNN with frequency-domain loss functions for initial image restoration. A Frequency Domain-guided Diffusion (FD-guided Diffusion) then refines high-frequency details using novel modules: Frequency-Domain Information Splitter and high-frequency guided cross-attention.	ResDiff achieves faster convergence than previous diffusion-based SISR methods. It generates higher-quality super-resolved images with finer details. The model produces more diverse samples compared to other approaches.	While ResDiff improves convergence speed, operations like DWT remain computationally expensive. The model's performance is limited by its smaller size compared to some SOTA methods. Utilizing larger U-net models could bridge this gap.	super-resolution, diffusion models, deep learning, computer vision, image restoration
2303.08695 Report	RefiNeRF: Modelling dynamic neural radiance fields with inconsistent or missing camera parameters	Shuja Khalid, Frank Rudzicz	Novel view synthesis (NVS) is a challenging task in computer vision that involves synthesizing new views of a scene from a limited set of input images. Neural Radiance Fields (NeRF) have emerged as a powerful approach to address this problem, but they require accurate knowledge of camera \textit{intrinsic} and \textit{extrinsic} parameters. Traditionally, structure-from-motion (SfM) and multi-view stereo (MVS) approaches have been used to extract camera parameters, but these methods can be unreliable and may fail in certain cases. In this paper, we propose a novel technique that leverages unposed images from dynamic datasets, such as the NVIDIA dynamic scenes dataset, to learn camera parameters directly from data. Our approach is highly extensible and can be integrated into existing NeRF architectures with minimal modifications. We demonstrate the effectiveness of our method on a variety of static and dynamic scenes and show that it outperforms traditional SfM and MVS approaches. The code for our method is publicly available at \href{https://github.com/redacted/refinerf}{https://github.com/redacted/refinerf}. Our approach offers a promising new direction for improving the accuracy and robustness of NVS using NeRF, and we anticipate that it will be a valuable tool for a wide range of applications in computer vision and graphics.	This paper introduces refiNeRF, a novel method to model dynamic neural radiance fields (NeRFs) even with inconsistent or missing camera parameters.	Accurate camera parameters are crucial for NeRFs to synthesize novel views, but traditional methods like SfM can be unreliable. RefiNeRF aims to improve the accuracy and robustness of NeRFs in challenging real-world scenarios.	The method refines camera parameters by jointly optimizing them with the NeRF model using a photometric loss. It employs a learning scheduler for stable training and leverages multi-resolution encoding for high-fidelity reconstruction.	RefiNeRF outperforms state-of-the-art methods like BARF and NeRF-- in novel view synthesis quality on the NVIDIA dynamic scenes dataset. The method effectively refines even significantly perturbed camera poses, improving reconstruction metrics compared to using coarse initializations. RefiNeRF demonstrates generalizability by enabling novel view synthesis on the challenging Cholec80 dataset, where traditional SfM methods struggle.	RefiNeRF inherits limitations of the original NeRF, such as slow optimization and rendering speed. The current method is computationally demanding, limiting the length of processable video clips, particularly for high-resolution data.	neural radiance fields, novel view synthesis, camera pose estimation, dynamic scenes, deep learning
2303.08686 Report	Weakly Supervised Monocular 3D Object Detection using Multi-View Projection and Direction Consistency	Runzhou Tao, Wencheng Han, Zhongying Qiu, Cheng-zhong Xu, Jianbing Shen	Monocular 3D object detection has become a mainstream approach in automatic driving for its easy application. A prominent advantage is that it does not need LiDAR point clouds during the inference. However, most current methods still rely on 3D point cloud data for labeling the ground truths used in the training phase. This inconsistency between the training and inference makes it hard to utilize the large-scale feedback data and increases the data collection expenses. To bridge this gap, we propose a new weakly supervised monocular 3D objection detection method, which can train the model with only 2D labels marked on images. To be specific, we explore three types of consistency in this task, i.e. the projection, multi-view and direction consistency, and design a weakly-supervised architecture based on these consistencies. Moreover, we propose a new 2D direction labeling method in this task to guide the model for accurate rotation direction prediction. Experiments show that our weakly-supervised method achieves comparable performance with some fully supervised methods. When used as a pre-training method, our model can significantly outperform the corresponding fully-supervised baseline with only 1/3 3D labels. https://github.com/weakmono3d/weakmono3d	This paper proposes a weakly supervised method for monocular 3D object detection, eliminating the need for 3D point cloud annotations during training by using only 2D bounding box and direction labels on images.	This is important because it enables the utilization of large-scale feedback data from production cars, which lack 3D annotations but are crucial for improving model robustness in real-world scenarios.	The method leverages three types of consistency: 1) Projection Consistency ensures the projected 3D bounding boxes align with 2D labels. 2) Multi-view Consistency enforces consistency between predictions from different viewpoints of the same object. 3) Direction Consistency aligns predicted 3D box rotations with newly proposed 2D direction labels.	The method achieves comparable performance to some fully supervised methods on KITTI benchmark. It performs well on a newly collected dataset (ProdCars) from production cars. As a pre-training method, it outperforms the fully supervised baseline with only 1/3 of 3D labels.	The performance using video sequences as multi-view data is not as good as using multi-camera data. The method assumes objects are stationary between frames for video sequence data, which might not always hold true in real-world scenarios. Future work could explore handling object motion in video sequences and extending the method to other label-efficient settings.	3d object detection, weakly supervised learning, monocular vision, autonomous driving, consistency loss
2303.08622 Report	Zero-Shot Contrastive Loss for Text-Guided Diffusion Image Style Transfer	Serin Yang, Hyunmin Hwang, Jong Chul Ye	Diffusion models have shown great promise in text-guided image style transfer, but there is a trade-off between style transformation and content preservation due to their stochastic nature. Existing methods require computationally expensive fine-tuning of diffusion models or additional neural network. To address this, here we propose a zero-shot contrastive loss for diffusion models that doesn't require additional fine-tuning or auxiliary networks. By leveraging patch-wise contrastive loss between generated samples and original image embeddings in the pre-trained diffusion model, our method can generate images with the same semantic content as the source image in a zero-shot manner. Our approach outperforms existing methods while preserving content and requiring no additional training, not only for image style transfer but also for image-to-image translation and manipulation. Our experimental results validate the effectiveness of our proposed method.	This paper proposes ZeCon, a zero-shot contrastive loss for diffusion models, enabling image style transfer while preserving content without requiring fine-tuning or auxiliary networks.	Existing diffusion-based style transfer methods struggle with the trade-off between style transformation and content preservation due to their stochastic nature.	ZeCon leverages patch-wise contrastive loss between generated samples and original image embeddings within a pre-trained diffusion model. By incorporating this loss, the model maintains semantic consistency throughout the generation process.	ZeCon outperforms GAN-based methods in terms of content preservation and style transformation quality. Compared to other diffusion models, ZeCon achieves superior style transfer from unseen domains without additional training. ZeCon is computationally efficient, requiring no training and demonstrating faster inference than methods like DiffusionCLIP.	Finding optimal weights for different losses still requires user adjustment. The method occasionally exhibits limitations by displaying text prompts from the targeted style on the generated images.	image style transfer, diffusion models, contrastive loss, content preservation, zero-shot learning
2303.08594 Report	FastInst: A Simple Query-Based Model for Real-Time Instance Segmentation	Junjie He, Pengyu Li, Yifeng Geng, Xuansong Xie	Recent attention in instance segmentation has focused on query-based models. Despite being non-maximum suppression (NMS)-free and end-to-end, the superiority of these models on high-accuracy real-time benchmarks has not been well demonstrated. In this paper, we show the strong potential of query-based models on efficient instance segmentation algorithm designs. We present FastInst, a simple, effective query-based framework for real-time instance segmentation. FastInst can execute at a real-time speed (i.e., 32.5 FPS) while yielding an AP of more than 40 (i.e., 40.5 AP) on COCO test-dev without bells and whistles. Specifically, FastInst follows the meta-architecture of recently introduced Mask2Former. Its key designs include instance activation-guided queries, dual-path update strategy, and ground truth mask-guided learning, which enable us to use lighter pixel decoders, fewer Transformer decoder layers, while achieving better performance. The experiments show that FastInst outperforms most state-of-the-art real-time counterparts, including strong fully convolutional baselines, in both speed and accuracy. Code can be found at https://github.com/junjiehe96/FastInst .	This paper proposes FastInst, a simple and efficient query-based model for real-time instance segmentation.	Real-time instance segmentation is crucial for applications like self-driving cars and robotics, but existing query-based methods are often computationally expensive. FastInst addresses this gap by demonstrating the potential of query-based models for efficient instance segmentation.	FastInst builds upon the Mask2Former architecture and introduces three key innovations: (1) Instance activation-guided queries that dynamically select pixel embeddings with high semantics as initial queries, (2) a dual-path Transformer decoder that alternately updates query and pixel features for richer embeddings, and (3) ground truth mask-guided learning to enhance the performance of masked attention.	FastInst surpasses most state-of-the-art real-time instance segmentation methods in both speed and accuracy on the COCO dataset. With a ResNet-50 backbone, FastInst-D1 achieves 35.6 AP at 53.8 FPS, outperforming strong convolutional baselines. Using a ResNet-50-d-DCN backbone, FastInst-D3 achieves real-time performance (32.5 FPS) with an AP exceeding 40 (40.5 AP).	Like other query-based models, FastInst struggles with segmenting small objects. While effective, ground truth mask-guided learning increases training costs.	instance segmentation, query-based model, real-time, transformer, computer vision
2303.08566 Report	Sensitivity-Aware Visual Parameter-Efficient Fine-Tuning	Haoyu He, Jianfei Cai, Jing Zhang, Dacheng Tao, Bohan Zhuang	Visual Parameter-Efficient Fine-Tuning (PEFT) has become a powerful alternative for full fine-tuning so as to adapt pre-trained vision models to downstream tasks, which only tunes a small number of parameters while freezing the vast majority ones to ease storage burden and optimization difficulty. However, existing PEFT methods introduce trainable parameters to the same positions across different tasks depending solely on human heuristics and neglect the domain gaps. To this end, we study where to introduce and how to allocate trainable parameters by proposing a novel Sensitivity-aware visual Parameter-efficient fine-Tuning (SPT) scheme, which adaptively allocates trainable parameters to task-specific important positions given a desired tunable parameter budget. Specifically, our SPT first quickly identifies the sensitive parameters that require tuning for a given task in a data-dependent way. Next, our SPT further boosts the representational capability for the weight matrices whose number of sensitive parameters exceeds a pre-defined threshold by utilizing existing structured tuning methods, e.g., LoRA [23] or Adapter [22], to replace directly tuning the selected sensitive parameters (unstructured tuning) under the budget. Extensive experiments on a wide range of downstream recognition tasks show that our SPT is complementary to the existing PEFT methods and largely boosts their performance, e.g., SPT improves Adapter with supervised pre-trained ViT-B/16 backbone by 4.2% and 1.4% mean Top-1 accuracy, reaching SOTA performance on FGVC and VTAB-1k benchmarks, respectively. Source code is at https://github.com/ziplab/SPT	This paper presents SPT, a Sensitivity-aware visual Parameter-efficient fine-Tuning scheme that adaptively allocates trainable parameters to task-specific important positions.	Existing PEFT methods introduce trainable parameters to the same positions across different tasks, neglecting domain gaps. SPT addresses this by identifying and leveraging task-specific parameter sensitivity.	SPT identifies sensitive parameters using a data-dependent criterion based on loss reduction when tuned. It then allocates a budget of trainable parameters using both unstructured (directly tuning sensitive parameters) and structured (using methods like LoRA or Adapter) tuning.	SPT consistently outperforms existing PEFT methods and full fine-tuning, especially with self-supervised backbones. Structured tuning is particularly effective for datasets with large domain gaps. SPT is robust to the number of training samples used to calculate parameter sensitivity.	The fine-tuning memory cost of SPT is slightly higher than some reparameterization-based methods due to sparse gradient updates. Future work includes adapting SPT to more downstream tasks and improving training efficiency.	parameter-efficient fine-tuning, transfer learning, vision transformers, sensitivity analysis, structured tuning
2303.08370 Report	Harnessing Low-Frequency Neural Fields for Few-Shot View Synthesis	Liangchen Song, Zhong Li, Xuan Gong, Lele Chen, Zhang Chen, Yi Xu, Junsong Yuan	Neural Radiance Fields (NeRF) have led to breakthroughs in the novel view synthesis problem. Positional Encoding (P.E.) is a critical factor that brings the impressive performance of NeRF, where low-dimensional coordinates are mapped to high-dimensional space to better recover scene details. However, blindly increasing the frequency of P.E. leads to overfitting when the reconstruction problem is highly underconstrained, \eg, few-shot images for training. We harness low-frequency neural fields to regularize high-frequency neural fields from overfitting to better address the problem of few-shot view synthesis. We propose reconstructing with a low-frequency only field and then finishing details with a high-frequency equipped field. Unlike most existing solutions that regularize the output space (\ie, rendered images), our regularization is conducted in the input space (\ie, signal frequency). We further propose a simple-yet-effective strategy for tuning the frequency to avoid overfitting few-shot inputs: enforcing consistency among the frequency domain of rendered 2D images. Thanks to the input space regularizing scheme, our method readily applies to inputs beyond spatial locations, such as the time dimension in dynamic scenes. Comparisons with state-of-the-art on both synthetic and natural datasets validate the effectiveness of our proposed solution for few-shot view synthesis. Code is available at \href{https://github.com/lsongx/halo}{https://github.com/lsongx/halo}.	This paper presents HALO, a novel method for few-shot view synthesis that leverages low-frequency neural fields to regularize high-frequency neural fields and prevent overfitting.	NeRF struggles with few-shot view synthesis due to overfitting to limited training views, resulting in inaccurate scene representations. HALO addresses this limitation by harnessing the smooth geometry produced by low-frequency neural fields to guide the learning of high-frequency details.	HALO consists of a three-stage training process: (1) Train a low-frequency NeRF (Lo-NeRF) with tuned frequency; (2) Train a ray-based field supervised by Lo-NeRF to efficiently predict rough depth for each ray; (3) Train a high-frequency NeRF (Hi-NeRF), guided by the ray-based field and regularized to maintain geometry consistency with Lo-NeRF.	HALO achieves comparable results to state-of-the-art methods like DietNeRF on 360° rendering tasks, without relying on external semantic supervision. The method demonstrates superior extrapolation ability compared to DietNeRF, accurately reconstructing periodic structures and textures in unseen areas. HALO effectively improves novel view synthesis quality on forward-facing light field data and dynamic scenes, demonstrating its generalizability.	The optimal frequency for Lo-NeRF is determined empirically, potentially limiting its applicability to diverse scenes. The method assumes the availability of at least three views for reconstructing a reasonable initial geometry.	novel view synthesis, neural radiance fields (nerf), few-shot learning, positional encoding, frequency regularization
2303.08331 Report	Towards High-Quality and Efficient Video Super-Resolution via Spatial-Temporal Data Overfitting	Gen Li, Jie Ji, Minghai Qin, Wei Niu, Bin Ren, Fatemeh Afghah, Linke Guo, Xiaolong Ma	As deep convolutional neural networks (DNNs) are widely used in various fields of computer vision, leveraging the overfitting ability of the DNN to achieve video resolution upscaling has become a new trend in the modern video delivery system. By dividing videos into chunks and overfitting each chunk with a super-resolution model, the server encodes videos before transmitting them to the clients, thus achieving better video quality and transmission efficiency. However, a large number of chunks are expected to ensure good overfitting quality, which substantially increases the storage and consumes more bandwidth resources for data transmission. On the other hand, decreasing the number of chunks through training optimization techniques usually requires high model capacity, which significantly slows down execution speed. To reconcile such, we propose a novel method for high-quality and efficient video resolution upscaling tasks, which leverages the spatial-temporal information to accurately divide video into chunks, thus keeping the number of chunks as well as the model size to minimum. Additionally, we advance our method into a single overfitting model by a data-aware joint training technique, which further reduces the storage requirement with negligible quality drop. We deploy our models on an off-the-shelf mobile phone, and experimental results show that our method achieves real-time video super-resolution with high video quality. Compared with the state-of-the-art, our method achieves 28 fps streaming speed with 41.6 PSNR, which is 14$\times$ faster and 2.29 dB better in the live video resolution upscaling tasks. Code available in https://github.com/coulsonlee/STDO-CVPR2023.git	This paper proposes STDO, a novel spatial-temporal data overfitting approach for high-quality and efficient video resolution upscaling, which leverages spatial-temporal information to divide video into chunks for overfitting with independent or a single jointly trained SR model.	Existing video resolution upscaling methods suffer from limited generalization ability or require large models with high computation costs for overfitting. STDO addresses these limitations by efficiently encoding HR videos into LR videos and compact SR models while maintaining high super-resolution quality.	STDO divides video frames into patches and groups them into chunks based on PSNR values obtained from a pre-trained SR model. It then overfits each chunk with an independent SR model. Additionally, it introduces JSTDO, which utilizes a data-aware joint training technique to generate a single SR model for the entire video with minimal quality loss.	STDO consistently outperforms state-of-the-art methods in video super-resolution quality (PSNR) while using smaller SR models with lower computation costs. JSTDO effectively reduces the model size while maintaining comparable PSNR to STDO, making it suitable for deployment on resource-constrained devices. Deploying on mobile devices, STDO achieves real-time video super-resolution performance with high video quality.	The performance of STDO may be affected when encountering significant scene changes in long videos. Future work includes exploring more sophisticated data scheduling strategies for joint training in JSTDO to further improve efficiency and quality.	video super-resolution, data overfitting, spatial-temporal information, joint training, mobile deployment
2303.08320 Report	VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation	Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, Tieniu Tan	A diffusion probabilistic model (DPM), which constructs a forward diffusion process by gradually adding noise to data points and learns the reverse denoising process to generate new samples, has been shown to handle complex data distribution. Despite its recent success in image synthesis, applying DPMs to video generation is still challenging due to high-dimensional data spaces. Previous methods usually adopt a standard diffusion process, where frames in the same video clip are destroyed with independent noises, ignoring the content redundancy and temporal correlation. This work presents a decomposed diffusion process via resolving the per-frame noise into a base noise that is shared among all frames and a residual noise that varies along the time axis. The denoising pipeline employs two jointly-learned networks to match the noise decomposition accordingly. Experiments on various datasets confirm that our approach, termed as VideoFusion, surpasses both GAN-based and diffusion-based alternatives in high-quality video generation. We further show that our decomposed formulation can benefit from pre-trained image diffusion models and well-support text-conditioned video creation.	Presents VideoFusion, a decomposed diffusion probabilistic model for high-quality video generation that decomposes per-frame noise into shared base noise and time-varying residual noise, enabling efficient learning of spatial-temporal correlations.	Addresses the challenge of applying diffusion models to high-dimensional video data by leveraging content redundancy and temporal correlations within video frames.	Decomposes diffusion process into shared base noise and residual noise; employs two jointly-learned networks for denoising, leveraging pre-trained image diffusion models for base noise estimation.	Outperforms GAN-based and diffusion-based methods on UCF101, Sky Time-lapse, and TaiChi-HD datasets in terms of FVD, KVD, and IS. Effectively leverages pre-trained image diffusion models, improving efficiency and results. Exhibits potential for content control and generation of longer coherent video sequences.	Shared base noise might limit motion diversity in generated videos. Current implementation relies on pre-trained prior for conditioning, potentially limiting performance in long-text video generation.	video generation, diffusion models, decomposed representation, pre-trained models, content control
2303.08132 Report	InstMove: Instance Motion for Object-centric Video Segmentation	Qihao Liu, Junfeng Wu, Yi Jiang, Xiang Bai, Alan Yuille, Song Bai	Despite significant efforts, cutting-edge video segmentation methods still remain sensitive to occlusion and rapid movement, due to their reliance on the appearance of objects in the form of object embeddings, which are vulnerable to these disturbances. A common solution is to use optical flow to provide motion information, but essentially it only considers pixel-level motion, which still relies on appearance similarity and hence is often inaccurate under occlusion and fast movement. In this work, we study the instance-level motion and present InstMove, which stands for Instance Motion for Object-centric Video Segmentation. In comparison to pixel-wise motion, InstMove mainly relies on instance-level motion information that is free from image feature embeddings, and features physical interpretations, making it more accurate and robust toward occlusion and fast-moving objects. To better fit in with the video segmentation tasks, InstMove uses instance masks to model the physical presence of an object and learns the dynamic model through a memory network to predict its position and shape in the next frame. With only a few lines of code, InstMove can be integrated into current SOTA methods for three different video segmentation tasks and boost their performance. Specifically, we improve the previous arts by 1.5 AP on OVIS dataset, which features heavy occlusions, and 4.9 AP on YouTubeVIS-Long dataset, which mainly contains fast-moving objects. These results suggest that instance-level motion is robust and accurate, and hence serving as a powerful solution in complex scenarios for object-centric video segmentation.	This paper introduces InstMove, a novel instance motion module for object-centric video segmentation that predicts object motion and deformation directly from instance masks.	Existing video segmentation methods struggle with occlusion and rapid movement due to their reliance on appearance-based object embeddings. InstMove addresses this by leveraging instance-level motion information, which is more robust and accurate.	InstMove utilizes an RNN-based module with a memory network to extract motion features from previous instance masks, store and retrieve dynamic information, and predict the position and shape of the object in the next frame. Image features can be incorporated to refine boundary prediction.	InstMove significantly outperforms optical flow-based motion prediction, especially in challenging scenarios with occlusions or fast-moving objects. Integrating InstMove with SOTA methods for VIS, VOS, and MOTS tasks leads to consistent performance improvements on benchmarks like OVIS, YouTubeVIS-Long, and BDD100K. The improvements are particularly notable in complex scenarios with heavy occlusion and rapid motion, highlighting the effectiveness of incorporating instance-level motion information.	The current implementation relies on low-level image features for boundary refinement, which might limit its generalizability. The computational cost of InstMove could be further optimized for real-time applications.	video segmentation, instance motion, motion prediction, object tracking, occlusion handling
2303.08131 Report	A Simple Framework for Open-Vocabulary Segmentation and Detection	Hao Zhang, Feng Li, Xueyan Zou, Shilong Liu, Chunyuan Li, Jianfeng Gao, Jianwei Yang, Lei Zhang	We present OpenSeeD, a simple Open-vocabulary Segmentation and Detection framework that jointly learns from different segmentation and detection datasets. To bridge the gap of vocabulary and annotation granularity, we first introduce a pre-trained text encoder to encode all the visual concepts in two tasks and learn a common semantic space for them. This gives us reasonably good results compared with the counterparts trained on segmentation task only. To further reconcile them, we locate two discrepancies: $i$) task discrepancy -- segmentation requires extracting masks for both foreground objects and background stuff, while detection merely cares about the former; $ii$) data discrepancy -- box and mask annotations are with different spatial granularity, and thus not directly interchangeable. To address these issues, we propose a decoupled decoding to reduce the interference between foreground/background and a conditioned mask decoding to assist in generating masks for given boxes. To this end, we develop a simple encoder-decoder model encompassing all three techniques and train it jointly on COCO and Objects365. After pre-training, our model exhibits competitive or stronger zero-shot transferability for both segmentation and detection. Specifically, OpenSeeD beats the state-of-the-art method for open-vocabulary instance and panoptic segmentation across 5 datasets, and outperforms previous work for open-vocabulary detection on LVIS and ODinW under similar settings. When transferred to specific tasks, our model achieves new SoTA for panoptic segmentation on COCO and ADE20K, and instance segmentation on ADE20K and Cityscapes. Finally, we note that OpenSeeD is the first to explore the potential of joint training on segmentation and detection, and hope it can be received as a strong baseline for developing a single model for both tasks in open world.	This paper proposes OpenSeeD, a simple framework for building an open-vocabulary model that can perform both segmentation and detection by jointly learning from segmentation and detection datasets.	Existing methods primarily focus on either open-vocabulary detection or segmentation, but not both. This work explores bridging the gap between detection and segmentation to achieve a single model for both tasks in the open world.	OpenSeeD employs a shared text encoder to align visual and textual semantics. It tackles task discrepancies by using decoupled foreground/background decoding and addresses data discrepancies via conditioned mask decoding.	OpenSeeD achieves state-of-the-art zero-shot segmentation performance on multiple datasets, outperforming methods like ODISE and X-Decoder. It exhibits competitive zero-shot detection performance, surpassing GLIP on LVIS under similar settings. OpenSeeD sets new state-of-the-art results for task-specific transfer on COCO and ADE20K panoptic segmentation and ADE20K and Cityscapes instance segmentation.	The model currently doesn't incorporate referring/grounding data or large-scale image-text pairs, which could further enhance training data and semantic coverage. Future work will explore a more comprehensive joint training approach that leverages these additional data sources.	open vocabulary, segmentation, detection, joint learning, conditioned mask decoding
2303.08129 Report	PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object Detection	Anthony Chen, Kevin Zhang, Renrui Zhang, Zihan Wang, Yuheng Lu, Yandong Guo, Shanghang Zhang	Masked Autoencoders learn strong visual representations and achieve state-of-the-art results in several independent modalities, yet very few works have addressed their capabilities in multi-modality settings. In this work, we focus on point cloud and RGB image data, two modalities that are often presented together in the real world, and explore their meaningful interactions. To improve upon the cross-modal synergy in existing works, we propose PiMAE, a self-supervised pre-training framework that promotes 3D and 2D interaction through three aspects. Specifically, we first notice the importance of masking strategies between the two sources and utilize a projection module to complementarily align the mask and visible tokens of the two modalities. Then, we utilize a well-crafted two-branch MAE pipeline with a novel shared decoder to promote cross-modality interaction in the mask tokens. Finally, we design a unique cross-modal reconstruction module to enhance representation learning for both modalities. Through extensive experiments performed on large-scale RGB-D scene understanding benchmarks (SUN RGB-D and ScannetV2), we discover it is nontrivial to interactively learn point-image features, where we greatly improve multiple 3D detectors, 2D detectors, and few-shot classifiers by 2.9%, 6.7%, and 2.4%, respectively. Code is available at https://github.com/BLVLab/PiMAE.	PiMAE, a novel self-supervised pre-training framework for 3D object detection, is introduced. It leverages masked autoencoders to learn interactive point cloud and RGB image representations.	Existing methods struggle to effectively bridge 3D and 2D data for enhanced feature learning in multi-modal settings. PiMAE addresses this limitation by maximizing cross-modal synergy between point cloud and image data.	PiMAE employs a two-branch MAE architecture with a shared decoder, ensuring both modal-specific and cross-modal learning. It introduces a novel complementary masking strategy, aligning masks between projected point tokens and image patches, and incorporates a cross-modal reconstruction module to strengthen representation learning.	PiMAE significantly improves the performance of 3D object detectors, outperforming state-of-the-art methods by a large margin on SUN RGB-D and ScanNetV2 datasets. PiMAE demonstrates strong generalization ability, enhancing 2D object detection on ScanNetV2 and monocular 3D detection on KITTI. PiMAE excels in few-shot image classification, demonstrating the effectiveness of its learned image representations on CIFAR-FS, FC100, and miniImageNet datasets.	The reliance on projection for alignment might limit its applicability to scenarios with inaccurate camera poses. Future work can investigate alternative fusion mechanisms beyond the shared decoder to further enhance cross-modal learning.	3d object detection, multi-modal learning, masked autoencoders, self-supervised learning, point cloud, rgb image fusion
2303.08120 Report	Blind Video Deflickering by Neural Filtering with a Flawed Atlas	Chenyang Lei, Xuanchi Ren, Zhaoxiang Zhang, Qifeng Chen	Many videos contain flickering artifacts. Common causes of flicker include video processing algorithms, video generation algorithms, and capturing videos under specific situations. Prior work usually requires specific guidance such as the flickering frequency, manual annotations, or extra consistent videos to remove the flicker. In this work, we propose a general flicker removal framework that only receives a single flickering video as input without additional guidance. Since it is blind to a specific flickering type or guidance, we name this "blind deflickering." The core of our approach is utilizing the neural atlas in cooperation with a neural filtering strategy. The neural atlas is a unified representation for all frames in a video that provides temporal consistency guidance but is flawed in many cases. To this end, a neural network is trained to mimic a filter to learn the consistent features (e.g., color, brightness) and avoid introducing the artifacts in the atlas. To validate our method, we construct a dataset that contains diverse real-world flickering videos. Extensive experiments show that our method achieves satisfying deflickering performance and even outperforms baselines that use extra guidance on a public benchmark.	This paper proposes the first "blind deflickering" approach, capable of removing diverse flickering artifacts from videos without needing to know the specific flicker type or requiring extra guidance.	Many videos suffer from flickering artifacts due to various reasons (e.g., old cameras, high-speed cameras, video processing algorithms), and existing methods are often task-specific or require additional guidance like consistent videos, which limits their applicability.	The method leverages a neural atlas to represent all video frames in a unified manner, ensuring temporal consistency. Since the atlas can have flaws, it employs a neural filtering strategy to learn invariant features from distorted versions of the atlas and input frames, effectively removing flicker while preserving important details.	The proposed method achieves state-of-the-art performance on a newly constructed dataset containing diverse flickering videos. It outperforms baselines designed for specific flickering types. Even without using extra input videos for guidance, it surpasses methods that rely on them, demonstrating its effectiveness and broader applicability.	The method might not handle temporal inconsistencies arising from content variations (e.g., significant content differences in generated videos or large scratches in old films). Future work could explore extensions to address these limitations and apply the blind deflickering concept to other tasks like novel view synthesis.	video deflickering, neural atlas, neural filtering, temporal consistency, video processing
2303.08096 Report	MELON: NeRF with Unposed Images in SO(3)	Axel Levy, Mark Matthews, Matan Sela, Gordon Wetzstein, Dmitry Lagun	Neural radiance fields enable novel-view synthesis and scene reconstruction with photorealistic quality from a few images, but require known and accurate camera poses. Conventional pose estimation algorithms fail on smooth or self-similar scenes, while methods performing inverse rendering from unposed views require a rough initialization of the camera orientations. The main difficulty of pose estimation lies in real-life objects being almost invariant under certain transformations, making the photometric distance between rendered views non-convex with respect to the camera parameters. Using an equivalence relation that matches the distribution of local minima in camera space, we reduce this space to its quotient set, in which pose estimation becomes a more convex problem. Using a neural-network to regularize pose estimation, we demonstrate that our method - MELON - can reconstruct a neural radiance field from unposed images with state-of-the-art accuracy while requiring ten times fewer views than adversarial approaches.	MELON infers a neural radiance field from unposed images by simultaneously training a CNN encoder that maps images to camera poses and a neural radiance field of the scene.	This approach eliminates the need for known camera poses in neural rendering, which is a significant limitation in applications like novel view synthesis and scene reconstruction.	MELON introduces a Modulo-Equivalent Loss (MEL) that replicates encoder outputs based on an equivalence relation in camera space. This allows the encoder to operate in a quotient set, simplifying pose estimation. The method is illustrated with a 1D toy problem and then applied to 3D inverse rendering.	MELON demonstrates competitive reconstruction metrics on synthetic and real datasets, outperforming existing methods like GNeRF in terms of pose accuracy and novel view synthesis quality. It exhibits robustness to noise, generating noise-free novel views even from noisy input images. The method requires significantly fewer views compared to adversarial approaches, successfully reconstructing scenes from as few as six unposed images.	Characterizing the full loss landscape of 3D inverse rendering with unknown poses is still an open question, and the theoretical analysis currently relies on simplifying assumptions. The assumption of a perfectly object-centered setup limits its applicability to real-world scenarios, and predicting camera extrinsics in SE(3) remains a challenge.	neural radiance fields, pose estimation, novel view synthesis, inverse rendering, unposed images
2303.08085 Report	Alias-Free Convnets: Fractional Shift Invariance via Polynomial Activations	Hagay Michaeli, Tomer Michaeli, Daniel Soudry	Although CNNs are believed to be invariant to translations, recent works have shown this is not the case, due to aliasing effects that stem from downsampling layers. The existing architectural solutions to prevent aliasing are partial since they do not solve these effects, that originate in non-linearities. We propose an extended anti-aliasing method that tackles both downsampling and non-linear layers, thus creating truly alias-free, shift-invariant CNNs. We show that the presented model is invariant to integer as well as fractional (i.e., sub-pixel) translations, thus outperforming other shift-invariant methods in terms of robustness to adversarial translations.	This paper proposes Alias-Free Convnet (AFC), a convolutional neural network architecture that achieves shift-invariance by eliminating aliasing effects through the use of polynomial activations and alias-free downsampling layers.	Shift-invariance is a desirable property in CNNs for image classification as it improves generalization, robustness to adversarial attacks, and consistency of predictions under image translations.	The authors modify the ConvNeXt architecture by replacing standard activations with polynomial activations and strided convolutions with BlurPool layers. Polynomial activations have limited bandwidth expansion, which is addressed by upsampling and downsampling within the activation function. BlurPool layers implement alias-free downsampling using low-pass filtering before subsampling.	AFC achieves 100% consistency to both integer and fractional pixel translations. AFC demonstrates superior robustness to adversarial attacks based on image translations, maintaining high accuracy even under fractional pixel shifts. The study provides the first demonstration of competitive performance with polynomial activations on ImageNet.	The modifications for alias-free properties come at the cost of a slight reduction in standard test accuracy compared to the baseline ConvNeXt model. The guaranteed shift-invariance of AFC is limited to circular translations. While the paper shows improved robustness to other types of translations, perfect invariance is not guaranteed.	convolutional neural networks, shift-invariance, aliasing, polynomial activations, adversarial robustness
2303.08084 Report	Editing Implicit Assumptions in Text-to-Image Diffusion Models	Hadas Orgad, Bahjat Kawar, Yonatan Belinkov	Text-to-image diffusion models often make implicit assumptions about the world when generating images. While some assumptions are useful (e.g., the sky is blue), they can also be outdated, incorrect, or reflective of social biases present in the training data. Thus, there is a need to control these assumptions without requiring explicit user input or costly re-training. In this work, we aim to edit a given implicit assumption in a pre-trained diffusion model. Our Text-to-Image Model Editing method, TIME for short, receives a pair of inputs: a "source" under-specified prompt for which the model makes an implicit assumption (e.g., "a pack of roses"), and a "destination" prompt that describes the same setting, but with a specified desired attribute (e.g., "a pack of blue roses"). TIME then updates the model's cross-attention layers, as these layers assign visual meaning to textual tokens. We edit the projection matrices in these layers such that the source prompt is projected close to the destination prompt. Our method is highly efficient, as it modifies a mere 2.2% of the model's parameters in under one second. To evaluate model editing approaches, we introduce TIMED (TIME Dataset), containing 147 source and destination prompt pairs from various domains. Our experiments (using Stable Diffusion) show that TIME is successful in model editing, generalizes well for related prompts unseen during editing, and imposes minimal effect on unrelated generations.	TIME is a method for editing implicit assumptions in text-to-image diffusion models after training.	Text-to-image models often make implicit assumptions that are useful but can also be outdated, incorrect, or reflect societal biases.	TIME leverages the ability of diffusion models to generate different outputs based on explicit specification. It modifies the projection matrices in the cross-attention layers to map a user-specified 'source' prompt closer to a 'destination' prompt with the desired attribute.	TIME successfully edits model assumptions for various prompts. The method generalizes well to related prompts unseen during editing. The overall generative quality of the model remains unaffected after editing as measured by FID and CLIP Score.	TIME inherits the generative limitations of the diffusion model it edits, it cannot teach the model entirely new concepts. The method can sometimes apply an edit too mildly or aggressively, hindering its generalizability or specificity.	text-to-image generation, diffusion models, model editing, implicit assumptions, social bias mitigation
2303.08063 Report	Interpretable ODE-style Generative Diffusion Model via Force Field Construction	Weiyang Jin, Yongpei Zhu, Yuxi Peng	For a considerable time, researchers have focused on developing a method that establishes a deep connection between the generative diffusion model and mathematical physics. Despite previous efforts, progress has been limited to the pursuit of a single specialized method. In order to advance the interpretability of diffusion models and explore new research directions, it is essential to establish a unified ODE-style generative diffusion model. Such a model should draw inspiration from physical models and possess a clear geometric meaning. This paper aims to identify various physical models that are suitable for constructing ODE-style generative diffusion models accurately from a mathematical perspective. We then summarize these models into a unified method. Additionally, we perform a case study where we use the theoretical model identified by our method to develop a range of new diffusion model methods, and conduct experiments. Our experiments on CIFAR-10 demonstrate the effectiveness of our approach. We have constructed a computational framework that attains highly proficient results with regards to image generation speed, alongside an additional model that demonstrates exceptional performance in both Inception score and FID score. These results underscore the significance of our method in advancing the field of diffusion models.	This paper proposes a novel method for constructing interpretable ODE-style generative diffusion models by leveraging force field construction inspired by physical models.	This work aims to enhance the interpretability of diffusion models and explore new research avenues by establishing a unified ODE-style generative diffusion model framework grounded in mathematical physics.	The authors establish a connection between ODE-style diffusion models and the transport equation from physics. They utilize Green's functions to construct vector fields satisfying initial and final distribution conditions and provide solutions for specific cases like isotropic fields. Different trajectory types (linear, distribution-based, curve) are proposed and their learning process is formulated as a score matching objective. Sampling methods tailored for each trajectory type are also presented.	The diffusion model with multi-sample linear superposition achieved the best Inception and FID scores on CIFAR-10. The diffusion model with a one-sample straight line demonstrated high efficiency within a limited number of iterations. The study revealed that excessively high numbers of fitted curves in multi-sample straight line models can lead to mode collapse.	The paper primarily focuses on image generation, and further investigation is needed to extend the proposed method to other data modalities. The analysis of mode collapse in multi-sample straight-line trajectories, while insightful, could benefit from more detailed theoretical exploration.	generative diffusion models, force field construction, ode-style diffusion models, interpretable machine learning, score matching
2303.07945 Report	Edit-A-Video: Single Video Editing with Object-Aware Consistency	Chaehun Shin, Heeseung Kim, Che Hyun Lee, Sang-gil Lee, Sungroh Yoon	Despite the fact that text-to-video (TTV) model has recently achieved remarkable success, there have been few approaches on TTV for its extension to video editing. Motivated by approaches on TTV models adapting from diffusion-based text-to-image (TTI) models, we suggest the video editing framework given only a pretrained TTI model and a single pair, which we term Edit-A-Video. The framework consists of two stages: (1) inflating the 2D model into the 3D model by appending temporal modules and tuning on the source video (2) inverting the source video into the noise and editing with target text prompt and attention map injection. Each stage enables the temporal modeling and preservation of semantic attributes of the source video. One of the key challenges for video editing include a background inconsistency problem, where the regions not included for the edit suffer from undesirable and inconsistent temporal alterations. To mitigate this issue, we also introduce a novel mask blending method, termed as sparse-causal blending (SC Blending). We improve previous mask blending methods to reflect the temporal consistency so that the area where the editing is applied exhibits smooth transition while also achieving spatio-temporal consistency of the unedited regions. We present extensive experimental results over various types of text and videos, and demonstrate the superiority of the proposed method compared to baselines in terms of background consistency, text alignment, and video editing quality.	This paper introduces Edit-A-Video, a novel framework for text-guided video editing using a pre-trained text-to-image (TTI) model and a single source video.	Existing text-to-video editing methods often struggle to maintain temporal consistency, especially in background regions, leading to unrealistic and jarring edits.	Edit-A-Video consists of two stages: 1) Inflating a 2D TTI model to a 3D model by adding temporal modules and finetuning on the source video, 2) Inverting the source video into noise and iteratively denoising it towards the target text while injecting attention maps. Crucially, it employs a novel "temporal-consistent blending" (TC Blending) method to ensure smooth and consistent edits across frames.	Edit-A-Video successfully edits videos to match target text prompts while preserving background consistency and source video dynamics. The proposed TC Blending method significantly reduces background inconsistencies compared to traditional blending techniques. Quantitative and qualitative comparisons demonstrate Edit-A-Video's superiority over existing methods in terms of editing quality, text alignment, and background preservation.	The method is currently limited to editing short video clips due to computational constraints. Future work could explore more sophisticated temporal modeling techniques and user-interactive editing tools.	video editing, text-guided synthesis, diffusion models, temporal consistency, attention mechanisms
2303.07938 Report	Controllable Mesh Generation Through Sparse Latent Point Diffusion Models	Zhaoyang Lyu, Jinyi Wang, Yuwei An, Ya Zhang, Dahua Lin, Bo Dai	Mesh generation is of great value in various applications involving computer graphics and virtual content, yet designing generative models for meshes is challenging due to their irregular data structure and inconsistent topology of meshes in the same category. In this work, we design a novel sparse latent point diffusion model for mesh generation. Our key insight is to regard point clouds as an intermediate representation of meshes, and model the distribution of point clouds instead. While meshes can be generated from point clouds via techniques like Shape as Points (SAP), the challenges of directly generating meshes can be effectively avoided. To boost the efficiency and controllability of our mesh generation method, we propose to further encode point clouds to a set of sparse latent points with point-wise semantic meaningful features, where two DDPMs are trained in the space of sparse latent points to respectively model the distribution of the latent point positions and features at these latent points. We find that sampling in this latent space is faster than directly sampling dense point clouds. Moreover, the sparse latent points also enable us to explicitly control both the overall structures and local details of the generated meshes. Extensive experiments are conducted on the ShapeNet dataset, where our proposed sparse latent point diffusion model achieves superior performance in terms of generation quality and controllability when compared to existing methods.	This paper proposes SLIDE, a novel sparse latent point diffusion model for controllable 3D mesh generation.	Mesh generation is crucial in computer graphics but challenging due to irregular mesh data and inconsistent topology. Existing methods struggle with limited topology and quality issues. This work aims to address these challenges by using point clouds as an intermediate representation and introducing a novel sparse latent point diffusion model.	The approach involves: 1) Training an autoencoder that encodes a point cloud to features at a sparse set of latent points and decodes it back. 2) Training two DDPMs in the latent space, one for the distribution of sparse latent point positions and the other for the distribution of features at these points.	SLIDE generates high-quality meshes with diverse topologies, outperforming baselines in visual quality and metrics like 1-NN, MMD, and COV. The model allows controllable mesh generation by manipulating the positions of sparse latent points, enabling control over overall structure and local details without part annotations. SLIDE is efficient, achieving faster generation speeds compared to DDPMs directly trained on point clouds.	The correspondence between sparse latent points across different shapes needs improvement for better control. Exploring alternative surface reconstruction techniques beyond SAP might further enhance mesh quality.	mesh generation, point cloud, diffusion models, deep learning, controllable generation
2303.07937 Report	Let 2D Diffusion Model Know 3D-Consistency for Robust Text-to-3D Generation	Junyoung Seo, Wooseok Jang, Min-Seop Kwak, Hyeonsu Kim, Jaehoon Ko, Junho Kim, Jin-Hwa Kim, Jiyoung Lee, Seungryong Kim	Text-to-3D generation has shown rapid progress in recent days with the advent of score distillation, a methodology of using pretrained text-to-2D diffusion models to optimize neural radiance field (NeRF) in the zero-shot setting. However, the lack of 3D awareness in the 2D diffusion models destabilizes score distillation-based methods from reconstructing a plausible 3D scene. To address this issue, we propose 3DFuse, a novel framework that incorporates 3D awareness into pretrained 2D diffusion models, enhancing the robustness and 3D consistency of score distillation-based methods. We realize this by first constructing a coarse 3D structure of a given text prompt and then utilizing projected, view-specific depth map as a condition for the diffusion model. Additionally, we introduce a training strategy that enables the 2D diffusion model learns to handle the errors and sparsity within the coarse 3D structure for robust generation, as well as a method for ensuring semantic consistency throughout all viewpoints of the scene. Our framework surpasses the limitations of prior arts, and has significant implications for 3D consistent generation of 2D diffusion models.	This paper introduces 3DFuse, a novel framework that improves 3D consistency in text-to-3D generation by incorporating 3D awareness into pretrained 2D diffusion models.	Existing score distillation-based text-to-3D generation methods often produce geometrically inconsistent scenes due to the lack of 3D awareness in 2D diffusion models.	3DFuse uses a consistency injection module to condition the diffusion model on sparse depth projections of a generated point cloud, effectively guiding the generation process with 3D information. It also employs semantic code sampling to reduce ambiguity in text prompts and enhance semantic consistency.	3DFuse significantly improves the geometric consistency and fidelity of generated 3D scenes compared to baselines like DreamFusion, SJC, and ProlificDreamer. Qualitative results and a proposed COLMAP-based quantitative metric demonstrate the effectiveness of 3DFuse in ensuring geometric consistency. User studies confirm that 3DFuse generates 3D scenes with higher fidelity and better geometric consistency than previous methods.	The approach inherits the limitations of pretrained diffusion models in reflecting complex user prompts. Potential societal biases inherent in the training data may affect the generated 3D scenes. Future work could explore incorporating more sophisticated 3D priors and addressing the limitations of pretrained diffusion models.	text-to-3d generation, score distillation sampling, 3d consistency, diffusion models, neural radiance fields (nerf)
2303.07820 Report	Adaptive Rotated Convolution for Rotated Object Detection	Yifan Pu, Yiru Wang, Zhuofan Xia, Yizeng Han, Yulin Wang, Weihao Gan, Zidong Wang, Shiji Song, Gao Huang	Rotated object detection aims to identify and locate objects in images with arbitrary orientation. In this scenario, the oriented directions of objects vary considerably across different images, while multiple orientations of objects exist within an image. This intrinsic characteristic makes it challenging for standard backbone networks to extract high-quality features of these arbitrarily orientated objects. In this paper, we present Adaptive Rotated Convolution (ARC) module to handle the aforementioned challenges. In our ARC module, the convolution kernels rotate adaptively to extract object features with varying orientations in different images, and an efficient conditional computation mechanism is introduced to accommodate the large orientation variations of objects within an image. The two designs work seamlessly in rotated object detection problem. Moreover, ARC can conveniently serve as a plug-and-play module in various vision backbones to boost their representation ability to detect oriented objects accurately. Experiments on commonly used benchmarks (DOTA and HRSC2016) demonstrate that equipped with our proposed ARC module in the backbone network, the performance of multiple popular oriented object detectors is significantly improved (\eg +3.03\% mAP on Rotated RetinaNet and +4.16\% on CFA). Combined with the highly competitive method Oriented R-CNN, the proposed approach achieves state-of-the-art performance on the DOTA dataset with 81.77\% mAP. Code is available at \url{https://github.com/LeapLabTHU/ARC}.	This paper proposes Adaptive Rotated Convolution (ARC), a plug-and-play module for boosting backbone network performance in rotated object detection.	Standard backbone networks struggle to extract quality features from arbitrarily oriented objects, as their orientations differ significantly across and within images.	ARC rotates convolution kernels adaptively based on input image features using a routing function. It employs conditional computation to efficiently handle multiple object orientations within a single image.	ARC significantly improves the performance of various rotated object detectors (single-stage and two-stage) on DOTA and HRSC2016 datasets. Combined with Oriented R-CNN, ARC achieves state-of-the-art performance on DOTA. ARC maintains efficiency with a minimal increase in FLOPs and a slight drop in FPS compared to baseline models.	The paper mainly focuses on replacing 3x3 convolutions and doesn't explore other kernel sizes extensively. Future work could investigate the integration of ARC with transformer-based backbones.	rotated object detection, adaptive convolution, dynamic networks, conditional computation, backbone networks
2303.07418 Report	FreeNeRF: Improving Few-shot Neural Rendering with Free Frequency Regularization	Jiawei Yang, Marco Pavone, Yue Wang	Novel view synthesis with sparse inputs is a challenging problem for neural radiance fields (NeRF). Recent efforts alleviate this challenge by introducing external supervision, such as pre-trained models and extra depth signals, and by non-trivial patch-based rendering. In this paper, we present Frequency regularized NeRF (FreeNeRF), a surprisingly simple baseline that outperforms previous methods with minimal modifications to the plain NeRF. We analyze the key challenges in few-shot neural rendering and find that frequency plays an important role in NeRF's training. Based on the analysis, we propose two regularization terms. One is to regularize the frequency range of NeRF's inputs, while the other is to penalize the near-camera density fields. Both techniques are ``free lunches'' at no additional computational cost. We demonstrate that even with one line of code change, the original NeRF can achieve similar performance as other complicated methods in the few-shot setting. FreeNeRF achieves state-of-the-art performance across diverse datasets, including Blender, DTU, and LLFF. We hope this simple baseline will motivate a rethinking of the fundamental role of frequency in NeRF's training under the low-data regime and beyond.	This paper introduces FreeNeRF, a simple yet effective baseline for few-shot neural rendering that leverages frequency and occlusion regularization.	Few-shot neural rendering is challenging because NeRF models often overfit to limited training views and struggle to generalize to novel views.	The authors analyze the failure modes of NeRF in few-shot settings and propose (1) frequency regularization to stabilize training by gradually introducing high-frequency components, and (2) occlusion regularization to penalize dense fields near the camera, mitigating artifacts like floaters.	FreeNeRF outperforms previous state-of-the-art methods on Blender, DTU, and LLFF datasets in terms of novel view synthesis quality. The proposed method introduces minimal computational overhead compared to a plain NeRF, requiring no pre-training or additional rendering steps. Ablation studies validate the effectiveness of both frequency and occlusion regularization in improving few-shot neural rendering performance.	A longer frequency curriculum can cause blurriness, resulting in lower LPIPS scores despite higher PSNR. Occlusion regularization might lead to over-regularization and incomplete representations of near-camera objects.	neural rendering, nerf, few-shot learning, frequency regularization, occlusion regularization
2303.07274 Report	Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	Nitzan Bitton-Guetta, Yonatan Bitton, Jack Hessel, Ludwig Schmidt, Yuval Elovici, Gabriel Stanovsky, Roy Schwartz	Weird, unusual, and uncanny images pique the curiosity of observers because they challenge commonsense. For example, an image released during the 2022 world cup depicts the famous soccer stars Lionel Messi and Cristiano Ronaldo playing chess, which playfully violates our expectation that their competition should occur on the football field. Humans can easily recognize and interpret these unconventional images, but can AI models do the same? We introduce WHOOPS!, a new dataset and benchmark for visual commonsense. The dataset is comprised of purposefully commonsense-defying images created by designers using publicly-available image generation tools like Midjourney. We consider several tasks posed over the dataset. In addition to image captioning, cross-modal matching, and visual question answering, we introduce a difficult explanation generation task, where models must identify and explain why a given image is unusual. Our results show that state-of-the-art models such as GPT3 and BLIP2 still lag behind human performance on WHOOPS!. We hope our dataset will inspire the development of AI models with stronger visual commonsense reasoning abilities. Data, models and code are available at the project website: whoops-benchmark.github.io	Introduced WHOOPS!, a novel dataset of 500 synthetic images designed to challenge AI models' ability to reason about commonsense and compositionality in vision-and-language tasks.	Existing vision-and-language models struggle to demonstrate commonsense reasoning and often rely on superficial correlations. WHOOPS! provides a challenging benchmark to foster development in this area.	Employed a human-in-the-loop approach, using designers and text-to-image models (e.g., Midjourney) to craft images that violate commonsense expectations. Collected annotations for four tasks: explanation generation, image captioning, cross-modal matching, and visual question answering.	State-of-the-art models lag significantly behind human performance on all tasks, particularly in generating explanations for the unusual images. Analysis reveals that the challenge stems from the 'weirdness' of the images, not their synthetic nature. Developed an automatic evaluation metric for explanation generation using GPT4, achieving over 81% accuracy compared to human judgment.	The dataset size, while sufficient for the current study, could be expanded to encompass a wider range of commonsense violations. Despite efforts to filter for potentially offensive content, some images might still be perceived as such and require further refinement.	commonsense reasoning, vision and language, image captioning, visual question answering, explanation generation
2303.07216 Report	Parallel Vertex Diffusion for Unified Visual Grounding	Zesen Cheng, Kehan Li, Peng Jin, Xiangyang Ji, Li Yuan, Chang Liu, Jie Chen	Unified visual grounding pursues a simple and generic technical route to leverage multi-task data with less task-specific design. The most advanced methods typically present boxes and masks as vertex sequences to model referring detection and segmentation as an autoregressive sequential vertex generation paradigm. However, generating high-dimensional vertex sequences sequentially is error-prone because the upstream of the sequence remains static and cannot be refined based on downstream vertex information, even if there is a significant location gap. Besides, with limited vertexes, the inferior fitting of objects with complex contours restricts the performance upper bound. To deal with this dilemma, we propose a parallel vertex generation paradigm for superior high-dimension scalability with a diffusion model by simply modifying the noise dimension. An intuitive materialization of our paradigm is Parallel Vertex Diffusion (PVD) to directly set vertex coordinates as the generation target and use a diffusion model to train and infer. We claim that it has two flaws: (1) unnormalized coordinate caused a high variance of loss value; (2) the original training objective of PVD only considers point consistency but ignores geometry consistency. To solve the first flaw, Center Anchor Mechanism (CAM) is designed to convert coordinates as normalized offset values to stabilize the training loss value. For the second flaw, Angle summation loss (ASL) is designed to constrain the geometry difference of prediction and ground truth vertexes for geometry-level consistency. Empirical results show that our PVD achieves state-of-the-art in both referring detection and segmentation, and our paradigm is more scalable and efficient than sequential vertex generation with high-dimension data.	This paper proposes Parallel Vertex Diffusion (PVD), a novel paradigm for unified visual grounding that leverages a diffusion model to generate vertexes of bounding boxes and masks in parallel, overcoming limitations of sequential generation methods.	Existing sequential vertex generation methods for unified visual grounding suffer from error accumulation and struggle to scale to high-dimensional data (complex object boundaries). This paper addresses these issues with a parallel approach using diffusion models, enabling more accurate and efficient grounding.	The proposed PVD method utilizes a diffusion model with a specifically designed "Denoiser" network. To further enhance performance, a Center Anchor Mechanism (CAM) is introduced for coordinate normalization and an Angle Summation Loss (ASL) for ensuring geometry consistency.	PVD achieves state-of-the-art results on benchmark datasets for both referring expression comprehension (REC) and referring image segmentation (RIS). PVD demonstrates superior scalability compared to sequential methods, exhibiting improved performance and efficiency with an increasing number of vertexes. Quantitative analysis highlights the effectiveness of CAM and ASL in stabilizing training and enhancing geometry consistency.	The current implementation of PVD is limited to generating a fixed number of vertexes. Adaptively determining the optimal number based on object complexity could be explored. Investigating more sophisticated geometry constraints beyond angle summation could further improve performance, particularly for highly irregular objects.	visual grounding, referring expression comprehension, referring image segmentation, diffusion models, parallel vertex generation
2303.06994 Report	Synthesizing Realistic Image Restoration Training Pairs: A Diffusion Approach	Tao Yang, Peiran Ren, Xuansong xie, Lei Zhang	In supervised image restoration tasks, one key issue is how to obtain the aligned high-quality (HQ) and low-quality (LQ) training image pairs. Unfortunately, such HQ-LQ training pairs are hard to capture in practice, and hard to synthesize due to the complex unknown degradation in the wild. While several sophisticated degradation models have been manually designed to synthesize LQ images from their HQ counterparts, the distribution gap between the synthesized and real-world LQ images remains large. We propose a new approach to synthesizing realistic image restoration training pairs using the emerging denoising diffusion probabilistic model (DDPM). First, we train a DDPM, which could convert a noisy input into the desired LQ image, with a large amount of collected LQ images, which define the target data distribution. Then, for a given HQ image, we synthesize an initial LQ image by using an off-the-shelf degradation model, and iteratively add proper Gaussian noises to it. Finally, we denoise the noisy LQ image using the pre-trained DDPM to obtain the final LQ image, which falls into the target distribution of real-world LQ images. Thanks to the strong capability of DDPM in distribution approximation, the synthesized HQ-LQ image pairs can be used to train robust models for real-world image restoration tasks, such as blind face image restoration and blind image super-resolution. Experiments demonstrated the superiority of our proposed approach to existing degradation models. Code and data will be released.	This paper presents a novel diffusion-based approach for synthesizing realistic image restoration training pairs, aiming to bridge the distribution gap between synthetic and real-world low-quality images.	Acquiring aligned high-quality and low-quality image pairs for training supervised image restoration models is challenging. Existing degradation models often fail to capture the complexities of real-world degradations, leading to limited performance on real images. This diffusion-based approach generates more realistic training pairs, potentially improving the robustness of trained models.	The proposed method first trains a denoising diffusion probabilistic model (DDPM) using a large dataset of real-world low-quality images. To synthesize a training pair, an initial low-quality image is generated from a high-quality image using an off-the-shelf degradation model. This initial image is then iteratively denoised using the pre-trained DDPM, guiding it towards the target distribution of real-world degradations.	Synthesized image pairs using the proposed method achieve lower FID and higher PSNR/SSIM compared to pairs generated using only handcrafted degradation models, indicating closer distribution to real-world data and better structural preservation. Blind face restoration models trained on the proposed pairs demonstrate superior performance on both synthetic and real-world images, achieving better FID/LPIPS/PSNR/SSIM and improved visual quality with finer details, as evidenced by quantitative metrics and user study. Blind image super-resolution models trained on the proposed pairs exhibit enhanced ability to handle complex real-world degradations, generating higher quality reconstructions with fewer artifacts and better detail preservation, outperforming existing methods in FID/LPIPS/PSNR/SSIM and user preference.	The quality of synthesized pairs is influenced by the initial handcrafted degradation model and the number of diffusion steps, requiring careful parameter selection. Collecting a diverse and representative LQ image dataset is crucial for training an effective DDPM, which can be time-consuming and laborious. Future work could explore unsupervised or semi-supervised methods to alleviate this reliance on large labeled datasets.	image restoration, denoising diffusion probabilistic model (ddpm), degradation modeling, blind face restoration, blind image super-resolution
2303.06930 Report	Twin Contrastive Learning with Noisy Labels	Zhizhong Huang, Junping Zhang, Hongming Shan	Learning from noisy data is a challenging task that significantly degenerates the model performance. In this paper, we present TCL, a novel twin contrastive learning model to learn robust representations and handle noisy labels for classification. Specifically, we construct a Gaussian mixture model (GMM) over the representations by injecting the supervised model predictions into GMM to link label-free latent variables in GMM with label-noisy annotations. Then, TCL detects the examples with wrong labels as the out-of-distribution examples by another two-component GMM, taking into account the data distribution. We further propose a cross-supervision with an entropy regularization loss that bootstraps the true targets from model predictions to handle the noisy labels. As a result, TCL can learn discriminative representations aligned with estimated labels through mixup and contrastive learning. Extensive experimental results on several standard benchmarks and real-world datasets demonstrate the superior performance of TCL. In particular, TCL achieves 7.5\% improvements on CIFAR-10 with 90\% noisy label -- an extremely noisy scenario. The source code is available at \url{https://github.com/Hzzone/TCL}.	This paper proposes TCL, a twin contrastive learning model, to learn robust representations and handle noisy labels for image classification.	Learning with noisy labels is a crucial problem as mislabeled data is prevalent and can significantly degrade model performance.	TCL leverages a Gaussian mixture model (GMM) over contrastive learning representations, linking label-free latent variables with noisy annotations. It then detects mislabeled samples as out-of-distribution examples using another two-component GMM and utilizes cross-supervision with entropy regularization to estimate true labels.	TCL demonstrates superior performance on CIFAR-10/100 with various noise ratios, especially achieving 7.5% improvement on CIFAR-10 with 90% noise. The proposed out-of-distribution label noise detection method proves effective in handling extremely noisy scenarios. TCL outperforms state-of-the-art methods on real-world noisy datasets like WebVision and Clothing1M.	The assumption of uniform label distribution might not hold for all datasets. Future work includes incorporating semantic information for low noise ratios and exploring dynamic GMM updates.	noisy labels, contrastive learning, out-of-distribution detection, cross-supervision, robust representation learning
2303.06919 Report	NeRFLiX: High-Quality Neural View Synthesis by Learning a Degradation-Driven Inter-viewpoint MiXer	Kun Zhou, Wenbo Li, Yi Wang, Tao Hu, Nianjuan Jiang, Xiaoguang Han, Jiangbo Lu	Neural radiance fields (NeRF) show great success in novel view synthesis. However, in real-world scenes, recovering high-quality details from the source images is still challenging for the existing NeRF-based approaches, due to the potential imperfect calibration information and scene representation inaccuracy. Even with high-quality training frames, the synthetic novel views produced by NeRF models still suffer from notable rendering artifacts, such as noise, blur, etc. Towards to improve the synthesis quality of NeRF-based approaches, we propose NeRFLiX, a general NeRF-agnostic restorer paradigm by learning a degradation-driven inter-viewpoint mixer. Specially, we design a NeRF-style degradation modeling approach and construct large-scale training data, enabling the possibility of effectively removing NeRF-native rendering artifacts for existing deep neural networks. Moreover, beyond the degradation removal, we propose an inter-viewpoint aggregation framework that is able to fuse highly related high-quality training images, pushing the performance of cutting-edge NeRF models to entirely new levels and producing highly photo-realistic synthetic views.	This paper proposes NeRFLiX, a general-purpose NeRF-agnostic restoration method for improving the quality of neural view synthesis by learning a degradation-driven inter-viewpoint mixer.	Existing NeRF models often produce synthetic views with notable artifacts due to imperfect camera calibration, scene representation inaccuracy, and other limitations. NeRFLiX addresses this issue by learning to remove these artifacts and enhance the quality of NeRF-rendered images.	The authors introduce a NeRF-style degradation simulator (NDS) to generate a large-scale paired dataset of degraded and high-quality views. This dataset is used to train an inter-viewpoint mixer (IVM) that learns to restore a high-quality view by aggregating information from multiple neighboring high-quality reference views. A view selection strategy is also proposed to efficiently choose the most relevant reference views.	NeRFLiX consistently improves the performance of various state-of-the-art NeRF models on different datasets, including LLFF, Tanks and Temples, and Noisy LLFF Synthetic. The proposed NDS effectively simulates NeRF-style degradations, outperforming existing image degradation methods. NeRFLiX enables training acceleration for NeRF models, achieving better results with reduced training time.	The proposed NDS is one of many possible solutions for NeRF degradation simulation and can be further explored. Exploring real-time inter-viewpoint mixers would be beneficial for practical applications.	neural radiance fields (nerf), novel view synthesis, image restoration, degradation simulation, inter-viewpoint aggregation
2303.06885 Report	DR2: Diffusion-based Robust Degradation Remover for Blind Face Restoration	Zhixin Wang, Xiaoyun Zhang, Ziying Zhang, Huangjie Zheng, Mingyuan Zhou, Ya Zhang, Yanfeng Wang	Blind face restoration usually synthesizes degraded low-quality data with a pre-defined degradation model for training, while more complex cases could happen in the real world. This gap between the assumed and actual degradation hurts the restoration performance where artifacts are often observed in the output. However, it is expensive and infeasible to include every type of degradation to cover real-world cases in the training data. To tackle this robustness issue, we propose Diffusion-based Robust Degradation Remover (DR2) to first transform the degraded image to a coarse but degradation-invariant prediction, then employ an enhancement module to restore the coarse prediction to a high-quality image. By leveraging a well-performing denoising diffusion probabilistic model, our DR2 diffuses input images to a noisy status where various types of degradation give way to Gaussian noise, and then captures semantic information through iterative denoising steps. As a result, DR2 is robust against common degradation (e.g. blur, resize, noise and compression) and compatible with different designs of enhancement modules. Experiments in various settings show that our framework outperforms state-of-the-art methods on heavily degraded synthetic and real-world datasets.	This paper introduces DR2E, a two-stage blind face restoration framework that first removes degradation from inputs using a pre-trained diffusion model and then enhances the coarse output for high-quality restoration.	Existing blind face restoration methods struggle with real-world degraded images due to the reliance on pre-defined degradation models during training, leading to artifacts in the output.	DR2E consists of a Diffusion-based Robust Degradation Remover (DR2) and an Enhancement module. DR2 leverages a pre-trained denoising diffusion probabilistic model (DDPM) to transform degraded images into coarse but degradation-invariant predictions by diffusing them into a noisy status where degradation becomes similar to Gaussian noise. The Enhancement module then refines the coarse prediction into a high-quality image.	DR2E demonstrates robustness against various degradation types like blur, resize, noise, and compression. The framework is flexible and compatible with different Enhancement module designs, allowing for incorporating various restoration methods. Experiments on synthetic and real-world datasets show that DR2E outperforms state-of-the-art methods, particularly on heavily degraded images.	The sampling process of DR2, relying on a DDPM, can be slow. Choosing the optimal controlling parameters for DR2 currently requires manual tuning.	blind face restoration, denoising diffusion probabilistic model, degradation removal, robust image restoration, deep learning
2303.06880 Report	Uni3D: A Unified Baseline for Multi-dataset 3D Object Detection	Bo Zhang, Jiakang Yuan, Botian Shi, Tao Chen, Yikang Li, Yu Qiao	Current 3D object detection models follow a single dataset-specific training and testing paradigm, which often faces a serious detection accuracy drop when they are directly deployed in another dataset. In this paper, we study the task of training a unified 3D detector from multiple datasets. We observe that this appears to be a challenging task, which is mainly due to that these datasets present substantial data-level differences and taxonomy-level variations caused by different LiDAR types and data acquisition standards. Inspired by such observation, we present a Uni3D which leverages a simple data-level correction operation and a designed semantic-level coupling-and-recoupling module to alleviate the unavoidable data-level and taxonomy-level differences, respectively. Our method is simple and easily combined with many 3D object detection baselines such as PV-RCNN and Voxel-RCNN, enabling them to effectively learn from multiple off-the-shelf 3D datasets to obtain more discriminative and generalizable representations. Experiments are conducted on many dataset consolidation settings including Waymo-nuScenes, nuScenes-KITTI, Waymo-KITTI, and Waymo-nuScenes-KITTI consolidations. Their results demonstrate that Uni3D exceeds a series of individual detectors trained on a single dataset, with a 1.04x parameter increase over a selected baseline detector. We expect this work will inspire the research of 3D generalization since it will push the limits of perceptual performance.	This paper proposes Uni3D, a unified 3D object detection framework trained on multiple datasets to address the accuracy drop when single-dataset models are tested on different datasets (dataset-interference issue).	Current 3D object detection models are trained and evaluated on single datasets, leading to significant accuracy drops when deployed on datasets with different distributions, hindering generalization.	Uni3D uses a data-level correction operation to normalize features based on dataset-specific mean and variance. It also employs a semantic-level coupling-and-recoupling module to learn dataset-agnostic features using spatial-wise and dataset-level attention. Finally, it uses dataset-specific detection heads for prediction.	Uni3D significantly improves cross-dataset detection accuracy compared to single-dataset training or pre-training. The data-level correction and semantic-level coupling-and-recoupling modules are shown to be effective in addressing data-level and taxonomy-level differences between datasets. Uni3D enhances the zero-shot learning ability of the baseline detector, making it more robust to unseen scenes.	The parameter sharing of coordinate-origin shift across different classes may be suboptimal and needs further exploration. The BEV feature copy method, while ensuring training-and-testing consistency, is not the optimal solution for addressing the inconsistency between multi-dataset training and single-dataset inference. Further research is needed to explore better fusion strategies for BEV features from different datasets.	3d object detection, multi-dataset learning, domain generalization, lidar point cloud, autonomous driving
2303.06840 Report	DDFM: Denoising Diffusion Model for Multi-Modality Image Fusion	Zixiang Zhao, Haowen Bai, Yuanzhi Zhu, Jiangshe Zhang, Shuang Xu, Yulun Zhang, Kai Zhang, Deyu Meng, Radu Timofte, Luc Van Gool	Multi-modality image fusion aims to combine different modalities to produce fused images that retain the complementary features of each modality, such as functional highlights and texture details. To leverage strong generative priors and address challenges such as unstable training and lack of interpretability for GAN-based generative methods, we propose a novel fusion algorithm based on the denoising diffusion probabilistic model (DDPM). The fusion task is formulated as a conditional generation problem under the DDPM sampling framework, which is further divided into an unconditional generation subproblem and a maximum likelihood subproblem. The latter is modeled in a hierarchical Bayesian manner with latent variables and inferred by the expectation-maximization (EM) algorithm. By integrating the inference solution into the diffusion sampling iteration, our method can generate high-quality fused images with natural image generative priors and cross-modality information from source images. Note that all we required is an unconditional pre-trained generative model, and no fine-tuning is needed. Our extensive experiments indicate that our approach yields promising fusion results in infrared-visible image fusion and medical image fusion. The code is available at \url{https://github.com/Zhaozixiang1228/MMIF-DDFM}.	This paper proposes DDFM, a novel multi-modality image fusion algorithm based on denoising diffusion probabilistic models (DDPM).	Existing GAN-based image fusion methods suffer from unstable training and lack interpretability. DDFM leverages the strong generative priors of DDPM for high-quality fusion while addressing the limitations of GAN-based methods.	DDFM formulates image fusion as a conditional generation problem within the DDPM sampling framework. It decomposes the problem into an unconditional generation part handled by a pre-trained DDPM and a likelihood rectification part. The latter utilizes a hierarchical Bayesian model with latent variables and is inferred by the Expectation-Maximization (EM) algorithm. The solution is then integrated into the DDPM loop for conditional image generation.	DDFM effectively preserves structural and detail information from source images in both infrared-visible and medical image fusion tasks. DDFM consistently outperforms state-of-the-art methods on various datasets based on quantitative metrics including EN, SD, MI, VIF, Qabf, and SSIM. Ablation studies validate the contribution of individual components in DDFM, including the DDPM module and EM module.	The current implementation of DDFM relies on a pre-trained DDPM, which might limit its performance on specific datasets. Future work could explore incorporating task-specific information during training for improved fusion results.	image fusion, denoising diffusion probabilistic model, generative model, multi-modality, likelihood rectification
2303.06705 Report	Retinexformer: One-stage Retinex-based Transformer for Low-light Image Enhancement	Yuanhao Cai, Hao Bian, Jing Lin, Haoqian Wang, Radu Timofte, Yulun Zhang	When enhancing low-light images, many deep learning algorithms are based on the Retinex theory. However, the Retinex model does not consider the corruptions hidden in the dark or introduced by the light-up process. Besides, these methods usually require a tedious multi-stage training pipeline and rely on convolutional neural networks, showing limitations in capturing long-range dependencies. In this paper, we formulate a simple yet principled One-stage Retinex-based Framework (ORF). ORF first estimates the illumination information to light up the low-light image and then restores the corruption to produce the enhanced image. We design an Illumination-Guided Transformer (IGT) that utilizes illumination representations to direct the modeling of non-local interactions of regions with different lighting conditions. By plugging IGT into ORF, we obtain our algorithm, Retinexformer. Comprehensive quantitative and qualitative experiments demonstrate that our Retinexformer significantly outperforms state-of-the-art methods on thirteen benchmarks. The user study and application on low-light object detection also reveal the latent practical values of our method. Code, models, and results are available at https://github.com/caiyuanhao1998/Retinexformer	This paper proposes Retinexformer, the first Transformer-based algorithm for low-light image enhancement.	Many deep learning methods for low-light image enhancement rely on the Retinex theory but ignore corruptions or require multi-stage training. Existing methods also struggle to capture long-range dependencies.	This work formulates the One-stage Retinex-based Framework (ORF) and designs an Illumination-Guided Transformer (IGT). ORF estimates illumination and restores corruption in a single stage. IGT, plugged into ORF as the corruption restorer, uses illumination representations to guide long-range dependency modeling.	Retinexformer significantly outperforms state-of-the-art methods on thirteen benchmarks, achieving up to 6dB improvement on SID and SDSD datasets. User study confirms Retinexformer's superior visual quality compared to competing algorithms. Retinexformer effectively preprocesses low-light images for object detection, improving average precision by 0.8 AP compared to the best fully supervised method.	The model's performance on extremely dark images could be further improved. Exploring more efficient self-attention mechanisms for greater computational efficiency is a promising direction for future work.	low-light image enhancement, retinex theory, transformer, illumination-guided attention, one-stage framework
2303.06678 Report	PointPatchMix: Point Cloud Mixing with Patch Scoring	Yi Wang, Jiaze Wang, Jinpeng Li, Zixu Zhao, Guangyong Chen, Anfeng Liu, Pheng-Ann Heng	Data augmentation is an effective regularization strategy for mitigating overfitting in deep neural networks, and it plays a crucial role in 3D vision tasks, where the point cloud data is relatively limited. While mixing-based augmentation has shown promise for point clouds, previous methods mix point clouds either on block level or point level, which has constrained their ability to strike a balance between generating diverse training samples and preserving the local characteristics of point clouds. Additionally, the varying importance of each part of the point clouds has not been fully considered, cause not all parts contribute equally to the classification task, and some parts may contain unimportant or redundant information. To overcome these challenges, we propose PointPatchMix, a novel approach that mixes point clouds at the patch level and integrates a patch scoring module to generate content-based targets for mixed point clouds. Our approach preserves local features at the patch level, while the patch scoring module assigns targets based on the content-based significance score from a pre-trained teacher model. We evaluate PointPatchMix on two benchmark datasets, ModelNet40 and ScanObjectNN, and demonstrate significant improvements over various baselines in both synthetic and real-world datasets, as well as few-shot settings. With Point-MAE as our baseline, our model surpasses previous methods by a significant margin, achieving 86.3% accuracy on ScanObjectNN and 94.1% accuracy on ModelNet40. Furthermore, our approach shows strong generalization across multiple architectures and enhances the robustness of the baseline model.	This paper proposes PointPatchMix, a novel point cloud data augmentation method based on patch-level mixing and content-based target generation using a pre-trained teacher model.	Data augmentation is crucial for point cloud processing due to limited data availability, and existing methods struggle to balance diversity and local feature preservation while assigning accurate targets to mixed point clouds.	PointPatchMix divides point clouds into patches, mixes them using an optimal assignment algorithm based on Earth Mover's Distance (EMD), and assigns content-based targets using patch significance scores derived from a pre-trained teacher model.	PointPatchMix significantly outperforms state-of-the-art methods on both synthetic (ModelNet40) and real-world (ScanObjectNN) point cloud classification datasets. The method demonstrates strong generalization ability across various network architectures (PointNet, PointNet++, Transformer) and improves performance in few-shot learning settings. Ablation studies confirm the effectiveness of patch-level mixing, content-based target generation, and the choice of optimal patch assignment strategy.	The current study primarily focuses on point cloud classification, and future work could explore its application to other domains like segmentation. Investigating the computational cost and efficiency of PointPatchMix, particularly in resource-constrained environments, could be beneficial.	point cloud, data augmentation, pointpatchmix, classification, transformer
2303.06628 Report	Preventing Zero-Shot Transfer Degradation in Continual Learning of Vision-Language Models	Zangwei Zheng, Mingyuan Ma, Kai Wang, Ziheng Qin, Xiangyu Yue, Yang You	Continual learning (CL) can help pre-trained vision-language models efficiently adapt to new or under-trained data distributions without re-training. Nevertheless, during the continual training of the Contrastive Language-Image Pre-training (CLIP) model, we observe that the model's zero-shot transfer ability significantly degrades due to catastrophic forgetting. Existing CL methods can mitigate forgetting by replaying previous data. However, since the CLIP dataset is private, replay methods cannot access the pre-training dataset. In addition, replaying data of previously learned downstream tasks can enhance their performance but comes at the cost of sacrificing zero-shot performance. To address this challenge, we propose a novel method ZSCL to prevent zero-shot transfer degradation in the continual learning of vision-language models in both feature and parameter space. In the feature space, a reference dataset is introduced for distillation between the current and initial models. The reference dataset should have semantic diversity but no need to be labeled, seen in pre-training, or matched image-text pairs. In parameter space, we prevent a large parameter shift by averaging weights during the training. We propose a more challenging Multi-domain Task Incremental Learning (MTIL) benchmark to evaluate different methods, where tasks are from various domains instead of class-separated in a single dataset. Our method outperforms other methods in the traditional class-incremental learning setting and the MTIL by 9.7% average score. Our code locates at https://github.com/Thunderbeee/ZSCL.	This paper investigates and addresses the problem of zero-shot transfer degradation in Continual Learning (CL) of vision-language models, particularly the Contrastive Language-Image Pre-training (CLIP) model.	Continual learning is crucial for efficiently adapting pre-trained vision-language models to new data distributions without costly retraining. However, current methods suffer from catastrophic forgetting, leading to degraded zero-shot transfer ability.	The paper introduces ZSCL, a novel method that prevents zero-shot transfer degradation in both feature and parameter space. It employs distillation with a reference dataset for feature space preservation and weight ensemble during training for parameter space regularization.	ZSCL effectively prevents zero-shot transfer degradation in CLIP, maintaining high performance on both previously learned and new tasks. The use of a reference dataset with diverse semantics for distillation proves crucial for preserving the feature space learned during pre-training. ZSCL consistently outperforms existing CL methods on both conventional class-incremental learning benchmarks and the proposed Multi-domain Task Incremental Learning (MTIL) benchmark.	ZSCL currently relies on a reference dataset, and future work could explore methods to remove this dependency. The authors plan to expand ZSCL for next-token prediction tasks in multi-modality models utilizing large language models.	continual learning, vision-language models, zero-shot transfer, catastrophic forgetting, clip
2303.06547 Report	Towards Universal Vision-language Omni-supervised Segmentation	Bowen Dong, Jiaxi Gu, Jianhua Han, Hang Xu, Wangmeng Zuo	Existing open-world universal segmentation approaches usually leverage CLIP and pre-computed proposal masks to treat open-world segmentation tasks as proposal classification. However, 1) these works cannot handle universal segmentation in an end-to-end manner, and 2) the limited scale of panoptic datasets restricts the open-world segmentation ability on things classes. In this paper, we present Vision-Language Omni-Supervised Segmentation (VLOSS). VLOSS starts from a Mask2Former universal segmentation framework with CLIP text encoder. To improve the open-world segmentation ability, we leverage omni-supervised data (i.e., panoptic segmentation data, object detection data, and image-text pairs data) into training, thus enriching the open-world segmentation ability and achieving better segmentation accuracy. To better improve the training efficiency and fully release the power of omni-supervised data, we propose several advanced techniques, i.e., FPN-style encoder, switchable training technique, and positive classification loss. Benefiting from the end-to-end training manner with proposed techniques, VLOSS can be applied to various open-world segmentation tasks without further adaptation. Experimental results on different open-world panoptic and instance segmentation benchmarks demonstrate the effectiveness of VLOSS. Notably, with fewer parameters, our VLOSS with Swin-Tiny backbone surpasses MaskCLIP by ~2% in terms of mask AP on LVIS v1 dataset.	This paper proposes VLOSS, a universal open-world segmentation framework that leverages omni-supervised data (panoptic, detection, image-text pairs) and CLIP for enhanced recognition.	Existing open-world segmentation methods are limited by end-to-end training capabilities and restricted open-world recognition due to dataset limitations.	VLOSS utilizes a Mask2Former base with CLIP, trained on a mix of panoptic, detection, and image-text data. It introduces a FPN-style encoder, switchable training technique, and positive classification loss to improve training efficiency and leverage diverse annotations.	VLOSS achieves comparable results to state-of-the-art MaskCLIP on ADE20K panoptic segmentation with fewer parameters. On LVIS v1, VLOSS with Swin-Tiny backbone surpasses MaskCLIP by ~2% in mask AP. Qualitative results showcase VLOSS's capability to segment and recognize both seen and unseen things and stuff classes.	The current method underutilizes regions not present in annotations for weakly-supervised datasets. The work doesn't incorporate visual grounding datasets, which could further improve open-world recognition.	open-world segmentation, universal segmentation, vision-language models, omni-supervised learning, clip
2303.06464 Report	PARASOL: Parametric Style Control for Diffusion Image Synthesis	Gemma Canet Tarrés, Dan Ruta, Tu Bui, John Collomosse	We propose PARASOL, a multi-modal synthesis model that enables disentangled, parametric control of the visual style of the image by jointly conditioning synthesis on both content and a fine-grained visual style embedding. We train a latent diffusion model (LDM) using specific losses for each modality and adapt the classifier-free guidance for encouraging disentangled control over independent content and style modalities at inference time. We leverage auxiliary semantic and style-based search to create training triplets for supervision of the LDM, ensuring complementarity of content and style cues. PARASOL shows promise for enabling nuanced control over visual style in diffusion models for image creation and stylization, as well as generative search where text-based search results may be adapted to more closely match user intent by interpolating both content and style descriptors.	PARASOL is a multi-modal synthesis model that enables disentangled, parametric control of visual style in images, jointly conditioning synthesis on both content and fine-grained visual style embedding.	Current deep generative models lack fine-grained control over visual style, often limited by coarse-grained inputs like text descriptions or struggle to disentangle content from style information.	PARASOL leverages a latent diffusion model (LDM) trained with specific losses for content and style, employing classifier-free guidance for disentangled control at inference. It uses auxiliary semantic and style-based search to create training triplets, ensuring complementarity of content and style cues.	PARASOL achieves superior performance in transferring specific styles compared to existing multi-modal and style transfer models. The model enables fine-grained control over the degree of style transfer and content preservation via parameters like 'lambda' for inversion and 'g_s', 'g_y' for classifier-free guidance. PARASOL supports style and content interpolation, enabling the creation of novel images by combining different styles and semantic concepts.	Challenges remain in disentangling style from content for certain ambiguous styles. Addressing challenging content like faces often requires additional specialized training.	image synthesis, style control, diffusion models, multi-modal learning, generative search
2303.06424 Report	Regularized Vector Quantization for Tokenized Image Synthesis	Jiahui Zhang, Fangneng Zhan, Christian Theobalt, Shijian Lu	Quantizing images into discrete representations has been a fundamental problem in unified generative modeling. Predominant approaches learn the discrete representation either in a deterministic manner by selecting the best-matching token or in a stochastic manner by sampling from a predicted distribution. However, deterministic quantization suffers from severe codebook collapse and misalignment with inference stage while stochastic quantization suffers from low codebook utilization and perturbed reconstruction objective. This paper presents a regularized vector quantization framework that allows to mitigate above issues effectively by applying regularization from two perspectives. The first is a prior distribution regularization which measures the discrepancy between a prior token distribution and the predicted token distribution to avoid codebook collapse and low codebook utilization. The second is a stochastic mask regularization that introduces stochasticity during quantization to strike a good balance between inference stage misalignment and unperturbed reconstruction objective. In addition, we design a probabilistic contrastive loss which serves as a calibrated metric to further mitigate the perturbed reconstruction objective. Extensive experiments show that the proposed quantization framework outperforms prevailing vector quantization methods consistently across different generative models including auto-regressive models and diffusion models.	This paper presents a regularized vector quantization framework for tokenized image synthesis that addresses limitations of existing deterministic and stochastic quantization methods, such as codebook collapse, low codebook utilization, and perturbed reconstruction objectives.	Quantizing images into discrete representations is crucial for unified generative modeling. Existing methods struggle to balance accurate representation learning, efficient codebook usage, and high-fidelity image generation.	The proposed framework employs a prior distribution regularization to encourage full codebook utilization and prevent collapse. It also introduces a stochastic mask regularization to balance deterministic and stochastic quantization, mitigating misalignment during inference. Finally, a probabilistic contrastive loss is designed for elastic image reconstruction, adapting to the varying discrepancies caused by stochastic sampling.	The regularized quantization consistently outperforms existing methods in image reconstruction and generation quality across diverse datasets and generative models (auto-regressive and diffusion). Prior distribution regularization and stochastic mask regularization are shown to effectively mitigate codebook collapse and inference stage misalignment respectively. The probabilistic contrastive loss improves image reconstruction and generation quality by enabling elastic image reconstruction and adapting to the perturbations caused by stochastic sampling.	The current method employs the same learning objective for the encoder and decoder, which may not be optimal for both accurate representation and realistic image generation. Exploring different prior distributions, such as Gaussian, for potential performance improvement.	vector quantization, image synthesis, generative modeling, discrete representation learning, contrastive learning
2303.06373 Report	Recursive Generalization Transformer for Image Super-Resolution	Zheng Chen, Yulun Zhang, Jinjin Gu, Linghe Kong, Xiaokang Yang	Transformer architectures have exhibited remarkable performance in image super-resolution (SR). Since the quadratic computational complexity of the self-attention (SA) in Transformer, existing methods tend to adopt SA in a local region to reduce overheads. However, the local design restricts the global context exploitation, which is crucial for accurate image reconstruction. In this work, we propose the Recursive Generalization Transformer (RGT) for image SR, which can capture global spatial information and is suitable for high-resolution images. Specifically, we propose the recursive-generalization self-attention (RG-SA). It recursively aggregates input features into representative feature maps, and then utilizes cross-attention to extract global information. Meanwhile, the channel dimensions of attention matrices (query, key, and value) are further scaled to mitigate the redundancy in the channel domain. Furthermore, we combine the RG-SA with local self-attention to enhance the exploitation of the global context, and propose the hybrid adaptive integration (HAI) for module integration. The HAI allows the direct and effective fusion between features at different levels (local or global). Extensive experiments demonstrate that our RGT outperforms recent state-of-the-art methods quantitatively and qualitatively. Code and pre-trained models are available at https://github.com/zhengchen1999/RGT.	This paper proposes the Recursive Generalization Transformer (RGT) for image super-resolution, which can effectively capture global spatial information with linear computational complexity, making it suitable for high-resolution images.	Existing Transformer-based image SR methods rely on local attention mechanisms to reduce computational complexity, limiting their ability to capture global context crucial for accurate image reconstruction. This work addresses this limitation by enabling effective global information modeling with manageable complexity.	The paper introduces the recursive-generalization self-attention (RG-SA) module. RG-SA first employs a recursive generalization module (RGM) to compress input features into representative feature maps. Then, it performs cross-attention between the input features and the representative maps to capture global dependencies. Additionally, it scales the channel dimensions of attention matrices to reduce redundancy and complexity. The RGT architecture combines RG-SA with local self-attention in an alternating arrangement, further enhancing global context utilization through the proposed hybrid adaptive integration (HAI) method.	RGT quantitatively outperforms recent state-of-the-art image SR methods on benchmark datasets across different scaling factors. RGT qualitatively surpasses other methods in handling challenging cases, reconstructing more image details and alleviating blurring artifacts. RGT achieves a better trade-off between model complexity and performance compared to existing CNN-based and Transformer-based methods.	The current design of RGM mainly utilizes depth-wise convolutions, which could be further explored for better feature aggregation. Exploring the application of RGT in other low-level vision tasks beyond image super-resolution.	image super-resolution, vision transformer, global attention, recursive generalization, hybrid adaptive integration
2303.06329 Report	MetaViewer: Towards A Unified Multi-View Representation	Ren Wang, Haoliang Sun, Yuling Ma, Xiaoming Xi, Yilong Yin	Existing multi-view representation learning methods typically follow a specific-to-uniform pipeline, extracting latent features from each view and then fusing or aligning them to obtain the unified object representation. However, the manually pre-specify fusion functions and view-private redundant information mixed in features potentially degrade the quality of the derived representation. To overcome them, we propose a novel bi-level-optimization-based multi-view learning framework, where the representation is learned in a uniform-to-specific manner. Specifically, we train a meta-learner, namely MetaViewer, to learn fusion and model the view-shared meta representation in outer-level optimization. Start with this meta representation, view-specific base-learners are then required to rapidly reconstruct the corresponding view in inner-level. MetaViewer eventually updates by observing reconstruction processes from uniform to specific over all views, and learns an optimal fusion scheme that separates and filters out view-private information. Extensive experimental results in downstream tasks such as classification and clustering demonstrate the effectiveness of our method.	Proposes MetaViewer, a novel bi-level optimization framework for multi-view representation learning that learns a unified representation in a uniform-to-specific manner.	Addresses limitations of traditional specific-to-uniform multi-view learning methods that struggle with data-driven fusion and filtering view-private redundant information.	MetaViewer uses a meta-learner to learn the fusion of view-shared information and base-learners to reconstruct individual views, effectively separating view-private information.	MetaViewer outperforms state-of-the-art methods in clustering tasks on multiple benchmarks. The learned unified representation achieves superior classification results, particularly in datasets with a large number of classes. Ablation studies confirm the effectiveness of meta-learning fusion and the robustness to hyperparameter settings.	The current implementation primarily focuses on reconstruction-based self-supervision. Exploring alternative meta-learner architectures beyond convolutional layers could be beneficial.	multi-view learning, representation learning, meta-learning, bi-level optimization, self-supervision
2303.06285 Report	DeltaEdit: Exploring Text-free Training for Text-Driven Image Manipulation	Yueming Lyu, Tianwei Lin, Fu Li, Dongliang He, Jing Dong, Tieniu Tan	Text-driven image manipulation remains challenging in training or inference flexibility. Conditional generative models depend heavily on expensive annotated training data. Meanwhile, recent frameworks, which leverage pre-trained vision-language models, are limited by either per text-prompt optimization or inference-time hyper-parameters tuning. In this work, we propose a novel framework named \textit{DeltaEdit} to address these problems. Our key idea is to investigate and identify a space, namely delta image and text space that has well-aligned distribution between CLIP visual feature differences of two images and CLIP textual embedding differences of source and target texts. Based on the CLIP delta space, the DeltaEdit network is designed to map the CLIP visual features differences to the editing directions of StyleGAN at training phase. Then, in inference phase, DeltaEdit predicts the StyleGAN's editing directions from the differences of the CLIP textual features. In this way, DeltaEdit is trained in a text-free manner. Once trained, it can well generalize to various text prompts for zero-shot inference without bells and whistles. Code is available at https://github.com/Yueming6568/DeltaEdit.	Proposes DeltaEdit, a novel framework for text-driven image manipulation that uses a text-free training paradigm, eliminating the need for expensive annotated text data during training.	Addresses the limitations of previous text-driven image manipulation methods that suffer from training/inference inflexibility, poor generalization, and dependence on expensive annotated training data.	Leverages the semantically aligned CLIP delta image-text feature space to train a Delta Mapper network. This network learns a mapping from image feature differences to StyleGAN's latent style space changes, enabling text-driven manipulation during inference by utilizing CLIP text embedding differences.	Achieves state-of-the-art performance on various datasets (FFHQ, LSUN Cat, Church, Horse) with high-quality and disentangled editing results. Generalizes well to unseen text prompts for zero-shot inference without requiring per-prompt optimization or hyper-parameter tuning. Demonstrates superior efficiency compared to previous methods, with significantly reduced training and inference times.	The quality of manipulation relies on the pre-trained StyleGAN and CLIP models. Struggles with manipulating images containing attributes not well-represented in the training dataset.	text-driven image manipulation, text-free training, clip, stylegan, zero-shot learning
2303.05970 Report	Exploring Recurrent Long-term Temporal Fusion for Multi-view 3D Perception	Chunrui Han, Jinrong Yang, Jianjian Sun, Zheng Ge, Runpei Dong, Hongyu Zhou, Weixin Mao, Yuang Peng, Xiangyu Zhang	Long-term temporal fusion is a crucial but often overlooked technique in camera-based Bird's-Eye-View (BEV) 3D perception. Existing methods are mostly in a parallel manner. While parallel fusion can benefit from long-term information, it suffers from increasing computational and memory overheads as the fusion window size grows. Alternatively, BEVFormer adopts a recurrent fusion pipeline so that history information can be efficiently integrated, yet it fails to benefit from longer temporal frames. In this paper, we explore an embarrassingly simple long-term recurrent fusion strategy built upon the LSS-based methods and find it already able to enjoy the merits from both sides, i.e., rich long-term information and efficient fusion pipeline. A temporal embedding module is further proposed to improve the model's robustness against occasionally missed frames in practical scenarios. We name this simple but effective fusing pipeline VideoBEV. Experimental results on the nuScenes benchmark show that VideoBEV obtains strong performance on various camera-based 3D perception tasks, including object detection (55.4\% mAP and 62.9\% NDS), segmentation (48.6\% vehicle mIoU), tracking (54.8\% AMOTA), and motion prediction (0.80m minADE and 0.463 EPA).	This paper proposes VideoBEV, a simple yet effective recurrent long-term temporal fusion framework for camera-based Bird's-Eye-View (BEV) 3D perception.	Long-term temporal fusion is crucial for accurate 3D perception in autonomous driving but often overlooked or limited in existing methods. VideoBEV addresses this by efficiently fusing long-term information for comprehensive scene understanding.	VideoBEV leverages a recurrent fusion module that sequentially integrates BEV features from a long sequence of frames. Additionally, a temporal embedding module is introduced to enhance robustness against missed frames in real-world scenarios.	VideoBEV achieves state-of-the-art performance on the nuScenes benchmark across various 3D perception tasks, including 3D object detection (55.4% mAP and 62.9% NDS), map segmentation, and 3D object tracking (54.8% AMOTA). The study demonstrates, for the first time, that recurrent temporal fusion with longer sequences (e.g., 16 frames in 8s) brings further benefits for perception accuracy. VideoBEV maintains efficiency compared to parallel fusion methods, with consistently low overhead for memory and computation even with longer video inputs.	The paper primarily focuses on high-level BEV feature fusion, leaving room for exploration of incorporating more advanced temporal fusion techniques at lower feature levels. Further research could investigate extending VideoBEV with more sophisticated motion modeling and prediction capabilities.	3d perception, autonomous driving, temporal fusion, "birds-eye-view (bev)", recurrent neural networks
2303.05892 Report	Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection	Luting Wang, Yi Liu, Penghui Du, Zihan Ding, Yue Liao, Qiaosong Qi, Biaolong Chen, Si Liu	Open-vocabulary object detection aims to provide object detectors trained on a fixed set of object categories with the generalizability to detect objects described by arbitrary text queries. Previous methods adopt knowledge distillation to extract knowledge from Pretrained Vision-and-Language Models (PVLMs) and transfer it to detectors. However, due to the non-adaptive proposal cropping and single-level feature mimicking processes, they suffer from information destruction during knowledge extraction and inefficient knowledge transfer. To remedy these limitations, we propose an Object-Aware Distillation Pyramid (OADP) framework, including an Object-Aware Knowledge Extraction (OAKE) module and a Distillation Pyramid (DP) mechanism. When extracting object knowledge from PVLMs, the former adaptively transforms object proposals and adopts object-aware mask attention to obtain precise and complete knowledge of objects. The latter introduces global and block distillation for more comprehensive knowledge transfer to compensate for the missing relation information in object distillation. Extensive experiments show that our method achieves significant improvement compared to current methods. Especially on the MS-COCO dataset, our OADP framework reaches $35.6$ mAP$^{\text{N}}_{50}$, surpassing the current state-of-the-art method by $3.3$ mAP$^{\text{N}}_{50}$. Code is released at https://github.com/LutingWang/OADP.	This paper proposes OADP, an Open-vocabulary Adaptive Proposal Distillation Pyramid framework for open-vocabulary object detection.	Existing methods suffer from information loss during knowledge extraction from pre-trained vision-and-language models and inefficient knowledge transfer to detectors.	OADP uses an OAKE module to adaptively transform object proposals and extract precise knowledge with masked attention. It also employs a DP mechanism with global, block, and object distillation for comprehensive knowledge transfer.	OADP achieves 35.6 mAPN@50 on OV-COCO, surpassing the previous state-of-the-art by 3.3 mAPN@50. On OV-LVIS, OADP achieves 21.9 AP_r for object detection and 21.7 AP_r for instance segmentation, outperforming previous methods by more than 1.1 AP_r and 1.9 AP_r respectively. Ablation studies demonstrate the effectiveness of OAKE and the DP mechanism in improving detection performance.	The training cost of OADP is high due to the use of a large-scale image-text model. The performance of OADP on novel categories is still lower than that on base categories.	open-vocabulary object detection, knowledge distillation, vision and language, object proposal, adaptive learning
2303.05828 Report	Adapting Contrastive Language-Image Pretrained (CLIP) Models for Out-of-Distribution Detection	Nikolas Adaloglou, Felix Michels, Tim Kaiser, Markus Kollmann	We present a comprehensive experimental study on pretrained feature extractors for visual out-of-distribution (OOD) detection, focusing on adapting contrastive language-image pretrained (CLIP) models. Without fine-tuning on the training data, we are able to establish a positive correlation ($R^2\geq0.92$) between in-distribution classification and unsupervised OOD detection for CLIP models in $4$ benchmarks. We further propose a new simple and scalable method called \textit{pseudo-label probing} (PLP) that adapts vision-language models for OOD detection. Given a set of label names of the training set, PLP trains a linear layer using the pseudo-labels derived from the text encoder of CLIP. To test the OOD detection robustness of pretrained models, we develop a novel feature-based adversarial OOD data manipulation approach to create adversarial samples. Intriguingly, we show that (i) PLP outperforms the previous state-of-the-art \citep{ming2022mcm} on all $5$ large-scale benchmarks based on ImageNet, specifically by an average AUROC gain of 3.4\% using the largest CLIP model (ViT-G), (ii) we show that linear probing outperforms fine-tuning by large margins for CLIP architectures (i.e. CLIP ViT-H achieves a mean gain of 7.3\% AUROC on average on all ImageNet-based benchmarks), and (iii) billion-parameter CLIP models still fail at detecting adversarially manipulated OOD images. The code and adversarially created datasets will be made publicly available.	This paper investigates the use of pretrained CLIP models for visual out-of-distribution (OOD) detection and proposes a new method called pseudo-label probing (PLP) for adapting CLIP to this task.	Accurate OOD detection is crucial for real-world applications to ensure safety during deployment, and leveraging pretrained models like CLIP can significantly benefit this task.	The paper conducts experiments with 25 pretrained feature extractors on various OOD benchmarks. PLP utilizes CLIP's text encoder to generate pseudo-labels for training a linear layer on top of CLIP's visual features. The authors also introduce a novel feature-based adversarial OOD data manipulation technique.	CLIP models show a strong correlation between in-distribution classification accuracy and unsupervised OOD detection performance. PLP outperforms previous state-of-the-art methods on ImageNet benchmarks, achieving an average AUROC gain of 3.4% with CLIP ViT-G. Linear probing on CLIP features surpasses fine-tuning for OOD detection on ImageNet-based benchmarks, indicating that OOD-related information is readily available in large-scale models.	The study primarily focuses on ImageNet-based benchmarks and might not generalize to other datasets. Further research is needed to understand the impact of PLP on in-distribution test accuracy and explore its applicability to other visual feature extractors.	out-of-distribution detection, clip, contrastive language-image pretraining, pseudo-label probing, adversarial robustness
2303.05807 Report	Aleth-NeRF: Low-light Condition View Synthesis with Concealing Fields	Ziteng Cui, Lin Gu, Xiao Sun, Xianzheng Ma, Yu Qiao, Tatsuya Harada	Common capture low-light scenes are challenging for most computer vision techniques, including Neural Radiance Fields (NeRF). Vanilla NeRF is viewer-centred simplifies the rendering process only as light emission from 3D locations in the viewing direction, thus failing to model the low-illumination induced darkness. Inspired by the emission theory of ancient Greeks that visual perception is accomplished by rays casting from eyes, we make slight modifications on vanilla NeRF to train on multiple views of low-light scenes, we can thus render out the well-lit scene in an unsupervised manner. We introduce a surrogate concept, Concealing Fields, that reduces the transport of light during the volume rendering stage. Specifically, our proposed method, Aleth-NeRF, directly learns from the dark image to understand volumetric object representation and concealing field under priors. By simply eliminating Concealing Fields, we can render a single or multi-view well-lit image(s) and gain superior performance over other 2D low-light enhancement methods. Additionally, we collect the first paired LOw-light and normal-light Multi-view (LOM) datasets for future research. This version is invalid, please refer to our new AAAI version: arXiv:2312.09093	Presents Aleth-NeRF, the first NeRF-based method trained on dark multi-view sRGB images for unsupervised low-light enhancement.	Vanilla NeRF struggles with low-light scenes due to its viewer-centered approach, failing to model light attenuation. Existing solutions either require known lighting conditions or rely on 2D enhancement methods that lack 3D consistency.	Introduces 'Concealing Fields' into the NeRF framework to simulate light attenuation, effectively extending the transmittance function. Employs priors like value range, structure similarity, and color constancy to enable unsupervised learning of these fields from low-light images.	Achieves state-of-the-art performance on the LOL dataset for single-image low-light enhancement, demonstrating high generation quality. Introduces LOM, the first paired low-light and normal-light multi-view dataset for benchmarking. Outperforms existing 2D enhancement methods on the LOM dataset, exhibiting superior image quality and multi-view consistency in low-light scene rendering.	Requires separate training for each scene, limiting generalizability. May struggle with scenes exhibiting non-uniform lighting or strong shadows. Future work includes addressing these limitations and exploring applications in dynamic scene relighting.	neural radiance fields, low-light image enhancement, unsupervised learning, novel view synthesis, 3d scene understanding
2303.05775 Report	Self-NeRF: A Self-Training Pipeline for Few-Shot Neural Radiance Fields	Jiayang Bai, Letian Huang, Wen Gong, Jie Guo, Yanwen Guo	Recently, Neural Radiance Fields (NeRF) have emerged as a potent method for synthesizing novel views from a dense set of images. Despite its impressive performance, NeRF is plagued by its necessity for numerous calibrated views and its accuracy diminishes significantly in a few-shot setting. To address this challenge, we propose Self-NeRF, a self-evolved NeRF that iteratively refines the radiance fields with very few number of input views, without incorporating additional priors. Basically, we train our model under the supervision of reference and unseen views simultaneously in an iterative procedure. In each iteration, we label unseen views with the predicted colors or warped pixels generated by the model from the preceding iteration. However, these expanded pseudo-views are afflicted by imprecision in color and warping artifacts, which degrades the performance of NeRF. To alleviate this issue, we construct an uncertainty-aware NeRF with specialized embeddings. Some techniques such as cone entropy regularization are further utilized to leverage the pseudo-views in the most efficient manner. Through experiments under various settings, we verified that our Self-NeRF is robust to input with uncertainty and surpasses existing methods when trained on limited training data.	This paper introduces Self-NeRF, a novel iterative self-training pipeline for Neural Radiance Fields (NeRF) designed to enhance novel view synthesis from a limited set of input views (few-shot).	NeRF often struggles with few-shot scenarios, leading to degenerate solutions and overfitting. This work aims to address this challenge by iteratively refining NeRF reconstructions without relying on additional priors.	Self-NeRF operates by iteratively training an uncertainty-aware NeRF model using both seen views and synthesized pseudo-views. The pseudo-views are generated in two ways: by warping seen views based on predicted depth and by using direct predictions from the previous iteration's model. The uncertainty-aware nature of the model allows it to handle inaccuracies inherent in these pseudo-views.	Self-NeRF effectively synthesizes novel views with superior detail compared to existing few-shot NeRF methods. The iterative training process demonstrably improves the quality of reconstructions over multiple iterations. Self-NeRF exhibits robustness to varying numbers of input views, demonstrating its effectiveness in few-shot settings.	The reliance on iterative training increases the overall computational cost. While effective, the performance gains of Self-NeRF decrease as the number of input views increases.	neural radiance fields, nerf, few-shot learning, novel view synthesis, self-training
2303.05724 Report	3D Cinemagraphy from a Single Image	Xingyi Li, Zhiguo Cao, Huiqiang Sun, Jianming Zhang, Ke Xian, Guosheng Lin	We present 3D Cinemagraphy, a new technique that marries 2D image animation with 3D photography. Given a single still image as input, our goal is to generate a video that contains both visual content animation and camera motion. We empirically find that naively combining existing 2D image animation and 3D photography methods leads to obvious artifacts or inconsistent animation. Our key insight is that representing and animating the scene in 3D space offers a natural solution to this task. To this end, we first convert the input image into feature-based layered depth images using predicted depth values, followed by unprojecting them to a feature point cloud. To animate the scene, we perform motion estimation and lift the 2D motion into the 3D scene flow. Finally, to resolve the problem of hole emergence as points move forward, we propose to bidirectionally displace the point cloud as per the scene flow and synthesize novel views by separately projecting them into target image planes and blending the results. Extensive experiments demonstrate the effectiveness of our method. A user study is also conducted to validate the compelling rendering results of our method.	Presents 3D Cinemagraphy, a novel technique generating videos with plausible animation and camera motion from a single still image.	Traditional cinemagraphs lack 3D immersion and parallax effects; this work aims to bridge the gap between 2D image animation and 3D photography for a more realistic experience.	Converts input image to feature-based layered depth images, unprojects to a feature point cloud, estimates and lifts 2D motion to 3D scene flow, animates the point cloud bidirectionally to address holes, and renders novel views at each time step.	Outperforms baselines combining 2D animation and novel view synthesis in quantitative metrics (PSNR, SSIM, LPIPS). Produces visually compelling results with fewer artifacts like flickering or jelly-like effects compared to alternative approaches. Demonstrates generalization ability on in-the-wild photos, paintings, and synthetic images, especially with user-provided masks and flow hints for controlled animation.	Performance depends on the accuracy of depth estimation, particularly for challenging structures like thin objects. Currently focuses on fluid motion animation, leaving more complex motions like cyclic movements for future exploration.	3d cinemagraphy, image animation, novel view synthesis, point cloud animation, single image animation
2303.05699 Report	Feature Unlearning for Pre-trained GANs and VAEs	Saemi Moon, Seunghyuk Cho, Dongwoo Kim	We tackle the problem of feature unlearning from a pre-trained image generative model: GANs and VAEs. Unlike a common unlearning task where an unlearning target is a subset of the training set, we aim to unlearn a specific feature, such as hairstyle from facial images, from the pre-trained generative models. As the target feature is only presented in a local region of an image, unlearning the entire image from the pre-trained model may result in losing other details in the remaining region of the image. To specify which features to unlearn, we collect randomly generated images that contain the target features. We then identify a latent representation corresponding to the target feature and then use the representation to fine-tune the pre-trained model. Through experiments on MNIST, CelebA, and FFHQ datasets, we show that target features are successfully removed while keeping the fidelity of the original models. Further experiments with an adversarial attack show that the unlearned model is more robust under the presence of malicious parties.	This paper presents a novel framework for unlearning specific features from pre-trained image generative models, such as GANs and VAEs.	This addresses the problem of unwanted or harmful content generation while avoiding the need to retrain the entire model.	The method involves identifying a latent representation of the target feature and then fine-tuning the pre-trained model to prevent the generation of images with that feature. This is achieved by collecting images with the target feature, identifying a corresponding latent vector, and then using that vector to guide the fine-tuning process.	The unlearned models successfully reduce the generation of target features, achieving similar target feature ratios to oracle models trained without the target feature. The unlearning process maintains high image quality, as demonstrated by comparable Inception Score and Fréchet Inception Distance scores to the original and oracle models. The unlearned models show increased robustness against adversarial attacks, making them less susceptible to manipulation for generating unwanted content.	The proposed method's effectiveness heavily relies on the quality of the latent space disentanglement. Future work includes exploring more sophisticated feature disentanglement algorithms to improve the precision of feature unlearning.	generative adversarial networks, variational autoencoders, machine unlearning, feature unlearning, adversarial robustness
2303.05646 Report	Iterative Few-shot Semantic Segmentation from Image Label Text	Haohan Wang, Liang Liu, Wuhao Zhang, Jiangning Zhang, Zhenye Gan, Yabiao Wang, Chengjie Wang, Haoqian Wang	Few-shot semantic segmentation aims to learn to segment unseen class objects with the guidance of only a few support images. Most previous methods rely on the pixel-level label of support images. In this paper, we focus on a more challenging setting, in which only the image-level labels are available. We propose a general framework to firstly generate coarse masks with the help of the powerful vision-language model CLIP, and then iteratively and mutually refine the mask predictions of support and query images. Extensive experiments on PASCAL-5i and COCO-20i datasets demonstrate that our method not only outperforms the state-of-the-art weakly supervised approaches by a significant margin, but also achieves comparable or better results to recent supervised methods. Moreover, our method owns an excellent generalization ability for the images in the wild and uncommon classes. Code will be available at https://github.com/Whileherham/IMR-HSNet.	This document provides style instructions for authors submitting papers to the IJCAI--22 Proceedings.	These guidelines ensure uniformity in formatting for all published papers in the IJCAI-22 proceedings.	The paper details specific formatting requirements for various aspects like layout, fonts, headings, citations, illustrations, tables, formulas, algorithms etc. It also provides downloadable LaTeX and Microsoft Word templates that implement these guidelines.	The use of Adobe's Portable Document Format (PDF) is mandatory for the electronic manuscript submission. For uniformity, Adobe's Times Roman font is strongly recommended. Authors are required to use the provided LaTeX or Microsoft Word templates for formatting.	The document assumes the use of LaTeX or Microsoft Word, it does not provide instructions for other word processing software. The document doesn't extensively cover accessibility aspects for readers with disabilities.	ijcai, conference paper formatting, style guidelines, latex template, microsoft word template
2303.05503 Report	Open-world Instance Segmentation: Top-down Learning with Bottom-up Supervision	Tarun Kalluri, Weiyao Wang, Heng Wang, Manmohan Chandraker, Lorenzo Torresani, Du Tran	Many top-down architectures for instance segmentation achieve significant success when trained and tested on pre-defined closed-world taxonomy. However, when deployed in the open world, they exhibit notable bias towards seen classes and suffer from significant performance drop. In this work, we propose a novel approach for open world instance segmentation called bottom-Up and top-Down Open-world Segmentation (UDOS) that combines classical bottom-up segmentation algorithms within a top-down learning framework. UDOS first predicts parts of objects using a top-down network trained with weak supervision from bottom-up segmentations. The bottom-up segmentations are class-agnostic and do not overfit to specific taxonomies. The part-masks are then fed into affinity-based grouping and refinement modules to predict robust instance-level segmentations. UDOS enjoys both the speed and efficiency from the top-down architectures and the generalization ability to unseen categories from bottom-up supervision. We validate the strengths of UDOS on multiple cross-category as well as cross-dataset transfer tasks from 5 challenging datasets including MS-COCO, LVIS, ADE20k, UVO and OpenImages, achieving significant improvements over state-of-the-art across the board. Our code and models are available on our project page.	This paper introduces UDOS (Bottom-Up and Top-Down Open-World Segmentation), a novel method for open-world instance segmentation that combines the strengths of bottom-up and top-down approaches.	Open-world instance segmentation is crucial for real-world applications where models encounter novel objects not present in the training taxonomy.	UDOS leverages a top-down network trained with weak supervision from class-agnostic bottom-up segmentation to predict object parts. These parts are then grouped using affinity scores and refined for boundary accuracy, enabling the detection of both seen and unseen objects.	UDOS outperforms state-of-the-art methods in cross-category generalization on COCO, achieving 33.5% box AR and 31.6% mask AR. The method excels in cross-dataset generalization, setting new state-of-the-art results on UVO, ADE20k, and OpenImages datasets without fine-tuning. Ablation studies validate the contribution of each module and the importance of design choices.	UDOS faces challenges with densely clustered objects of similar appearance. Future work could explore more robust grouping methods or incorporate recent innovations like Segment Anything (SAM) for improved initial segmentation.	open-world learning, instance segmentation, bottom-up segmentation, top-down learning, cross-category generalization
2303.05499 Report	Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection	Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang	In this paper, we present an open-set object detector, called Grounding DINO, by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary objects with human inputs such as category names or referring expressions. The key solution of open-set object detection is introducing language to a closed-set detector for open-set concept generalization. To effectively fuse language and vision modalities, we conceptually divide a closed-set detector into three phases and propose a tight fusion solution, which includes a feature enhancer, a language-guided query selection, and a cross-modality decoder for cross-modality fusion. While previous works mainly evaluate open-set object detection on novel categories, we propose to also perform evaluations on referring expression comprehension for objects specified with attributes. Grounding DINO performs remarkably well on all three settings, including benchmarks on COCO, LVIS, ODinW, and RefCOCO/+/g. Grounding DINO achieves a $52.5$ AP on the COCO detection zero-shot transfer benchmark, i.e., without any training data from COCO. It sets a new record on the ODinW zero-shot benchmark with a mean $26.1$ AP. Code will be available at \url{https://github.com/IDEA-Research/GroundingDINO}.	The paper presents Grounding DINO, an open-set object detector that leverages grounded pre-training to enable the detection of arbitrary objects specified by human language input.	Open-set object detection is crucial for developing visual intelligence systems capable of understanding novel concepts, with applications ranging from image editing to generic object detection.	The paper proposes a tight fusion approach, incorporating language information into multiple phases of a DINO detector: a feature enhancer for cross-modality fusion, a language-guided query selection module, and a cross-modality decoder.	Grounding DINO achieves state-of-the-art performance on open-set detection benchmarks, including 52.5 AP on COCO zero-shot transfer and 26.1 mean AP on ODinW zero-shot. The paper extends open-set object detection evaluation to Referring Expression Comprehension (REC) tasks, revealing a need for future work to focus on REC zero-shot performance. Ablation studies demonstrate the effectiveness of each proposed fusion component in enhancing open-set object detection performance.	While achieving impressive open-set detection results, Grounding DINO lacks segmentation capabilities. The training data scale is limited compared to the largest GLIP models, potentially hindering further performance improvement.	open-set object detection, referring expression comprehension, transformer-based detectors, multi-modal learning, grounded pre-training
2303.05498 Report	Mark My Words: Dangers of Watermarked Images in ImageNet	Kirill Bykov, Klaus-Robert Müller, Marina M. -C. Höhne	The utilization of pre-trained networks, especially those trained on ImageNet, has become a common practice in Computer Vision. However, prior research has indicated that a significant number of images in the ImageNet dataset contain watermarks, making pre-trained networks susceptible to learning artifacts such as watermark patterns within their latent spaces. In this paper, we aim to assess the extent to which popular pre-trained architectures display such behavior and to determine which classes are most affected. Additionally, we examine the impact of watermarks on the extracted features. Contrary to the popular belief that the Chinese logographic watermarks impact the "carton" class only, our analysis reveals that a variety of ImageNet classes, such as "monitor", "broom", "apron" and "safe" rely on spurious correlations. Finally, we propose a simple approach to mitigate this issue in fine-tuned networks by ignoring the encodings from the feature-extractor layer of ImageNet pre-trained networks that are most susceptible to watermark imprints.	This paper investigates the impact of watermarks on ImageNet pre-trained models and reveals that many ImageNet classes, beyond the previously known "carton" class, are susceptible to spurious correlations with watermarks, particularly Chinese logograms.	This is important because the presence of watermarks in training data can lead to models learning unintended artifacts, hindering their generalization ability and potentially leading to incorrect predictions.	The authors analyzed the activations of 20 popular ImageNet pre-trained architectures on datasets with and without watermarks (Chinese, Latin, Hindi, and Numeric). They then measured the models' ability to differentiate between watermarked and normal images using AUC ROC.	Numerous ImageNet classes exhibit sensitivity to Chinese watermarks, not just "carton". This sensitivity is prevalent across all tested pre-trained architectures. Ignoring the most watermark-sensitive representations during fine-tuning can mitigate the reliance on watermarks without significantly impacting performance.	The study primarily focuses on Chinese logographic watermarks. Future work can explore the impact of other watermark types and develop more sophisticated mitigation techniques.	imagenet, watermarks, spurious correlations, deep learning, transfer learning
2303.05456 Report	Restoration based Generative Models	Jaemoo Choi, Yesom Park, Myungjoo Kang	Denoising diffusion models (DDMs) have recently attracted increasing attention by showing impressive synthesis quality. DDMs are built on a diffusion process that pushes data to the noise distribution and the models learn to denoise. In this paper, we establish the interpretation of DDMs in terms of image restoration (IR). Integrating IR literature allows us to use an alternative objective and diverse forward processes, not confining to the diffusion process. By imposing prior knowledge on the loss function grounded on MAP-based estimation, we eliminate the need for the expensive sampling of DDMs. Also, we propose a multi-scale training, which improves the performance compared to the diffusion process, by taking advantage of the flexibility of the forward process. Experimental results demonstrate that our model improves the quality and efficiency of both training and inference. Furthermore, we show the applicability of our model to inverse problems. We believe that our framework paves the way for designing a new type of flexible general generative model.	This paper introduces Restoration-based Generative Models (RGMs), a flexible generative model family inspired by image restoration techniques, to enhance the efficiency and flexibility of Denoising Diffusion Models (DDMs).	DDMs, while effective, suffer from slow and computationally expensive sampling processes due to their reliance on iterative denoising and Gaussian noising processes.	The authors leverage a Maximum A Posteriori (MAP)-based estimation with a learned prior term to replace the MMSE objective of DDMs. This approach enables efficient sampling with fewer steps by alleviating the ill-posedness inherent in inverse problems. Furthermore, they introduce flexibility in designing the degradation process, proposing a multi-scale approach that progressively reduces image dimension for more efficient latent representation.	RGMs achieve comparable image generation quality to state-of-the-art DDMs, with significantly faster inference speed (e.g., FID 2.47 on CIFAR10 with only seven network function evaluations). The framework demonstrates flexibility through successful implementation of various prior terms (KLD, MMD, DSWD) and degradation processes, showcasing its adaptability and potential for further exploration. Beyond image generation, RGMs exhibit promising results in solving inverse problems like super-resolution and colorization when incorporated into Plug-and-Play algorithms.	While showing strong empirical performance, the paper lacks theoretical justification for the effectiveness of RGMs. The exploration of more sophisticated and effective forward processes beyond the multi-scale approach is left as future work.	generative models, denoising diffusion models, image restoration, map estimation, plug-and-play algorithms
2303.05371 Report	3DGen: Triplane Latent Diffusion for Textured Mesh Generation	Anchit Gupta, Wenhan Xiong, Yixin Nie, Ian Jones, Barlas Oğuz	Latent diffusion models for image generation have crossed a quality threshold which enabled them to achieve mass adoption. Recently, a series of works have made advancements towards replicating this success in the 3D domain, introducing techniques such as point cloud VAE, triplane representation, neural implicit surfaces and differentiable rendering based training. We take another step along this direction, combining these developments in a two-step pipeline consisting of 1) a triplane VAE which can learn latent representations of textured meshes and 2) a conditional diffusion model which generates the triplane features. For the first time this architecture allows conditional and unconditional generation of high quality textured or untextured 3D meshes across multiple diverse categories in a few seconds on a single GPU. It outperforms previous work substantially on image-conditioned and unconditional generation on mesh quality as well as texture generation. Furthermore, we demonstrate the scalability of our model to large datasets for increased quality and diversity. We will release our code and trained models.	Presents 3DGen, a two-stage pipeline for high-quality textured 3D mesh generation, utilizing a triplane VAE for latent representation learning and a conditional diffusion model for feature generation.	Aims to address limitations in existing 3D generation methods, such as scalability, joint geometry and texture learning, and practical computational constraints, bridging the gap towards practical and high-quality 3D object generation.	Combines a triplane VAE, trained with rendering-based reconstruction loss, with a conditional diffusion model incorporating 3D-aware convolutions and classifier-free guidance, enabling image-conditioned, text-conditioned, and unconditional generation.	Achieves state-of-the-art performance in unconditional and image-conditioned mesh generation, outperforming competitors like NFD and 3DILG in FID scores and geometry fidelity. Demonstrates superior performance in textured mesh generation compared to GET3D, with significant FID score improvements, generating high-quality meshes with detailed geometry and textures. Shows scalability and improved quality by pre-training on the large-scale Objaverse dataset, particularly benefiting low-resource categories and enabling text-guided generation.	Despite improvements, the model's generality still lags behind image generation models trained on massive datasets. Future work can explore utilizing 2D image datasets as weak supervision or leveraging 2D generative models to further enhance 3D generation capabilities.	3d mesh generation, textured mesh, latent diffusion model, triplane representation, variational autoencoder
2303.05342 Report	Knowledge-augmented Few-shot Visual Relation Detection	Tianyu Yu, Yangning Li, Jiaoyan Chen, Yinghui Li, Hai-Tao Zheng, Xi Chen, Qingbin Liu, Wenqiang Liu, Dongxiao Huang, Bei Wu, Yexin Wang	Visual Relation Detection (VRD) aims to detect relationships between objects for image understanding. Most existing VRD methods rely on thousands of training samples of each relationship to achieve satisfactory performance. Some recent papers tackle this problem by few-shot learning with elaborately designed pipelines and pre-trained word vectors. However, the performance of existing few-shot VRD models is severely hampered by the poor generalization capability, as they struggle to handle the vast semantic diversity of visual relationships. Nonetheless, humans have the ability to learn new relationships with just few examples based on their knowledge. Inspired by this, we devise a knowledge-augmented, few-shot VRD framework leveraging both textual knowledge and visual relation knowledge to improve the generalization ability of few-shot VRD. The textual knowledge and visual relation knowledge are acquired from a pre-trained language model and an automatically constructed visual relation knowledge graph, respectively. We extensively validate the effectiveness of our framework. Experiments conducted on three benchmarks from the commonly used Visual Genome dataset show that our performance surpasses existing state-of-the-art models with a large improvement.	This paper proposes \modelname, a knowledge-augmented few-shot visual relation detection framework that leverages textual and visual relation knowledge to improve generalization ability.	Existing few-shot VRD models struggle with the vast semantic diversity of visual relationships, limiting their performance. Humans, however, leverage knowledge to learn new relationships from few examples, motivating this work.	The framework acquires textual knowledge from a pre-trained language model using prompt-based representations. It constructs a visual relation knowledge graph from image captions, encoded into a distributed representation using a pre-trained BERT model. A Mixture-of-Experts module fuses both knowledge sources to predict relationships.	\modelname significantly outperforms state-of-the-art models on three VRD benchmarks, even without VRD pre-training or vision-language pre-training. Both textual and visual relation knowledge significantly improve performance, especially for unseen object pairs and triplets. A novel prompt template for textual knowledge and a visual relation knowledge graph constructed from a large corpus of image captions contribute to the performance gains.	The current approach only utilizes image captions for visual relation knowledge; exploring other sources like videos could be beneficial. Future work can explore more sophisticated knowledge fusion approaches beyond the Mixture-of-Experts module.	visual relation detection, few-shot learning, knowledge augmentation, textual knowledge, visual relation knowledge graph
2303.05323 Report	Controllable Video Generation by Learning the Underlying Dynamical System with Neural ODE	Yucheng Xu, Li Nanbo, Arushi Goel, Zijian Guo, Zonghai Yao, Hamidreza Kasaei, Mohammadreze Kasaei, Zhibin Li	Videos depict the change of complex dynamical systems over time in the form of discrete image sequences. Generating controllable videos by learning the dynamical system is an important yet underexplored topic in the computer vision community. This paper presents a novel framework, TiV-ODE, to generate highly controllable videos from a static image and a text caption. Specifically, our framework leverages the ability of Neural Ordinary Differential Equations~(Neural ODEs) to represent complex dynamical systems as a set of nonlinear ordinary differential equations. The resulting framework is capable of generating videos with both desired dynamics and content. Experiments demonstrate the ability of the proposed method in generating highly controllable and visually consistent videos, and its capability of modeling dynamical systems. Overall, this work is a significant step towards developing advanced controllable video generation models that can handle complex and dynamic scenes.	Presents TiV-ODE, a novel framework for generating highly controllable videos from a static image and text caption by leveraging Neural ODEs to represent complex dynamical systems.	Addresses limitations of traditional video generation methods by enabling control over both motion and appearance, and modeling the underlying continuous dynamical system for flexible frame rate generation.	Combines image and text embeddings using a transformer, uses these as initial conditions for a Neural ODE, solves the ODE at desired timesteps to generate latent vectors, and decodes these into video frames using a VQ-VAE.	Outperforms state-of-the-art methods like MAGE in metrics like FID and LPIPS on datasets like CATER and a new synthetic robot pick-and-place dataset. Demonstrates controllability by accurately manipulating objects in videos based on text captions. Successfully models continuous dynamics, enabling video generation with arbitrary and non-uniform frame rates (e.g., slow-motion effects).	Training and solving the Neural ODE can be time-consuming, especially for complex motions. Reliance on the first frame for visual information can lead to weaker constraints on later frames and potential blurring.	video generation, controllable generation, neural ode, dynamical systems, text-to-video
2303.05275 Report	Detecting Images Generated by Diffusers	Davide Alessandro Coccomini, Andrea Esuli, Fabrizio Falchi, Claudio Gennaro, Giuseppe Amato	This paper explores the task of detecting images generated by text-to-image diffusion models. To evaluate this, we consider images generated from captions in the MSCOCO and Wikimedia datasets using two state-of-the-art models: Stable Diffusion and GLIDE. Our experiments show that it is possible to detect the generated images using simple Multi-Layer Perceptrons (MLPs), starting from features extracted by CLIP, or traditional Convolutional Neural Networks (CNNs). We also observe that models trained on images generated by Stable Diffusion can detect images generated by GLIDE relatively well, however, the reverse is not true. Lastly, we find that incorporating the associated textual information with the images rarely leads to significant improvement in detection results but that the type of subject depicted in the image can have a significant impact on performance. This work provides insights into the feasibility of detecting generated images, and has implications for security and privacy concerns in real-world applications. The code to reproduce our results is available at: https://github.com/davide-coccomini/Detecting-Images-Generated-by-Diffusers	This paper investigates the detection of images generated by text-to-image diffusion models, specifically Stable Diffusion and GLIDE, using simple MLPs and CNNs.	The ability to detect synthetic images is crucial for addressing concerns related to misinformation, deepfakes, and the integrity of online information, especially with the increasing accessibility and realism of text-to-image generation models.	The study uses MLPs and CNNs trained on images from MSCOCO and Wikimedia datasets. They evaluate the models' performance in detecting images generated by Stable Diffusion and GLIDE, both within and across training methods. Additionally, they analyze the impact of image category and linguistic features on detection.	Pretrained CNNs and MLPs using CLIP-extracted features can effectively detect images generated by Stable Diffusion and GLIDE when trained on data generated by the same method. Models trained on Stable Diffusion-generated images show some ability to detect GLIDE-generated images, but not vice versa. Images depicting inanimate objects are more challenging to classify correctly, suggesting that generating believable animate objects is more difficult for current text-to-image models.	The generalization ability of classifiers across different text-to-image generation methods is limited. Further research is needed to explore the impact of more sophisticated language models in a multimodal detection setup.	image generation, diffusion models, synthetic image detection, stable diffusion, glide
2303.05266 Report	From Visual Prompt Learning to Zero-Shot Transfer: Mapping Is All You Need	Ziqing Yang, Zeyang Sha, Michael Backes, Yang Zhang	Visual prompt learning, as a newly emerged technique, leverages the knowledge learned by a large-scale pre-trained model and adapts it to downstream tasks through the usage of prompts. While previous research has focused on designing effective prompts, in this work, we argue that compared to prompt design, a good mapping strategy matters more. In this sense, we propose SeMap, a more effective mapping using the semantic alignment between the pre-trained model's knowledge and the downstream task. Our experimental results show that SeMap can largely boost the performance of visual prompt learning. Moreover, our experiments show that SeMap is capable of achieving competitive zero-shot transfer, indicating that it can perform the downstream task without any fine-tuning on the corresponding dataset. This demonstrates the potential of our proposed method to be used in a broader range of applications where the zero-shot transfer is desired. Results suggest that our proposed SeMap could lead to significant advancements in both visual prompt learning and zero-shot transfer. We hope with SeMap, we can help the community move forward to more efficient and lightweight utilization of large vision models.	This paper proposes \method, a semantics-based mapping strategy for visual prompt learning that leverages semantic alignment between pre-trained and downstream tasks.	Existing visual prompt learning methods primarily focus on prompt design, neglecting the importance of effective mapping strategies for performance improvement.	The paper introduces \method[-1] (1-on-1 mapping based on highest semantic similarity) and \method[-a] (adaptive k-on-1 mapping based on semantic similarity clustering) to map pre-trained model outputs to downstream task labels using CLIP's text encoder for semantic similarity measurement.	\method consistently outperforms existing visual prompt learning methods (RM-VP, FM-VP) by a large margin across various datasets. \method achieves competitive zero-shot transfer performance without prompt optimization, even surpassing some visual prompt learning methods. Mapping strategy shows a greater impact on performance compared to prompt design, highlighting its importance in visual prompt learning.	The performance of zero-shot transfer using \method heavily relies on the similarity between downstream and pre-trained datasets. Future work includes exploring the effectiveness of \method on more challenging downstream tasks and investigating its generalization capabilities to other pre-trained models.	visual prompt learning, zero-shot transfer, mapping strategy, semantic alignment, pre-trained models
2303.05251 Report	Masked Image Modeling with Local Multi-Scale Reconstruction	Haoqing Wang, Yehui Tang, Yunhe Wang, Jianyuan Guo, Zhi-Hong Deng, Kai Han	Masked Image Modeling (MIM) achieves outstanding success in self-supervised representation learning. Unfortunately, MIM models typically have huge computational burden and slow learning process, which is an inevitable obstacle for their industrial applications. Although the lower layers play the key role in MIM, existing MIM models conduct reconstruction task only at the top layer of encoder. The lower layers are not explicitly guided and the interaction among their patches is only used for calculating new activations. Considering the reconstruction task requires non-trivial inter-patch interactions to reason target signals, we apply it to multiple local layers including lower and upper layers. Further, since the multiple layers expect to learn the information of different scales, we design local multi-scale reconstruction, where the lower and upper layers reconstruct fine-scale and coarse-scale supervision signals respectively. This design not only accelerates the representation learning process by explicitly guiding multiple layers, but also facilitates multi-scale semantical understanding to the input. Extensive experiments show that with significantly less pre-training burden, our model achieves comparable or better performance on classification, detection and segmentation tasks than existing MIM models.	This paper proposes LocalMIM, a new Masked Image Modeling (MIM) technique that uses local multi-scale reconstruction to learn visual representations. It introduces reconstruction tasks at multiple layers of the encoder, with each layer focusing on different scales of the input image, rather than solely at the top layer like traditional MIM methods.	Existing MIM models suffer from high computational cost and slow learning, hindering their practical use. The authors argue that lower encoder layers are crucial for learning but are not explicitly guided in existing MIM models. LocalMIM aims to address this by explicitly guiding the learning of both lower and upper layers through multi-scale reconstruction, leading to faster and more efficient representation learning.	LocalMIM divides the input image into regions and extracts supervision signals (e.g., HOG features, normalized pixels) at multiple scales. It applies local reconstruction losses at specific layers of the encoder, with lower layers reconstructing fine-scale signals and upper layers reconstructing coarse-scale signals. The model uses an asymmetric encoder-decoder structure where the decoder is lightweight, minimizing computational overhead.	LocalMIM significantly outperforms existing MIM models in terms of pre-training efficiency, achieving comparable or better results on ImageNet-1K classification with considerably less training time. The learned representations generalize well to downstream tasks, demonstrating superior performance on ADE20K semantic segmentation and COCO object detection/segmentation compared to previous MIM methods. Ablation studies validate the importance of local reconstructions, multi-scale supervisions, and the choice of reconstruction targets and decoder design.	The selection of optimal locations for local reconstruction in the encoder is currently based on empirical observations and may require further investigation. While the paper primarily focuses on image-level representation learning, exploring the application of LocalMIM to other vision tasks like video understanding could be a promising future direction.	self-supervised learning, masked image modeling, representation learning, vision transformers, multi-scale learning
2303.05122 Report	M-Tuning: Prompt Tuning with Mitigated Label Bias in Open-Set Scenarios	Ning Liao, Xiaopeng Zhang, Min Cao, Junchi Yan, Qi Tian	In realistic open-set scenarios where labels of a part of testing data are totally unknown, when vision-language (VL) prompt learning methods encounter inputs related to unknown classes (i.e., not seen during training), they always predict them as one of the training classes. The exhibited label bias causes difficulty in open set recognition (OSR), in which an image should be correctly predicted as one of the known classes or the unknown one. To achieve this goal, we propose a vision-language prompt tuning method with mitigated label bias (M-Tuning). It introduces open words from the WordNet to extend the range of words forming the prompt texts from only closed-set label words to more, and thus prompts are tuned in a simulated open-set scenario. Besides, inspired by the observation that classifying directly on large datasets causes a much higher false positive rate than on small datasets, we propose a Combinatorial Tuning and Testing (CTT) strategy for improving performance. CTT decomposes M-Tuning on large datasets as multiple independent group-wise tuning on fewer classes, then makes accurate and comprehensive predictions by selecting the optimal sub-prompt. Finally, given the lack of VL-based OSR baselines in the literature, especially for prompt methods, we contribute new baselines for fair comparisons. Our method achieves the best performance on datasets with various scales, and extensive ablation studies also validate its effectiveness.	Proposes M-Tuning, a vision-language prompt tuning method that mitigates label bias for open-set recognition (OSR) by introducing open words from WordNet during training, simulating an open-set scenario and improving generalization to unknown classes.	Existing VL prompt learning methods struggle with open-set scenarios where testing data includes unknown classes, leading to misclassification. This work addresses this limitation by enabling OSR within VL prompt learning.	M-Tuning extends prompt vocabulary with open words unrelated to training/testing labels, simulating an open-set scenario. For large datasets, a Combinatorial Tuning and Testing (CTT) strategy divides the data into groups for independent tuning, improving accuracy. New baselines are constructed for fair comparison.	M-Tuning significantly outperforms existing prompt learning and OSR methods in unknown detection tasks on various datasets. CTT strategy effectively improves performance on large-scale datasets by decomposing the tuning and inference processes. Analysis shows that open words less similar to closed-set classes improve performance, suggesting flexibility in open word selection.	The performance of M-Tuning may vary with different choices of open words and grouping strategies. Further exploration is needed to optimize the selection and utilization of open words from potentially diverse sources beyond WordNet.	vision-language, open set recognition, prompt tuning, label bias, combinatorial tuning and testing
2303.05031 Report	CoralStyleCLIP: Co-optimized Region and Layer Selection for Image Editing	Ambareesh Revanur, Debraj Basu, Shradha Agrawal, Dhwanit Agarwal, Deepak Pai	Edit fidelity is a significant issue in open-world controllable generative image editing. Recently, CLIP-based approaches have traded off simplicity to alleviate these problems by introducing spatial attention in a handpicked layer of a StyleGAN. In this paper, we propose CoralStyleCLIP, which incorporates a multi-layer attention-guided blending strategy in the feature space of StyleGAN2 for obtaining high-fidelity edits. We propose multiple forms of our co-optimized region and layer selection strategy to demonstrate the variation of time complexity with the quality of edits over different architectural intricacies while preserving simplicity. We conduct extensive experimental analysis and benchmark our method against state-of-the-art CLIP-based methods. Our findings suggest that CoralStyleCLIP results in high-quality edits while preserving the ease of use.	CoralStyleCLIP, a novel method for text-driven image editing, co-optimizes region and layer selection in StyleGAN2 for high-fidelity edits with minimal manual intervention.	Existing methods struggle to achieve both ease of use and edit fidelity, often requiring manual selection of layers and resulting in undesirable edits to unintended regions.	CoralStyleCLIP introduces a multi-layer attention-guided blending strategy, learning both latent edit directions and spatial masks for each StyleGAN2 layer. Two variants are presented: one using pre-trained segment selection for faster training and another using a convolutional attention network for finer control.	CoralStyleCLIP produces high-quality edits localized to relevant regions, outperforming baselines in accuracy and minimizing unwanted modifications. The method automatically learns appropriate layers for editing, selecting earlier layers for coarse edits (e.g., shape) and later layers for finer details (e.g., color). Segment selection offers significant speed advantages over the attention network, achieving comparable results for less complex edits.	Segment selection can be limited by the pre-defined segments of the model, potentially leading to over- or under-selection. The attention network variant, while more accurate, incurs higher training costs.	image editing, text-guided image manipulation, stylegan, clip, attention mechanisms
2303.04989 Report	ARS-DETR: Aspect Ratio-Sensitive Detection Transformer for Aerial Oriented Object Detection	Ying Zeng, Yushi Chen, Xue Yang, Qingyun Li, Junchi Yan	Existing oriented object detection methods commonly use metric AP$_{50}$ to measure the performance of the model. We argue that AP$_{50}$ is inherently unsuitable for oriented object detection due to its large tolerance in angle deviation. Therefore, we advocate using high-precision metric, e.g. AP$_{75}$, to measure the performance of models. In this paper, we propose an Aspect Ratio Sensitive Oriented Object Detector with Transformer, termed ARS-DETR, which exhibits a competitive performance in high-precision oriented object detection. Specifically, a new angle classification method, calling Aspect Ratio aware Circle Smooth Label (AR-CSL), is proposed to smooth the angle label in a more reasonable way and discard the hyperparameter that introduced by previous work (e.g. CSL). Then, a rotated deformable attention module is designed to rotate the sampling points with the corresponding angles and eliminate the misalignment between region features and sampling points. Moreover, a dynamic weight coefficient according to the aspect ratio is adopted to calculate the angle loss. Comprehensive experiments on several challenging datasets show that our method achieves competitive performance on the high-precision oriented object detection task.	The paper proposes Aspect Ratio-Sensitive Detection Transformer (ARS-DETR) for high-precision oriented object detection in aerial images.	Current oriented object detectors often neglect the sensitivity of objects with different aspect ratios to angle, and existing metrics like AP50 aren't sensitive enough to reflect angle prediction accuracy. This hinders high-precision oriented object detection crucial for tasks like fine-grained recognition.	The authors propose Aspect Ratio Aware Circle Smooth Label (AR-CSL) to smooth angle labels dynamically based on object aspect ratio using SkewIoU. They also introduce a Rotated Deformable Attention module to align features with object orientation and use aspect ratio sensitive matching/loss during training. The effectiveness of these methods is demonstrated using Deformable DETR as the base architecture.	AR-CSL outperforms CSL with various radii and angle discrete granularities, achieving better AP75 across different detectors. The Rotated Deformable Attention module aligns features effectively, leading to significant improvements in AP75. ARS-DETR achieves competitive performance on high-precision oriented object detection across DOTA-v1.0, DIOR-R, and OHD-SJTU datasets.	The paper focuses on AP75 as the main metric for high-precision detection. While justified, exploring other metrics like AP90 could be interesting. The computational complexity of ARS-DETR compared to other oriented object detectors is not discussed.	oriented object detection, high-precision detection, detection transformer, feature alignment, remote sensing
2303.04970 Report	LMR: A Large-Scale Multi-Reference Dataset for Reference-based Super-Resolution	Lin Zhang, Xin Li, Dongliang He, Errui Ding, Zhaoxiang Zhang	It is widely agreed that reference-based super-resolution (RefSR) achieves superior results by referring to similar high quality images, compared to single image super-resolution (SISR). Intuitively, the more references, the better performance. However, previous RefSR methods have all focused on single-reference image training, while multiple reference images are often available in testing or practical applications. The root cause of such training-testing mismatch is the absence of publicly available multi-reference SR training datasets, which greatly hinders research efforts on multi-reference super-resolution. To this end, we construct a large-scale, multi-reference super-resolution dataset, named LMR. It contains 112,142 groups of 300x300 training images, which is 10x of the existing largest RefSR dataset. The image size is also much larger. More importantly, each group is equipped with 5 reference images with different similarity levels. Furthermore, we propose a new baseline method for multi-reference super-resolution: MRefSR, including a Multi-Reference Attention Module (MAM) for feature fusion of an arbitrary number of reference images, and a Spatial Aware Filtering Module (SAFM) for the fused feature selection. The proposed MRefSR achieves significant improvements over state-of-the-art approaches on both quantitative and qualitative evaluations. Our code and data would be made available soon.	This paper introduces LMR, the first large-scale multi-reference dataset for reference-based super-resolution (RefSR), and proposes MRefSR, a novel multi-reference RefSR baseline method.	Existing RefSR methods rely on single-reference training datasets, limiting their effectiveness in real-world applications where multiple reference images are often available.	LMR is constructed from the MegaDepth dataset by selecting and cropping image patches with varying similarity levels. MRefSR leverages a Multi-Reference Attention Module (MAM) for feature fusion and a Spatial Aware Filtering Module (SAFM) for feature selection.	MRefSR significantly outperforms state-of-the-art methods on both CUFED5 and LMR datasets. Models trained on LMR exhibit strong generalization ability on other RefSR datasets. MRefSR effectively utilizes multiple reference images, leading to superior visual quality compared to single-reference methods.	The impact of different similarity levels of reference images requires further exploration. Investigating the effectiveness of MRefSR on other computer vision tasks is a promising future direction.	reference-based super-resolution, multi-reference super-resolution, dataset, deep learning, computer vision
2303.04838 Report	The Casual Conversations v2 Dataset	Bilal Porgali, Vítor Albiero, Jordan Ryda, Cristian Canton Ferrer, Caner Hazirbas	This paper introduces a new large consent-driven dataset aimed at assisting in the evaluation of algorithmic bias and robustness of computer vision and audio speech models in regards to 11 attributes that are self-provided or labeled by trained annotators. The dataset includes 26,467 videos of 5,567 unique paid participants, with an average of almost 5 videos per person, recorded in Brazil, India, Indonesia, Mexico, Vietnam, Philippines, and the USA, representing diverse demographic characteristics. The participants agreed for their data to be used in assessing fairness of AI models and provided self-reported age, gender, language/dialect, disability status, physical adornments, physical attributes and geo-location information, while trained annotators labeled apparent skin tone using the Fitzpatrick Skin Type and Monk Skin Tone scales, and voice timbre. Annotators also labeled for different recording setups and per-second activity annotations.	Introduces Casual Conversations v2, a large and diverse dataset designed for evaluating fairness and robustness in audio, vision, and speech AI models.	Addresses the lack of ethically constructed benchmarks for identifying fairness issues in AI models, particularly concerning demographic attributes.	Collected 26,467 videos from 5,567 participants across 7 countries, encompassing self-reported demographics, annotated physical attributes, and diverse recording setups.	Models trained on FairFace exhibit better accuracy across datasets and demographic groups. Significant performance bias exists in vision models towards household items from higher-income backgrounds. Strong correlation observed between native language and spoken language among participants.	Limited number of race categories in some datasets like UTKFace and RFW. Reliance on binary gender categories in several datasets can be limiting and potentially discriminatory.	fairness, robustness, dataset, audio-visual, speech
2303.04803 Report	Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models	Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiaolong Wang, Shalini De Mello	We present ODISE: Open-vocabulary DIffusion-based panoptic SEgmentation, which unifies pre-trained text-image diffusion and discriminative models to perform open-vocabulary panoptic segmentation. Text-to-image diffusion models have the remarkable ability to generate high-quality images with diverse open-vocabulary language descriptions. This demonstrates that their internal representation space is highly correlated with open concepts in the real world. Text-image discriminative models like CLIP, on the other hand, are good at classifying images into open-vocabulary labels. We leverage the frozen internal representations of both these models to perform panoptic segmentation of any category in the wild. Our approach outperforms the previous state of the art by significant margins on both open-vocabulary panoptic and semantic segmentation tasks. In particular, with COCO training only, our method achieves 23.4 PQ and 30.0 mIoU on the ADE20K dataset, with 8.3 PQ and 7.9 mIoU absolute improvement over the previous state of the art. We open-source our code and models at https://github.com/NVlabs/ODISE .	This paper proposes ODISE, an open-vocabulary diffusion-based panoptic segmentation model that leverages the internal representations of pre-trained text-to-image diffusion models for open-vocabulary segmentation.	Open-vocabulary recognition is crucial for real-world applications, as it allows models to recognize limitless categories, unlike closed-vocabulary approaches that are limited by their training data.	ODISE uses a frozen text-to-image diffusion model to extract visual features from an image and its caption. It trains a mask generator on these features to predict panoptic masks and utilizes a mask classification module to categorize masks into open-vocabulary categories.	ODISE achieves state-of-the-art performance on open-vocabulary panoptic segmentation, outperforming previous methods by a significant margin. It also excels in open-vocabulary semantic segmentation, object detection, and open-world instance segmentation tasks. The study finds that the internal representations of text-to-image diffusion models are better suited for open-vocabulary segmentation compared to traditional discriminative models.	The category definitions in existing datasets can be ambiguous, affecting evaluation accuracy. Potential bias in the pre-trained diffusion model's internal representation due to web-crawled data.	open-vocabulary, panoptic segmentation, diffusion models, text-to-image generation, computer vision
2303.04761 Report	Video-P2P: Video Editing with Cross-attention Control	Shaoteng Liu, Yuechen Zhang, Wenbo Li, Zhe Lin, Jiaya Jia	This paper presents Video-P2P, a novel framework for real-world video editing with cross-attention control. While attention control has proven effective for image editing with pre-trained image generation models, there are currently no large-scale video generation models publicly available. Video-P2P addresses this limitation by adapting an image generation diffusion model to complete various video editing tasks. Specifically, we propose to first tune a Text-to-Set (T2S) model to complete an approximate inversion and then optimize a shared unconditional embedding to achieve accurate video inversion with a small memory cost. For attention control, we introduce a novel decoupled-guidance strategy, which uses different guidance strategies for the source and target prompts. The optimized unconditional embedding for the source prompt improves reconstruction ability, while an initialized unconditional embedding for the target prompt enhances editability. Incorporating the attention maps of these two branches enables detailed editing. These technical designs enable various text-driven editing applications, including word swap, prompt refinement, and attention re-weighting. Video-P2P works well on real-world videos for generating new characters while optimally preserving their original poses and scenes. It significantly outperforms previous approaches.	Video-P2P, a novel framework for realistic video editing using cross-attention control with pre-trained image generation diffusion models.	Addresses the lack of publicly available large-scale video generation models for video editing tasks, enabling detailed control over object properties and actions within real-world videos.	Adapts a text-to-image diffusion model into a text-to-set model for video processing. Optimizes a shared unconditional embedding for accurate video inversion. Introduces a decoupled-guidance strategy for attention control, utilizing separate guidance for source and target prompts.	Successfully performs local and global video editing tasks like word swapping, prompt refinement, and attention re-weighting. Demonstrates superior performance in preserving temporal coherence and background details compared to existing methods like Tune-A-Video and Dreamix. Quantitative analysis using metrics like CLIP Score, Masked PSNR, LPIPS, and Object Semantic Variance (OSV) confirms the effectiveness of Video-P2P in achieving semantic consistency and high editing quality.	Limited ability to edit video motion due to the use of an image-based diffusion model. Future work will focus on enhancing Video-P2P to handle more complex editing scenarios, such as adding new objects into the video.	video editing, diffusion models, cross-attention control, text-to-video generation, unconditional embedding
2303.04748 Report	CLIP-FO3D: Learning Free Open-world 3D Scene Representations from 2D Dense CLIP	Junbo Zhang, Runpei Dong, Kaisheng Ma	Training a 3D scene understanding model requires complicated human annotations, which are laborious to collect and result in a model only encoding close-set object semantics. In contrast, vision-language pre-training models (e.g., CLIP) have shown remarkable open-world reasoning properties. To this end, we propose directly transferring CLIP's feature space to 3D scene understanding model without any form of supervision. We first modify CLIP's input and forwarding process so that it can be adapted to extract dense pixel features for 3D scene contents. We then project multi-view image features to the point cloud and train a 3D scene understanding model with feature distillation. Without any annotations or additional training, our model achieves promising annotation-free semantic segmentation results on open-vocabulary semantics and long-tailed concepts. Besides, serving as a cross-modal pre-training framework, our method can be used to improve data efficiency during fine-tuning. Our model outperforms previous SOTA methods in various zero-shot and data-efficient learning benchmarks. Most importantly, our model successfully inherits CLIP's rich-structured knowledge, allowing 3D scene understanding models to recognize not only object concepts but also open-world semantics.	This paper proposes CLIP-FO3D, a novel method for transferring CLIP's feature space to 3D scene understanding models without any human supervision, enabling open-world 3D scene understanding.	Training 3D scene understanding models typically requires extensive human annotations, limiting them to recognizing only a fixed set of object semantics. CLIP-FO3D addresses this limitation by leveraging the open-world reasoning capabilities of CLIP.	CLIP-FO3D extracts dense pixel features from 3D scene RGB views by modifying CLIP's input and forwarding process. It then projects these features to the point cloud and trains a 3D scene understanding model using feature distillation.	CLIP-FO3D achieves impressive annotation-free semantic segmentation results on standard benchmarks (ScanNet, S3DIS) and open-vocabulary concepts. The model exhibits remarkable open-world properties, recognizing long-tailed categories and successfully identifying regions relevant to open-world text queries (e.g., color, affordance). CLIP-FO3D outperforms existing methods in zero-shot and data-efficient learning tasks, demonstrating its effectiveness in scenarios with limited or no annotations.	CLIP-FO3D's performance on recognizing smaller objects and fine-grained details could be further improved. The computational cost of extracting dense pixel features from CLIP can be high, posing challenges for real-time applications.	3d scene understanding, open-world learning, zero-shot learning, data-efficient learning, clip
2303.04707 Report	DiM: Distilling Dataset into Generative Model	Kai Wang, Jianyang Gu, Daquan Zhou, Zheng Zhu, Wei Jiang, Yang You	Dataset distillation reduces the network training cost by synthesizing small and informative datasets from large-scale ones. Despite the success of the recent dataset distillation algorithms, three drawbacks still limit their wider application: i). the synthetic images perform poorly on large architectures; ii). they need to be re-optimized when the distillation ratio changes; iii). the limited diversity restricts the performance when the distillation ratio is large. In this paper, we propose a novel distillation scheme to \textbf{D}istill information of large train sets \textbf{i}nto generative \textbf{M}odels, named DiM. Specifically, DiM learns to use a generative model to store the information of the target dataset. During the distillation phase, we minimize the differences in logits predicted by a models pool between real and generated images. At the deployment stage, the generative model synthesizes various training samples from random noises on the fly. Due to the simple yet effective designs, the trained DiM can be directly applied to different distillation ratios and large architectures without extra cost. We validate the proposed DiM across 4 datasets and achieve state-of-the-art results on all of them. To the best of our knowledge, we are the first to achieve higher accuracy on complex architectures than simple ones, such as 75.1\% with ResNet-18 and 72.6\% with ConvNet-3 on ten images per class of CIFAR-10. Besides, DiM outperforms previous methods with 10\% $\sim$ 22\% when images per class are 1 and 10 on the SVHN dataset.	This paper proposes DiM, a novel dataset distillation method that distills information from a large training dataset into a generative model instead of synthetic images.	Existing dataset distillation methods have limitations in cross-architecture generalization, redeployment efficiency (require redistillation when target data size changes), and often perform poorly on larger architectures.	DiM employs a conditional GAN trained with a logits matching strategy using a pool of diverse models. This allows the generator to learn to synthesize discriminative images helpful for downstream tasks across different architectures.	DiM significantly outperforms state-of-the-art methods on various benchmarks, especially for large architectures and low image-per-class settings. It achieves superior cross-architecture generalization, effectively distilling knowledge from simple architectures to larger ones. DiM demonstrates high redeployment efficiency, needing only a single training for various target data sizes, unlike previous methods.	Generating training samples during deployment introduces extra computational effort compared to using static synthetic images. Future work includes exploring lighter generative models and applying DiM to large-scale datasets and tasks like object detection and semantic segmentation.	dataset distillation, generative adversarial networks, cross-architecture generalization, logits matching, model pooling
2303.04664 Report	Centroid-centered Modeling for Efficient Vision Transformer Pre-training	Xin Yan, Zuchao Li, Lefei Zhang, Bo Du, Dacheng Tao	Masked Image Modeling (MIM) is a new self-supervised vision pre-training paradigm using Vision Transformer (ViT). Previous works can be pixel-based or token-based, using original pixels or discrete visual tokens from parametric tokenizer models, respectively. Our proposed approach, \textbf{CCViT}, leverages k-means clustering to obtain centroids for image modeling without supervised training of tokenizer model. The centroids represent patch pixels and index tokens and have the property of local invariance. Non-parametric centroid tokenizer only takes seconds to create and is faster for token inference. Specifically, we adopt patch masking and centroid replacement strategies to construct corrupted inputs, and two stacked encoder blocks to predict corrupted patch tokens and reconstruct original patch pixels. Experiments show that the ViT-B model with only 300 epochs achieves 84.3\% top-1 accuracy on ImageNet-1K classification and 51.6\% on ADE20K semantic segmentation. Our approach achieves competitive results with BEiTv2 without distillation training from other models and outperforms other methods such as MAE.	This paper introduces CCViT, a novel Vision Transformer pre-training framework called centroid-centered MIM using k-means clustering for efficient image modeling without training a separate tokenizer.	Existing token-based MIM methods are computationally expensive due to the need for training a separate tokenizer, while pixel-based methods require a redundant decoder. CCViT addresses these limitations by leveraging centroids for both token and pixel learning.	CCViT uses k-means clustering on a small subset of the pre-training data to obtain centroids, which act as both token indices and representative patch pixels. The model is pre-trained using blockwise masking and centroid replacement strategies, with a dual objective of predicting centroid tokens and reconstructing original patch pixels.	CCViT achieves 84.3% top-1 accuracy on ImageNet-1K classification and 51.6% mIoU on ADE20K segmentation with only 300 epochs. The centroid-based tokenizer is significantly faster to train and infer compared to parametric tokenizers used in previous works. CCViT demonstrates better noise resistance compared to BEiT and BEiTv2, suggesting its ability to learn more robust and locally invariant representations.	The study is limited to a base-size ViT model and 300 pre-training epochs due to resource constraints. Future work will investigate scaling up the model and data size, and exploring knowledge distillation for potential improvement.	vision transformer, self-supervised learning, masked image modeling, k-means clustering, centroid-based representation
2303.04587 Report	A Prompt Log Analysis of Text-to-Image Generation Systems	Yutong Xie, Zhaoying Pan, Jinge Ma, Luo Jie, Qiaozhu Mei	Recent developments in large language models (LLM) and generative AI have unleashed the astonishing capabilities of text-to-image generation systems to synthesize high-quality images that are faithful to a given reference text, known as a "prompt". These systems have immediately received lots of attention from researchers, creators, and common users. Despite the plenty of efforts to improve the generative models, there is limited work on understanding the information needs of the users of these systems at scale. We conduct the first comprehensive analysis of large-scale prompt logs collected from multiple text-to-image generation systems. Our work is analogous to analyzing the query logs of Web search engines, a line of work that has made critical contributions to the glory of the Web search industry and research. Compared with Web search queries, text-to-image prompts are significantly longer, often organized into special structures that consist of the subject, form, and intent of the generation tasks and present unique categories of information needs. Users make more edits within creation sessions, which present remarkable exploratory patterns. There is also a considerable gap between the user-input prompts and the captions of the images included in the open training data of the generative models. Our findings provide concrete implications on how to improve text-to-image generation systems for creation purposes.	This paper presents the first comprehensive analysis of large-scale prompt logs from text-to-image generation systems (Midjourney, Stable Diffusion, LDMs), revealing user information needs and workflows.	Understanding user information needs is crucial for improving text-to-image generation systems and facilitating AI-powered creativity.	The authors analyze millions of user prompts, comparing them to Web search queries and image captions in training datasets, examining term frequencies, prompt structures, session patterns, and correlations with user ratings.	Prompts typically describe the subject, form, and intent of the desired image. Text-to-image prompts differ significantly from Web search queries, exhibiting greater length, exploratory patterns, and a new category of “exploratory prompts”. Longer prompts and specific terms correlate with higher-rated generated images.	The analysis relies on open datasets, potentially excluding private training data. Further research is needed to develop tools and glossaries for extracting subject, form, and intent from prompts.	text-to-image generation, ai-generated content (aigc), ai for creativity, prompt analysis, query log analysis
2303.04248 Report	TRACT: Denoising Diffusion Models with Transitive Closure Time-Distillation	David Berthelot, Arnaud Autef, Jierui Lin, Dian Ang Yap, Shuangfei Zhai, Siyuan Hu, Daniel Zheng, Walter Talbott, Eric Gu	Denoising Diffusion models have demonstrated their proficiency for generative sampling. However, generating good samples often requires many iterations. Consequently, techniques such as binary time-distillation (BTD) have been proposed to reduce the number of network calls for a fixed architecture. In this paper, we introduce TRAnsitive Closure Time-distillation (TRACT), a new method that extends BTD. For single step diffusion,TRACT improves FID by up to 2.4x on the same architecture, and achieves new single-step Denoising Diffusion Implicit Models (DDIM) state-of-the-art FID (7.4 for ImageNet64, 3.8 for CIFAR10). Finally we tease apart the method through extended ablations. The PyTorch implementation will be released soon.	This paper proposes TRAnsitive Closure Time-distillation (TRACT), a novel method for distilling diffusion models to significantly improve the quality of generated samples within a few steps, ideally one.	Generating high-quality samples from diffusion models typically demands a large number of inference steps, hindering their efficiency. TRACT addresses this limitation by enabling the generation of high-quality samples in just one or two steps.	TRACT extends Binary Time-Distillation (BTD) by using a self-teaching approach with exponential moving average (EMA) to distill the output of a teacher model's inference from step t to t' with t' < t. The method is independent of specific noise schedules or samplers, demonstrating effectiveness with both variance preserving and variance exploding schedules, and with both DDIM and Runge-Kutta samplers.	TRACT achieves state-of-the-art FID scores for single-step diffusion models, notably 7.4 for ImageNet64 and 3.8 for CIFAR10. Ablations confirm the importance of the self-teaching EMA momentum and demonstrate that 2-phase distillation schedules generally outperform schedules with more phases. Beyond time distillation, TRACT can also distill knowledge to other, potentially smaller, architectures with minimal performance loss.	Self-teaching in TRACT might lead to less efficient objectives compared to supervised training in BTD. Future work could explore distilling from even higher step count teachers, enabled by TRACT's flexible reduction in steps between training phases, potentially unlocking new applications for diffusion models.	diffusion models, generative sampling, knowledge distillation, time distillation, self-teaching
2303.04244 Report	A Light-Weight Contrastive Approach for Aligning Human Pose Sequences	Robert T. Collins	We present a simple unsupervised method for learning an encoder mapping short 3D pose sequences into embedding vectors suitable for sequence-to-sequence alignment by dynamic time warping. Training samples consist of temporal windows of frames containing 3D body points such as mocap markers or skeleton joints. A light-weight, 3-layer encoder is trained using a contrastive loss function that encourages embedding vectors of augmented sample pairs to have cosine similarity 1, and similarity 0 with all other samples in a minibatch. When multiple scripted training sequences are available, temporal alignments inferred from an initial round of training are harvested to extract additional, cross-performance match pairs for a second phase of training to refine the encoder. In addition to being simple, the proposed method is fast to train, making it easy to adapt to new data using different marker sets or skeletal joint layouts. Experimental results illustrate ease of use, transferability, and utility of the learned embeddings for comparing and analyzing human behavior sequences.	This paper presents a simple unsupervised contrastive learning approach for aligning 3D human pose sequences.	Temporal alignment of pose sequences is crucial for various tasks in human motion understanding, including studying variability, identifying repeated actions, transferring labels, and detecting anomalies.	A lightweight 3-layer encoder is trained using a contrastive loss function based on cosine similarity. The method utilizes data augmentation and a two-phase training approach, refining the encoder with cross-performance matching pairs obtained through dynamic time warping (DTW).	The method produces discriminative pose sequence representations, effectively aligning complex, multi-action sequences. Learned representations demonstrate transferability across datasets with different sensors, point sets, and even movement styles (Taiji to Karate). Experiments on the Penn Action dataset show state-of-the-art alignment performance, outperforming several previous methods.	The current method relies on DTW, limiting its ability to handle out-of-order action sequences. Manual correspondence mapping between different point sets is required, presenting a challenge for automation.	pose sequence alignment, contrastive learning, dynamic time warping, unsupervised learning, human motion analysis
2303.04186 Report	End-to-end Face-swapping via Adaptive Latent Representation Learning	Chenhao Lin, Pengbin Hu, Chao Shen, Qian Li	Taking full advantage of the excellent performance of StyleGAN, style transfer-based face swapping methods have been extensively investigated recently. However, these studies require separate face segmentation and blending modules for successful face swapping, and the fixed selection of the manipulated latent code in these works is reckless, thus degrading face swapping quality, generalizability, and practicability. This paper proposes a novel and end-to-end integrated framework for high resolution and attribute preservation face swapping via Adaptive Latent Representation Learning. Specifically, we first design a multi-task dual-space face encoder by sharing the underlying feature extraction network to simultaneously complete the facial region perception and face encoding. This encoder enables us to control the face pose and attribute individually, thus enhancing the face swapping quality. Next, we propose an adaptive latent codes swapping module to adaptively learn the mapping between the facial attributes and the latent codes and select effective latent codes for improved retention of facial attributes. Finally, the initial face swapping image generated by StyleGAN2 is blended with the facial region mask generated by our encoder to address the background blur problem. Our framework integrating facial perceiving and blending into the end-to-end training and testing process can achieve high realistic face-swapping on wild faces without segmentation masks. Experimental results demonstrate the superior performance of our approach over state-of-the-art methods.	This paper introduces FS-ALL, an end-to-end framework for high-resolution face swapping that uses adaptive latent representation learning to improve identity transfer and attribute preservation.	Existing face swapping methods struggle to achieve high generalizability and realism simultaneously, often requiring separate segmentation and blending modules. Fixed latent code manipulation in these methods leads to low-quality swapping and poor attribute preservation.	The framework uses a multi-task dual-space encoder (MDE) to perceive facial regions and map faces into separate pose and attribute latent spaces. An adaptive latent code swapping module (ALS) then selects and swaps effective latent codes based on a learnable network, enhancing attribute retention. Finally, StyleGAN2 generates the swapped face, refined by an internal blending module.	FS-ALL demonstrates superior performance over state-of-the-art methods in both qualitative and quantitative evaluations. The adaptive latent code swapping module improves identity transfer and attribute preservation compared to fixed latent code manipulation. The multi-task dual-space encoder effectively maintains facial details and generates accurate masks for seamless blending.	The decoupling of latent codes for attribute control on certain datasets requires further improvement. The method currently relies on a pre-trained StyleGAN2 model, limiting its flexibility in generating specific face styles.	face swapping, deepfake, adaptive latent representation learning, generative adversarial networks (gans), attribute preservation
2303.04105 Report	Your representations are in the network: composable and parallel adaptation for large scale models	Yonatan Dukler, Alessandro Achille, Hao Yang, Varsha Vivek, Luca Zancato, Benjamin Bowman, Avinash Ravichandran, Charless Fowlkes, Ashwin Swaminathan, Stefano Soatto	We propose InCA, a lightweight method for transfer learning that cross-attends to any activation layer of a pre-trained model. During training, InCA uses a single forward pass to extract multiple activations, which are passed to external cross-attention adapters, trained anew and combined or selected for downstream tasks. We show that, even when selecting a single top-scoring adapter, InCA achieves performance comparable to full fine-tuning, at a cost comparable to fine-tuning just the last layer. For example, with a cross-attention probe 1.3% the size of a pre-trained ViT-L/16 model, we achieve performance within 0.2% of the full fine-tuning paragon at a computational training cost of 51% of the baseline, on average across 11 downstream classification. Unlike other forms of efficient adaptation, InCA does not require backpropagating through the pre-trained model, thus leaving its execution unaltered at both training and inference. The versatility of InCA is best illustrated in fine-grained tasks, which may require accessing information absent in the last layer but accessible in intermediate layer activations. Since the backbone is fixed, InCA allows parallel ensembling as well as parallel execution of multiple tasks. InCA achieves state-of-the-art performance in the ImageNet-to-Sketch multi-task benchmark.	This paper introduces InCA (Introspective-Cross-Attention), a novel transfer learning framework that adapts large pre-trained models by attaching lightweight cross-attention modules to intermediate activations.	Full fine-tuning of large-scale models is computationally expensive and impractical for many real-world applications. InCA provides an efficient and versatile alternative for adapting these models to downstream tasks.	InCA works by attaching and training multiple isolated, lightweight cross-attention adapters in parallel to different activations of a frozen pre-trained model. These adapters learn to extract task-relevant information from the activations, enabling efficient adaptation without modifying the original model.	A single InCA adapter, only 1.3% the size of the full model, achieves comparable accuracy to full fine-tuning on 11 diverse downstream classification tasks. InCA's isolated adaptation is highly computationally efficient, allowing adaptation of massive models like ViT-G/14 on a single GPU. The method enables flexible learning scenarios, including multi-task learning and class-incremental learning, by combining or incrementally modifying learned adapters.	The paper primarily focuses on image classification tasks. Further investigation is needed to evaluate InCA's performance on other vision tasks like object detection or segmentation. While the paper explores ensembling adapters, more sophisticated ensembling techniques and their impact on robustness and out-of-distribution performance remain to be explored.	transfer learning, parameter-efficient fine-tuning, cross-attention, intermediate representations, multi-task learning
2303.04001 Report	ELODIN: Naming Concepts in Embedding Spaces	Rodrigo Mello, Filipe Calegario, Geber Ramalho	Despite recent advancements, the field of text-to-image synthesis still suffers from lack of fine-grained control. Using only text, it remains challenging to deal with issues such as concept coherence and concept contamination. We propose a method to enhance control by generating specific concepts that can be reused throughout multiple images, effectively expanding natural language with new words that can be combined much like a painter's palette. Unlike previous contributions, our method does not copy visuals from input data and can generate concepts through text alone. We perform a set of comparisons that finds our method to be a significant improvement over text-only prompts.	Introduces ELODIN, a method for generating and using 'named concepts' (namecons), custom keywords associated with specific visual concepts in the embedding space of text-to-image models, enhancing control over concept coherence and contamination in generated images.	Addresses limitations of text-based prompts in achieving precise visual consistency and preventing unintended interactions between concepts in text-to-image synthesis.	ELODIN searches the embedding space by optimizing an embedding vector through backpropagation, guided by a similarity loss (e.g., text-image or face similarity), and associates it with a user-defined keyword (namecon). Namecons are then integrated into prompts, replacing guiding concepts' embeddings during inference.	ELODIN reduces concept contamination (e.g., maintaining distinct colors) compared to text-only prompts. ELODIN improves concept coherence (e.g., preserving consistent facial features) across multiple generations. Quantitative analysis using face similarity metrics shows higher coherence for images generated with namecons.	Limited exploration of loss functions beyond text-image and face similarity. Further research needed on applicability to non-visual modalities and tasks like segmentation/object detection.	text-to-image synthesis, concept naming, embedding space, fine-grained control, concept coherence
2303.03991 Report	OpenOccupancy: A Large Scale Benchmark for Surrounding Semantic Occupancy Perception	Xiaofeng Wang, Zheng Zhu, Wenbo Xu, Yunpeng Zhang, Yi Wei, Xu Chi, Yun Ye, Dalong Du, Jiwen Lu, Xingang Wang	Semantic occupancy perception is essential for autonomous driving, as automated vehicles require a fine-grained perception of the 3D urban structures. However, existing relevant benchmarks lack diversity in urban scenes, and they only evaluate front-view predictions. Towards a comprehensive benchmarking of surrounding perception algorithms, we propose OpenOccupancy, which is the first surrounding semantic occupancy perception benchmark. In the OpenOccupancy benchmark, we extend the large-scale nuScenes dataset with dense semantic occupancy annotations. Previous annotations rely on LiDAR points superimposition, where some occupancy labels are missed due to sparse LiDAR channels. To mitigate the problem, we introduce the Augmenting And Purifying (AAP) pipeline to ~2x densify the annotations, where ~4000 human hours are involved in the labeling process. Besides, camera-based, LiDAR-based and multi-modal baselines are established for the OpenOccupancy benchmark. Furthermore, considering the complexity of surrounding occupancy perception lies in the computational burden of high-resolution 3D predictions, we propose the Cascade Occupancy Network (CONet) to refine the coarse prediction, which relatively enhances the performance by ~30% than the baseline. We hope the OpenOccupancy benchmark will boost the development of surrounding occupancy perception algorithms.	This paper presents OpenOccupancy, the first benchmark designed for surrounding semantic occupancy perception in driving scenarios.	Surrounding semantic occupancy perception is crucial for autonomous driving as it enables a fine-grained understanding of 3D urban structures, which is essential for safe navigation. Existing benchmarks lack diversity in urban scenes and focus on front-view predictions, limiting their effectiveness in evaluating surrounding perception algorithms.	The benchmark extends the nuScenes dataset with dense semantic occupancy annotations using the Augmenting And Purifying (AAP) pipeline. It proposes camera-based, LiDAR-based, and multi-modal baselines, and introduces the Cascade Occupancy Network (CONet) to improve efficiency and accuracy of high-resolution occupancy predictions.	Surrounding occupancy perception paradigm outperforms single-view methods. Multi-modal baseline significantly enhances performance by adaptively fusing camera and LiDAR data. CONet improves efficiency and accuracy by refining low-resolution predictions.	Benchmark currently relies on the nuScenes dataset and could be expanded to include more diverse driving scenarios. Future work can explore more sophisticated fusion methods and architectures for improved performance.	autonomous driving, semantic occupancy perception, benchmarking, multi-modal fusion, 3d perception
2303.03932 Report	FFT-based Dynamic Token Mixer for Vision	Yuki Tatsunami, Masato Taki	Multi-head-self-attention (MHSA)-equipped models have achieved notable performance in computer vision. Their computational complexity is proportional to quadratic numbers of pixels in input feature maps, resulting in slow processing, especially when dealing with high-resolution images. New types of token-mixer are proposed as an alternative to MHSA to circumvent this problem: an FFT-based token-mixer involves global operations similar to MHSA but with lower computational complexity. However, despite its attractive properties, the FFT-based token-mixer has not been carefully examined in terms of its compatibility with the rapidly evolving MetaFormer architecture. Here, we propose a novel token-mixer called Dynamic Filter and novel image recognition models, DFFormer and CDFFormer, to close the gaps above. The results of image classification and downstream tasks, analysis, and visualization show that our models are helpful. Notably, their throughput and memory efficiency when dealing with high-resolution image recognition is remarkable. Our results indicate that Dynamic Filter is one of the token-mixer options that should be seriously considered. The code is available at https://github.com/okojoalg/dfformer	This paper introduces the "dynamic filter," a novel mechanism for dynamically generating global filters in vision models, and proposes two new MetaFormer classes: DFFormer and CDFFormer.	This work addresses the limitations of traditional global filters and aims to close the performance gap between FFT-based models and state-of-the-art vision models, particularly in handling high-resolution images.	The authors develop a dynamic filter that generates global filters based on image content using an MLP. This dynamic filter is then integrated into a MetaFormer architecture, resulting in DFFormer and a hybrid model with convolutions, CDFFormer. Extensive experiments are conducted on ImageNet-1K, ADE20K, and COCO benchmarks.	DFFormer and CDFFormer achieve competitive performance compared to state-of-the-art MHSA-free models on ImageNet-1K image classification. The models demonstrate superior performance in downstream tasks like semantic segmentation (ADE20K) and object detection (COCO) compared to ResNet and PoolFormer backbones. DFFormer and CDFFormer exhibit significantly faster processing and lower memory usage than MHSA-based models at high resolutions, making them beneficial for tasks requiring high-resolution inputs.	The current implementation of the dynamic filter does not inherently support arbitrary resolutions due to its reliance on element-wise products. Further investigation is needed to understand the observed differences in representation learning between FFT-based token-mixers and MHSA within hierarchical architectures.	computer vision, vision transformers, metaformer, fft, dynamic filter
2303.03887 Report	How to Construct Energy for Images? Denoising Autoencoder Can Be Energy Based Model	Weili Zeng	Energy-based models parameterize the unnormalized log-probability of data samples, but there is a lack of guidance on how to construct the "energy". In this paper, we propose a Denoising-EBM which decomposes the image energy into "semantic energy" and "texture energy". We define the "semantic energy" in the latent space of DAE to model the high-level representations, and define the pixel-level reconstruction error for denoising as "texture energy". Inspired by score-based model, our model utilizes multi-scale noisy samples for maximum-likelihood training and it outputs a vector instead of a scalar for exploring a larger set of functions during optimization. After training, the semantics are first synthesized by fast MCMC through "semantic energy", and then the pixel-level refinement of semantic image will be performed to generate perfect samples based on "texture energy". Ultimately, our model can outperform most EBMs in image generation. And we also demonstrate that Denoising-EBM has top performance among EBMs for out-of-distribution detection.	This paper proposes Denoising-EBM, a novel energy-based model framework for images, decomposing image energy into 'semantic energy' learned in the latent space of a Denoising Autoencoder (DAE) and 'texture energy' defined by pixel-level denoising reconstruction error.	Existing energy-based models lack guidance on constructing physically meaningful energy functions for images and often face computational challenges in training and sampling due to high dimensionality. This work addresses these limitations by leveraging DAEs to learn energy functions in a more efficient and interpretable manner.	Denoising-EBM utilizes a DAE with a U-Net structure and a semantic decoder. It models the latent distribution of noisy real data in the DAE's latent space as 'semantic energy' and defines 'texture energy' using denoising reconstruction error. The model is trained using maximum likelihood with a two-stage MCMC sampling strategy for efficient and stable optimization.	Denoising-EBM outperforms most existing EBMs in image generation tasks on datasets like CIFAR-10 and CelebA, achieving comparable results to GAN-based methods. The two-stage sampling strategy allows for faster generation than traditional EBMs and score-based models, as demonstrated by significantly reduced sampling time on CIFAR-10. Denoising-EBM demonstrates top performance among EBMs in out-of-distribution detection tasks, indicating its ability to accurately estimate data likelihood and penalize non-data-like regions.	The performance of Denoising-EBM is sensitive to the choice of noise levels and interval density during training, requiring careful tuning. Future work includes generalizing the energy function to continuous time and applying the framework to larger-scale images.	energy-based models, denoising autoencoders, image generation, out-of-distribution detection, mcmc sampling
2303.03808 Report	Multiscale Tensor Decomposition and Rendering Equation Encoding for View Synthesis	Kang Han, Wei Xiang	Rendering novel views from captured multi-view images has made considerable progress since the emergence of the neural radiance field. This paper aims to further advance the quality of view synthesis by proposing a novel approach dubbed the neural radiance feature field (NRFF). We first propose a multiscale tensor decomposition scheme to organize learnable features so as to represent scenes from coarse to fine scales. We demonstrate many benefits of the proposed multiscale representation, including more accurate scene shape and appearance reconstruction, and faster convergence compared with the single-scale representation. Instead of encoding view directions to model view-dependent effects, we further propose to encode the rendering equation in the feature space by employing the anisotropic spherical Gaussian mixture predicted from the proposed multiscale representation. The proposed NRFF improves state-of-the-art rendering results by over 1 dB in PSNR on both the NeRF and NSVF synthetic datasets. A significant improvement has also been observed on the real-world Tanks & Temples dataset. Code can be found at https://github.com/imkanghan/nrff.	This paper introduces NRFF, a novel approach for view synthesis using neural radiance feature fields, employing multiscale tensor decomposition and rendering equation encoding for enhanced quality.	View synthesis methods often compromise between compact representation and computational efficiency. NRFF addresses this by combining the strengths of neural and learnable feature representations.	NRFF represents scenes at multiple scales using tensor decomposition, allowing for detailed reconstruction. It then encodes the rendering equation in feature space using anisotropic spherical Gaussians, enabling effective modeling of view-dependent effects.	NRFF surpasses state-of-the-art methods by over 1 dB in PSNR on both synthetic and real-world datasets. The multiscale representation results in faster convergence and better rendering quality compared to single-scale methods. Encoding the rendering equation in feature space proves superior to traditional view direction encoding methods, leading to more accurate reflections and illumination effects.	NRFF currently utilizes a larger MLP compared to some learnable feature methods, impacting training and testing time. Multiscale representation increases computational overhead due to interpolation weight calculations, which could be addressed through optimization and GPU texture memory.	view synthesis, neural rendering, rendering equation, multiscale representation, tensor decomposition
2303.03667 Report	Run, Don't Walk: Chasing Higher FLOPS for Faster Neural Networks	Jierun Chen, Shiu-hong Kao, Hao He, Weipeng Zhuo, Song Wen, Chul-Ho Lee, S. -H. Gary Chan	To design fast neural networks, many works have been focusing on reducing the number of floating-point operations (FLOPs). We observe that such reduction in FLOPs, however, does not necessarily lead to a similar level of reduction in latency. This mainly stems from inefficiently low floating-point operations per second (FLOPS). To achieve faster networks, we revisit popular operators and demonstrate that such low FLOPS is mainly due to frequent memory access of the operators, especially the depthwise convolution. We hence propose a novel partial convolution (PConv) that extracts spatial features more efficiently, by cutting down redundant computation and memory access simultaneously. Building upon our PConv, we further propose FasterNet, a new family of neural networks, which attains substantially higher running speed than others on a wide range of devices, without compromising on accuracy for various vision tasks. For example, on ImageNet-1k, our tiny FasterNet-T0 is $2.8\times$, $3.3\times$, and $2.4\times$ faster than MobileViT-XXS on GPU, CPU, and ARM processors, respectively, while being $2.9\%$ more accurate. Our large FasterNet-L achieves impressive $83.5\%$ top-1 accuracy, on par with the emerging Swin-B, while having $36\%$ higher inference throughput on GPU, as well as saving $37\%$ compute time on CPU. Code is available at \url{https://github.com/JierunChen/FasterNet}.	This paper introduces FasterNet, a novel family of neural networks designed for high-speed inference on various devices, and a new operator called Partial Convolution (PConv) as its core building block.	Many neural network designs focus on reducing FLOPs, but this doesn't always translate to reduced latency due to inefficiently low FLOPS caused by frequent memory access in operators like Depthwise Convolution.	The authors propose PConv, which extracts spatial features by applying a regular convolution on a subset of input channels while leaving others untouched. This reduces both FLOPs and memory access. FasterNet leverages PConv with Pointwise Convolutions in an inverted residual block structure, optimizing normalization and activation layer placement for further latency reduction.	PConv achieves significantly higher FLOPS than Depthwise Convolution and Group Convolution with reduced FLOPs. PConv, combined with Pointwise Convolution, effectively approximates a regular convolution for feature transformation. FasterNet consistently outperforms state-of-the-art networks in terms of accuracy-latency/throughput trade-off on ImageNet-1k classification and COCO object detection/instance segmentation tasks.	The stride of PConv is limited to 1 to ensure spatial resolution alignment. FasterNet's receptive field might be limited by its convolutional architecture.	neural networks, efficient inference, convolutional neural networks, partial convolution, fasternet
2303.03595 Report	LoGoNet: Towards Accurate 3D Object Detection with Local-to-Global Cross-Modal Fusion	Xin Li, Tao Ma, Yuenan Hou, Botian Shi, Yuchen Yang, Youquan Liu, Xingjiao Wu, Qin Chen, Yikang Li, Yu Qiao, Liang He	LiDAR-camera fusion methods have shown impressive performance in 3D object detection. Recent advanced multi-modal methods mainly perform global fusion, where image features and point cloud features are fused across the whole scene. Such practice lacks fine-grained region-level information, yielding suboptimal fusion performance. In this paper, we present the novel Local-to-Global fusion network (LoGoNet), which performs LiDAR-camera fusion at both local and global levels. Concretely, the Global Fusion (GoF) of LoGoNet is built upon previous literature, while we exclusively use point centroids to more precisely represent the position of voxel features, thus achieving better cross-modal alignment. As to the Local Fusion (LoF), we first divide each proposal into uniform grids and then project these grid centers to the images. The image features around the projected grid points are sampled to be fused with position-decorated point cloud features, maximally utilizing the rich contextual information around the proposals. The Feature Dynamic Aggregation (FDA) module is further proposed to achieve information interaction between these locally and globally fused features, thus producing more informative multi-modal features. Extensive experiments on both Waymo Open Dataset (WOD) and KITTI datasets show that LoGoNet outperforms all state-of-the-art 3D detection methods. Notably, LoGoNet ranks 1st on Waymo 3D object detection leaderboard and obtains 81.02 mAPH (L2) detection performance. It is noteworthy that, for the first time, the detection performance on three classes surpasses 80 APH (L2) simultaneously. Code will be available at \url{https://github.com/sankin97/LoGoNet}.	This paper proposes LoGoNet, a novel local-to-global fusion network for 3D object detection in autonomous driving, improving LiDAR-camera fusion by incorporating both global and local feature interactions.	Existing LiDAR-camera fusion methods for 3D object detection primarily focus on global fusion, lacking fine-grained local information crucial for accurate object localization and classification, especially in complex scenes.	LoGoNet utilizes a global fusion module for scene-level fusion, a local fusion module for proposal-level fusion with position encoding, and a feature dynamic aggregation module for interaction between local and global features.	Achieves state-of-the-art performance on Waymo Open Dataset and KITTI dataset, ranking 1st on Waymo 3D object detection leaderboard with 81.02 mAPH (L2). Demonstrates the effectiveness of local-to-global fusion by surpassing previous best methods, including BEVFusion, by a significant margin. Shows consistent improvements across different object classes (vehicle, pedestrian, cyclist) and difficulty levels on both benchmarks.	The method utilizes a frozen image branch, potentially limiting further performance improvements from joint optimization. Future work can explore extending the local-to-global fusion strategy to other multi-modal tasks and incorporating temporal information for more robust detection in dynamic environments.	3d object detection, lidar-camera fusion, local-to-global fusion, autonomous driving, waymo open dataset
2303.03405 Report	Neural Style Transfer for Vector Graphics	Valeria Efimova, Artyom Chebykin, Ivan Jarsky, Evgenii Prosvirnin, Andrey Filchenkov	Neural style transfer draws researchers' attention, but the interest focuses on bitmap images. Various models have been developed for bitmap image generation both online and offline with arbitrary and pre-trained styles. However, the style transfer between vector images has not almost been considered. Our research shows that applying standard content and style losses insignificantly changes the vector image drawing style because the structure of vector primitives differs a lot from pixels. To handle this problem, we introduce new loss functions. We also develop a new method based on differentiable rasterization that uses these loss functions and can change the color and shape parameters of the content image corresponding to the drawing of the style image. Qualitative experiments demonstrate the effectiveness of the proposed VectorNST method compared with the state-of-the-art neural style transfer approaches for bitmap images and the only existing approach for stylizing vector images, DiffVG. Although the proposed model does not achieve the quality and smoothness of style transfer between bitmap images, we consider our work an important early step in this area. VectorNST code and demo service are available at https://github.com/IzhanVarsky/VectorNST.	This paper introduces VectorNST, a novel neural style transfer method specifically designed for vector graphics, addressing the limitations of existing bitmap-based approaches.	Style transfer has largely focused on bitmap images, neglecting the unique characteristics and advantages of vector graphics. VectorNST offers a solution for scalable style transfer without the drawbacks of rasterization and vectorization.	The method leverages differentiable rasterization (DiffVG) to enable backpropagation through the vector image representation. It employs a modified LPIPS loss for style capture and a novel contour loss to preserve content fidelity during style transfer.	VectorNST successfully transfers artistic styles to vector images, preserving sharp contours and object shapes. Qualitative comparisons demonstrate superior performance over existing vector and raster-based style transfer methods. User study confirms that VectorNST produces more aesthetically pleasing stylized vector images.	The method inherits limitations from DiffVG, such as the inability to optimize vector topology (number of curves). The feature extractor (VGG-19) is trained on bitmap images, limiting its ability to fully capture vector image characteristics.	neural style transfer, vector graphics, differentiable rasterization, perceptual loss, contour loss
2303.03361 Report	Nerflets: Local Radiance Fields for Efficient Structure-Aware 3D Scene Representation from 2D Supervision	Xiaoshuai Zhang, Abhijit Kundu, Thomas Funkhouser, Leonidas Guibas, Hao Su, Kyle Genova	We address efficient and structure-aware 3D scene representation from images. Nerflets are our key contribution -- a set of local neural radiance fields that together represent a scene. Each nerflet maintains its own spatial position, orientation, and extent, within which it contributes to panoptic, density, and radiance reconstructions. By leveraging only photometric and inferred panoptic image supervision, we can directly and jointly optimize the parameters of a set of nerflets so as to form a decomposed representation of the scene, where each object instance is represented by a group of nerflets. During experiments with indoor and outdoor environments, we find that nerflets: (1) fit and approximate the scene more efficiently than traditional global NeRFs, (2) allow the extraction of panoptic and photometric renderings from arbitrary views, and (3) enable tasks rare for NeRFs, such as 3D panoptic segmentation and interactive editing.	This paper introduces Nerflets, a novel 3D scene representation composed of multiple local neural radiance fields, to efficiently represent 3D scenes with structure awareness from 2D image supervision.	Existing methods for 3D scene representation from images often require 3D ground truth, lack efficiency, or don't handle object instances well. Nerflets address these issues, offering a compact, efficient, and comprehensive solution.	Each Nerflet is a small NeRF with spatial pose, influencing a local region. They are jointly optimized using photometric and 2D panoptic segmentation losses to decompose the scene. A greedy merge algorithm then groups Nerflets into object instances.	Nerflets achieve state-of-the-art performance for panoptic novel view synthesis on KITTI-360, outperforming methods relying on 3D supervision. They demonstrate superior novel view synthesis quality on ScanNet compared to baselines, capturing object details better due to explicit parameter allocation. Nerflets enable interactive scene editing by directly manipulating individual Nerflets, leading to cleaner results than methods without explicit scene decomposition.	Nerflets currently do not model dynamic scene content, which could be a potential future direction. The assumption of a fixed number of Nerflets regardless of scene complexity might be limiting. Dynamically adjusting the number of Nerflets based on scene complexity could be explored.	3d scene representation, neural radiance fields, nerflets, panoptic segmentation, scene editing
2303.03003 Report	Efficient Large-scale Scene Representation with a Hybrid of High-resolution Grid and Plane Features	Yuqi Zhang, Guanying Chen, Shuguang Cui	Existing neural radiance fields (NeRF) methods for large-scale scene modeling require days of training using multiple GPUs, hindering their applications in scenarios with limited computing resources. Despite fast optimization NeRF variants have been proposed based on the explicit dense or hash grid features, their effectivenesses are mainly demonstrated in object-scale scene representation. In this paper, we point out that the low feature resolution in explicit representation is the bottleneck for large-scale unbounded scene representation. To address this problem, we introduce a new and efficient hybrid feature representation for NeRF that fuses the 3D hash-grids and high-resolution 2D dense plane features. Compared with the dense-grid representation, the resolution of a dense 2D plane can be scaled up more efficiently. Based on this hybrid representation, we propose a fast optimization NeRF variant, called GP-NeRF, that achieves better rendering results while maintaining a compact model size. Extensive experiments on multiple large-scale unbounded scene datasets show that our model can converge in 1.5 hours using a single GPU while achieving results comparable to or even better than the existing method that requires about one day's training with 8 GPUs.	This paper introduces GP-NeRF, a novel neural radiance field variant that uses a hybrid feature representation of 3D hash-grids and high-resolution 2D dense plane features for efficient large-scale unbounded scene modeling.	Existing large-scale scene modeling methods often require days of training with multiple GPUs due to low feature resolution, hindering their practicality for users with limited computational resources.	The method combines a space contraction strategy for compact unbounded scene representation with a hybrid feature representation. This representation leverages the efficiency of hash-grids and enhances it with multi-resolution dense plane features to mitigate collision issues, allowing for high-resolution scene representation with low memory consumption. The model then uses a lightweight MLP to regress density and color from the interpolated hybrid features.	GP-NeRF achieves comparable or better rendering quality than state-of-the-art Mega-NeRF while being significantly faster, converging in 1.5 hours on a single GPU compared to a day on 8 GPUs for Mega-NeRF. The proposed hybrid feature representation outperforms baselines using only dense-grids, hash-grids, or TensoRF representations in terms of rendering quality and efficiency. Ablation studies confirm the effectiveness of plane features in enhancing the hybrid representation, improving accuracy with minimal parameter increase.	While significantly faster, GP-NeRF still doesn't achieve real-time scene reconstruction. The method lacks explicit modeling of dynamic objects, limiting its application to static scenes.	neural radiance fields, large-scale scene modeling, 3d reconstruction, hybrid feature representation, fast nerf optimization
2303.02995 Report	HiCLIP: Contrastive Language-Image Pretraining with Hierarchy-aware Attention	Shijie Geng, Jianbo Yuan, Yu Tian, Yuxiao Chen, Yongfeng Zhang	The success of large-scale contrastive vision-language pretraining (CLIP) has benefited both visual recognition and multimodal content understanding. The concise design brings CLIP the advantage in inference efficiency against other vision-language models with heavier cross-attention fusion layers, making it a popular choice for a wide spectrum of downstream tasks. However, CLIP does not explicitly capture the hierarchical nature of high-level and fine-grained semantics conveyed in images and texts, which is arguably critical to vision-language understanding and reasoning. To this end, we equip both the visual and language branches in CLIP with hierarchy-aware attentions, namely Hierarchy-aware CLIP (HiCLIP), to progressively discover semantic hierarchies layer-by-layer from both images and texts in an unsupervised manner. As a result, such hierarchical aggregation significantly improves the cross-modal alignment. To demonstrate the advantages of HiCLIP, we conduct qualitative analysis on its unsupervised hierarchy induction during inference, as well as extensive quantitative experiments on both visual recognition and vision-language downstream tasks.	The paper proposes HiCLIP, a model that incorporates hierarchy-aware attention into CLIP to explicitly capture hierarchical structures in vision and language.	CLIP lacks explicit mechanisms to capture the hierarchical nature of semantics in images and texts, which is crucial for multimodal understanding and reasoning.	HiCLIP introduces "hierarchy-aware attention" by utilizing affinity scores between adjacent patches/tokens to guide the attention mechanism. These scores, evolving layer-by-layer, encourage grouping spatially and semantically similar elements, progressively building hierarchical representations.	HiCLIP significantly outperforms CLIP family models on zero-shot image-text retrieval. HiCLIP achieves superior performance on visual recognition tasks, especially when combined with self-supervised learning (HiDeCLIP). Visualization of HiCLIP's hierarchy induction process demonstrates its capability to discover meaningful visual and language hierarchies in an unsupervised manner.	The current unsupervised hierarchy induction relies on manually set thresholds for visual parsing. The paper primarily focuses on evaluating up to 30 million image-text pairs due to computational limitations.	multimodal learning, contrastive learning, vision-language models, hierarchy-aware attention, unsupervised hierarchy induction
2303.02984 Report	Learning multi-scale local conditional probability models of images	Zahra Kadkhodaie, Florentin Guth, Stéphane Mallat, Eero P Simoncelli	Deep neural networks can learn powerful prior probability models for images, as evidenced by the high-quality generations obtained with recent score-based diffusion methods. But the means by which these networks capture complex global statistical structure, apparently without suffering from the curse of dimensionality, remain a mystery. To study this, we incorporate diffusion methods into a multi-scale decomposition, reducing dimensionality by assuming a stationary local Markov model for wavelet coefficients conditioned on coarser-scale coefficients. We instantiate this model using convolutional neural networks (CNNs) with local receptive fields, which enforce both the stationarity and Markov properties. Global structures are captured using a CNN with receptive fields covering the entire (but small) low-pass image. We test this model on a dataset of face images, which are highly non-stationary and contain large-scale geometric structures. Remarkably, denoising, super-resolution, and image synthesis results all demonstrate that these structures can be captured with significantly smaller conditioning neighborhoods than required by a Markov model implemented in the pixel domain. Our results show that score estimation for large complex images can be reduced to low-dimensional Markov conditional models across scales, alleviating the curse of dimensionality.	This paper presents a low-dimensional image probability model based on a multi-scale decomposition with local Markov conditional probabilities of wavelet coefficients.	The work aims to address the curse of dimensionality in score-based diffusion models for images, investigating how these models capture global structure despite high data dimensionality.	The model factorizes the image probability distribution into conditional probabilities of wavelet coefficients conditioned on coarser scales, assuming stationarity and locality. These conditional scores are estimated using CNNs with local receptive fields.	Multi-scale denoising with the proposed model significantly outperforms conventional pixel-domain denoisers, especially for high noise levels. The model captures long-range dependencies in face images even with small receptive fields (as small as 9x9) in the wavelet domain. Super-resolution and synthesis experiments demonstrate that the model generates more realistic face images compared to models based on local Markov assumptions in the pixel domain.	The dimensionality of conditioning neighborhoods, while reduced, is still high and requires further reduction. Further research is needed to extend the model to more diverse datasets beyond centered faces.	diffusion models, wavelet transform, markov random fields, image denoising, super-resolution
2303.02943 Report	Adaptive Texture Filtering for Single-Domain Generalized Segmentation	Xinhui Li, Mingjia Li, Yaxing Wang, Chuan-Xian Ren, Xiaojie Guo	Domain generalization in semantic segmentation aims to alleviate the performance degradation on unseen domains through learning domain-invariant features. Existing methods diversify images in the source domain by adding complex or even abnormal textures to reduce the sensitivity to domain specific features. However, these approaches depend heavily on the richness of the texture bank, and training them can be time-consuming. In contrast to importing textures arbitrarily or augmenting styles randomly, we focus on the single source domain itself to achieve generalization. In this paper, we present a novel adaptive texture filtering mechanism to suppress the influence of texture without using augmentation, thus eliminating the interference of domain-specific features. Further, we design a hierarchical guidance generalization network equipped with structure-guided enhancement modules, which purpose is to learn the domain-invariant generalized knowledge. Extensive experiments together with ablation studies on widely-used datasets are conducted to verify the effectiveness of the proposed model, and reveal its superiority over other state-of-the-art alternatives.	This paper proposes a novel adaptive filtering mechanism (AFM) and a hierarchical guidance generalization network (HGGN) for single-domain generalization in semantic segmentation, aiming to learn domain-invariant features by suppressing domain-specific textures.	Domain generalization in semantic segmentation is crucial for real-world applications like autonomous driving where models need to generalize well to unseen domains.	The AFM adaptively filters out textures from images to generate content-dependent representations, while the HGGN with structure-guided enhancement modules learns domain-invariant features under contour supervision.	The proposed method outperforms state-of-the-art methods on benchmark datasets (GTA5, SYNTHIA, Cityscapes, BDD-100K, Mapillary) for domain generalization in semantic segmentation. The adaptive texture filtering in AFM proves to be more effective than fixed filtering levels. The hierarchical design of HGGN with contour supervision significantly improves the generalization ability compared to using only the backbone network.	The method's performance relies on the pre-trained texture filtering generator, which might be limited by the diversity of the training data. Future work could explore incorporating other domain-invariant features beyond texture and shape information.	domain generalization, semantic segmentation, texture suppression, adaptive filtering, hierarchical guidance
2303.02936 Report	UniHCP: A Unified Model for Human-Centric Perceptions	Yuanzheng Ci, Yizhou Wang, Meilin Chen, Shixiang Tang, Lei Bai, Feng Zhu, Rui Zhao, Fengwei Yu, Donglian Qi, Wanli Ouyang	Human-centric perceptions (e.g., pose estimation, human parsing, pedestrian detection, person re-identification, etc.) play a key role in industrial applications of visual models. While specific human-centric tasks have their own relevant semantic aspect to focus on, they also share the same underlying semantic structure of the human body. However, few works have attempted to exploit such homogeneity and design a general-propose model for human-centric tasks. In this work, we revisit a broad range of human-centric tasks and unify them in a minimalist manner. We propose UniHCP, a Unified Model for Human-Centric Perceptions, which unifies a wide range of human-centric tasks in a simplified end-to-end manner with the plain vision transformer architecture. With large-scale joint training on 33 human-centric datasets, UniHCP can outperform strong baselines on several in-domain and downstream tasks by direct evaluation. When adapted to a specific task, UniHCP achieves new SOTAs on a wide range of human-centric tasks, e.g., 69.8 mIoU on CIHP for human parsing, 86.18 mA on PA-100K for attribute prediction, 90.3 mAP on Market1501 for ReID, and 85.8 JI on CrowdHuman for pedestrian detection, performing better than specialized models tailored for each task.	This paper proposes UniHCP, a unified model for human-centric perceptions that can simultaneously handle pose estimation, semantic part segmentation, pedestrian detection, ReID, and person attribute recognition using a single architecture.	A unified model for human-centric tasks can exploit the shared underlying semantic structure of the human body to improve performance, enable fast adaptation to new tasks, and decrease memory cost in large-scale multitask system deployment.	UniHCP utilizes a plain vision transformer as a shared encoder-decoder architecture and introduces task-specific queries and a task-guided interpreter to handle the diversity of data and output structures across different tasks.	UniHCP achieves state-of-the-art performance on nine out of twelve human-centric benchmark datasets after fine-tuning. The model demonstrates strong performance even with direct evaluation on in-pretrain datasets and shows promising transferability to unseen datasets. Ablation studies show the effectiveness of weight sharing design and data-efficient transferring ability with prompt tuning.	The person ReID task requires finetuning for optimal performance due to its disparity from other tasks. The model raises ethical concerns regarding potential identity information leaking in ReID, which requires careful handling and limited release of the pretrained model.	human-centric perception, unified model, vision transformer, multitask learning, weight sharing
2303.02688 Report	Text2Face: A Multi-Modal 3D Face Model	Will Rowan, Patrik Huber, Nick Pears, Andrew Keeling	We present the first 3D morphable modelling approach, whereby 3D face shape can be directly and completely defined using a textual prompt. Building on work in multi-modal learning, we extend the FLAME head model to a common image-and-text latent space. This allows for direct 3D Morphable Model (3DMM) parameter generation and therefore shape manipulation from textual descriptions. Our method, Text2Face, has many applications; for example: generating police photofits where the input is already in natural language. It further enables multi-modal 3DMM image fitting to sketches and sculptures, as well as images.	Presents Text2Face, the first 3D morphable modelling approach that directly generates 3D face shapes from textual descriptions.	Automates 3D face creation from text, enabling applications like generating police photofits directly from witness descriptions, multi-modal 3DMM fitting (sketches, sculptures), and improved initialization for model-to-image fitting.	Trains a deep MLP (Text2Face) to map CLIP embeddings to FLAME model parameters, using a dataset of synthetic faces with corresponding CLIP embeddings and FLAME parameters extracted via DECA.	Successfully generates 3D faces with identity, expression, and detail from text prompts. Demonstrates multi-modal fitting capabilities, generating 3D faces from sketches and sculptures. Enables texture mapping from DALL-E generated images onto the generated 3D meshes.	Potential for inherited gender and racial biases from CLIP impacting 3D face generation. Limited exploration of text prompts for fine-grained shape manipulation.	3d morphable model, text-to-3d, clip, face generation, multi-modal learning
2303.02584 Report	Super-Resolution Neural Operator	Min Wei, Xuesong Zhang	We propose Super-resolution Neural Operator (SRNO), a deep operator learning framework that can resolve high-resolution (HR) images at arbitrary scales from the low-resolution (LR) counterparts. Treating the LR-HR image pairs as continuous functions approximated with different grid sizes, SRNO learns the mapping between the corresponding function spaces. From the perspective of approximation theory, SRNO first embeds the LR input into a higher-dimensional latent representation space, trying to capture sufficient basis functions, and then iteratively approximates the implicit image function with a kernel integral mechanism, followed by a final dimensionality reduction step to generate the RGB representation at the target coordinates. The key characteristics distinguishing SRNO from prior continuous SR works are: 1) the kernel integral in each layer is efficiently implemented via the Galerkin-type attention, which possesses non-local properties in the spatial domain and therefore benefits the grid-free continuum; and 2) the multilayer attention architecture allows for the dynamic latent basis update, which is crucial for SR problems to "hallucinate" high-frequency information from the LR image. Experiments show that SRNO outperforms existing continuous SR methods in terms of both accuracy and running time. Our code is at https://github.com/2y7c3/Super-Resolution-Neural-Operator	This paper proposes Super-Resolution Neural Operator (SRNO), a deep operator learning framework to resolve high-resolution (HR) images at arbitrary scales from low-resolution (LR) counterparts.	Existing deep learning-based SR methods often require training separate models for each scaling factor, proving inefficient for arbitrary scale requirements. SRNO addresses this limitation by learning the mapping between continuous function spaces representing LR-HR image pairs.	SRNO leverages a three-step methodology: 1) Lifting: Embeds LR input into a higher-dimensional latent space using a CNN encoder and spatial interpolation; 2) Iterative Kernel Integral: Approximates the image function with a kernel integral mechanism, efficiently implemented via Galerkin-type attention for non-local spatial relationship capturing; 3) Projection: Reduces the final dimensionality to generate the RGB representation at the target coordinates.	SRNO outperforms existing continuous SR methods in both reconstruction accuracy and running time, irrespective of the encoder used. The Galerkin-type attention mechanism in SRNO contributes to its superior function approximation capability. SRNO effectively captures global image structures, leading to better visual quality with fewer artifacts compared to methods like LIIF and LTE.	The impact of varying the number of basis functions and iterative updating layers requires further investigation. Exploring alternative sampling strategies beyond random and sequential methods could potentially yield further performance improvements.	super-resolution, neural operator, deep learning, galerkin-type attention, continuous image representation
2303.02416 Report	PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling	Yuan Liu, Songyang Zhang, Jiacheng Chen, Kai Chen, Dahua Lin	Masked Image Modeling (MIM) has achieved promising progress with the advent of Masked Autoencoders (MAE) and BEiT. However, subsequent works have complicated the framework with new auxiliary tasks or extra pre-trained models, inevitably increasing computational overhead. This paper undertakes a fundamental analysis of MIM from the perspective of pixel reconstruction, which examines the input image patches and reconstruction target, and highlights two critical but previously overlooked bottlenecks. Based on this analysis, we propose a remarkably simple and effective method, {\ourmethod}, that entails two strategies: 1) filtering the high-frequency components from the reconstruction target to de-emphasize the network's focus on texture-rich details and 2) adopting a conservative data transform strategy to alleviate the problem of missing foreground in MIM training. {\ourmethod} can be easily integrated into most existing pixel-based MIM approaches (\ie, using raw images as reconstruction target) with negligible additional computation. Without bells and whistles, our method consistently improves three MIM approaches, MAE, ConvMAE, and LSMAE, across various downstream tasks. We believe this effective plug-and-play method will serve as a strong baseline for self-supervised learning and provide insights for future improvements of the MIM framework. Code and models are available at \url{https://github.com/open-mmlab/mmselfsup/tree/dev-1.x/configs/selfsup/pixmim}.	This paper presents PixMIM, a simple yet effective method for improving masked image modeling (MIM) by focusing on pixel reconstruction.	Existing MIM methods either complicate the framework with extra tasks or rely on computationally expensive pre-trained models for target generation. This paper addresses these limitations by revisiting the fundamental aspects of pixel reconstruction.	PixMIM introduces two key strategies: (1) Filtering high-frequency components from the reconstruction target to prioritize learning of low-frequency patterns like shapes and global structures. (2) Replacing Random Resized Crop (RRC) with Simple Resized Crop (SRC) to preserve more semantically important foreground information in input patches.	PixMIM consistently improves the performance of three baselines (MAE, ConvMAE, LSMAE) on various downstream tasks like ImageNet classification, COCO object detection, and ADE20K semantic segmentation. The method enhances model robustness against domain shifts, evidenced by superior performance on out-of-distribution ImageNet variants. PixMIM leads to models with a stronger shape bias, aligning them more closely with human visual perception.	Current experiments primarily focus on ViT-B architecture; further evaluation on larger models is needed. The bandwidth of the low-pass filter is a hyperparameter that might require tuning for different datasets and input resolutions. Investigating a self-adaptive bandwidth is a potential future direction.	self-supervised learning, masked image modeling, pixel reconstruction, vision transformer, representation learning
2303.02151 Report	Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners	Renrui Zhang, Xiangfei Hu, Bohao Li, Siyuan Huang, Hanqiu Deng, Hongsheng Li, Yu Qiao, Peng Gao	Visual recognition in low-data regimes requires deep neural networks to learn generalized representations from limited training samples. Recently, CLIP-based methods have shown promising few-shot performance benefited from the contrastive language-image pre-training. We then question, if the more diverse pre-training knowledge can be cascaded to further assist few-shot representation learning. In this paper, we propose CaFo, a Cascade of Foundation models that incorporates diverse prior knowledge of various pre-training paradigms for better few-shot learning. Our CaFo incorporates CLIP's language-contrastive knowledge, DINO's vision-contrastive knowledge, DALL-E's vision-generative knowledge, and GPT-3's language-generative knowledge. Specifically, CaFo works by 'Prompt, Generate, then Cache'. Firstly, we leverage GPT-3 to produce textual inputs for prompting CLIP with rich downstream linguistic semantics. Then, we generate synthetic images via DALL-E to expand the few-shot training data without any manpower. At last, we introduce a learnable cache model to adaptively blend the predictions from CLIP and DINO. By such collaboration, CaFo can fully unleash the potential of different pre-training methods and unify them to perform state-of-the-art for few-shot classification. Code is available at https://github.com/ZrrSkywalker/CaFo.	This paper proposes CaFo, a cascade of foundation models (CLIP, DINO, DALL-E, GPT-3) for few-shot image classification.	Few-shot learning requires good generalization from limited data, and leveraging diverse pre-trained knowledge can enhance this ability.	CaFo employs a 'Prompt, Generate, then Cache' pipeline: 1) GPT-3 generates semantic prompts for CLIP. 2) DALL-E synthesizes additional training images. 3) A learnable cache model fuses predictions from CLIP and DINO based on distribution similarity.	CaFo achieves state-of-the-art few-shot classification performance on 11 datasets. Zero-shot CaFo, trained only on DALL-E generated images, demonstrates competitive results. Ablations confirm the contribution of each component and the effectiveness of the adaptive inference strategy.	The current work explores a limited set of foundation models. Future work could investigate incorporating more diverse pre-trained models, such as masked-generative or 3D models.	few-shot learning, foundation models, vision-language pre-training, data augmentation, knowledge ensemble
2303.02091 Report	Delicate Textured Mesh Recovery from NeRF via Adaptive Surface Refinement	Jiaxiang Tang, Hang Zhou, Xiaokang Chen, Tianshu Hu, Errui Ding, Jingdong Wang, Gang Zeng	Neural Radiance Fields (NeRF) have constituted a remarkable breakthrough in image-based 3D reconstruction. However, their implicit volumetric representations differ significantly from the widely-adopted polygonal meshes and lack support from common 3D software and hardware, making their rendering and manipulation inefficient. To overcome this limitation, we present a novel framework that generates textured surface meshes from images. Our approach begins by efficiently initializing the geometry and view-dependency decomposed appearance with a NeRF. Subsequently, a coarse mesh is extracted, and an iterative surface refining algorithm is developed to adaptively adjust both vertex positions and face density based on re-projected rendering errors. We jointly refine the appearance with geometry and bake it into texture images for real-time rendering. Extensive experiments demonstrate that our method achieves superior mesh quality and competitive rendering quality.	This paper proposes NeRF2Mesh, a framework for reconstructing textured surface meshes from multi-view RGB images, enabling compatibility with common 3D hardware and software.	NeRF's implicit volumetric representations are inefficient for rendering and manipulation, lacking support from standard 3D tools. Polygonal meshes address these limitations, but their direct reconstruction poses challenges.	NeRF2Mesh first initializes geometry and decomposed appearance (diffuse and specular) using a grid-based NeRF. It then extracts a coarse mesh, followed by iterative refinement of vertex positions and face density based on re-projected rendering errors. Finally, the appearance is baked into texture images.	NeRF2Mesh achieves superior mesh quality with accurate thin structure reconstruction compared to previous methods. The method results in relatively smaller mesh sizes due to adaptive face density adjustment. The framework achieves competitive rendering quality and enables real-time rendering with standard 3D software and hardware.	The current method bakes lighting into textures, limiting relighting capabilities. The relatively small appearance network struggles with complex view-dependent effects, impacting surface quality in those regions.	neural radiance fields, surface reconstruction, mesh generation, texture baking, 3d reconstruction
2303.02001 Report	Zero-shot Object Counting	Jingyi Xu, Hieu Le, Vu Nguyen, Viresh Ranjan, Dimitris Samaras	Class-agnostic object counting aims to count object instances of an arbitrary class at test time. It is challenging but also enables many potential applications. Current methods require human-annotated exemplars as inputs which are often unavailable for novel categories, especially for autonomous systems. Thus, we propose zero-shot object counting (ZSC), a new setting where only the class name is available during test time. Such a counting system does not require human annotators in the loop and can operate automatically. Starting from a class name, we propose a method that can accurately identify the optimal patches which can then be used as counting exemplars. Specifically, we first construct a class prototype to select the patches that are likely to contain the objects of interest, namely class-relevant patches. Furthermore, we introduce a model that can quantitatively measure how suitable an arbitrary patch is as a counting exemplar. By applying this model to all the candidate patches, we can select the most suitable patches as exemplars for counting. Experimental results on a recent class-agnostic counting dataset, FSC-147, validate the effectiveness of our method. Code is available at https://github.com/cvlab-stonybrook/zero-shot-counting	This supplementary material provides additional experiments and analyses for the zero-shot object counting method presented in the main paper.	This supplementary material aims to enhance the understanding and validate the effectiveness of the proposed zero-shot object counting method.	The authors conduct ablation studies on different aspects of their method including exploring different methods for acquiring candidate patches, comparing the use of predicted counting errors versus objectness scores for selecting exemplars, and evaluating the performance of using correlation matching with a generated prototype as an alternative to patch selection.	Using a combination of randomly sampled patches and RPN proposals as candidate patches yields the best performance. Selecting counting exemplars based on predicted counting error outperforms using objectness scores. The proposed patch selection method achieves better results compared to directly using a generated prototype for correlation matching.	The study primarily focuses on the FSC-147 dataset and further evaluation on other datasets is needed. Future work could explore incorporating additional information, such as object scale and shape, to further improve exemplar selection.	zero-shot learning, object counting, exemplar selection, patch selection, error prediction
2303.01559 Report	Improving GAN Training via Feature Space Shrinkage	Haozhe Liu, Wentian Zhang, Bing Li, Haoqian Wu, Nanjun He, Yawen Huang, Yuexiang Li, Bernard Ghanem, Yefeng Zheng	Due to the outstanding capability for data generation, Generative Adversarial Networks (GANs) have attracted considerable attention in unsupervised learning. However, training GANs is difficult, since the training distribution is dynamic for the discriminator, leading to unstable image representation. In this paper, we address the problem of training GANs from a novel perspective, \emph{i.e.,} robust image classification. Motivated by studies on robust image representation, we propose a simple yet effective module, namely AdaptiveMix, for GANs, which shrinks the regions of training data in the image representation space of the discriminator. Considering it is intractable to directly bound feature space, we propose to construct hard samples and narrow down the feature distance between hard and easy samples. The hard samples are constructed by mixing a pair of training images. We evaluate the effectiveness of our AdaptiveMix with widely-used and state-of-the-art GAN architectures. The evaluation results demonstrate that our AdaptiveMix can facilitate the training of GANs and effectively improve the image quality of generated samples. We also show that our AdaptiveMix can be further applied to image classification and Out-Of-Distribution (OOD) detection tasks, by equipping it with state-of-the-art methods. Extensive experiments on seven publicly available datasets show that our method effectively boosts the performance of baselines. The code is publicly available at https://github.com/WentianZhang-ML/AdaptiveMix.	This paper introduces AdaptiveMix, a novel module designed to enhance the training stability of Generative Adversarial Networks (GANs) by shrinking the feature space representation within the discriminator.	Training GANs is inherently challenging due to the dynamic nature of the training distribution, often leading to unstable image representation and low-quality generated samples. This work addresses this challenge by enhancing the robustness of image representation in the discriminator.	AdaptiveMix operates by constructing hard samples through the linear mixing of training images and then minimizes the feature distance between these hard samples and easy (original) training samples. This process effectively shrinks the regions occupied by training data in the discriminator’s feature space, enhancing representation robustness.	AdaptiveMix significantly improves the performance of various GAN architectures, including DCGAN and StyleGAN-V2, achieving lower FID scores and generating higher-quality images. The module exhibits effectiveness across different datasets, particularly showcasing substantial improvements when trained on a limited number of samples. Beyond image generation, AdaptiveMix demonstrates applicability to image classification and Out-Of-Distribution (OOD) detection tasks, consistently boosting the performance of baseline models.	The paper primarily focuses on linear mixing for hard sample generation, exploring other mixing strategies could be a potential avenue for future work. While the paper provides theoretical analysis connecting AdaptiveMix to Lipschitz continuity under the L1 norm, extending this analysis to other distance metrics could further strengthen the theoretical grounding.	generative adversarial networks, image generation, robust image classification, out-of-distribution detection, feature space shrinkage
2303.01494 Report	Image as Set of Points	Xu Ma, Yuqian Zhou, Huan Wang, Can Qin, Bin Sun, Chang Liu, Yun Fu	What is an image and how to extract latent features? Convolutional Networks (ConvNets) consider an image as organized pixels in a rectangular shape and extract features via convolutional operation in local region; Vision Transformers (ViTs) treat an image as a sequence of patches and extract features via attention mechanism in a global range. In this work, we introduce a straightforward and promising paradigm for visual representation, which is called Context Clusters. Context clusters (CoCs) view an image as a set of unorganized points and extract features via simplified clustering algorithm. In detail, each point includes the raw feature (e.g., color) and positional information (e.g., coordinates), and a simplified clustering algorithm is employed to group and extract deep features hierarchically. Our CoCs are convolution- and attention-free, and only rely on clustering algorithm for spatial interaction. Owing to the simple design, we show CoCs endow gratifying interpretability via the visualization of clustering process. Our CoCs aim at providing a new perspective on image and visual representation, which may enjoy broad applications in different domains and exhibit profound insights. Even though we are not targeting SOTA performance, COCs still achieve comparable or even better results than ConvNets or ViTs on several benchmarks. Codes are available at: https://github.com/ma-xu/Context-Cluster.	This paper introduces Context Clusters (CoCs), a novel visual representation paradigm that uses a simplified clustering algorithm to extract features from images viewed as sets of unorganized points.	This approach offers a new perspective on image understanding and feature extraction, distinct from Convolutional Networks (ConvNets) and Vision Transformers (ViTs). It provides promising interpretability through visualization of the clustering process and demonstrates strong generalization ability across different data domains.	CoCs treat images as point clouds, with each point containing color and positional information. A hierarchical clustering algorithm groups these points into clusters, aggregates features within each cluster, and dispatches the aggregated information back to individual points. This process facilitates context-aware feature learning.	CoCs achieve comparable or superior performance to ConvNets and ViTs on ImageNet-1K classification, demonstrating the effectiveness of clustering for visual representation. Visualization of the clustering process reveals that CoCs can effectively group semantically similar image regions, highlighting their interpretability. CoCs exhibit strong generalization ability by achieving promising results on 3D point cloud classification (ScanObjectNN), object detection and instance segmentation (MS COCO), and semantic segmentation (ADE20K).	The fixed-center clustering strategy, adopted for computational efficiency, may limit the model's ability to capture complex relationships compared to dynamic center updates. The current CoC architecture requires compromises to accommodate the rectangular feature map format of common detection and segmentation heads, potentially limiting its performance for those tasks.	visual representation learning, clustering algorithms, image understanding, point cloud analysis, interpretability
2303.01416 Report	3D generation on ImageNet	Ivan Skorokhodov, Aliaksandr Siarohin, Yinghao Xu, Jian Ren, Hsin-Ying Lee, Peter Wonka, Sergey Tulyakov	Existing 3D-from-2D generators are typically designed for well-curated single-category datasets, where all the objects have (approximately) the same scale, 3D location, and orientation, and the camera always points to the center of the scene. This makes them inapplicable to diverse, in-the-wild datasets of non-alignable scenes rendered from arbitrary camera poses. In this work, we develop a 3D generator with Generic Priors (3DGP): a 3D synthesis framework with more general assumptions about the training data, and show that it scales to very challenging datasets, like ImageNet. Our model is based on three new ideas. First, we incorporate an inaccurate off-the-shelf depth estimator into 3D GAN training via a special depth adaptation module to handle the imprecision. Then, we create a flexible camera model and a regularization strategy for it to learn its distribution parameters during training. Finally, we extend the recent ideas of transferring knowledge from pre-trained classifiers into GANs for patch-wise trained models by employing a simple distillation-based technique on top of the discriminator. It achieves more stable training than the existing methods and speeds up the convergence by at least 40%. We explore our model on four datasets: SDIP Dogs 256x256, SDIP Elephants 256x256, LSUN Horses 256x256, and ImageNet 256x256, and demonstrate that 3DGP outperforms the recent state-of-the-art in terms of both texture and geometry quality. Code and visualizations: https://snap-research.github.io/3dgp.	The paper presents 3DGP, a 3D-aware generative model capable of synthesizing diverse, in-the-wild images from datasets like ImageNet, overcoming the limitations of existing models designed for single-category, aligned datasets.	Existing 3D-from-2D generators struggle with diverse, non-alignable datasets due to the lack of a single canonical pose and the variability in object scales and camera parameters. This work tackles these challenges to enable 3D synthesis for in-the-wild data.	The model incorporates three key novelties: a learnable 'Ball-in-Sphere' camera distribution to handle diverse camera poses, adversarial depth supervision using an off-the-shelf depth estimator with a depth adaptor to guide geometry learning, and knowledge distillation from a pre-trained ResNet50 into the discriminator for improved image fidelity.	3DGP outperforms state-of-the-art 3D-aware generators in image appearance (FID) and geometry quality on non-aligned single-category datasets (SDIP Dogs, SDIP Elephants, LSUN Horses). The model successfully demonstrates multi-categorical 3D synthesis on the challenging ImageNet dataset, producing realistic images and outperforming baselines. Ablation studies demonstrate the effectiveness of each proposed component (learnable camera, adversarial depth supervision, knowledge distillation) in improving geometry and overall generation quality.	The visual quality of 3DGP, while exceeding existing 3D generators, is still lower than state-of-the-art 2D generators. The model exhibits background sticking artifacts, potentially due to dataset bias towards frontal views and limitations of the tri-plane representation.	3d synthesis, generative adversarial networks, depth supervision, camera distribution learning, knowledge distillation
2303.01267 Report	Token Contrast for Weakly-Supervised Semantic Segmentation	Lixiang Ru, Heliang Zheng, Yibing Zhan, Bo Du	Weakly-Supervised Semantic Segmentation (WSSS) using image-level labels typically utilizes Class Activation Map (CAM) to generate the pseudo labels. Limited by the local structure perception of CNN, CAM usually cannot identify the integral object regions. Though the recent Vision Transformer (ViT) can remedy this flaw, we observe it also brings the over-smoothing issue, \ie, the final patch tokens incline to be uniform. In this work, we propose Token Contrast (ToCo) to address this issue and further explore the virtue of ViT for WSSS. Firstly, motivated by the observation that intermediate layers in ViT can still retain semantic diversity, we designed a Patch Token Contrast module (PTC). PTC supervises the final patch tokens with the pseudo token relations derived from intermediate layers, allowing them to align the semantic regions and thus yield more accurate CAM. Secondly, to further differentiate the low-confidence regions in CAM, we devised a Class Token Contrast module (CTC) inspired by the fact that class tokens in ViT can capture high-level semantics. CTC facilitates the representation consistency between uncertain local regions and global objects by contrasting their class tokens. Experiments on the PASCAL VOC and MS COCO datasets show the proposed ToCo can remarkably surpass other single-stage competitors and achieve comparable performance with state-of-the-art multi-stage methods. Code is available at https://github.com/rulixiang/ToCo.	This paper proposes Token Contrast (ToCo), a novel approach for weakly-supervised semantic segmentation (WSSS) that leverages Vision Transformer (ViT) and addresses the over-smoothing issue inherent in ViT.	WSSS with image-level labels typically relies on Class Activation Map (CAM), but existing methods using CNNs or ViTs have limitations in accurately identifying integral object regions due to local structure perception or over-smoothing.	ToCo introduces two novel modules: Patch Token Contrast (PTC) and Class Token Contrast (CTC). PTC utilizes intermediate layer knowledge to supervise and diversify final patch tokens, mitigating over-smoothing. CTC contrasts global and local class tokens to enhance representation consistency between less discriminative and global object regions.	ToCo significantly outperforms state-of-the-art single-stage WSSS methods on PASCAL VOC and MS COCO datasets. The proposed method achieves comparable results to multi-stage WSSS methods while only using image-level labels. Extensive ablation studies validate the effectiveness of PTC and CTC in addressing over-smoothing and improving CAM quality.	The paper mainly evaluates ToCo on natural image datasets, and its generalization to other domains is not extensively studied. The computational cost of ViT, especially for larger variants, may be a limitation for real-time applications.	weakly-supervised semantic segmentation, vision transformer, over-smoothing, class activation map, token contrast
2303.01237 Report	FlowFormer++: Masked Cost Volume Autoencoding for Pretraining Optical Flow Estimation	Xiaoyu Shi, Zhaoyang Huang, Dasong Li, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, Jifeng Dai, Hongsheng Li	FlowFormer introduces a transformer architecture into optical flow estimation and achieves state-of-the-art performance. The core component of FlowFormer is the transformer-based cost-volume encoder. Inspired by the recent success of masked autoencoding (MAE) pretraining in unleashing transformers' capacity of encoding visual representation, we propose Masked Cost Volume Autoencoding (MCVA) to enhance FlowFormer by pretraining the cost-volume encoder with a novel MAE scheme. Firstly, we introduce a block-sharing masking strategy to prevent masked information leakage, as the cost maps of neighboring source pixels are highly correlated. Secondly, we propose a novel pre-text reconstruction task, which encourages the cost-volume encoder to aggregate long-range information and ensures pretraining-finetuning consistency. We also show how to modify the FlowFormer architecture to accommodate masks during pretraining. Pretrained with MCVA, FlowFormer++ ranks 1st among published methods on both Sintel and KITTI-2015 benchmarks. Specifically, FlowFormer++ achieves 1.07 and 1.94 average end-point error (AEPE) on the clean and final pass of Sintel benchmark, leading to 7.76\% and 7.18\% error reductions from FlowFormer. FlowFormer++ obtains 4.52 F1-all on the KITTI-2015 test set, improving FlowFormer by 0.16.	This paper proposes Masked Cost Volume Autoencoding (MCVA), a self-supervised pretraining scheme to enhance the cost-volume encoding of FlowFormer for better optical flow estimation.	Pretraining transformers on large datasets is crucial for optical flow estimation, and MCVA enables pretraining of the FlowFormer's cost-volume encoder for improved performance.	The paper introduces block-sharing masking to prevent information leakage and proposes a novel pre-text reconstruction task mimicking the FlowFormer's decoding process to ensure pretraining-finetuning consistency.	FlowFormer++ with MCVA ranks 1st among published methods on Sintel and KITTI-2015 benchmarks. It achieves 1.07 and 1.94 AEPE on Sintel clean and final pass, a 7.76% and 7.18% error reduction from FlowFormer. On KITTI-2015, it obtains 4.52 F1-all, improving FlowFormer by 0.16 and outperforming the previous best model S-Flow by 0.12.	The pretraining process requires large-scale video datasets like YouTube-VOS. Further investigation into more efficient pretraining strategies for optical flow estimation is needed.	optical flow estimation, transformer, self-supervised learning, masked autoencoding, pretraining
2303.01091 Report	OPE-SR: Orthogonal Position Encoding for Designing a Parameter-free Upsampling Module in Arbitrary-scale Image Super-Resolution	Gaochao Song, Luo Zhang, Ran Su, Jianfeng Shi, Ying He, Qian Sun	Implicit neural representation (INR) is a popular approach for arbitrary-scale image super-resolution (SR), as a key component of INR, position encoding improves its representation ability. Motivated by position encoding, we propose orthogonal position encoding (OPE) - an extension of position encoding - and an OPE-Upscale module to replace the INR-based upsampling module for arbitrary-scale image super-resolution. Same as INR, our OPE-Upscale Module takes 2D coordinates and latent code as inputs; however it does not require training parameters. This parameter-free feature allows the OPE-Upscale Module to directly perform linear combination operations to reconstruct an image in a continuous manner, achieving an arbitrary-scale image reconstruction. As a concise SR framework, our method has high computing efficiency and consumes less memory comparing to the state-of-the-art (SOTA), which has been confirmed by extensive experiments and evaluations. In addition, our method has comparable results with SOTA in arbitrary scale image super-resolution. Last but not the least, we show that OPE corresponds to a set of orthogonal basis, justifying our design principle.	This paper proposes Orthogonal Position Encoding (OPE), a novel position encoding method inspired by 2D-Fourier Series, and uses it to design a parameter-free upsampling module (OPE-Upscale) for arbitrary-scale image super-resolution.	Existing INR-based upsampling modules for arbitrary-scale SR increase network complexity and suffer from limitations in learning symmetric features. This work aims to address these issues by simplifying the SR framework and providing an interpretable image representation.	OPE represents continuous image patches as linear combinations of orthogonal basis functions derived from 2D-Fourier Series. The OPE-Upscale module utilizes these basis functions and latent codes extracted from a feature map to reconstruct target image pixels at arbitrary scales. Patch ensemble is introduced to ensure seamless stitching of reconstructed patches.	The proposed OPE method achieves comparable image super-resolution performance to state-of-the-art methods, with significantly reduced computational complexity and memory consumption. OPE-Upscale module demonstrates superior time efficiency, especially for larger scale factors, compared to INR-based counterparts. OPE effectively addresses the flipping consistency problem observed in INR-based methods, producing accurate symmetrical outputs for flipped inputs.	The performance of OPE slightly degrades for low scale factors due to the simplified representation of larger grid regions in the continuous 2D domain. Future work will explore sampling strategies to enhance OPE's performance at low scale factors without significantly compromising its efficiency. Additionally, exploring other orthogonal basis functions, like Legendre or Chebyshev polynomials, for position encoding is of interest.	image super-resolution, arbitrary-scale, position encoding, orthogonal basis, parameter-free
2303.00848 Report	Understanding Diffusion Objectives as the ELBO with Simple Data Augmentation	Diederik P. Kingma, Ruiqi Gao	To achieve the highest perceptual quality, state-of-the-art diffusion models are optimized with objectives that typically look very different from the maximum likelihood and the Evidence Lower Bound (ELBO) objectives. In this work, we reveal that diffusion model objectives are actually closely related to the ELBO. Specifically, we show that all commonly used diffusion model objectives equate to a weighted integral of ELBOs over different noise levels, where the weighting depends on the specific objective used. Under the condition of monotonic weighting, the connection is even closer: the diffusion objective then equals the ELBO, combined with simple data augmentation, namely Gaussian noise perturbation. We show that this condition holds for a number of state-of-the-art diffusion models. In experiments, we explore new monotonic weightings and demonstrate their effectiveness, achieving state-of-the-art FID scores on the high-resolution ImageNet benchmark.	This paper reveals a close relationship between various diffusion model objectives and the Evidence Lower Bound (ELBO), showing they are equivalent to a weighted integral of ELBOs over different noise levels.	This connection provides a deeper understanding of diffusion models and their relationship to traditional likelihood-based generative models.	The authors analyze the weighted diffusion loss, generalizing various objectives by expressing them as special cases with specific weighting functions. They prove that monotonic weighting functions lead to equivalence with the ELBO combined with Gaussian noise perturbation.	All commonly used diffusion model objectives can be expressed as a weighted integral of ELBOs over noise levels. Monotonic weighting functions in diffusion objectives equate to maximizing the ELBO with Gaussian noise data augmentation. Experiments on ImageNet using novel monotonic weighting functions achieve state-of-the-art FID scores for high-resolution image generation.	Empirical results can be sensitive to hyperparameter choices and may require re-tuning for different datasets or resolutions. Future work includes comparing diffusion models to other likelihood-based models using the established equivalence for a more comprehensive evaluation.	diffusion models, generative models, evidence lower bound (elbo), data augmentation, image generation
2303.00748 Report	Efficient and Explicit Modelling of Image Hierarchies for Image Restoration	Yawei Li, Yuchen Fan, Xiaoyu Xiang, Denis Demandolx, Rakesh Ranjan, Radu Timofte, Luc Van Gool	The aim of this paper is to propose a mechanism to efficiently and explicitly model image hierarchies in the global, regional, and local range for image restoration. To achieve that, we start by analyzing two important properties of natural images including cross-scale similarity and anisotropic image features. Inspired by that, we propose the anchored stripe self-attention which achieves a good balance between the space and time complexity of self-attention and the modelling capacity beyond the regional range. Then we propose a new network architecture dubbed GRL to explicitly model image hierarchies in the Global, Regional, and Local range via anchored stripe self-attention, window self-attention, and channel attention enhanced convolution. Finally, the proposed network is applied to 7 image restoration types, covering both real and synthetic settings. The proposed method sets the new state-of-the-art for several of those. Code will be available at https://github.com/ofsoundof/GRL-Image-Restoration.git.	This paper presents GRL, a transformer network for image restoration that efficiently models image hierarchies in the global, regional, and local ranges.	Natural images exhibit features at various scales. Explicitly modelling these hierarchical dependencies is crucial for high-quality image restoration, especially with increasing image resolutions.	The authors propose anchored stripe self-attention, inspired by cross-scale similarity and anisotropic image features, to efficiently capture long-range dependencies. This mechanism is integrated with window self-attention and channel attention enhanced convolutions in the GRL architecture.	GRL achieves state-of-the-art performance on various image restoration tasks, including denoising, super-resolution, deblurring, and JPEG artifact removal. The method shows significant PSNR improvements over previous state-of-the-art methods, such as Restormer, on datasets like GoPro and RealBlur-R. A tiny version of the network, GRL-T, demonstrates high efficiency with significantly reduced model complexity while maintaining competitive accuracy.	Theoretical guarantees for similarity propagation in the anchored self-attention mechanism require further investigation. Future work can explore the application of GRL to other image restoration tasks and more complex degradation scenarios.	image restoration, transformer networks, self-attention, cross-scale similarity, anisotropic image features
2303.00521 Report	Quality-aware Pre-trained Models for Blind Image Quality Assessment	Kai Zhao, Kun Yuan, Ming Sun, Mading Li, Xing Wen	Blind image quality assessment (BIQA) aims to automatically evaluate the perceived quality of a single image, whose performance has been improved by deep learning-based methods in recent years. However, the paucity of labeled data somewhat restrains deep learning-based BIQA methods from unleashing their full potential. In this paper, we propose to solve the problem by a pretext task customized for BIQA in a self-supervised learning manner, which enables learning representations from orders of magnitude more data. To constrain the learning process, we propose a quality-aware contrastive loss based on a simple assumption: the quality of patches from a distorted image should be similar, but vary from patches from the same image with different degradations and patches from different images. Further, we improve the existing degradation process and form a degradation space with the size of roughly $2\times10^7$. After pre-trained on ImageNet using our method, models are more sensitive to image quality and perform significantly better on downstream BIQA tasks. Experimental results show that our method obtains remarkable improvements on popular BIQA datasets.	This paper proposes Quality-aware Pre-Trained (QPT) models for Blind Image Quality Assessment (BIQA) to address the challenge of limited labeled data by utilizing self-supervised learning on a massive scale.	Existing BIQA datasets are too small to fully leverage the power of deep learning. This paper aims to overcome this limitation and improve the performance of BIQA models.	The paper introduces a novel self-supervised learning framework based on MoCoV2. It involves a complex degradation process with shuffle order, high-order, and skip operations to generate diverse distorted images. A quality-aware contrastive loss distinguishes between patches with varying perceptual qualities, enabling the model to learn quality-aware representations.	QPT models significantly outperform state-of-the-art BIQA methods on five benchmark datasets. QPT models demonstrate strong generalization ability and can be easily integrated with existing methods by replacing pre-trained weights. The degradation space and the quality-aware contrastive loss are crucial for the effectiveness of QPT.	Larger-scale datasets like JFT-300M could further enhance QPT's performance. Exploring the trade-off between model capacity and pre-training time is crucial for practical applications.	blind image quality assessment, self-supervised learning, contrastive learning, image degradation modeling, pre-training
2303.00404 Report	Distilled Reverse Attention Network for Open-world Compositional Zero-Shot Learning	Yun Li, Zhe Liu, Saurav Jha, Sally Cripps, Lina Yao	Open-World Compositional Zero-Shot Learning (OW-CZSL) aims to recognize new compositions of seen attributes and objects. In OW-CZSL, methods built on the conventional closed-world setting degrade severely due to the unconstrained OW test space. While previous works alleviate the issue by pruning compositions according to external knowledge or correlations in seen pairs, they introduce biases that harm the generalization. Some methods thus predict state and object with independently constructed and trained classifiers, ignoring that attributes are highly context-dependent and visually entangled with objects. In this paper, we propose a novel Distilled Reverse Attention Network to address the challenges. We also model attributes and objects separately but with different motivations, capturing contextuality and locality, respectively. We further design a reverse-and-distill strategy that learns disentangled representations of elementary components in training data supervised by reverse attention and knowledge distillation. We conduct experiments on three datasets and consistently achieve state-of-the-art (SOTA) performance.	This paper proposes DRANet for Open-World Compositional Zero-Shot Learning, which disentangles visual primitives of attributes and objects using a novel reverse-and-distill strategy.	OW-CZSL, aiming to recognize unseen compositions of seen elements, is challenging due to the unconstrained output space. Existing methods either suffer from biases introduced by external knowledge or fail to address the context-dependent nature of attributes and visual entanglement.	DRANet utilizes non-local attention for attributes to capture context and local attention for objects to enhance locality. It then leverages reverse attention and knowledge distillation to disentangle attribute and object features for improved generalization.	DRANet achieves state-of-the-art performance on three benchmark datasets (MIT-States, UT-Zappos, C-GQA). The proposed reverse-and-distill strategy effectively disentangles attribute and object embeddings, improving recognition of unseen compositions. Employing different feature extractors tailored for attributes and objects, considering their distinct characteristics, further benefits the model's performance.	Reverse attention might cause focal confusion or lead to inconsistencies between the predicted attributes and objects. Future work includes extending the disentanglement strategy to multi-object recognition and exploring alternative disentanglement methods to address limitations.	compositional zero-shot learning, open-world learning, disentanglement, reverse attention, knowledge distillation
2303.00354 Report	Unlimited-Size Diffusion Restoration	Yinhuai Wang, Jiwen Yu, Runyi Yu, Jian Zhang	Recently, using diffusion models for zero-shot image restoration (IR) has become a new hot paradigm. This type of method only needs to use the pre-trained off-the-shelf diffusion models, without any finetuning, and can directly handle various IR tasks. The upper limit of the restoration performance depends on the pre-trained diffusion models, which are in rapid evolution. However, current methods only discuss how to deal with fixed-size images, but dealing with images of arbitrary sizes is very important for practical applications. This paper focuses on how to use those diffusion-based zero-shot IR methods to deal with any size while maintaining the excellent characteristics of zero-shot. A simple way to solve arbitrary size is to divide it into fixed-size patches and solve each patch independently. But this may yield significant artifacts since it neither considers the global semantics of all patches nor the local information of adjacent patches. Inspired by the Range-Null space Decomposition, we propose the Mask-Shift Restoration to address local incoherence and propose the Hierarchical Restoration to alleviate out-of-domain issues. Our simple, parameter-free approaches can be used not only for image restoration but also for image generation of unlimited sizes, with the potential to be a general tool for diffusion models. Code: https://github.com/wyhuai/DDNM/tree/main/hq_demo	This paper proposes two parameter-free methods, Mask-Shift Restoration (MSR) and Hierarchical Restoration (HiR), to enable diffusion-based zero-shot image restoration methods to handle images of unlimited size.	Existing diffusion-based zero-shot image restoration methods primarily focus on fixed-size images, limiting their practical application in real-world scenarios where desired output sizes can vary.	MSR addresses local incoherence by processing the image in overlapping patches and using restored regions as constraints. HiR tackles out-of-domain issues by first restoring a low-resolution version of the image, then using it as a global prior for the final restoration.	MSR effectively eliminates boundary artifacts when processing large images in patches. HiR significantly improves the semantic correctness of the restored images, especially in large-scale inpainting and super-resolution tasks. Both MSR and HiR are parameter-free, training-free, and can be flexibly combined and applied to various diffusion models and zero-shot restoration methods.	The proposed methods have a higher computational cost compared to supervised methods. Performance relies on the pre-trained diffusion models, limiting their effectiveness for tasks where suitable models are unavailable.	image restoration, diffusion models, zero-shot learning, unlimited size, range-null space decomposition
2303.00165 Report	Diffusion Probabilistic Fields	Peiye Zhuang, Samira Abnar, Jiatao Gu, Alex Schwing, Joshua M. Susskind, Miguel Ángel Bautista	Diffusion probabilistic models have quickly become a major approach for generative modeling of images, 3D geometry, video and other domains. However, to adapt diffusion generative modeling to these domains the denoising network needs to be carefully designed for each domain independently, oftentimes under the assumption that data lives in a Euclidean grid. In this paper we introduce Diffusion Probabilistic Fields (DPF), a diffusion model that can learn distributions over continuous functions defined over metric spaces, commonly known as fields. We extend the formulation of diffusion probabilistic models to deal with this field parametrization in an explicit way, enabling us to define an end-to-end learning algorithm that side-steps the requirement of representing fields with latent vectors as in previous approaches (Dupont et al., 2022a; Du et al., 2021). We empirically show that, while using the same denoising network, DPF effectively deals with different modalities like 2D images and 3D geometry, in addition to modeling distributions over fields defined on non-Euclidean metric spaces.	This paper introduces Diffusion Probabilistic Fields (DPF), a novel diffusion model capable of learning distributions over continuous functions defined on metric spaces (fields), unifying generative modeling across different data domains.	Existing diffusion models often assume data lies on a grid and require domain-specific denoising networks. DPF overcomes these limitations by unifying data representation as fields and enabling a single model to handle diverse domains.	DPF uses an explicit field parameterization with context and query pairs, employing a PerceiverIO architecture as the score field network. This allows continuous evaluation and efficient handling of large numbers of context and query pairs during training and inference.	DPF demonstrates compelling generative performance on diverse domains like images (CelebA-HQ, CIFAR-10), 3D geometry (ShapeNet), and spherical data, outperforming existing domain-agnostic methods. The explicit field parameterization enables end-to-end learning, surpassing the performance of two-stage approaches that rely on latent representations. DPF exhibits resolution-free generation capabilities, allowing for sampling at different resolutions than seen during training.	The computational cost of the score network can be prohibitive for high-resolution data, necessitating further exploration of efficient transformer architectures. Sampling in DPF, similar to other diffusion models, requires iterating over all timesteps, leading to slower inference compared to GANs. Investigating faster sampling techniques while maintaining sample quality is crucial.	diffusion models, generative modeling, fields, perceiverio, domain-agnostic
2303.00157 Report	Semi-supervised Parametric Real-world Image Harmonization	Ke Wang, Michaël Gharbi, He Zhang, Zhihao Xia, Eli Shechtman	Learning-based image harmonization techniques are usually trained to undo synthetic random global transformations applied to a masked foreground in a single ground truth photo. This simulated data does not model many of the important appearance mismatches (illumination, object boundaries, etc.) between foreground and background in real composites, leading to models that do not generalize well and cannot model complex local changes. We propose a new semi-supervised training strategy that addresses this problem and lets us learn complex local appearance harmonization from unpaired real composites, where foreground and background come from different images. Our model is fully parametric. It uses RGB curves to correct the global colors and tone and a shading map to model local variations. Our method outperforms previous work on established benchmarks and real composites, as shown in a user study, and processes high-resolution images interactively.	This paper introduces a novel semi-supervised dual-stream training strategy for real-world image harmonization, addressing limitations of existing methods trained on synthetic data.	Existing methods struggle to generalize to real-world composites due to the domain gap between synthetic training data and real composites, which exhibit complex appearance mismatches.	The proposed method alternates between supervised training on artist-retouched image pairs and unsupervised adversarial training on unpaired real composites. A parametric model with global RGB curves and a local shading map is employed for efficient and high-resolution processing.	Outperforms state-of-the-art methods on iHarmony benchmark and real composite datasets. User study confirms superior performance on real-world composites. Enables local tonal adjustments, unlike previous methods limited to global corrections.	The method's generalization to a wider range of image harmonization operations beyond color and shading is yet to be explored. Future work could focus on incorporating more attributes into the model to further enhance realism.	image harmonization, semi-supervised learning, adversarial training, parametric model, shading correction
2302.14859 Report	BakedSDF: Meshing Neural SDFs for Real-Time View Synthesis	Lior Yariv, Peter Hedman, Christian Reiser, Dor Verbin, Pratul P. Srinivasan, Richard Szeliski, Jonathan T. Barron, Ben Mildenhall	We present a method for reconstructing high-quality meshes of large unbounded real-world scenes suitable for photorealistic novel view synthesis. We first optimize a hybrid neural volume-surface scene representation designed to have well-behaved level sets that correspond to surfaces in the scene. We then bake this representation into a high-quality triangle mesh, which we equip with a simple and fast view-dependent appearance model based on spherical Gaussians. Finally, we optimize this baked representation to best reproduce the captured viewpoints, resulting in a model that can leverage accelerated polygon rasterization pipelines for real-time view synthesis on commodity hardware. Our approach outperforms previous scene representations for real-time rendering in terms of accuracy, speed, and power consumption, and produces high quality meshes that enable applications such as appearance editing and physical simulation.	BakedSDF presents a method for reconstructing high-quality meshes of large unbounded real-world scenes suitable for photorealistic novel view synthesis, enabling real-time rendering on commodity hardware.	Existing NeRF-based methods struggle to balance high-quality reconstruction with real-time rendering capabilities, especially on commodity hardware. BakedSDF addresses this by baking a neural volumetric representation into an efficiently renderable mesh.	BakedSDF utilizes a hybrid neural volume-surface representation optimized in contracted coordinate space. This representation is then baked into a high-quality triangle mesh, equipped with a view-dependent appearance model based on spherical Gaussians, and fine-tuned to reproduce captured viewpoints.	Outperforms previous scene representations for real-time rendering in terms of accuracy, speed, and power consumption. Produces high-quality meshes suitable for applications like appearance editing and physical simulation. Demonstrates that spherical Gaussians are a practical representation for view-dependent appearance in view synthesis.	Limitations in representing semi-transparent content and scenes with small or detailed geometry due to the use of a fully opaque mesh. The output meshes have a significant on-disk footprint, posing potential storage and streaming challenges.	neural radiance fields, signed distance function, surface reconstruction, real-time rendering, view synthesis
2302.14771 Report	Generic-to-Specific Distillation of Masked Autoencoders	Wei Huang, Zhiliang Peng, Li Dong, Furu Wei, Jianbin Jiao, Qixiang Ye	Large vision Transformers (ViTs) driven by self-supervised pre-training mechanisms achieved unprecedented progress. Lightweight ViT models limited by the model capacity, however, benefit little from those pre-training mechanisms. Knowledge distillation defines a paradigm to transfer representations from large (teacher) models to small (student) ones. However, the conventional single-stage distillation easily gets stuck on task-specific transfer, failing to retain the task-agnostic knowledge crucial for model generalization. In this study, we propose generic-to-specific distillation (G2SD), to tap the potential of small ViT models under the supervision of large models pre-trained by masked autoencoders. In generic distillation, decoder of the small model is encouraged to align feature predictions with hidden representations of the large model, so that task-agnostic knowledge can be transferred. In specific distillation, predictions of the small model are constrained to be consistent with those of the large model, to transfer task-specific features which guarantee task performance. With G2SD, the vanilla ViT-Small model respectively achieves 98.7%, 98.1% and 99.3% the performance of its teacher (ViT-Base) for image classification, object detection, and semantic segmentation, setting a solid baseline for two-stage vision distillation. Code will be available at https://github.com/pengzhiliang/G2SD.	This paper introduces Generic-to-Specific Distillation (G2SD), a two-stage knowledge distillation approach for lightweight Vision Transformers (ViTs), transferring both task-agnostic and task-specific knowledge from large masked autoencoder pre-trained models.	Lightweight ViTs struggle to benefit from self-supervised pre-training methods like Masked Image Modeling (MIM), limiting their performance. G2SD addresses this by effectively transferring knowledge from larger, MIM-pretrained teachers, bridging the performance gap with CNNs in resource-constrained settings.	G2SD uses two stages: 1) Generic Distillation: Aligns student decoder feature predictions with hidden representations of the teacher's decoder from a pre-trained MAE, transferring task-agnostic knowledge. 2) Specific Distillation: Fine-tunes the student on a specific task using a fine-tuned teacher MAE, transferring task-specific knowledge via consistent prediction.	Vanilla ViT-Small with G2SD achieves 98.7% the top-1 accuracy of its teacher (ViT-Base) on ImageNet-1k. G2SD surpasses single-stage distillation counterparts and competing methods in object detection and semantic segmentation tasks, demonstrating strong generalization ability. The method proves effective for lightweight ViTs, pushing their performance to a new height and establishing a solid baseline for two-stage vision model distillation.	The study primarily focuses on transferring knowledge from MAE-pretrained teachers; exploring other MIM methods could further enhance performance. Investigating the impact of varying teacher-student model size ratios and more efficient distillation strategies remains for future work.	knowledge distillation, vision transformers, masked image modeling, self-supervised learning, lightweight models
2302.14736 Report	TextIR: A Simple Framework for Text-based Editable Image Restoration	Yunpeng Bai, Cairong Wang, Shuzhao Xie, Chao Dong, Chun Yuan, Zhi Wang	Most existing image restoration methods use neural networks to learn strong image-level priors from huge data to estimate the lost information. However, these works still struggle in cases when images have severe information deficits. Introducing external priors or using reference images to provide information also have limitations in the application domain. In contrast, text input is more readily available and provides information with higher flexibility. In this work, we design an effective framework that allows the user to control the restoration process of degraded images with text descriptions. We use the text-image feature compatibility of the CLIP to alleviate the difficulty of fusing text and image features. Our framework can be used for various image restoration tasks, including image inpainting, image super-resolution, and image colorization. Extensive experiments demonstrate the effectiveness of our method.	This paper presents TextIR, a novel framework for text-based editable image restoration leveraging the text-image feature compatibility of CLIP.	Existing image restoration methods struggle with severe information deficits, and while external priors or reference images can help, they have limitations. Text input offers a more flexible and accessible alternative.	TextIR utilizes CLIP's shared embedding space to train a generator that takes degraded images and text descriptions as input. During training, ground truth images are translated into CLIP image embeddings to simulate text conditions. The generator incorporates multi-level features from the degraded image and modulates them with text-derived style codes.	TextIR outperforms a diffusion-based method in text-guided inpainting, producing more natural and realistic results. The framework effectively colorizes grayscale images based on text descriptions, demonstrating accurate target localization and color matching. In super-resolution, TextIR surpasses a blind face restoration method, generating clearer results consistent with the provided text.	The current implementation of TextIR relies on CLIP's pre-trained knowledge and may not generalize well to unseen concepts or domains. Future work could explore alternative text-image fusion mechanisms or incorporate additional constraints for improved control over the restoration process.	image restoration, text-guided image editing, clip, image inpainting, super-resolution
2302.14728 Report	Global Context-Aware Person Image Generation	Prasun Roy, Saumik Bhattacharya, Subhankar Ghosh, Umapada Pal, Michael Blumenstein	We propose a data-driven approach for context-aware person image generation. Specifically, we attempt to generate a person image such that the synthesized instance can blend into a complex scene. In our method, the position, scale, and appearance of the generated person are semantically conditioned on the existing persons in the scene. The proposed technique is divided into three sequential steps. At first, we employ a Pix2PixHD model to infer a coarse semantic mask that represents the new person's spatial location, scale, and potential pose. Next, we use a data-centric approach to select the closest representation from a precomputed cluster of fine semantic masks. Finally, we adopt a multi-scale, attention-guided architecture to transfer the appearance attributes from an exemplar image. The proposed strategy enables us to synthesize semantically coherent realistic persons that can blend into an existing scene without altering the global context. We conclude our findings with relevant qualitative and quantitative evaluations.	This paper proposes a data-driven approach for generating person images that blend seamlessly into complex scenes, considering the global context of existing people.	Existing person image generation methods often produce unrealistic results due to their reliance on local attributes and neglect of global contextual information.	The proposed method uses a three-stage approach: (1) Estimating the target person's location and pose with a Pix2PixHD model, (2) Refining the semantic map using a data-driven approach with a clustered knowledge base, (3) Rendering the refined map by transferring appearance attributes from an exemplar image.	The proposed method generates semantically coherent and realistic persons that blend well with existing scenes. A data-driven refinement strategy improves the visual quality and realism of the generated images. The method achieves state-of-the-art results on various qualitative and quantitative benchmarks.	The method may struggle with unconventional poses or misclassified outliers during clustering. Future work could explore better ways to model global scene context and develop a more robust end-to-end approach.	image generation, context-aware, person image synthesis, deep learning, computer vision
2302.14683 Report	IntrinsicNGP: Intrinsic Coordinate based Hash Encoding for Human NeRF	Bo Peng, Jun Hu, Jingtao Zhou, Xuan Gao, Juyong Zhang	Recently, many works have been proposed to utilize the neural radiance field for novel view synthesis of human performers. However, most of these methods require hours of training, making them difficult for practical use. To address this challenging problem, we propose IntrinsicNGP, which can train from scratch and achieve high-fidelity results in few minutes with videos of a human performer. To achieve this target, we introduce a continuous and optimizable intrinsic coordinate rather than the original explicit Euclidean coordinate in the hash encoding module of instant-NGP. With this novel intrinsic coordinate, IntrinsicNGP can aggregate inter-frame information for dynamic objects with the help of proxy geometry shapes. Moreover, the results trained with the given rough geometry shapes can be further refined with an optimizable offset field based on the intrinsic coordinate.Extensive experimental results on several datasets demonstrate the effectiveness and efficiency of IntrinsicNGP. We also illustrate our approach's ability to edit the shape of reconstructed subjects.	IntrinsicNGP, a novel view synthesis method for human bodies that can be trained from scratch in minutes on monocular videos using an intrinsic coordinate representation for hash encoding in INGP.	Existing methods for human NeRF require hours of training, making them impractical for common users. IntrinsicNGP addresses this by enabling fast, high-fidelity novel view synthesis within minutes.	IntrinsicNGP uses a UV-D mapping to represent query points with intrinsic coordinates based on nearest points on a rough human surface mesh and signed distance. It employs hash encoding on these coordinates for fast NeRF training and introduces an offset field to refine details.	IntrinsicNGP achieves high-fidelity novel view synthesis comparable to state-of-the-art methods on ZJU-MoCap and custom datasets. It converges significantly faster (within minutes) than other methods, which typically take hours. IntrinsicNGP allows for shape editing of the reconstructed human body by manipulating the input surface mesh.	The method's reliance on a template model (SMPL) can limit expressiveness despite using an offset field. Future work could explore combining IntrinsicNGP with more advanced human shape reconstruction methods for improved accuracy and detail.	neural rendering, human performance capture, novel view synthesis, intrinsic coordinates, hash encoding
2302.14475 Report	Benchmarking Deepart Detection	Yabin Wang, Zhiwu Huang, Xiaopeng Hong	Deepfake technologies have been blurring the boundaries between the real and unreal, likely resulting in malicious events. By leveraging newly emerged deepfake technologies, deepfake researchers have been making a great upending to create deepfake artworks (deeparts), which are further closing the gap between reality and fantasy. To address potentially appeared ethics questions, this paper establishes a deepart detection database (DDDB) that consists of a set of high-quality conventional art images (conarts) and five sets of deepart images generated by five state-of-the-art deepfake models. This database enables us to explore once-for-all deepart detection and continual deepart detection. For the two new problems, we suggest four benchmark evaluations and four families of solutions on the constructed DDDB. The comprehensive study demonstrates the effectiveness of the proposed solutions on the established benchmark dataset, which is capable of paving a way to more interesting directions of deepart detection. The constructed benchmark dataset and the source code will be made publicly available.	This paper introduces DDDB, the first deepart detection database, and proposes two new deepart detection tasks: once-for-all deepart detection (ODD) and continual deepart detection (CDD).	The emergence of highly realistic deepfake artworks (deeparts) necessitates detection and copyright identification to address ethical concerns.	The authors construct DDDB with deeparts from five models and conarts from LAION-5B, designing four benchmark evaluations: one for ODD and three for CDD with varying rehearsal constraints. They propose solutions for each benchmark, including adapting existing methods and introducing a transformation framework to rescue rehearsal-free methods for the most challenging CDD scenario.	Deeparts are significantly different from traditional deepfakes, rendering existing deepfake detectors ineffective. Continual deepart detection methods generally outperform once-for-all methods, particularly with the proposed transformation framework in rehearsal-free settings. The study highlights the challenge of deepart detection due to the high realism and closeness to real artworks.	The paper acknowledges the limited availability of high-quality conarts and the reliance on Stable Diffusion's training data. Future work includes exploring the use of easily-acquired conarts, collecting more diverse data, and leveraging deepart prompts.	deepfake detection, deepart, continual learning, benchmarking, copyright identification
2302.14452 Report	An Effective Crop-Paste Pipeline for Few-shot Object Detection	Shaobo Lin, Kun Wang, Xingyu Zeng, Rui Zhao	Few-shot object detection (FSOD) aims to expand an object detector for novel categories given only a few instances for training. However, detecting novel categories with only a few samples usually leads to the problem of misclassification. In FSOD, we notice the false positive (FP) of novel categories is prominent, in which the base categories are often recognized as novel ones. To address this issue, a novel data augmentation pipeline that Crops the Novel instances and Pastes them on the selected Base images, called CNPB, is proposed. There are two key questions to be answered: (1) How to select useful base images? and (2) How to combine novel and base data? We design a multi-step selection strategy to find useful base data. Specifically, we first discover the base images which contain the FP of novel categories and select a certain amount of samples from them for the base and novel categories balance. Then the bad cases, such as the base images that have unlabeled ground truth or easily confused base instances, are removed by using CLIP. Finally, the same category strategy is adopted, in which a novel instance with category n is pasted on the base image with the FP of n. During combination, a novel instance is cropped and randomly down-sized, and thus pasted at the assigned optimal location from the randomly generated candidates in a selected base image. Our method is simple yet effective and can be easy to plug into existing FSOD methods, demonstrating significant potential for use. Extensive experiments on PASCAL VOC and MS COCO validate the effectiveness of our method.	This paper proposes CNPB, a novel data augmentation pipeline for Few-Shot Object Detection (FSOD) that addresses the issue of misclassifying base categories as novel categories (false positives).	FSOD models often struggle with misclassification, particularly false positives where base categories are incorrectly identified as novel categories. This limits their accuracy and practical applicability.	CNPB works by cropping novel instances and pasting them onto carefully selected base images containing false positives. The key steps include: (1) Identifying base images with false positives using a trained FSOD model, (2) Selecting a balanced subset of these base images, (3) Removing unsuitable base images (e.g., containing unlabeled ground truth) using the CLIP model, and (4) Pasting a novel instance onto a base image containing a false positive of the same category.	CNPB consistently reduces the false positive ratio of novel categories in FSOD models. CNPB significantly improves the performance of multiple baseline FSOD methods (TFA, FSCE, DeFRCN). CNPB achieves state-of-the-art performance on PASCAL VOC and MS COCO datasets.	The improvement on MS COCO is less significant than PASCAL VOC due to higher shot settings used. Further exploration of advanced data augmentation techniques on the pasted novel instances might yield additional benefits.	few-shot object detection, data augmentation, false positives, misclassification, computer vision
2302.14434 Report	A Hierarchical Representation Network for Accurate and Detailed Face Reconstruction from In-The-Wild Images	Biwen Lei, Jianqiang Ren, Mengyang Feng, Miaomiao Cui, Xuansong Xie	Limited by the nature of the low-dimensional representational capacity of 3DMM, most of the 3DMM-based face reconstruction (FR) methods fail to recover high-frequency facial details, such as wrinkles, dimples, etc. Some attempt to solve the problem by introducing detail maps or non-linear operations, however, the results are still not vivid. To this end, we in this paper present a novel hierarchical representation network (HRN) to achieve accurate and detailed face reconstruction from a single image. Specifically, we implement the geometry disentanglement and introduce the hierarchical representation to fulfill detailed face modeling. Meanwhile, 3D priors of facial details are incorporated to enhance the accuracy and authenticity of the reconstruction results. We also propose a de-retouching module to achieve better decoupling of the geometry and appearance. It is noteworthy that our framework can be extended to a multi-view fashion by considering detail consistency of different views. Extensive experiments on two single-view and two multi-view FR benchmarks demonstrate that our method outperforms the existing methods in both reconstruction accuracy and visual effects. Finally, we introduce a high-quality 3D face dataset FaceHD-100 to boost the research of high-fidelity face reconstruction. The project homepage is at https://younglbw.github.io/HRN-homepage/.	This paper introduces Hierarchical Representation Network (HRN), a novel method for accurate and detailed 3D face reconstruction from single and multi-view images.	Current 3DMM-based face reconstruction methods struggle to recover high-frequency facial details. This paper aims to address this by introducing a novel hierarchical representation network that captures details in a coarse-to-fine manner.	The method decouples facial geometry into low, mid, and high-frequency details, representing them with blendshape coefficients, a vertex-wise deformation map, and a pixel-wise displacement map, respectively. It utilizes two image translation networks to estimate detail maps and incorporates 3D priors of facial details for enhanced accuracy. A de-retouching module helps decouple geometry and appearance.	HRN outperforms state-of-the-art methods on single-view face reconstruction benchmarks (FaceScape, REALY) in terms of detail capturing and shape accuracy. The method generalizes well to multi-view face reconstruction, achieving superior performance on FaceScape and ESRC datasets with only a few input views. Ablation studies validate the contribution of each component, including hierarchical modeling, contour-aware loss, 3D detail priors, and the de-retouching module.	The paper acknowledges limitations regarding the handling of extreme poses and heavy occlusions. Future work will focus on extending the method for high-quality head reconstruction and exploring alternative detail modeling approaches.	3d face reconstruction, hierarchical representation, detail modeling, 3d morphable model (3dmm), de-retouching
2302.14431 Report	Efficient Masked Autoencoders with Self-Consistency	Zhaowen Li, Yousong Zhu, Zhiyang Chen, Wei Li, Chaoyang Zhao, Liwei Wu, Rui Zhao, Ming Tang, Jinqiao Wang	Inspired by masked language modeling (MLM) in natural language processing, masked image modeling (MIM) has been recognized as a strong and popular self-supervised pre-training method in computer vision. However, its high random mask ratio would result in two serious problems: 1) the data are not efficiently exploited, which brings inefficient pre-training (\eg, 1600 epochs for MAE $vs.$ 300 epochs for the supervised), and 2) the high uncertainty and inconsistency of the pre-trained model, \ie, the prediction of the same patch may be inconsistent under different mask rounds. To tackle these problems, we propose efficient masked autoencoders with self-consistency (EMAE), to improve the pre-training efficiency and increase the consistency of MIM. In particular, we progressively divide the image into K non-overlapping parts, each of which is generated by a random mask and has the same mask ratio. Then the MIM task is conducted parallelly on all parts in an iteration and generates predictions. Besides, we design a self-consistency module to further maintain the consistency of predictions of overlapping masked patches among parts. Overall, the proposed method is able to exploit the data more efficiently and obtains reliable representations. Experiments on ImageNet show that EMAE achieves even higher results with only 300 pre-training epochs under ViT-Base than MAE (1600 epochs). EMAE also consistently obtains state-of-the-art transfer performance on various downstream tasks, like object detection, and semantic segmentation.	This paper proposes Efficient Masked Autoencoders with Self-Consistency (EMAE) to improve pre-training efficiency and consistency in Masked Image Modeling (MIM).	High random mask ratios in MIM lead to inefficient pre-training and high inconsistency in the pre-trained model.	EMAE divides the image into non-overlapping parts, performs MIM on each part parallelly, and utilizes a self-consistency module to maintain consistency among overlapping predictions.	EMAE achieves higher accuracy on ImageNet linear evaluation with fewer epochs compared to MAE. EMAE consistently obtains state-of-the-art transfer performance on object detection, instance segmentation, and semantic segmentation. Ablation studies demonstrate the effectiveness of whole data utilization and the self-consistency module.	The method's performance on larger datasets and architectures needs further investigation due to resource constraints. The model's reliance on training data statistics might lead to inheriting biases, potentially with negative social impacts.	self-supervised learning, masked image modeling, vision transformer, pre-training, computer vision
2302.14368 Report	Towards Enhanced Controllability of Diffusion Models	Wonwoong Cho, Hareesh Ravi, Midhun Harikumar, Vinh Khuc, Krishna Kumar Singh, Jingwan Lu, David I. Inouye, Ajinkya Kale	Denoising Diffusion models have shown remarkable capabilities in generating realistic, high-quality and diverse images. However, the extent of controllability during generation is underexplored. Inspired by techniques based on GAN latent space for image manipulation, we train a diffusion model conditioned on two latent codes, a spatial content mask and a flattened style embedding. We rely on the inductive bias of the progressive denoising process of diffusion models to encode pose/layout information in the spatial structure mask and semantic/style information in the style code. We propose two generic sampling techniques for improving controllability. We extend composable diffusion models to allow for some dependence between conditional inputs, to improve the quality of generations while also providing control over the amount of guidance from each latent code and their joint distribution. We also propose timestep dependent weight scheduling for content and style latents to further improve the translations. We observe better controllability compared to existing methods and show that without explicit training objectives, diffusion models can be used for effective image manipulation and image translation.	This paper introduces a novel framework to enhance the controllability of image-conditioned diffusion models for image translation and manipulation.	Diffusion models often lack the fine-grained controllability offered by GANs, limiting their use in applications like reference-based image translation.	The proposed method learns disentangled content and style latent spaces by training separate encoders alongside the diffusion model. Two novel sampling techniques, Generalized Composable Diffusion Models (GCDM) and timestep-dependent weight scheduling, are introduced to improve controllability during generation.	GCDM outperforms existing methods, including DiffuseIT and SAE, achieving better FID and LPIPS scores on image translation tasks. Timestep scheduling, leveraging the inductive bias of diffusion models, further enhances translation quality and control by weighting content and style information across timesteps. The learned latent spaces demonstrate desirable properties for manipulation, allowing for attribute-specific editing via PCA and smooth content/style interpolations.	Further research is needed to explore training diffusion models with timestep scheduling to implicitly learn a mixture-of-experts model. Exploring the use of classifiers to potentially discover better directions for attribute manipulation in the latent space is a promising future direction.	diffusion models, image translation, image manipulation, controllable generation, latent space
2302.14290 Report	Learning to Retain while Acquiring: Combating Distribution-Shift in Adversarial Data-Free Knowledge Distillation	Gaurav Patel, Konda Reddy Mopuri, Qiang Qiu	Data-free Knowledge Distillation (DFKD) has gained popularity recently, with the fundamental idea of carrying out knowledge transfer from a Teacher neural network to a Student neural network in the absence of training data. However, in the Adversarial DFKD framework, the student network's accuracy, suffers due to the non-stationary distribution of the pseudo-samples under multiple generator updates. To this end, at every generator update, we aim to maintain the student's performance on previously encountered examples while acquiring knowledge from samples of the current distribution. Thus, we propose a meta-learning inspired framework by treating the task of Knowledge-Acquisition (learning from newly generated samples) and Knowledge-Retention (retaining knowledge on previously met samples) as meta-train and meta-test, respectively. Hence, we dub our method as Learning to Retain while Acquiring. Moreover, we identify an implicit aligning factor between the Knowledge-Retention and Knowledge-Acquisition tasks indicating that the proposed student update strategy enforces a common gradient direction for both tasks, alleviating interference between the two objectives. Finally, we support our hypothesis by exhibiting extensive evaluation and comparison of our method with prior arts on multiple datasets.	This paper introduces a novel meta-learning inspired student update strategy for Adversarial Data-Free Knowledge Distillation (DFKD) that maintains student performance on past data (Knowledge-Retention) while learning from new data (Knowledge-Acquisition).	In Adversarial DFKD, the student network's accuracy suffers due to the constantly changing distribution of generated pseudo-samples. The proposed method addresses this by encouraging the student to retain knowledge from previously encountered distributions.	The method treats Knowledge-Acquisition (learning from new samples) and Knowledge-Retention (retaining knowledge from past samples) as meta-train and meta-test tasks, respectively. This strategy implicitly aligns the gradients of both tasks, enforcing a common optimization path.	The proposed method demonstrates significant improvement in the learning evolution and peak accuracies compared to existing Adversarial DFKD methods. It exhibits global monotonicity in student learning, ensuring consistently high accuracy throughout the distillation process. The method is scalable across different network architectures and replay schemes, showing consistent improvements with both Memory Buffer and Generative Replay.	The method's performance on complex datasets like Tiny-ImageNet with Generative Replay requires further investigation. Training a VAE for Generative Replay on a stream of synthetic samples can be challenging due to distribution drift.	knowledge distillation, data-free learning, meta-learning, adversarial learning, distribution shift
2302.14007 Report	Joint-MAE: 2D-3D Joint Masked Autoencoders for 3D Point Cloud Pre-training	Ziyu Guo, Renrui Zhang, Longtian Qiu, Xianzhi Li, Pheng-Ann Heng	Masked Autoencoders (MAE) have shown promising performance in self-supervised learning for both 2D and 3D computer vision. However, existing MAE-style methods can only learn from the data of a single modality, i.e., either images or point clouds, which neglect the implicit semantic and geometric correlation between 2D and 3D. In this paper, we explore how the 2D modality can benefit 3D masked autoencoding, and propose Joint-MAE, a 2D-3D joint MAE framework for self-supervised 3D point cloud pre-training. Joint-MAE randomly masks an input 3D point cloud and its projected 2D images, and then reconstructs the masked information of the two modalities. For better cross-modal interaction, we construct our JointMAE by two hierarchical 2D-3D embedding modules, a joint encoder, and a joint decoder with modal-shared and model-specific decoders. On top of this, we further introduce two cross-modal strategies to boost the 3D representation learning, which are local-aligned attention mechanisms for 2D-3D semantic cues, and a cross-reconstruction loss for 2D-3D geometric constraints. By our pre-training paradigm, Joint-MAE achieves superior performance on multiple downstream tasks, e.g., 92.4% accuracy for linear SVM on ModelNet40 and 86.07% accuracy on the hardest split of ScanObjectNN.	This paper proposes Joint-MAE, a novel 2D-3D joint masked autoencoding framework for self-supervised 3D point cloud pre-training, leveraging readily available 2D images to enhance 3D representation learning.	Existing MAE methods only learn from single modality data (images or point clouds) neglecting the implicit correlations between 2D and 3D. Joint-MAE addresses this by exploiting the dense, fine-grained information in 2D images to benefit 3D point cloud understanding.	Joint-MAE projects 3D point clouds into 2D depth maps. It uses hierarchical modules for 2D and 3D token embedding, masks tokens, and employs a joint encoder for cross-modal interaction. A joint decoder with modal-shared and specific components reconstructs masked data. Further, it introduces local-aligned attention for better feature interaction and a cross-reconstruction loss for geometric constraint.	Joint-MAE outperforms existing self-supervised methods, achieving 92.4% accuracy on ModelNet40 with linear SVM. It demonstrates superior performance on out-of-distribution data, surpassing Point-MAE by 0.89% on the challenging ScanObjectNN dataset. It excels in few-shot learning scenarios and achieves state-of-the-art results on part segmentation, highlighting its strong representation learning capability.	While Joint-MAE demonstrates the benefit of 2D for 3D pre-training, exploring the reverse (3D benefiting 2D MAE) is left for future work. The current design of Joint-MAE relies on projecting point clouds into depth maps; incorporating other 2D modalities like RGB images could further enhance performance.	self-supervised learning, masked autoencoding, point cloud representation learning, multi-modal learning, cross-modal interaction
2302.13987 Report	UMIFormer: Mining the Correlations between Similar Tokens for Multi-View 3D Reconstruction	Zhenwei Zhu, Liying Yang, Ning Li, Chaohao Jiang, Yanyan Liang	In recent years, many video tasks have achieved breakthroughs by utilizing the vision transformer and establishing spatial-temporal decoupling for feature extraction. Although multi-view 3D reconstruction also faces multiple images as input, it cannot immediately inherit their success due to completely ambiguous associations between unstructured views. There is not usable prior relationship, which is similar to the temporally-coherence property in a video. To solve this problem, we propose a novel transformer network for Unstructured Multiple Images (UMIFormer). It exploits transformer blocks for decoupled intra-view encoding and designed blocks for token rectification that mine the correlation between similar tokens from different views to achieve decoupled inter-view encoding. Afterward, all tokens acquired from various branches are compressed into a fixed-size compact representation while preserving rich information for reconstruction by leveraging the similarities between tokens. We empirically demonstrate on ShapeNet and confirm that our decoupled learning method is adaptable for unstructured multiple images. Meanwhile, the experiments also verify our model outperforms existing SOTA methods by a large margin. Code will be available at https://github.com/GaryZhu1996/UMIFormer.	This paper introduces UMIFormer, a novel transformer network that decouples intra- and inter-view feature extraction for multi-view 3D reconstruction from unstructured images.	Existing methods struggle to effectively extract features from unstructured multi-view images due to the lack of prior positional correspondence, like temporal coherence in videos.	UMIFormer leverages transformer blocks for intra-view encoding and introduces Inter-View-Decoupled Blocks (IVDBs) based on similar token correlations for inter-view encoding. A Similar-Token Merger (STM) compresses features into a compact representation for the decoder.	UMIFormer significantly outperforms previous state-of-the-art methods on ShapeNet benchmark. The proposed decoupled learning method is shown to be effective for unstructured multi-view images. The model exhibits robustness to varying numbers of input views.	The model requires large memory and faces computational challenges with a high number of input views. Future work includes model compression and algorithm acceleration for higher resolution reconstruction and improved inference efficiency.	3d reconstruction, vision transformer, multi-view learning, deep learning, computer vision
2302.13770 Report	Mask Reference Image Quality Assessment	Pengxiang Xiao, Shuai He, Limin Liu, Anlong Ming	Understanding semantic information is an essential step in knowing what is being learned in both full-reference (FR) and no-reference (NR) image quality assessment (IQA) methods. However, especially for many severely distorted images, even if there is an undistorted image as a reference (FR-IQA), it is difficult to perceive the lost semantic and texture information of distorted images directly. In this paper, we propose a Mask Reference IQA (MR-IQA) method that masks specific patches of a distorted image and supplements missing patches with the reference image patches. In this way, our model only needs to input the reconstructed image for quality assessment. First, we design a mask generator to select the best candidate patches from reference images and supplement the lost semantic information in distorted images, thus providing more reference for quality assessment; in addition, the different masked patches imply different data augmentations, which favors model training and reduces overfitting. Second, we provide a Mask Reference Network (MRNet): the dedicated modules can prevent disturbances due to masked patches and help eliminate the patch discontinuity in the reconstructed image. Our method achieves state-of-the-art performances on the benchmark KADID-10k, LIVE and CSIQ datasets and has better generalization performance across datasets. The code and results are available in the supplementary material.	This paper proposes Mask Reference IQA (MR-IQA) to recover lost semantic and texture information in distorted images for better quality assessment.	Existing FR-IQA methods struggle to recover lost semantic and texture details in distorted images, hindering accurate quality assessment. MR-IQA addresses this by directly incorporating reference image information into distorted regions.	The method uses a Mask Generator (MG) to select severely distorted patches based on MAE difference with reference images. These patches are then replaced with corresponding reference patches, creating a masked image. This masked image is then fed into a Mask Reference Network (MRNet), a modified Swin Transformer, for quality prediction. The MRNet incorporates a Feature Mask Module (FMM) to mitigate interference from masked patches and enhance feature processing.	MR-IQA achieves state-of-the-art performance on LIVE, CSIQ, and KADID-10k datasets. It outperforms both traditional and deep learning-based FR and NR IQA methods. The method shows strong generalization ability across different datasets.	The performance improvement is not consistent across all datasets, with less pronounced gains on datasets with simpler distortion types. Future work can explore optimizing the masking strategy and adapting the approach for NR-IQA.	image quality assessment, full-reference iqa, semantic information, mask reference image, swin transformer
2302.13543 Report	BLiRF: Bandlimited Radiance Fields for Dynamic Scene Modeling	Sameera Ramasinghe, Violetta Shevchenko, Gil Avraham, Anton Van Den Hengel	Reasoning the 3D structure of a non-rigid dynamic scene from a single moving camera is an under-constrained problem. Inspired by the remarkable progress of neural radiance fields (NeRFs) in photo-realistic novel view synthesis of static scenes, extensions have been proposed for dynamic settings. These methods heavily rely on neural priors in order to regularize the problem. In this work, we take a step back and reinvestigate how current implementations may entail deleterious effects, including limited expressiveness, entanglement of light and density fields, and sub-optimal motion localization. As a remedy, we advocate for a bridge between classic non-rigid-structure-from-motion (\nrsfm) and NeRF, enabling the well-studied priors of the former to constrain the latter. To this end, we propose a framework that factorizes time and space by formulating a scene as a composition of bandlimited, high-dimensional signals. We demonstrate compelling results across complex dynamic scenes that involve changes in lighting, texture and long-range dynamics.	This paper proposes BLiRF, a novel framework for dynamic 3D scene modeling that represents radiance fields as bandlimited signals, allowing for the integration of explicit and implicit priors and enabling efficient factorization of spatio-temporal dynamics.	Existing dynamic NeRF extensions, heavily reliant on implicit neural priors, suffer from limitations like dependence on a canonical frame, entanglement of light and density fields, limited expressiveness, and sub-optimal motion localization.	BLiRF models the scene as a composition of bandlimited, high-dimensional signals, factoring in spatio-temporal dynamics. An implementation enforces a low-rank constraint on shape space, a neural prior over the frequency domain, and a union-of-subspaces prior on shape deformation over time.	BLiRF demonstrates superior modeling of long-range dynamics and motion localization compared to ray deformation-based methods. The framework effectively disentangles light and density fields, capturing scenes with dynamic lighting and textures. BLiRF exhibits faster training times and doesn't necessitate complex loss regularizers or optimization procedures common in other dynamic NeRF architectures.	The volumetric representation limits reconstruction resolution, a trade-off for speed common in grid-based NeRF models. Exploration of alternative implementations and more complex priors within the generic framework is left for future work.	neural radiance fields, dynamic scene modeling, novel view synthesis, non-rigid structure from motion, space-time factorization
2302.13331 Report	Learning Input-agnostic Manipulation Directions in StyleGAN with Text Guidance	Yoonjeon Kim, Hyunsu Kim, Junho Kim, Yunjey Choi, Eunho Yang	With the advantages of fast inference and human-friendly flexible manipulation, image-agnostic style manipulation via text guidance enables new applications that were not previously available. The state-of-the-art text-guided image-agnostic manipulation method embeds the representation of each channel of StyleGAN independently in the Contrastive Language-Image Pre-training (CLIP) space, and provides it in the form of a Dictionary to quickly find out the channel-wise manipulation direction during inference time. However, in this paper we argue that this dictionary which is constructed by controlling single channel individually is limited to accommodate the versatility of text guidance since the collective and interactive relation among multiple channels are not considered. Indeed, we show that it fails to discover a large portion of manipulation directions that can be found by existing methods, which manually manipulates latent space without texts. To alleviate this issue, we propose a novel method that learns a Dictionary, whose entry corresponds to the representation of a single channel, by taking into account the manipulation effect coming from the interaction with multiple other channels. We demonstrate that our strategy resolves the inability of previous methods in finding diverse known directions from unsupervised methods and unknown directions from random text while maintaining the real-time inference speed and disentanglement ability.	This paper proposes Multi2One, a novel method for text-guided image manipulation in StyleGAN that learns a dictionary to represent multi-channel manipulation effects in CLIP space.	Existing text-guided manipulation methods, particularly StyleCLIP's GlobalDirection, fail to capture the full manipulation capabilities of StyleGAN due to their reliance on single-channel manipulation representations, leading to limited coverage of possible edits.	Multi2One learns a dictionary by embedding the manipulation effects of known directions from unsupervised methods (GANspace, SeFa) into CLIP space. It leverages both the reconstruction of these known directions and the mapping of their multi-channel manipulation effects to CLIP space to learn a more comprehensive representation.	Multi2One demonstrates superior performance in reconstructing unsupervised directions compared to StyleCLIP GlobalDirection. It achieves higher cosine similarity scores between manipulated images and text guidance in CLIP space, indicating better alignment with user intent. The method successfully discovers manipulation directions that were not present in the original unsupervised directions, highlighting its ability to generalize to unseen combinations of semantic attributes.	The flexibility and diversity of text input are not fully utilized due to limitations in CLIP's encoding ability and deterministic representation. Future work could explore incorporating more advanced language models or alternative encoding schemes to enhance the expressiveness and controllability of text-guided manipulation.	text-guided image manipulation, stylegan, clip, dictionary learning, unsupervised directions
2302.13279 Report	Makeup Extraction of 3D Representation via Illumination-Aware Image Decomposition	Xingchao Yang, Takafumi Taketomi, Yoshihiro Kanamori	Facial makeup enriches the beauty of not only real humans but also virtual characters; therefore, makeup for 3D facial models is highly in demand in productions. However, painting directly on 3D faces and capturing real-world makeup are costly, and extracting makeup from 2D images often struggles with shading effects and occlusions. This paper presents the first method for extracting makeup for 3D facial models from a single makeup portrait. Our method consists of the following three steps. First, we exploit the strong prior of 3D morphable models via regression-based inverse rendering to extract coarse materials such as geometry and diffuse/specular albedos that are represented in the UV space. Second, we refine the coarse materials, which may have missing pixels due to occlusions. We apply inpainting and optimization. Finally, we extract the bare skin, makeup, and an alpha matte from the diffuse albedo. Our method offers various applications for not only 3D facial models but also 2D portrait images. The extracted makeup is well-aligned in the UV space, from which we build a large-scale makeup dataset and a parametric makeup model for 3D faces. Our disentangled materials also yield robust makeup transfer and illumination-aware makeup interpolation/removal without a reference image.	This paper introduces a novel method for extracting facial makeup for 3D models from a single portrait image, enabling illumination-aware makeup manipulation in both 2D and 3D domains.	Existing makeup transfer methods struggle with physical constraints like lighting and occlusions, while this method offers an integrated solution for realistic makeup application on 3D models.	The method uses a three-step approach: (1) coarse facial material extraction using 3DMM fitting, (2) UV completion and material refinement via optimization, and (3) makeup extraction using a network trained on makeup and non-makeup albedo datasets.	The method disentangles bare skin, makeup, and illumination components, enabling realistic makeup transfer while preserving lighting conditions. The extracted makeup, represented in UV space, facilitates building a large-scale makeup dataset and a PCA-based makeup model for 3D faces. The framework allows for various applications such as 3D makeup avatar creation, makeup editing, and illumination-aware makeup interpolation/removal.	The method's reliance on 3DMM limits its ability to capture the full range of skin tones and subtle geometric details. The current approach focuses on diffuse albedo for makeup extraction, future work could explore specular albedo for more realistic makeup representation.	makeup extraction, 3d face reconstruction, illumination-aware makeup transfer, uv completion, inverse rendering
2302.13153 Report	Directed Diffusion: Direct Control of Object Placement through Attention Guidance	Wan-Duo Kurt Ma, J. P. Lewis, Avisek Lahiri, Thomas Leung, W. Bastiaan Kleijn	Text-guided diffusion models such as DALLE-2, Imagen, eDiff-I, and Stable Diffusion are able to generate an effectively endless variety of images given only a short text prompt describing the desired image content. In many cases the images are of very high quality. However, these models often struggle to compose scenes containing several key objects such as characters in specified positional relationships. The missing capability to ``direct'' the placement of characters and objects both within and across images is crucial in storytelling, as recognized in the literature on film and animation theory. In this work, we take a particularly straightforward approach to providing the needed direction. Drawing on the observation that the cross-attention maps for prompt words reflect the spatial layout of objects denoted by those words, we introduce an optimization objective that produces ``activation'' at desired positions in these cross-attention maps. The resulting approach is a step toward generalizing the applicability of text-guided diffusion models beyond single images to collections of related images, as in storybooks. Directed Diffusion provides easy high-level positional control over multiple objects, while making use of an existing pre-trained model and maintaining a coherent blend between the positioned objects and the background. Moreover, it requires only a few lines to implement.	This paper introduces Directed Diffusion, a method to control object placement in text-to-image synthesis using pre-trained diffusion models without fine-tuning.	Existing text-to-image models struggle to compose scenes with multiple objects in specific positions, hindering their use in storytelling and other applications requiring layout control.	The method leverages the spatial interpretation of cross-attention maps in diffusion models. It optimizes a weight vector to re-weight trailing attention maps, guiding the placement of objects within user-specified bounding boxes during the denoising process.	Directed Diffusion enables consistent control over the positioning of multiple objects, facilitating image generation for storytelling. The method ensures seamless integration of positioned objects with the background, maintaining contextual interactions like shadows and lighting. It offers a simple and efficient approach, requiring only bounding box specifications and a small optimization without extensive training or code changes.	The method relies on the existing capabilities and limitations of pre-trained models, potentially inheriting their biases or struggling with complex prompts. While enabling object placement, the approach currently focuses on static images and does not address challenges in generating dynamic scenes or videos.	denoising diffusion, text-to-image synthesis, object placement, cross-attention guidance, storytelling
2302.12995 Report	Raw Image Reconstruction with Learned Compact Metadata	Yufei Wang, Yi Yu, Wenhan Yang, Lanqing Guo, Lap-Pui Chau, Alex Kot, Bihan Wen	While raw images exhibit advantages over sRGB images (e.g., linearity and fine-grained quantization level), they are not widely used by common users due to the large storage requirements. Very recent works propose to compress raw images by designing the sampling masks in the raw image pixel space, leading to suboptimal image representations and redundant metadata. In this paper, we propose a novel framework to learn a compact representation in the latent space serving as the metadata in an end-to-end manner. Furthermore, we propose a novel sRGB-guided context model with improved entropy estimation strategies, which leads to better reconstruction quality, smaller size of metadata, and faster speed. We illustrate how the proposed raw image compression scheme can adaptively allocate more bits to image regions that are important from a global perspective. The experimental results show that the proposed method can achieve superior raw image reconstruction results using a smaller size of the metadata on both uncompressed sRGB images and JPEG images.	This paper proposes a novel end-to-end deep encoding framework for raw image reconstruction that learns compact metadata in latent space with adaptive bit allocation, leading to high-fidelity reconstruction with less storage overhead.	Raw images, despite advantages like linearity and fine-grained quantization, are not widely used due to large storage requirements. Existing compression methods suffer from suboptimal representations and metadata redundancy.	The framework uses an sRGB-guided context model for efficient latent code encoding and a hyperprior model with improved entropy estimation strategies for further compression. It adaptively allocates bits based on image content, prioritizing complex regions.	Achieves superior raw image reconstruction quality with lower storage overhead than previous state-of-the-art methods on AdobeFiveK and NUS datasets. The sRGB-guided context model allows for adaptive bit allocation, prioritizing complex regions and resulting in efficient compression. The proposed method shows robustness when reconstructing raw images from compressed JPEG images of varying quality factors.	The current implementation only considers the information from a single sRGB image. Future work could explore incorporating information from adjacent frames in a video to further reduce redundancy.	raw image reconstruction, image compression, latent space, adaptive bit allocation, context modeling
2302.12764 Report	Modulating Pretrained Diffusion Models for Multimodal Image Synthesis	Cusuh Ham, James Hays, Jingwan Lu, Krishna Kumar Singh, Zhifei Zhang, Tobias Hinz	We present multimodal conditioning modules (MCM) for enabling conditional image synthesis using pretrained diffusion models. Previous multimodal synthesis works rely on training networks from scratch or fine-tuning pretrained networks, both of which are computationally expensive for large, state-of-the-art diffusion models. Our method uses pretrained networks but \textit{does not require any updates to the diffusion network's parameters}. MCM is a small module trained to modulate the diffusion network's predictions during sampling using 2D modalities (e.g., semantic segmentation maps, sketches) that were unseen during the original training of the diffusion model. We show that MCM enables user control over the spatial layout of the image and leads to increased control over the image generation process. Training MCM is cheap as it does not require gradients from the original diffusion net, consists of only $\sim$1$\%$ of the number of parameters of the base diffusion model, and is trained using only a limited number of training examples. We evaluate our method on unconditional and text-conditional models to demonstrate the improved control over the generated images and their alignment with respect to the conditioning inputs.	This paper introduces Multimodal Conditioning Modules (MCM), a lightweight method for adapting pretrained diffusion models to perform multimodal image synthesis without requiring any updates to the original model's parameters.	Training diffusion models from scratch or fine-tuning them for specific conditions is computationally expensive. This paper addresses this by enabling multimodal control of pretrained models in a computationally efficient manner.	MCM is a small diffusion-like network trained to modulate the predictions of a pretrained diffusion model during sampling. It takes new modalities and the diffusion model's intermediate outputs as input and outputs parameters that modulate the noise prediction at each sampling timestep.	MCM enables user control over spatial layout and generation process of images using new modalities like segmentation maps and sketches. It achieves high-quality results comparable to fine-tuned models while being significantly smaller and using less training data. MCM is flexible with respect to sampling methods and can be applied to both unconditional and conditional diffusion models.	MCM currently only supports 2D modalities. It struggles to ground semantics with poor-quality training data.	image synthesis, diffusion models, multimodal learning, conditional image generation, pretrained models
2302.12469 Report	Unsupervised Discovery of Semantic Latent Directions in Diffusion Models	Yong-Hyun Park, Mingi Kwon, Junghyo Jo, Youngjung Uh	Despite the success of diffusion models (DMs), we still lack a thorough understanding of their latent space. While image editing with GANs builds upon latent space, DMs rely on editing the conditions such as text prompts. We present an unsupervised method to discover interpretable editing directions for the latent variables $\mathbf{x}_t \in \mathcal{X}$ of DMs. Our method adopts Riemannian geometry between $\mathcal{X}$ and the intermediate feature maps $\mathcal{H}$ of the U-Nets to provide a deep understanding over the geometrical structure of $\mathcal{X}$. The discovered semantic latent directions mostly yield disentangled attribute changes, and they are globally consistent across different samples. Furthermore, editing in earlier timesteps edits coarse attributes, while ones in later timesteps focus on high-frequency details. We define the curvedness of a line segment between samples to show that $\mathcal{X}$ is a curved manifold. Experiments on different baselines and datasets demonstrate the effectiveness of our method even on Stable Diffusion. Our source code will be publicly available for the future researchers.	This paper presents an unsupervised method for discovering interpretable editing directions in the latent space of pre-trained diffusion models (DMs).	Understanding the latent space of DMs is crucial for developing controllable image editing techniques, similar to what has been achieved with GANs.	The method leverages Riemannian geometry by analyzing the Jacobian of the mapping between the latent space and the intermediate feature space of the U-Net.	The discovered directions correspond to semantically meaningful image manipulations, such as changing age, gender, or breed. Editing in earlier timesteps affects coarse attributes, while later timesteps control fine details. The latent space of DMs exhibits a curved manifold structure.	Some editing directions can be entangled due to dataset bias and model limitations. The method exhibits occasional abrupt changes when applied to Stable Diffusion, suggesting a more complex latent space structure.	machine learning, diffusion model, latent space, image editing, unsupervised learning
2302.12464 Report	RGI: robust GAN-inversion for mask-free image inpainting and unsupervised pixel-wise anomaly detection	Shancong Mou, Xiaoyi Gu, Meng Cao, Haoping Bai, Ping Huang, Jiulong Shan, Jianjun Shi	Generative adversarial networks (GANs), trained on a large-scale image dataset, can be a good approximator of the natural image manifold. GAN-inversion, using a pre-trained generator as a deep generative prior, is a promising tool for image restoration under corruptions. However, the performance of GAN-inversion can be limited by a lack of robustness to unknown gross corruptions, i.e., the restored image might easily deviate from the ground truth. In this paper, we propose a Robust GAN-inversion (RGI) method with a provable robustness guarantee to achieve image restoration under unknown \textit{gross} corruptions, where a small fraction of pixels are completely corrupted. Under mild assumptions, we show that the restored image and the identified corrupted region mask converge asymptotically to the ground truth. Moreover, we extend RGI to Relaxed-RGI (R-RGI) for generator fine-tuning to mitigate the gap between the GAN learned manifold and the true image manifold while avoiding trivial overfitting to the corrupted input image, which further improves the image restoration and corrupted region mask identification performance. The proposed RGI/R-RGI method unifies two important applications with state-of-the-art (SOTA) performance: (i) mask-free semantic inpainting, where the corruptions are unknown missing regions, the restored background can be used to restore the missing content; (ii) unsupervised pixel-wise anomaly detection, where the corruptions are unknown anomalous regions, the retrieved mask can be used as the anomalous region's segmentation mask.	This paper proposes Robust GAN-inversion (RGI) and Relaxed RGI (R-RGI) methods to improve robustness and accuracy of GAN-inversion for image restoration under unknown gross corruptions, where a small fraction of pixels are completely corrupted.	Existing GAN-inversion methods lack robustness to gross corruptions and suffer from approximation gap between learned and true image manifolds, limiting their performance in image restoration and anomaly detection.	RGI learns latent representation and corrupted region mask simultaneously by minimizing a reconstruction loss with sparsity penalty on the mask. R-RGI extends RGI by incorporating generator fine-tuning to mitigate the approximation gap.	RGI/R-RGI provably converges to the true clean image and corrupted region mask asymptotically. RGI/R-RGI enables mask-free semantic inpainting, achieving comparable performance to methods requiring pre-configured masks. R-RGI significantly outperforms state-of-the-art unsupervised pixel-wise anomaly detection methods on a synthetic defect dataset.	The computational cost of RGI/R-RGI is high due to the optimization process for each image. The performance of RGI/R-RGI relies on sufficient training data for GAN to learn a generalizable image manifold.	gan-inversion, image restoration, anomaly detection, semantic inpainting, robust optimization
2302.12400 Report	Towards Stable Test-Time Adaptation in Dynamic Wild World	Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Zhiquan Wen, Yaofo Chen, Peilin Zhao, Mingkui Tan	Test-time adaptation (TTA) has shown to be effective at tackling distribution shifts between training and testing data by adapting a given model on test samples. However, the online model updating of TTA may be unstable and this is often a key obstacle preventing existing TTA methods from being deployed in the real world. Specifically, TTA may fail to improve or even harm the model performance when test data have: 1) mixed distribution shifts, 2) small batch sizes, and 3) online imbalanced label distribution shifts, which are quite common in practice. In this paper, we investigate the unstable reasons and find that the batch norm layer is a crucial factor hindering TTA stability. Conversely, TTA can perform more stably with batch-agnostic norm layers, \ie, group or layer norm. However, we observe that TTA with group and layer norms does not always succeed and still suffers many failure cases. By digging into the failure cases, we find that certain noisy test samples with large gradients may disturb the model adaption and result in collapsed trivial solutions, \ie, assigning the same class label for all samples. To address the above collapse issue, we propose a sharpness-aware and reliable entropy minimization method, called SAR, for further stabilizing TTA from two aspects: 1) remove partial noisy samples with large gradients, 2) encourage model weights to go to a flat minimum so that the model is robust to the remaining noisy samples. Promising results demonstrate that SAR performs more stably over prior methods and is computationally efficient under the above wild test scenarios.	This paper proposes a sharpness-aware and reliable entropy minimization method (SAR) to stabilize online test-time adaptation (TTA) under wild test settings (mix shifts, small batch, and imbalanced label shifts).	Existing TTA methods often fail to improve or even harm model performance under these common real-world test scenarios, hindering their practical deployment.	The paper first analyzes and verifies that batch-agnostic norm layers are more beneficial for stable TTA than batch norm. To address the model collapse issue of entropy-based methods on these models, SAR removes noisy samples with large gradients based on entropy and encourages optimization to a flat minimum for robustness to remaining noisy samples.	Batch-agnostic norm layers (group and layer norm) are more beneficial for stable TTA under wild test settings than batch norm. Online entropy minimization on group/layer norm models may lead to collapsed trivial solutions. SAR stabilizes online TTA under wild test settings by effectively removing noisy samples and optimizing to a flat minimum, outperforming prior methods.	The paper focuses on entropy-based online TTA methods and may not be directly applicable to other TTA strategies. Future work can explore incorporating other stability-enhancing techniques into SAR or investigating its effectiveness on broader tasks beyond image classification.	test-time adaptation, domain shift, entropy minimization, sharpness-aware learning, model robustness
2302.12253 Report	DisCO: Portrait Distortion Correction with Perspective-Aware 3D GANs	Zhixiang Wang, Yu-Lun Liu, Jia-Bin Huang, Shin'ichi Satoh, Sizhuo Ma, Gurunandan Krishnan, Jian Wang	Close-up facial images captured at short distances often suffer from perspective distortion, resulting in exaggerated facial features and unnatural/unattractive appearances. We propose a simple yet effective method for correcting perspective distortions in a single close-up face. We first perform GAN inversion using a perspective-distorted input facial image by jointly optimizing the camera intrinsic/extrinsic parameters and face latent code. To address the ambiguity of joint optimization, we develop starting from a short distance, optimization scheduling, reparametrizations, and geometric regularization. Re-rendering the portrait at a proper focal length and camera distance effectively corrects perspective distortions and produces more natural-looking results. Our experiments show that our method compares favorably against previous approaches qualitatively and quantitatively. We showcase numerous examples validating the applicability of our method on in-the-wild portrait photos. We will release our code and the evaluation protocol to facilitate future work.	This paper introduces DiSCO, a novel method for correcting perspective distortions in close-up facial images using perspective-aware 3D GAN inversion.	Close-up photos, like selfies, often suffer from undesirable perspective distortions that make facial features appear exaggerated. Existing correction methods struggle with severe distortions and cannot synthesize missing details.	DiSCO jointly optimizes camera parameters (focal length, distance) and face latent code. To address optimization ambiguity, it employs strategies like close-up distance initialization, separate optimization scheduling, parameter reparameterizations, and geometric constraints. It further utilizes a geometry-aware stitching technique to handle full images, ensuring consistent manipulation of both the face and body.	DiSCO outperforms previous methods qualitatively and quantitatively on benchmark datasets. The method effectively corrects severe distortions in in-the-wild images, generating more natural and visually pleasing results. It allows for additional visual effects like dolly-zoom videos.	DiSCO faces challenges with out-of-distribution faces, such as extreme expressions or occlusions. The current implementation relies on optimization-based inversion, which limits its speed. Future work will explore encoder-based solutions for real-time performance.	perspective correction, 3d gan inversion, portrait distortion, face editing, dolly zoom
2302.12251 Report	VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene Completion	Yiming Li, Zhiding Yu, Christopher Choy, Chaowei Xiao, Jose M. Alvarez, Sanja Fidler, Chen Feng, Anima Anandkumar	Humans can easily imagine the complete 3D geometry of occluded objects and scenes. This appealing ability is vital for recognition and understanding. To enable such capability in AI systems, we propose VoxFormer, a Transformer-based semantic scene completion framework that can output complete 3D volumetric semantics from only 2D images. Our framework adopts a two-stage design where we start from a sparse set of visible and occupied voxel queries from depth estimation, followed by a densification stage that generates dense 3D voxels from the sparse ones. A key idea of this design is that the visual features on 2D images correspond only to the visible scene structures rather than the occluded or empty spaces. Therefore, starting with the featurization and prediction of the visible structures is more reliable. Once we obtain the set of sparse queries, we apply a masked autoencoder design to propagate the information to all the voxels by self-attention. Experiments on SemanticKITTI show that VoxFormer outperforms the state of the art with a relative improvement of 20.0% in geometry and 18.1% in semantics and reduces GPU memory during training to less than 16GB. Our code is available on https://github.com/NVlabs/VoxFormer.	Proposes VoxFormer, a Transformer-based framework for camera-based 3D semantic scene completion, which outputs complete 3D volumetric semantics from only 2D images.	Enabling AI systems to imagine the complete 3D geometry of occluded objects and scenes is vital for recognition and understanding in applications like autonomous driving.	Adopts a two-stage design: (1) a query proposal network generates sparse occupied voxel queries from depth estimation, (2) a masked autoencoder-like Transformer densifies the sparse voxels and performs semantic segmentation.	Outperforms state-of-the-art camera-based methods by a large margin on SemanticKITTI. Achieves comparable performance to LiDAR-based methods, especially in safety-critical short-range areas. Significantly improves the completion of small objects compared to baselines.	Long-range performance needs further improvement due to unreliable depth estimation at far distances. Decoupling long-range and short-range scene completion is a potential future direction.	semantic scene completion, 3d vision, autonomous driving, transformer, camera-based perception
2302.12248 Report	Learning Visual Representations via Language-Guided Sampling	Mohamed El Banani, Karan Desai, Justin Johnson	Although an object may appear in numerous contexts, we often describe it in a limited number of ways. Language allows us to abstract away visual variation to represent and communicate concepts. Building on this intuition, we propose an alternative approach to visual representation learning: using language similarity to sample semantically similar image pairs for contrastive learning. Our approach diverges from image-based contrastive learning by sampling view pairs using language similarity instead of hand-crafted augmentations or learned clusters. Our approach also differs from image-text contrastive learning by relying on pre-trained language models to guide the learning rather than directly minimizing a cross-modal loss. Through a series of experiments, we show that language-guided learning yields better features than image-based and image-text representation learning approaches.	This paper proposes a new method called language-guided contrastive learning for visual representation learning, which utilizes language similarity to sample semantically similar image pairs for contrastive learning.	Current image-based contrastive learning methods rely on visual similarity as a proxy for conceptual similarity, which limits the learned visual invariances. This work uses language as a proxy for conceptual similarity to improve generalization.	The method samples image pairs with similar captions using a pre-trained sentence encoder (SBERT) and uses those pairs for contrastive learning with SimCLR, SimSiam, or SLIP.	Language-guided contrastive learning outperforms image-only and image-text contrastive learning on linear probe and few-shot classification tasks. The approach is robust to the choice of sampling strategy or language model, showing consistent performance gains with different sentence encoders. Sampling nearest neighbors in language space provides higher-quality pairs for training compared to sampling in visual feature space, especially for self-supervised visual models.	Image captions can be noisy or vague, resulting in the retrieval of unrelated image pairs. A caption only captures one aspect of an image, potentially leading to similarity based on irrelevant factors.	contrastive learning, visual representation learning, self-supervised learning, language-guided learning, image captioning
2302.12237 Report	Learning Neural Volumetric Representations of Dynamic Humans in Minutes	Chen Geng, Sida Peng, Zhen Xu, Hujun Bao, Xiaowei Zhou	This paper addresses the challenge of quickly reconstructing free-viewpoint videos of dynamic humans from sparse multi-view videos. Some recent works represent the dynamic human as a canonical neural radiance field (NeRF) and a motion field, which are learned from videos through differentiable rendering. But the per-scene optimization generally requires hours. Other generalizable NeRF models leverage learned prior from datasets and reduce the optimization time by only finetuning on new scenes at the cost of visual fidelity. In this paper, we propose a novel method for learning neural volumetric videos of dynamic humans from sparse view videos in minutes with competitive visual quality. Specifically, we define a novel part-based voxelized human representation to better distribute the representational power of the network to different human parts. Furthermore, we propose a novel 2D motion parameterization scheme to increase the convergence rate of deformation field learning. Experiments demonstrate that our model can be learned 100 times faster than prior per-scene optimization methods while being competitive in the rendering quality. Training our model on a $512 \times 512$ video with 100 frames typically takes about 5 minutes on a single RTX 3090 GPU. The code will be released on our project page: https://zju3dv.github.io/instant_nvr	This paper presents a novel dynamic human representation that significantly accelerates the optimization of neural human models from videos, achieving a 100x speedup compared to previous methods.	Creating volumetric videos of human performers from multi-view videos has many applications, but existing methods suffer from lengthy optimization times, hindering their practical use.	The proposed representation combines a part-based voxelized human model with a 2D motion parameterization scheme. The human body is decomposed into parts, each represented by an independent NeRF network with varying resolutions, optimizing representational power distribution. A 2D surface parameterization is used to predict motion, leveraging the fact that human motion primarily occurs at the surface level, which significantly reduces the dimensionality of the motion field and improves convergence rate.	The proposed method achieves 100x faster optimization compared to previous neural human representations. It maintains competitive rendering quality with state-of-the-art methods on benchmark datasets like ZJU-MoCap and MonoCap. Training the model on a 100-frame monocular video with 512x512 resolution takes approximately 5 minutes on an RTX 3090 GPU.	The method currently relies on accurate SMPL parameters, which may be difficult to obtain in unconstrained environments. It focuses on reconstructing foreground dynamic humans and cannot handle dynamic backgrounds.	neural human modeling, volumetric video, nerf, motion parameterization, fast optimization
2302.12231 Report	DiffusioNeRF: Regularizing Neural Radiance Fields with Denoising Diffusion Models	Jamie Wynn, Daniyar Turmukhambetov	Under good conditions, Neural Radiance Fields (NeRFs) have shown impressive results on novel view synthesis tasks. NeRFs learn a scene's color and density fields by minimizing the photometric discrepancy between training views and differentiable renderings of the scene. Once trained from a sufficient set of views, NeRFs can generate novel views from arbitrary camera positions. However, the scene geometry and color fields are severely under-constrained, which can lead to artifacts, especially when trained with few input views. To alleviate this problem we learn a prior over scene geometry and color, using a denoising diffusion model (DDM). Our DDM is trained on RGBD patches of the synthetic Hypersim dataset and can be used to predict the gradient of the logarithm of a joint probability distribution of color and depth patches. We show that, these gradients of logarithms of RGBD patch priors serve to regularize geometry and color of a scene. During NeRF training, random RGBD patches are rendered and the estimated gradient of the log-likelihood is backpropagated to the color and density fields. Evaluations on LLFF, the most relevant dataset, show that our learned prior achieves improved quality in the reconstructed geometry and improved generalization to novel views. Evaluations on DTU show improved reconstruction quality among NeRF methods.	This paper introduces DiffusioNeRF, a novel approach for regularizing Neural Radiance Fields (NeRFs) using Denoising Diffusion Models (DDMs).	NeRFs often produce low-quality or physically implausible geometries and appearances, particularly when trained on a limited number of input views. This method aims to address this issue and improve the quality of NeRF reconstructions.	A DDM is trained on RGBD patches from the synthetic Hypersim dataset to learn a prior over scene geometry and color. The DDM provides gradients of the log-likelihood of RGBD patches, which are then used to regularize the NeRF's density and color fields during training.	The learned prior improves the quality of reconstructed geometry, resulting in more plausible depth maps. DiffusioNeRF shows improved generalization to novel views, particularly in the few-view setting. On the DTU dataset, DiffusioNeRF achieves improved reconstruction quality compared to other NeRF methods, even surpassing some SDF-based methods.	The DDM regularization can sometimes lead to over-smoothing of thin structures. Further research is needed on the principled combination of DDM gradients with the NeRF objective to optimize the scheduling of diffusion time (τ) and gradient weights.	neural radiance fields, nerf, denoising diffusion models, novel view synthesis, 3d reconstruction
2302.12228 Report	Encoder-based Domain Tuning for Fast Personalization of Text-to-Image Models	Rinon Gal, Moab Arar, Yuval Atzmon, Amit H. Bermano, Gal Chechik, Daniel Cohen-Or	Text-to-image personalization aims to teach a pre-trained diffusion model to reason about novel, user provided concepts, embedding them into new scenes guided by natural language prompts. However, current personalization approaches struggle with lengthy training times, high storage requirements or loss of identity. To overcome these limitations, we propose an encoder-based domain-tuning approach. Our key insight is that by underfitting on a large set of concepts from a given domain, we can improve generalization and create a model that is more amenable to quickly adding novel concepts from the same domain. Specifically, we employ two components: First, an encoder that takes as an input a single image of a target concept from a given domain, e.g. a specific face, and learns to map it into a word-embedding representing the concept. Second, a set of regularized weight-offsets for the text-to-image model that learn how to effectively ingest additional concepts. Together, these components are used to guide the learning of unseen concepts, allowing us to personalize a model using only a single image and as few as 5 training steps - accelerating personalization from dozens of minutes to seconds, while preserving quality.	This paper proposes Encoder for Tuning (E4T), an encoder-based domain-tuning method for fast personalization of text-to-image models, enabling adaptation to novel concepts in seconds.	Current personalization methods for text-to-image models are slow, require significant storage for each concept, and often lead to overfitting.	E4T pretrains on a large dataset of a specific domain (e.g., faces, cats) to learn an encoder that maps concept images to word embeddings and weight offsets for efficient model tuning. At inference, it personalizes the model using a single image and few training steps.	E4T achieves comparable or superior personalization quality to existing methods like Textual Inversion and DreamBooth, using only a single image and significantly less training time. The iterative refinement approach used in E4T allows the model to focus on high-level details first and progressively refine the concept representation during denoising. Quantitative evaluation demonstrates E4T's effectiveness in capturing identity while adhering to user prompts, placing it on the Pareto front for both metrics.	The reliance on large, domain-specific datasets for encoder pretraining limits E4T's applicability to concepts with abundant training data. The need for inference-time tuning, while fast, requires capable hardware and more memory compared to direct fine-tuning methods.	text-to-image synthesis, personalization, diffusion models, encoder-decoder architecture, domain adaptation
2302.12066 Report	Teaching CLIP to Count to Ten	Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, Tali Dekel	Large vision-language models (VLMs), such as CLIP, learn rich joint image-text representations, facilitating advances in numerous downstream tasks, including zero-shot classification and text-to-image generation. Nevertheless, existing VLMs exhibit a prominent well-documented limitation - they fail to encapsulate compositional concepts such as counting. We introduce a simple yet effective method to improve the quantitative understanding of VLMs, while maintaining their overall performance on common benchmarks. Specifically, we propose a new counting-contrastive loss used to finetune a pre-trained VLM in tandem with its original objective. Our counting loss is deployed over automatically-created counterfactual examples, each consisting of an image and a caption containing an incorrect object count. For example, an image depicting three dogs is paired with the caption "Six dogs playing in the yard". Our loss encourages discrimination between the correct caption and its counterfactual variant which serves as a hard negative example. To the best of our knowledge, this work is the first to extend CLIP's capabilities to object counting. Furthermore, we introduce "CountBench" - a new image-text counting benchmark for evaluating a model's understanding of object counting. We demonstrate a significant improvement over state-of-the-art baseline models on this task. Finally, we leverage our count-aware CLIP model for image retrieval and text-conditioned image generation, demonstrating that our model can produce specific counts of objects more reliably than existing ones.	This paper introduces a novel method for improving the quantitative understanding of large-scale vision-language models (VLMs) like CLIP, enabling them to better comprehend and process object counts in images and text.	Existing VLMs struggle with compositional concepts like counting, limiting their performance in tasks such as image retrieval and text-to-image generation. This work addresses this limitation, enhancing VLMs' ability to accurately associate object counts in text with visual representations.	The method involves creating a filtered counting training set with captions explicitly stating object counts. A novel counting-contrastive loss is introduced, training the VLM to distinguish between correct captions and counterfactual ones with incorrect object counts.	The proposed method significantly improves zero-shot count classification accuracy on the newly introduced CountBench benchmark. The finetuned VLMs retain their performance on general zero-shot classification tasks, demonstrating the preservation of their original knowledge. The enhanced VLMs exhibit improved performance in text-to-image retrieval and generation, producing results that better adhere to specified object counts in text prompts.	The method's performance is limited by the availability of training data, particularly for images with a large number of objects and corresponding captions accurately stating the count. The current implementation focuses on counting up to ten and may not generalize well to larger numbers without further adaptation.	vision-language models, clip, counting, compositionality, text-to-image generation
2302.11831 Report	Embedding Fourier for Ultra-High-Definition Low-Light Image Enhancement	Chongyi Li, Chun-Le Guo, Man Zhou, Zhexin Liang, Shangchen Zhou, Ruicheng Feng, Chen Change Loy	Ultra-High-Definition (UHD) photo has gradually become the standard configuration in advanced imaging devices. The new standard unveils many issues in existing approaches for low-light image enhancement (LLIE), especially in dealing with the intricate issue of joint luminance enhancement and noise removal while remaining efficient. Unlike existing methods that address the problem in the spatial domain, we propose a new solution, UHDFour, that embeds Fourier transform into a cascaded network. Our approach is motivated by a few unique characteristics in the Fourier domain: 1) most luminance information concentrates on amplitudes while noise is closely related to phases, and 2) a high-resolution image and its low-resolution version share similar amplitude patterns.Through embedding Fourier into our network, the amplitude and phase of a low-light image are separately processed to avoid amplifying noise when enhancing luminance. Besides, UHDFour is scalable to UHD images by implementing amplitude and phase enhancement under the low-resolution regime and then adjusting the high-resolution scale with few computations. We also contribute the first real UHD LLIE dataset, \textbf{UHD-LL}, that contains 2,150 low-noise/normal-clear 4K image pairs with diverse darkness and noise levels captured in different scenarios. With this dataset, we systematically analyze the performance of existing LLIE methods for processing UHD images and demonstrate the advantage of our solution. We believe our new framework, coupled with the dataset, would push the frontier of LLIE towards UHD. The code and dataset are available at https://li-chongyi.github.io/UHDFour.	This paper proposes UHDFour, a novel UHD Low-Light Image Enhancement (LLIE) framework that leverages Fourier transform in a cascaded network for efficient joint luminance enhancement and noise removal, addressing limitations of existing spatial domain methods.	Existing LLIE methods struggle to handle real-world UHD images due to limitations in noise removal, suboptimal enhancement, incompatibility with high-resolution inputs, and inefficiency. UHDFour tackles these challenges by processing images in the Fourier domain.	UHDFour consists of LRNet and HRNet. LRNet processes downsampled images in Fourier domain (enhancing amplitude and phase separately) and estimates LR output. HRNet refines amplitude and phase in HR using LRNet outputs and estimates final HR output.	UHDFour outperforms 14 state-of-the-art LLIE methods on the newly introduced UHD-LL dataset, achieving superior quantitative and qualitative results. The paper introduces UHD-LL, the first real-world UHD LLIE dataset with 2,150 low-noise/normal-clear 4K image pairs, addressing the lack of diverse, high-resolution benchmark data. Analysis reveals that existing LLIE models, even when retrained, fail to effectively handle noise and maintain image fidelity in UHD images.	The study is limited to image enhancement, excluding video data and adversarial losses. Trained models on sRGB data might not generalize to extreme cases with information loss due to limited bit depth, necessitating exploration with HDR data.	low-light image enhancement, uhd image processing, fourier transform, deep learning, image denoising
2302.11710 Report	Controlled and Conditional Text to Image Generation with Diffusion Prior	Pranav Aggarwal, Hareesh Ravi, Naveen Marri, Sachin Kelkar, Fengbin Chen, Vinh Khuc, Midhun Harikumar, Ritiz Tambi, Sudharshan Reddy Kakumanu, Purvak Lapsiya, Alvin Ghouas, Sarah Saber, Malavika Ramprasad, Baldo Faieta, Ajinkya Kale	Denoising Diffusion models have shown remarkable performance in generating diverse, high quality images from text. Numerous techniques have been proposed on top of or in alignment with models like Stable Diffusion and Imagen that generate images directly from text. A lesser explored approach is DALLE-2's two step process comprising a Diffusion Prior that generates a CLIP image embedding from text and a Diffusion Decoder that generates an image from a CLIP image embedding. We explore the capabilities of the Diffusion Prior and the advantages of an intermediate CLIP representation. We observe that Diffusion Prior can be used in a memory and compute efficient way to constrain the generation to a specific domain without altering the larger Diffusion Decoder. Moreover, we show that the Diffusion Prior can be trained with additional conditional information such as color histogram to further control the generation. We show quantitatively and qualitatively that the proposed approaches perform better than prompt engineering for domain specific generation and existing baselines for color conditioned generation. We believe that our observations and results will instigate further research into the diffusion prior and uncover more of its capabilities.	This paper explores the capabilities of Diffusion Prior, a component of DALLE-2, for controllable and conditional text-to-image generation by training it on specific domains and with additional conditional information like color histograms.	This approach allows for domain-specific and conditional generation without modifying the larger Diffusion Decoder, making it memory and computationally efficient.	The authors trained separate Diffusion Prior models on datasets of textures, vectors, isolated objects, and color histograms. They also trained a custom LDM conditioned on CLIP L/14 image embeddings as the Diffusion Decoder.	Domain-specific priors effectively constrain image generation to the desired domain (textures, vectors, isolated objects) and outperform Stable Diffusion with prompt engineering. The color-conditioned prior generates images aligning with both the text prompt and color palette, surpassing color transfer methods applied to Stable Diffusion outputs in terms of quality and semantic relevance. The proposed method is more memory and computationally efficient than finetuning large diffusion models for similar tasks.	The color prior might be biased towards generating vector images when trained on a dataset containing vector images with color histograms. Further research is needed to explore the approach's effectiveness on a wider range of domains and conditional inputs.	diffusion models, text-to-image generation, conditional image generation, domain adaptation, diffusion prior
2302.11566 Report	Vid2Avatar: 3D Avatar Reconstruction from Videos in the Wild via Self-supervised Scene Decomposition	Chen Guo, Tianjian Jiang, Xu Chen, Jie Song, Otmar Hilliges	We present Vid2Avatar, a method to learn human avatars from monocular in-the-wild videos. Reconstructing humans that move naturally from monocular in-the-wild videos is difficult. Solving it requires accurately separating humans from arbitrary backgrounds. Moreover, it requires reconstructing detailed 3D surface from short video sequences, making it even more challenging. Despite these challenges, our method does not require any groundtruth supervision or priors extracted from large datasets of clothed human scans, nor do we rely on any external segmentation modules. Instead, it solves the tasks of scene decomposition and surface reconstruction directly in 3D by modeling both the human and the background in the scene jointly, parameterized via two separate neural fields. Specifically, we define a temporally consistent human representation in canonical space and formulate a global optimization over the background model, the canonical human shape and texture, and per-frame human pose parameters. A coarse-to-fine sampling strategy for volume rendering and novel objectives are introduced for a clean separation of dynamic human and static background, yielding detailed and robust 3D human geometry reconstructions. We evaluate our methods on publicly available datasets and show improvements over prior art.	This paper presents \methodname, a method to reconstruct detailed 3D avatars from monocular in-the-wild videos via self-supervised scene decomposition, without requiring groundtruth supervision, priors from large datasets, or external segmentation modules.	Reconstructing humans from in-the-wild videos is challenging because it requires separating humans from arbitrary backgrounds and reconstructing detailed surfaces from short video sequences.	The method jointly models the human and background with separate neural fields and optimizes them globally. It defines a temporally consistent human representation in canonical space and utilizes a coarse-to-fine sampling strategy with novel objectives for clean separation.	Outperforms state-of-the-art methods in 2D segmentation, novel view synthesis, and 3D reconstruction. Achieves robust and detailed 3D reconstruction of humans with complex clothing and facial features. Demonstrates high-quality results on various in-the-wild videos from different sources.	Relies on reasonable pose estimates as input. Faces challenges with loose clothing due to fast dynamics.	3d human reconstruction, scene decomposition, neural rendering, implicit neural representation, monocular video
2302.11562 Report	Uncovering Bias in Face Generation Models	Cristian Muñoz, Sara Zannone, Umar Mohammed, Adriano Koshiyama	Recent advancements in GANs and diffusion models have enabled the creation of high-resolution, hyper-realistic images. However, these models may misrepresent certain social groups and present bias. Understanding bias in these models remains an important research question, especially for tasks that support critical decision-making and could affect minorities. The contribution of this work is a novel analysis covering architectures and embedding spaces for fine-grained understanding of bias over three approaches: generators, attribute modifier, and post-processing bias mitigators. This work shows that generators suffer from bias across all social groups with attribute preferences such as between 75%-85% for whiteness and 60%-80% for the female gender (for all trained CelebA models) and low probabilities of generating children and older men. Modifier and mitigators work as post-processor and change the generator performance. For instance, attribute channel perturbation strategies modify the embedding spaces. We quantify the influence of this change on group fairness by measuring the impact on image quality and group features. Specifically, we use the Fr\'echet Inception Distance (FID), the Face Matching Error and the Self-Similarity score. For Interfacegan, we analyze one and two attribute channel perturbations and examine the effect on the fairness distribution and the quality of the image. Finally, we analyzed the post-processing bias mitigators, which are the fastest and most computationally efficient way to mitigate bias. We find that these mitigation techniques show similar results on KL divergence and FID score, however, self-similarity scores show a different feature concentration on the new groups of the data distribution. The weaknesses and ongoing challenges described in this work must be considered in the pursuit of creating fair and unbiased face generation models.	The paper presents a novel analysis of bias in face generation models, focusing on architectures and embedding spaces to understand bias in generators, attribute modifiers, and post-processing bias mitigators.	Understanding bias in face generation models is crucial as biased datasets can lead to unfair representations and discriminatory outcomes, especially in critical decision-making tasks affecting minorities.	The study analyzes bias across different generators (StyleGAN2, CIPS, LDM, DDPM), attribute channel modifiers (InterfaceGAN, GANSpace, StyleSpace), and bias mitigators (StyleFlow, FairGen, FairStyle) using metrics like FID, Face Matching Error, Self-Similarity score, and KL divergence.	Generators exhibit bias across social groups, showing preferences for whiteness (75%-85%) and female gender (60%-80%) in CelebA-trained models, and low representation of children and older men. Attribute modifiers, while manipulating attribute boundaries, impact generator performance, as seen with InterfaceGAN and its effect on fairness distribution and image quality. Post-processing bias mitigators, while computationally efficient, show varying results, with similar KL divergence and FID scores but differing self-similarity scores, indicating varied feature concentration in mitigated datasets.	The study primarily uses binary classifications for certain attributes like age (Young/Adult), which might not fully capture the nuances of age representation. Future work could explore intersectional bias across multiple attributes and develop more robust evaluation metrics for fairness in face generation models.	bias analysis, face generation, bias mitigation, gans, diffusion models
2302.11552 Report	Reduce, Reuse, Recycle: Compositional Generation with Energy-Based Diffusion Models and MCMC	Yilun Du, Conor Durkan, Robin Strudel, Joshua B. Tenenbaum, Sander Dieleman, Rob Fergus, Jascha Sohl-Dickstein, Arnaud Doucet, Will Grathwohl	Since their introduction, diffusion models have quickly become the prevailing approach to generative modeling in many domains. They can be interpreted as learning the gradients of a time-varying sequence of log-probability density functions. This interpretation has motivated classifier-based and classifier-free guidance as methods for post-hoc control of diffusion models. In this work, we build upon these ideas using the score-based interpretation of diffusion models, and explore alternative ways to condition, modify, and reuse diffusion models for tasks involving compositional generation and guidance. In particular, we investigate why certain types of composition fail using current techniques and present a number of solutions. We conclude that the sampler (not the model) is responsible for this failure and propose new samplers, inspired by MCMC, which enable successful compositional generation. Further, we propose an energy-based parameterization of diffusion models which enables the use of new compositional operators and more sophisticated, Metropolis-corrected samplers. Intriguingly we find these samplers lead to notable improvements in compositional generation across a wide set of problems such as classifier-guided ImageNet modeling and compositional text-to-image generation.	This paper investigates the compositionality of diffusion models, focusing on why typical composition methods fail and introducing solutions based on MCMC sampling and an energy-based parameterization for diffusion models.	Compositionality in generative models is crucial for efficiently repurposing learned priors and achieving flexible generation without retraining for complex scenarios.	The authors analyze the failure of existing composition methods in diffusion models, proposing annealed MCMC sampling and an energy-based parameterization to address the issue. They evaluate their method on various datasets, including 2D synthetic data, CLEVR, ImageNet, and text-to-image generation.	MCMC sampling significantly improves compositional generation quality compared to reverse diffusion. The energy-based parameterization enables more sophisticated MCMC sampling techniques with Metropolis corrections, leading to further improvements. The proposed methods demonstrate impressive results in complex compositional tasks like text-to-image generation with multiple concepts and generating image tapestries with spatially controlled content.	The use of sophisticated MCMC sampling increases computational cost compared to standard diffusion sampling. The energy-based parameterization requires double the memory and compute compared to score-parameterized models.	diffusion models, compositional generation, energy-based models, mcmc sampling, text-to-image generation
2302.11383 Report	Entity-Level Text-Guided Image Manipulation	Yikai Wang, Jianan Wang, Guansong Lu, Hang Xu, Zhenguo Li, Wei Zhang, Yanwei Fu	Existing text-guided image manipulation methods aim to modify the appearance of the image or to edit a few objects in a virtual or simple scenario, which is far from practical applications. In this work, we study a novel task on text-guided image manipulation on the entity level in the real world (eL-TGIM). The task imposes three basic requirements, (1) to edit the entity consistent with the text descriptions, (2) to preserve the entity-irrelevant regions, and (3) to merge the manipulated entity into the image naturally. To this end, we propose an elegant framework, dubbed as SeMani, forming the Semantic Manipulation of real-world images that can not only edit the appearance of entities but also generate new entities corresponding to the text guidance. To solve eL-TGIM, SeMani decomposes the task into two phases: the semantic alignment phase and the image manipulation phase. In the semantic alignment phase, SeMani incorporates a semantic alignment module to locate the entity-relevant region to be manipulated. In the image manipulation phase, SeMani adopts a generative model to synthesize new images conditioned on the entity-irrelevant regions and target text descriptions. We discuss and propose two popular generation processes that can be utilized in SeMani, the discrete auto-regressive generation with transformers and the continuous denoising generation with diffusion models, yielding SeMani-Trans and SeMani-Diff, respectively. We conduct extensive experiments on the real datasets CUB, Oxford, and COCO datasets to verify that SeMani can distinguish the entity-relevant and -irrelevant regions and achieve more precise and flexible manipulation in a zero-shot manner compared with baseline methods. Our codes and models will be released at https://github.com/Yikai-Wang/SeMani.	This paper introduces entity-Level Text-Guided Image Manipulation (eL-TGIM), a novel task aiming to manipulate specific entities within an image using text descriptions.	eL-TGIM addresses the limitations of existing TGIM methods that struggle to precisely identify and edit entities in real-world images.	The authors propose SeMani, a framework that decomposes eL-TGIM into semantic alignment and image manipulation phases. They present two variants: SeMani-Trans, employing discrete token-wise processing, and SeMani-Diff, utilizing continuous pixel-level manipulation with diffusion models.	SeMani effectively distinguishes and manipulates entities based on text descriptions while preserving irrelevant image regions. SeMani-Trans demonstrates the ability to manipulate both appearance and structure of entities. Quantitative and qualitative evaluations on CUB, Oxford, and COCO datasets show SeMani's superiority over existing TGIM methods.	SeMani-Trans's autoregressive generation may limit its capacity to fully leverage unmasked image regions. Future work could explore enhancing SeMani's ability to handle complex relationships and interactions between multiple entities.	image manipulation, text-guided image editing, semantic alignment, diffusion models, vision and language
2302.11306 Report	Human MotionFormer: Transferring Human Motions with Vision Transformers	Hongyu Liu, Xintong Han, Chengbin Jin, Lihui Qian, Huawei Wei, Zhe Lin, Faqiang Wang, Haoye Dong, Yibing Song, Jia Xu, Qifeng Chen	Human motion transfer aims to transfer motions from a target dynamic person to a source static one for motion synthesis. An accurate matching between the source person and the target motion in both large and subtle motion changes is vital for improving the transferred motion quality. In this paper, we propose Human MotionFormer, a hierarchical ViT framework that leverages global and local perceptions to capture large and subtle motion matching, respectively. It consists of two ViT encoders to extract input features (i.e., a target motion image and a source human image) and a ViT decoder with several cascaded blocks for feature matching and motion transfer. In each block, we set the target motion feature as Query and the source person as Key and Value, calculating the cross-attention maps to conduct a global feature matching. Further, we introduce a convolutional layer to improve the local perception after the global cross-attention computations. This matching process is implemented in both warping and generation branches to guide the motion transfer. During training, we propose a mutual learning loss to enable the co-supervision between warping and generation branches for better motion representations. Experiments show that our Human MotionFormer sets the new state-of-the-art performance both qualitatively and quantitatively. Project page: \url{https://github.com/KumapowerLIU/Human-MotionFormer}	This paper proposes Human MotionFormer, a hierarchical Vision Transformer framework that leverages global and local perceptions for accurate motion matching in human motion transfer.	Accurate matching between source person and target motion is crucial for high-quality motion transfer, especially in scenarios with both large and subtle motion changes.	The method utilizes two ViT encoders for feature extraction and a ViT decoder with cascaded blocks for feature matching and motion transfer. It incorporates cross-attention for global matching and convolutional layers for local refinement. A mutual learning loss is introduced to enable co-supervision between warping and generation branches during training.	MotionFormer achieves state-of-the-art performance both qualitatively and quantitatively on human motion transfer benchmarks. The method effectively captures both large and subtle motion changes, resulting in more realistic and natural motion transfer results. The proposed mutual learning loss effectively improves the quality of generated images by enhancing the complementariness of warping and generation branches.	The model assumes a fixed background, which might limit its applicability in complex real-world scenes. The computational cost of the model is relatively high compared to some existing methods.	motion transfer, vision transformer, global and local matching, mutual learning, image generation
2302.10893 Report	Fair Diffusion: Instructing Text-to-Image Generation Models on Fairness	Felix Friedrich, Manuel Brack, Lukas Struppek, Dominik Hintersdorf, Patrick Schramowski, Sasha Luccioni, Kristian Kersting	Generative AI models have recently achieved astonishing results in quality and are consequently employed in a fast-growing number of applications. However, since they are highly data-driven, relying on billion-sized datasets randomly scraped from the internet, they also suffer from degenerated and biased human behavior, as we demonstrate. In fact, they may even reinforce such biases. To not only uncover but also combat these undesired effects, we present a novel strategy, called Fair Diffusion, to attenuate biases after the deployment of generative text-to-image models. Specifically, we demonstrate shifting a bias, based on human instructions, in any direction yielding arbitrarily new proportions for, e.g., identity groups. As our empirical evaluation demonstrates, this introduced control enables instructing generative image models on fairness, with no data filtering and additional training required.	The paper introduces \textsc{Fair Diffusion}, a novel strategy to mitigate biases in deployed text-to-image generative models by allowing users to instruct the model on fairness using textual guidance.	Generative AI models, despite their impressive capabilities, often perpetuate and amplify biases present in their training data, leading to unfair outcomes in applications.	\textsc{Fair Diffusion} builds upon classifier-free guidance and introduces a fairness guidance term that allows users to steer image generation towards desired attribute proportions, enabling the implementation of different fairness definitions.	The study reveals significant gender and racial biases in Stable Diffusion's training dataset (LAION-5B) and its pre-trained model (CLIP), which are mirrored in the generated images. Stable Diffusion's generated images exhibit amplification, reflection, or mitigation of biases compared to LAION-5B, with no clear tendency observed. \textsc{Fair Diffusion} successfully mitigates gender occupation biases in Stable Diffusion's output, shifting attribute proportions towards user-defined fairness goals while preserving overall image composition.	The study relies on binary gender classification due to the limitations of current tools, while acknowledging the non-binary nature of gender. The evaluation of \textsc{Fair Diffusion} relies on a pre-trained classifier (FairFace) for gender classification, which may have its own inherent biases.	fairness, bias mitigation, generative ai, text-to-image synthesis, diffusion models
2302.10781 Report	Learning 3D Photography Videos via Self-supervised Diffusion on Single Images	Xiaodong Wang, Chenfei Wu, Shengming Yin, Minheng Ni, Jianfeng Wang, Linjie Li, Zhengyuan Yang, Fan Yang, Lijuan Wang, Zicheng Liu, Yuejian Fang, Nan Duan	3D photography renders a static image into a video with appealing 3D visual effects. Existing approaches typically first conduct monocular depth estimation, then render the input frame to subsequent frames with various viewpoints, and finally use an inpainting model to fill those missing/occluded regions. The inpainting model plays a crucial role in rendering quality, but it is normally trained on out-of-domain data. To reduce the training and inference gap, we propose a novel self-supervised diffusion model as the inpainting module. Given a single input image, we automatically construct a training pair of the masked occluded image and the ground-truth image with random cycle-rendering. The constructed training samples are closely aligned to the testing instances, without the need of data annotation. To make full use of the masked images, we design a Masked Enhanced Block (MEB), which can be easily plugged into the UNet and enhance the semantic conditions. Towards real-world animation, we present a novel task: out-animation, which extends the space and time of input objects. Extensive experiments on real datasets show that our method achieves competitive results with existing SOTA methods.	This paper proposes a novel self-supervised diffusion model for 3D photography that can generate high-quality 3D videos from single images, addressing the limitations of previous methods requiring large multi-view datasets.	Existing 3D photography methods suffer from a gap between training and inference, particularly in complex scenes, leading to visual distortions. This work aims to bridge this gap and enable high-quality 3D video generation from single images.	The proposed method uses a cycle-rendering technique to create self-supervised training pairs of masked and ground truth images. It then leverages a conditional diffusion model with a Masked Enhanced Block (MEB) to learn to inpaint the occluded regions of images, resulting in realistic 3D videos.	The method outperforms previous state-of-the-art methods in novel view synthesis on RealEstate10k and MannequinChallenge datasets. Qualitative results demonstrate the model's ability to generate clearer, more realistic content with better detail preservation compared to baselines. The proposed out-animation task extends the capabilities of 3D photography by generating videos that extend the space and time of input objects, showing promising results on the MSCOCO dataset.	The method currently relies on monocular depth estimation, which can introduce errors in complex scenes. Further exploration is needed to improve the temporal consistency and smoothness of generated 3D videos, particularly in the out-animation task.	3d photography, diffusion models, self-supervised learning, novel view synthesis, out-animation
2302.10688 Report	On Calibrating Diffusion Probabilistic Models	Tianyu Pang, Cheng Lu, Chao Du, Min Lin, Shuicheng Yan, Zhijie Deng	Recently, diffusion probabilistic models (DPMs) have achieved promising results in diverse generative tasks. A typical DPM framework includes a forward process that gradually diffuses the data distribution and a reverse process that recovers the data distribution from time-dependent data scores. In this work, we observe that the stochastic reverse process of data scores is a martingale, from which concentration bounds and the optional stopping theorem for data scores can be derived. Then, we discover a simple way for calibrating an arbitrary pretrained DPM, with which the score matching loss can be reduced and the lower bounds of model likelihood can consequently be increased. We provide general calibration guidelines under various model parametrizations. Our calibration method is performed only once and the resulting models can be used repeatedly for sampling. We conduct experiments on multiple datasets to empirically validate our proposal. Our code is at https://github.com/thudzj/Calibrated-DPMs.	This paper presents a simple calibration technique for pre-trained diffusion probabilistic models (DPMs) to enhance sample quality and model likelihood.	Existing DPMs often suffer from mis-calibration due to dataset bias or sub-optimal training, leading to reduced performance.	The method leverages the martingale property of data scores in DPMs and subtracts a time-dependent calibration term (expectation of the score model) from the pre-trained model's output.	Calibrated DPMs demonstrate significantly improved sample quality (FID score) on CIFAR-10 and CelebA datasets, especially with high-order DPM-Solver samplers. Calibration reduces the score matching objective, leading to an increased lower bound for model likelihood, as evidenced by experiments on various datasets. The calibration term can be effectively estimated using a substantial portion of training data or generated data from the pre-trained model.	While improving model likelihood, calibration does not always guarantee a lower FID score, highlighting the complex relationship between likelihood and sample quality. Post-training calibration is computationally challenging for text-to-image generation due to the vast number of conditions, requiring alternative strategies like dynamic recording.	diffusion models, generative models, score matching, calibration, model likelihood
2302.10668 Report	$PC^2$: Projection-Conditioned Point Cloud Diffusion for Single-Image 3D Reconstruction	Luke Melas-Kyriazi, Christian Rupprecht, Andrea Vedaldi	Reconstructing the 3D shape of an object from a single RGB image is a long-standing and highly challenging problem in computer vision. In this paper, we propose a novel method for single-image 3D reconstruction which generates a sparse point cloud via a conditional denoising diffusion process. Our method takes as input a single RGB image along with its camera pose and gradually denoises a set of 3D points, whose positions are initially sampled randomly from a three-dimensional Gaussian distribution, into the shape of an object. The key to our method is a geometrically-consistent conditioning process which we call projection conditioning: at each step in the diffusion process, we project local image features onto the partially-denoised point cloud from the given camera pose. This projection conditioning process enables us to generate high-resolution sparse geometries that are well-aligned with the input image, and can additionally be used to predict point colors after shape reconstruction. Moreover, due to the probabilistic nature of the diffusion process, our method is naturally capable of generating multiple different shapes consistent with a single input image. In contrast to prior work, our approach not only performs well on synthetic benchmarks, but also gives large qualitative improvements on complex real-world data.	This paper introduces Projection-Conditioned Point Cloud Diffusion, a novel method for reconstructing 3D objects from single RGB images. This is achieved by gradually denoising a randomly sampled point cloud into the shape of an object, guided by image features projected onto the points throughout the process.	Reconstructing 3D shapes from single images is challenging but crucial for applications like AR/VR. Existing deep learning methods are often limited to low-resolution outputs or struggle with representing shape ambiguity. This method leverages the power of denoising diffusion models to produce high-resolution, diverse 3D reconstructions.	The method utilizes a conditional denoising diffusion process on point clouds. Crucially, it introduces "projection conditioning", where image features are projected onto the intermediate point clouds at each denoising step, ensuring geometric consistency with the input image. This conditioning is also used for predicting point colors.	The method achieves competitive results on the ShapeNet benchmark, particularly excelling in reconstructing objects with fine details. Qualitative results on the real-world Co3D dataset demonstrate the capability to generate high-quality, detailed 3D reconstructions, outperforming previous methods like NeRF-WCE in handling shape uncertainty. By exploiting the probabilistic nature of diffusion models, the method can produce multiple plausible 3D shapes per input image, enabling filtering strategies to select the most consistent reconstruction.	The method currently relies on point cloud ground truth for training, although this can be obtained from multi-view data. While filtering strategies improve results, there's room for developing more sophisticated filtering criteria to further bridge the gap to the oracle upper bound.	3d reconstruction, diffusion models, point clouds, single-view reconstruction, conditional image synthesis
2302.10663 Report	RealFusion: 360° Reconstruction of Any Object from a Single Image	Luke Melas-Kyriazi, Christian Rupprecht, Iro Laina, Andrea Vedaldi	We consider the problem of reconstructing a full 360{\deg} photographic model of an object from a single image of it. We do so by fitting a neural radiance field to the image, but find this problem to be severely ill-posed. We thus take an off-the-self conditional image generator based on diffusion and engineer a prompt that encourages it to "dream up" novel views of the object. Using an approach inspired by DreamFields and DreamFusion, we fuse the given input view, the conditional prior, and other regularizers in a final, consistent reconstruction. We demonstrate state-of-the-art reconstruction results on benchmark images when compared to prior methods for monocular 3D reconstruction of objects. Qualitatively, our reconstructions provide a faithful match of the input view and a plausible extrapolation of its appearance and 3D shape, including to the side of the object not visible in the image.	Introduces \Methodname, a method for reconstructing a full 360° photographic 3D model of an object from a single image using an off-the-shelf 2D diffusion image generator as a prior.	Solves the severely ill-posed problem of single-image 3D reconstruction by leveraging the powerful statistical model of the 3D world captured in pre-trained 2D diffusion models.	Uses a single-image textual inversion technique to condition the diffusion model to 'dream up' novel views of the object. These views, along with the input image, are used to train a neural radiance field in a coarse-to-fine manner with additional regularization for smooth surfaces.	Achieves state-of-the-art reconstruction results on benchmark images and in-the-wild images compared to previous single-image reconstruction methods. Generates plausible 3D reconstructions that faithfully match the input view and provide a plausible extrapolation of appearance and 3D shape. Demonstrates the viability of leveraging pre-trained 2D diffusion models for single-image 3D reconstruction.	Occasionally fails to converge to a plausible geometry or copies the front view to the back of the object. Future work includes specializing the diffusion model for new-view synthesis and incorporating dynamics for animated 3D scenes.	3d reconstruction, diffusion models, neural radiance fields, single-image reconstruction, textual inversion
2302.10586 Report	Diffusion Models and Semi-Supervised Learners Benefit Mutually with Few Labels	Zebin You, Yong Zhong, Fan Bao, Jiacheng Sun, Chongxuan Li, Jun Zhu	In an effort to further advance semi-supervised generative and classification tasks, we propose a simple yet effective training strategy called dual pseudo training (DPT), built upon strong semi-supervised learners and diffusion models. DPT operates in three stages: training a classifier on partially labeled data to predict pseudo-labels; training a conditional generative model using these pseudo-labels to generate pseudo images; and retraining the classifier with a mix of real and pseudo images. Empirically, DPT consistently achieves SOTA performance of semi-supervised generation and classification across various settings. In particular, with one or two labels per class, DPT achieves a Fr\'echet Inception Distance (FID) score of 3.08 or 2.52 on ImageNet 256x256. Besides, DPT outperforms competitive semi-supervised baselines substantially on ImageNet classification tasks, achieving top-1 accuracies of 59.0 (+2.8), 69.5 (+3.0), and 74.4 (+2.0) with one, two, or five labels per class, respectively. Notably, our results demonstrate that diffusion can generate realistic images with only a few labels (e.g., <0.1%) and generative augmentation remains viable for semi-supervised classification. Our code is available at https://github.com/ML-GSAI/DPT.	This paper introduces Dual Pseudo Training (DPT), a novel training method for improving semi-supervised image generation and classification by leveraging the synergy between diffusion models and semi-supervised classifiers.	DPT addresses the challenge of limited labeled data in semi-supervised learning, aiming to improve the performance of both conditional image generation and classification tasks.	DPT operates in three stages: 1) Training a classifier on partially labeled data to generate pseudo-labels for unlabeled data. 2) Training a conditional generative diffusion model on all data using the pseudo-labels. 3) Retraining or fine-tuning the classifier using augmented data generated by the diffusion model.	DPT achieves state-of-the-art semi-supervised generation performance on CIFAR-10 and ImageNet, even outperforming some supervised methods. DPT significantly improves semi-supervised classification results on ImageNet benchmarks, demonstrating the efficacy of generative augmentation for classification. The paper provides evidence that diffusion models can generate high-quality images with very few labels (<0.1%).	The paper acknowledges that directly using pseudo images and labels without further filtering based on semantic alignment is a limitation. Future work can explore the integration of semantic alignment techniques like CLIP to filter noisy image-label pairs.	semi-supervised learning, diffusion models, image generation, image classification, generative augmentation
2302.10523 Report	I2V: Towards Texture-Aware Self-Supervised Blind Denoising using Self-Residual Learning for Real-World Images	Kanggeun Lee, Kyungryun Lee, Won-Ki Jeong	Although the advances of self-supervised blind denoising are significantly superior to conventional approaches without clean supervision in synthetic noise scenarios, it shows poor quality in real-world images due to spatially correlated noise corruption. Recently, pixel-shuffle downsampling (PD) has been proposed to eliminate the spatial correlation of noise. A study combining a blind spot network (BSN) and asymmetric PD (AP) successfully demonstrated that self-supervised blind denoising is applicable to real-world noisy images. However, PD-based inference may degrade texture details in the testing phase because high-frequency details (e.g., edges) are destroyed in the downsampled images. To avoid such an issue, we propose self-residual learning without the PD process to maintain texture information. We also propose an order-variant PD constraint, noise prior loss, and an efficient inference scheme (progressive random-replacing refinement ($\text{PR}^3$)) to boost overall performance. The results of extensive experiments show that the proposed method outperforms state-of-the-art self-supervised blind denoising approaches, including several supervised learning methods, in terms of PSNR, SSIM, LPIPS, and DISTS in real-world sRGB images.	This paper presents I2V, a self-supervised blind denoising framework for real-world sRGB images that preserves texture details better than existing methods.	Existing self-supervised blind denoising methods struggle with real-world images due to spatially correlated noise and often degrade texture details. I2V aims to address these limitations.	I2V leverages self-residual learning with a noise extractor network, order-variant pixel-shuffle downsampling, a noise prior loss, and a progressive random-replacing refinement (PR^3) inference scheme.	I2V outperforms state-of-the-art self-supervised blind denoisers on SIDD, DND, and NIND datasets in terms of PSNR, SSIM, LPIPS, and DISTS. The proposed method preserves texture details better than some supervised learning methods, as demonstrated by LPIPS and DISTS. I2V achieves a faster inference speed compared to the AP-BSN+R^3 method.	Training I2V requires more GPU memory than AP-BSN due to increased computational cost. Future work includes exploring different noise extractor structures like Restormer.	image denoising, self-supervised learning, blind denoising, texture preservation, real-world images
2302.10326 Report	Unsupervised Out-of-Distribution Detection with Diffusion Inpainting	Zhenzhen Liu, Jin Peng Zhou, Yufan Wang, Kilian Q. Weinberger	Unsupervised out-of-distribution detection (OOD) seeks to identify out-of-domain data by learning only from unlabeled in-domain data. We present a novel approach for this task - Lift, Map, Detect (LMD) - that leverages recent advancement in diffusion models. Diffusion models are one type of generative models. At their core, they learn an iterative denoising process that gradually maps a noisy image closer to their training manifolds. LMD leverages this intuition for OOD detection. Specifically, LMD lifts an image off its original manifold by corrupting it, and maps it towards the in-domain manifold with a diffusion model. For an out-of-domain image, the mapped image would have a large distance away from its original manifold, and LMD would identify it as OOD accordingly. We show through extensive experiments that LMD achieves competitive performance across a broad variety of datasets. Code can be found at https://github.com/zhenzhel/lift_map_detect.	This paper presents Lift, Map, Detect (LMD), a novel unsupervised out-of-distribution (OOD) detection method leveraging the manifold mapping ability of diffusion models.	Unsupervised OOD detection is crucial for deploying machine learning models in real-world settings where out-of-domain data can lead to unpredictable and potentially harmful consequences.	LMD lifts an image off its original manifold by masking it. Then, it maps the lifted image towards the in-domain manifold using a diffusion model trained on in-domain data. Finally, it leverages the reconstruction distance between the original and mapped images to detect OOD data.	LMD achieves competitive performance on various datasets, demonstrating its effectiveness and versatility. Using multiple reconstructions and an alternating checkerboard masking strategy consistently enhances LMD's performance. LPIPS distance metric proves to be a robust choice for measuring reconstruction dissimilarity across different datasets.	The reliance on iterative denoising in diffusion models makes LMD computationally expensive for real-time applications. Future work could explore integrating fast diffusion model sampling techniques to improve LMD's speed.	out-of-distribution detection, diffusion models, unsupervised learning, image inpainting, reconstruction error
2302.10305 Report	Analyzing Multimodal Objectives Through the Lens of Generative Diffusion Guidance	Chaerin Kong, Nojun Kwak	Recent years have witnessed astonishing advances in the field of multimodal representation learning, with contrastive learning being the cornerstone for major breakthroughs. Latest works delivered further improvements by incorporating different objectives such as masked modeling and captioning into the frameworks, but our understanding on how these objectives facilitate learning remains vastly incomplete. In this paper, we leverage the fact that classifier-guided diffusion models generate images that reflect the semantic signals provided by the classifier to study the characteristics of multimodal learning objectives. Specifically, we compare contrastive, matching and captioning loss in terms of their semantic signals, and introduce a simple baseline that not only supports our analyses but also improves the quality of generative guidance in a straightforward manner.	This paper leverages the ability of classifier-guided diffusion models to reflect semantic signals to analyze the characteristics of various multimodal learning objectives (contrastive, matching, and captioning) for vision-language pretraining.	Understanding how different objectives contribute to multimodal representation learning, particularly in vision-language tasks, is crucial for improving generative models and achieving better text-to-image generation.	The authors utilize a pre-trained diffusion model and the BLIP model, which is capable of evaluating contrastive (ITC), matching (ITM), and captioning (CAP) losses. They analyze the generated images by using each objective as guidance in the diffusion process.	ITC excels at generating fine details but struggles with global scene composition and semantic object relations, often omitting or incorrectly fusing attributes. CAP demonstrates strong scene understanding and generates images faithful to complex prompts, but it has higher optimization complexity compared to ITC. ITM, incorporating patch-token cross-attention, exhibits robust visual understanding and representation, generating coherent scenes with accurate object-attribute relations.	The study primarily relies on qualitative analysis and a limited user study for evaluation. Future work could explore incorporating more powerful generative models and diverse datasets to further validate the findings	multimodal learning, vision-language pretraining, contrastive learning, diffusion models, text-to-image generation
2302.10174 Report	Towards Universal Fake Image Detectors that Generalize Across Generative Models	Utkarsh Ojha, Yuheng Li, Yong Jae Lee	With generative models proliferating at a rapid rate, there is a growing need for general purpose fake image detectors. In this work, we first show that the existing paradigm, which consists of training a deep network for real-vs-fake classification, fails to detect fake images from newer breeds of generative models when trained to detect GAN fake images. Upon analysis, we find that the resulting classifier is asymmetrically tuned to detect patterns that make an image fake. The real class becomes a sink class holding anything that is not fake, including generated images from models not accessible during training. Building upon this discovery, we propose to perform real-vs-fake classification without learning; i.e., using a feature space not explicitly trained to distinguish real from fake images. We use nearest neighbor and linear probing as instantiations of this idea. When given access to the feature space of a large pretrained vision-language model, the very simple baseline of nearest neighbor classification has surprisingly good generalization ability in detecting fake images from a wide variety of generative models; e.g., it improves upon the SoTA by +15.07 mAP and +25.90% acc when tested on unseen diffusion and autoregressive models.	This paper proposes a method for detecting fake images generated by various generative models by leveraging the feature space of a large pre-trained vision-language model (CLIP-ViT), which is not explicitly trained for real-vs-fake classification.	Existing deep learning methods for fake image detection struggle to generalize to unseen families of generative models, often misclassifying fake images from diffusion models as real.	The authors propose two simple methods: 1) Nearest Neighbor classification: finding the nearest neighbor of a test image in a feature bank of real and fake images embedded using CLIP-ViT, and 2) Linear Probing: training a linear classifier on top of the CLIP-ViT features.	Both Nearest Neighbor and Linear Probing using CLIP-ViT's feature space significantly outperform state-of-the-art methods in detecting fake images from unseen generative model families. The approach is robust to the choice of training data source (GAN or diffusion models). The method maintains good performance even with a smaller training dataset.	The study mainly focuses on detecting completely generated images and might not directly apply to images with localized manipulations. Further research is needed to understand the underlying similarity between fake images from different generative models that enables their detection.	fake image detection, generalization, generative models, clip, vision-language models
2302.10167 Report	Cross-domain Compositing with Pretrained Diffusion Models	Roy Hachnochi, Mingrui Zhao, Nadav Orzech, Rinon Gal, Ali Mahdavi-Amiri, Daniel Cohen-Or, Amit Haim Bermano	Diffusion models have enabled high-quality, conditional image editing capabilities. We propose to expand their arsenal, and demonstrate that off-the-shelf diffusion models can be used for a wide range of cross-domain compositing tasks. Among numerous others, these include image blending, object immersion, texture-replacement and even CG2Real translation or stylization. We employ a localized, iterative refinement scheme which infuses the injected objects with contextual information derived from the background scene, and enables control over the degree and types of changes the object may undergo. We conduct a range of qualitative and quantitative comparisons to prior work, and exhibit that our method produces higher quality and realistic results without requiring any annotations or training. Finally, we demonstrate how our method may be used for data augmentation of downstream tasks.	This paper proposes a novel method for cross-domain compositing using off-the-shelf diffusion models, enabling realistic merging of image parts from different visual domains (e.g., photos and paintings).	This technique addresses the challenge of combining objects from different visual domains while maintaining realism and coherency, expanding the capabilities of diffusion models beyond traditional image editing tasks.	The method leverages a localized, iterative refinement scheme based on ILVR (in-domain latent space interpolation). It infuses injected objects with contextual information from the background, allowing control over the degree and types of object changes while ensuring domain consistency.	The method outperforms baselines in qualitative and quantitative comparisons for image modification, object immersion, and data augmentation for SVR. It enables realistic blending of objects with their backgrounds, matching style and adding details while preserving object structure. The technique effectively bridges the domain gap between synthetic and real images, improving the performance of single-view 3D reconstruction models on real-world data.	The method faces challenges when processing small objects or semantically complex images, requiring further exploration for optimal parameter selection. Future work includes extending the technique to video, addressing temporal consistency for cross-domain video compositing.	diffusion models, cross-domain compositing, image editing, object immersion, data augmentation
2302.09923 Report	Prompt Stealing Attacks Against Text-to-Image Generation Models	Xinyue Shen, Yiting Qu, Michael Backes, Yang Zhang	Text-to-Image generation models have revolutionized the artwork design process and enabled anyone to create high-quality images by entering text descriptions called prompts. Creating a high-quality prompt that consists of a subject and several modifiers can be time-consuming and costly. In consequence, a trend of trading high-quality prompts on specialized marketplaces has emerged. In this paper, we perform the first study on understanding the threat of a novel attack, namely prompt stealing attack, which aims to steal prompts from generated images by text-to-image generation models. Successful prompt stealing attacks directly violate the intellectual property of prompt engineers and jeopardize the business model of prompt marketplaces. We first perform a systematic analysis on a dataset collected by ourselves and show that a successful prompt stealing attack should consider a prompt's subject as well as its modifiers. Based on this observation, we propose a simple yet effective prompt stealing attack, PromptStealer. It consists of two modules: a subject generator trained to infer the subject and a modifier detector for identifying the modifiers within the generated image. Experimental results demonstrate that PromptStealer is superior over three baseline methods, both quantitatively and qualitatively. We also make some initial attempts to defend PromptStealer. In general, our study uncovers a new attack vector within the ecosystem established by the popular text-to-image generation models. We hope our results can contribute to understanding and mitigating this emerging threat.	This paper presents the first study on "prompt stealing attacks," which aim to steal the text prompts used to generate images from text-to-image generation models.	Successful attacks could violate the intellectual property of prompt engineers and impact the business model of prompt marketplaces.	The authors collect a dataset of prompt-image pairs and propose a novel attack method, PromptStealer, which uses a subject generator and a modifier detector to infer the prompt from an image.	PromptStealer outperforms baseline methods in recovering prompts, as measured by semantic, modifier, image, and pixel similarities. The attack remains effective on real-world prompts traded in marketplaces. A defense method based on adversarial examples shows promise but requires strong assumptions.	The evaluation primarily focuses on Stable Diffusion, with limited testing on other text-to-image models. The defense method assumes white-box access to the attack model.	prompt engineering, text-to-image generation, intellectual property, adversarial examples, ai security
2302.09778 Report	Composer: Creative and Controllable Image Synthesis with Composable Conditions	Lianghua Huang, Di Chen, Yu Liu, Yujun Shen, Deli Zhao, Jingren Zhou	Recent large-scale generative models learned on big data are capable of synthesizing incredible images yet suffer from limited controllability. This work offers a new generation paradigm that allows flexible control of the output image, such as spatial layout and palette, while maintaining the synthesis quality and model creativity. With compositionality as the core idea, we first decompose an image into representative factors, and then train a diffusion model with all these factors as the conditions to recompose the input. At the inference stage, the rich intermediate representations work as composable elements, leading to a huge design space (i.e., exponentially proportional to the number of decomposed factors) for customizable content creation. It is noteworthy that our approach, which we call Composer, supports various levels of conditions, such as text description as the global information, depth map and sketch as the local guidance, color histogram for low-level details, etc. Besides improving controllability, we confirm that Composer serves as a general framework and facilitates a wide range of classical generative tasks without retraining. Code and models will be made available.	This paper introduces Composer, a novel compositional generative model framework for highly controllable image synthesis, allowing flexible control over spatial layout, palette, and other image aspects while maintaining high synthesis quality and creativity.	Existing generative models, though capable of producing high-quality images, often lack the controllability needed for practical design applications. This work addresses this limitation by proposing a compositional approach that significantly expands the control space and enables more flexible image generation.	Composer decomposes images into representative factors (e.g., caption, semantics, color, sketch, depth) and trains a diffusion model conditioned on these factors for image recomposition. This enables flexible customization by combining different representations during inference.	Composer enables diverse image manipulations like variations, interpolations, reconfigurations, and region-specific editing. It can reformulate traditional generation tasks such as colorization, style transfer, image translation, and virtual try-on without retraining. The model achieves a zero-shot FID of 9.2 on COCO for text-to-image generation, demonstrating its competitive performance.	The joint training strategy for multiple conditions, though effective, could potentially downweight the single-conditional generation performance. Conflicts might arise when incompatible conditions are used, requiring further investigation and potential mitigation strategies.	image generation, controllable generation, compositionality, diffusion models, multi-modal generation
2302.09554 Report	Mixed Hierarchy Network for Image Restoration	Hu Gao, Depeng Dang	Image restoration is a long-standing low-level vision problem, e.g., deblurring and deraining. In the process of image restoration, it is necessary to consider not only the spatial details and contextual information of restoration to ensure the quality, but also the system complexity. Although many methods have been able to guarantee the quality of image restoration, the system complexity of the state-of-the-art (SOTA) methods is increasing as well. Motivated by this, we present a mixed hierarchy network that can balance these competing goals. Our main proposal is a mixed hierarchy architecture, that progressively recovers contextual information and spatial details from degraded images while we design intra-blocks to reduce system complexity. Specifically, our model first learns the contextual information using encoder-decoder architectures, and then combines them with high-resolution branches that preserve spatial detail. In order to reduce the system complexity of this architecture for convenient analysis and comparison, we replace or remove the nonlinear activation function with multiplication and use a simple network structure. In addition, we replace spatial convolution with global self-attention for the middle block of encoder-decoder. The resulting tightly interlinked hierarchy architecture, named as MHNet, delivers strong performance gains on several image restoration tasks, including image deraining, and deblurring.	This paper presents MHNet, a mixed hierarchy network for image restoration that balances high-quality restoration with low system complexity.	Existing deep learning methods for image restoration, while achieving good performance, often suffer from high system complexity, requiring significant computational resources.	MHNet uses a mixed hierarchy architecture. It first learns contextual information at a lower hierarchy using encoder-decoder subnetworks with a selective multi-head attention mechanism. It then refines spatial details at a higher hierarchy operating on full resolution with a full resolution subnetwork. An adaptive feature fusion mechanism facilitates information exchange between hierarchies. The network primarily utilizes nonlinear activation-free blocks to reduce system complexity.	MHNet achieves state-of-the-art performance on several image restoration tasks, including image deraining and deblurring. MHNet demonstrates significant performance gains with lower computational resources compared to existing methods. The paper provides ablation studies demonstrating the contribution of each component in MHNet.	The model's performance is limited by the reliance on a fixed hierarchy structure. Exploring more efficient attention mechanisms or alternative feature fusion strategies could further enhance the network's performance.	image restoration, deblurring, deraining, mixed hierarchy network, efficient deep learning
2302.09486 Report	LC-NeRF: Local Controllable Face Generation in Neural Randiance Field	Wenyang Zhou, Lu Yuan, Shuyu Chen, Lin Gao, Shimin Hu	3D face generation has achieved high visual quality and 3D consistency thanks to the development of neural radiance fields (NeRF). Recently, to generate and edit 3D faces with NeRF representation, some methods are proposed and achieve good results in decoupling geometry and texture. The latent codes of these generative models affect the whole face, and hence modifications to these codes cause the entire face to change. However, users usually edit a local region when editing faces and do not want other regions to be affected. Since changes to the latent code affect global generation results, these methods do not allow for fine-grained control of local facial regions. To improve local controllability in NeRF-based face editing, we propose LC-NeRF, which is composed of a Local Region Generators Module and a Spatial-Aware Fusion Module, allowing for local geometry and texture control of local facial regions. Qualitative and quantitative evaluations show that our method provides better local editing than state-of-the-art face editing methods. Our method also performs well in downstream tasks, such as text-driven facial image editing.	This paper introduces LC-NeRF, a novel NeRF-based face generation and editing method that provides local control over geometry and texture, enabling fine-grained modifications to specific facial regions.	Existing NeRF-based face editing methods often struggle to modify local regions without affecting the entire face, limiting their control and potentially leading to inconsistent facial identities. This method addresses this limitation by providing more fine-grained control over local features.	LC-NeRF utilizes a local region generator module to decompose global 3D representations into local regions for separate geometry and texture control. A spatial-aware fusion module then aggregates these regions into a final image, ensuring seamless integration. The method is trained with a double discriminator supervision strategy to ensure high-quality generation and consistency between the image and the semantic mask.	LC-NeRF can edit local facial regions accurately without affecting non-editing regions, as demonstrated by qualitative and quantitative evaluations. The method successfully decouples geometry and texture, allowing for independent modification of each aspect. LC-NeRF excels in downstream tasks such as text-driven facial image editing, showing its versatility and potential for various applications.	While LC-NeRF allows for local region and geometry/texture decoupling, it currently lacks the ability to control local internal textures finely, such as hair texture or wrinkles. Future work will focus on enabling finer control over local texture content.	face editing, neural radiance fields (nerf), generative adversarial networks (gans), local control, 3d face generation
2302.09311 Report	Temporal Interpolation Is All You Need for Dynamic Neural Radiance Fields	Sungheon Park, Minjung Son, Seokhwan Jang, Young Chun Ahn, Ji-Yeon Kim, Nahyup Kang	Temporal interpolation often plays a crucial role to learn meaningful representations in dynamic scenes. In this paper, we propose a novel method to train spatiotemporal neural radiance fields of dynamic scenes based on temporal interpolation of feature vectors. Two feature interpolation methods are suggested depending on underlying representations, neural networks or grids. In the neural representation, we extract features from space-time inputs via multiple neural network modules and interpolate them based on time frames. The proposed multi-level feature interpolation network effectively captures features of both short-term and long-term time ranges. In the grid representation, space-time features are learned via four-dimensional hash grids, which remarkably reduces training time. The grid representation shows more than 100 times faster training speed than the previous neural-net-based methods while maintaining the rendering quality. Concatenating static and dynamic features and adding a simple smoothness term further improve the performance of our proposed models. Despite the simplicity of the model architectures, our method achieved state-of-the-art performance both in rendering quality for the neural representation and in training speed for the grid representation.	This paper introduces a novel method for training dynamic Neural Radiance Fields (NeRFs) using temporal interpolation of feature vectors, enabling the representation of dynamic scenes without explicit deformation or scene flow estimation.	Existing dynamic NeRF methods struggle with ambiguities in scene changes (appearance, movement, color change) and often rely on complex deformation modules. This method provides a simpler, more effective approach for learning dynamic scene representations.	The method uses two representations: 1) Neural Representation: Features are extracted from space-time inputs via multiple MLPs and temporally interpolated. 2) Grid Representation: Space-time features are learned using 4D hash grids. Both representations are enhanced by concatenating static and dynamic features and incorporating a smoothness term.	The neural representation achieves state-of-the-art rendering quality on D-NeRF and competitive results on HyperNeRF datasets. The grid representation achieves remarkably faster training speeds (over 100x) compared to neural-network-based methods while maintaining competitive rendering quality. The proposed smoothness regularizer, encouraging feature similarity between adjacent frames, consistently improves the performance of both representations.	The method struggles to recover 3D structures of small, rapidly moving objects and unseen dynamic regions during training. Exploring hybrid representations combining the strengths of neural and grid representations could be promising for future work.	neural radiance fields, dynamic scene reconstruction, temporal interpolation, hash grids, novel view synthesis
2302.09260 Report	Attribute-Specific Manipulation Based on Layer-Wise Channels	Yuanjie Yan, Jian Zhao, Furao Shen	Image manipulation on the latent space of the pre-trained StyleGAN can control the semantic attributes of the generated images. Recently, some studies have focused on detecting channels with specific properties to directly manipulate the latent code, which is limited by the entanglement of the latent space. To detect the attribute-specific channels, we propose a novel detection method in the context of pre-trained classifiers. We analyse the gradients layer by layer on the style space. The intensities of the gradients indicate the channel's responses to specific attributes. The latent style codes of channels control separate attributes in the layers. We choose channels with top-$k$ gradients to control specific attributes in the maximum response layer. We implement single-channel and multi-channel manipulations with a certain attribute. Our methods can accurately detect relevant channels for a large number of face attributes. Extensive qualitative and quantitative results demonstrate that the proposed methods outperform state-of-the-art methods in generalization and scalability.	This paper proposes a novel gradient-based method for detecting and manipulating attribute-specific channels in the style space of StyleGAN for semantic image editing.	Existing methods for manipulating StyleGAN latent space are limited by entanglement, difficulty in pinpointing attribute-specific channels, and lack of flexibility in multi-attribute and continuous manipulation.	The method leverages pre-trained classifiers to analyze gradients of style codes with respect to specific attributes. It selects the top-k channels with the largest gradients for single-channel or multi-channel manipulation.	The method accurately detects relevant channels for a large number of face attributes (over 35), including both regions and semantic attributes. It enables both single-channel and multi-channel manipulation, allowing for fine-grained control over attribute editing. Quantitative and qualitative evaluations demonstrate superior performance over state-of-the-art methods in terms of generalization and scalability.	Multi-channel manipulation requires further research on balancing the editing intensity of multiple channels. The method focuses on facial attributes and could be extended to other domains.	semantic manipulation, face editing, generative adversarial networks (gans), stylegan, stylespace
2302.09057 Report	Consistent Diffusion Models: Mitigating Sampling Drift by Learning to be Consistent	Giannis Daras, Yuval Dagan, Alexandros G. Dimakis, Constantinos Daskalakis	Imperfect score-matching leads to a shift between the training and the sampling distribution of diffusion models. Due to the recursive nature of the generation process, errors in previous steps yield sampling iterates that drift away from the training distribution. Yet, the standard training objective via Denoising Score Matching (DSM) is only designed to optimize over non-drifted data. To train on drifted data, we propose to enforce a \emph{consistency} property which states that predictions of the model on its own generated data are consistent across time. Theoretically, we show that if the score is learned perfectly on some non-drifted points (via DSM) and if the consistency property is enforced everywhere, then the score is learned accurately everywhere. Empirically we show that our novel training objective yields state-of-the-art results for conditional and unconditional generation in CIFAR-10 and baseline improvements in AFHQ and FFHQ. We open-source our code and models: https://github.com/giannisdaras/cdm	This paper introduces Consistent Diffusion Models (CDM), a novel method to mitigate sampling drift in diffusion models by enforcing consistency, a property ensuring model predictions on generated data remain consistent over time.	Sampling drift, a discrepancy between training and sampling distributions due to imperfect score matching, is a major challenge in diffusion models. This drift leads to accumulated errors during the recursive generation process, impacting sample quality. CDM addresses this by improving score function accuracy, particularly in regions with low probability under the target distribution.	The authors define a 'consistency property' based on the idea that a denoising function's output should match the expected value of the clean image generated using the learned reverse process. They then propose a new training objective that enforces this consistency property, encouraging the model to make self-consistent predictions across time.	Theoretically, the paper proves that enforcing consistency, along with a weak form of score matching, suffices to learn the correct score function everywhere. Empirically, CDM achieves state-of-the-art results for conditional and unconditional generation on CIFAR-10, surpassing previous benchmarks. CDM also shows baseline improvements in image quality and reduced geometric inconsistencies on more challenging datasets like AFHQ and FFHQ.	The proposed regularization in CDM increases training time by approximately 1.5x. The method does not explicitly address or enforce the conservativeness of the learned vector field, a key theoretical assumption.	diffusion models, generative models, score matching, sampling drift, consistency regularization
2302.08908 Report	LayoutDiffuse: Adapting Foundational Diffusion Models for Layout-to-Image Generation	Jiaxin Cheng, Xiao Liang, Xingjian Shi, Tong He, Tianjun Xiao, Mu Li	Layout-to-image generation refers to the task of synthesizing photo-realistic images based on semantic layouts. In this paper, we propose LayoutDiffuse that adapts a foundational diffusion model pretrained on large-scale image or text-image datasets for layout-to-image generation. By adopting a novel neural adaptor based on layout attention and task-aware prompts, our method trains efficiently, generates images with both high perceptual quality and layout alignment, and needs less data. Experiments on three datasets show that our method significantly outperforms other 10 generative models based on GANs, VQ-VAE, and diffusion models.	Presents LayoutDiffuse, a method for adapting pretrained foundational diffusion models (trained on image-text pairs or only images) for layout-conditioned image generation.	Addresses the limitations of existing layout-to-image generation methods, such as the inability to handle complex layouts or the need for extensive training.	Adapts pretrained diffusion models by incorporating layout information through layout attention and task-adaptive prompts, fine-tuning the model for efficient adaptation.	Achieves state-of-the-art results on bounding box and mask layout-to-image generation benchmarks, outperforming GAN-based and other diffusion-based methods. Demonstrates time and data efficiency, requiring significantly less training time and data compared to training diffusion models from scratch. Generates high-quality images that are both perceptually plausible and well-aligned with the input layouts, as evidenced by quantitative metrics and human evaluation.	The adapted model size is larger due to the addition of layout attention layers. Future work can explore identity-preserving image editing by combining LayoutDiffuse with textual inversion fine-tuning methods.	layout-to-image generation, diffusion models, fine-tuning, layout attention, task-adaptive prompts
2302.08788 Report	MixNeRF: Modeling a Ray with Mixture Density for Novel View Synthesis from Sparse Inputs	Seunghyeon Seo, Donghoon Han, Yeonjin Chang, Nojun Kwak	Neural Radiance Field (NeRF) has broken new ground in the novel view synthesis due to its simple concept and state-of-the-art quality. However, it suffers from severe performance degradation unless trained with a dense set of images with different camera poses, which hinders its practical applications. Although previous methods addressing this problem achieved promising results, they relied heavily on the additional training resources, which goes against the philosophy of sparse-input novel-view synthesis pursuing the training efficiency. In this work, we propose MixNeRF, an effective training strategy for novel view synthesis from sparse inputs by modeling a ray with a mixture density model. Our MixNeRF estimates the joint distribution of RGB colors along the ray samples by modeling it with mixture of distributions. We also propose a new task of ray depth estimation as a useful training objective, which is highly correlated with 3D scene geometry. Moreover, we remodel the colors with regenerated blending weights based on the estimated ray depth and further improves the robustness for colors and viewpoints. Our MixNeRF outperforms other state-of-the-art methods in various standard benchmarks with superior efficiency of training and inference.	MixNeRF, a novel regularization-based neural radiance field (NeRF) training strategy for high-quality novel view synthesis from sparse inputs, addresses the limitations of previous methods that rely heavily on extra training resources, enhancing both training and inference efficiency.	Existing NeRF models struggle with performance degradation when trained on sparse input views due to the difficulty in accurately estimating 3D geometry, which hinders their practical applications in domains like AR/VR and autonomous driving where dense training data is often unavailable.	MixNeRF models the colors along a ray with a mixture density model, using the predicted weights as mixing coefficients for a mixture of Laplace distributions. It introduces ray depth estimation as an auxiliary task, utilizing the estimated depths to regenerate blending weights and remodel colors for enhanced robustness against viewpoint shifts.	MixNeRF successfully learns 3D geometry from sparse views by leveraging a mixture density model, representing blending weight distributions more accurately than baselines. It introduces ray depth estimation as an effective auxiliary task, resulting in more precise depth maps compared to methods relying on depth smoothing strategies. MixNeRF outperforms state-of-the-art pre-training and regularization methods on LLFF, DTU, and Realistic Synthetic 360° datasets, demonstrating superior efficiency in both training and inference.	MixNeRF may exhibit artifacts in rendered images under extremely sparse scenarios (e.g., 3-view) due to interference from non-object elements like backgrounds. Future work could focus on developing algorithms for distinguishing between object and non-object pixels to further mitigate artifacts, particularly in datasets like DTU.	novel view synthesis, neural radiance fields (nerf), sparse input, mixture density model, depth estimation
2302.08510 Report	Text-driven Visual Synthesis with Latent Diffusion Prior	Ting-Hsuan Liao, Songwei Ge, Yiran Xu, Yao-Chih Lee, Badour AlBahar, Jia-Bin Huang	There has been tremendous progress in large-scale text-to-image synthesis driven by diffusion models enabling versatile downstream applications such as 3D object synthesis from texts, image editing, and customized generation. We present a generic approach using latent diffusion models as powerful image priors for various visual synthesis tasks. Existing methods that utilize such priors fail to use these models' full capabilities. To improve this, our core ideas are 1) a feature matching loss between features from different layers of the decoder to provide detailed guidance and 2) a KL divergence loss to regularize the predicted latent features and stabilize the training. We demonstrate the efficacy of our approach on three different applications, text-to-3D, StyleGAN adaptation, and layered image editing. Extensive results show our method compares favorably against baselines.	This paper introduces a novel approach that utilizes latent diffusion models as powerful image priors for various visual synthesis tasks, such as text-to-3D, StyleGAN adaptation, and layered image editing.	Existing methods often lack a unified approach to leverage diffusion models for different visual synthesis tasks. This paper aims to address this gap and provide a more generic and effective solution.	The proposed approach consists of two key components: (1) a feature matching loss for extracting finer-grained details from multiple decoder layers of the latent diffusion model, and (2) a KL divergence loss to regularize the predicted latent features and stabilize the training process.	In text-to-3D synthesis, the method generates more detailed and visually appealing 3D models compared to baselines using CLIP or latent score distillation alone. For StyleGAN adaptation, the method achieves superior FID scores and competitive CLIP and LPIPS scores, indicating improved image quality and diversity. In layered image editing, the method demonstrates superior performance in manipulating image appearances and generating fine details compared to Text2LIVE and latent score distillation baseline.	The method struggles to resolve the multiple faces issue in the text-to-3D task, a common limitation in current text-to-3D methods. Some cases exhibit color over-saturation or out-of-focus issues, despite using the KL loss for regularization.	latent diffusion model, visual synthesis, text-to-3d, stylegan adaptation, image editing
2302.08453 Report	T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models	Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan, Xiaohu Qie	The incredible generative ability of large-scale text-to-image (T2I) models has demonstrated strong power of learning complex structures and meaningful semantics. However, relying solely on text prompts cannot fully take advantage of the knowledge learned by the model, especially when flexible and accurate controlling (e.g., color and structure) is needed. In this paper, we aim to ``dig out" the capabilities that T2I models have implicitly learned, and then explicitly use them to control the generation more granularly. Specifically, we propose to learn simple and lightweight T2I-Adapters to align internal knowledge in T2I models with external control signals, while freezing the original large T2I models. In this way, we can train various adapters according to different conditions, achieving rich control and editing effects in the color and structure of the generation results. Further, the proposed T2I-Adapters have attractive properties of practical value, such as composability and generalization ability. Extensive experiments demonstrate that our T2I-Adapter has promising generation quality and a wide range of applications.	This paper proposes T2I-Adapter, a lightweight model designed to enhance the controllability of pre-trained text-to-image diffusion models like Stable Diffusion by aligning internal model knowledge with external control signals.	Existing T2I models struggle to generate images that accurately reflect complex or imaginative user intentions, especially regarding structure and color, solely relying on text prompts.	T2I-Adapters are trained to extract guidance features from various conditions like sketches, color palettes, depth maps, etc., and inject them into the encoder of the diffusion model, providing additional control signals during image generation.	T2I-Adapters demonstrate superior generation quality and alignment compared to existing methods, evidenced by qualitative and quantitative (FID, CLIP Score) evaluations on tasks like sketch-to-image and segmentation-to-image generation. The method supports flexible single-adapter control for various conditions, including imaginative scenarios, and exhibits promising image editing capabilities. T2I-Adapters are composable, allowing for multi-condition control without retraining, and exhibit generalizability, enabling their use on custom models fine-tuned from the same base T2I model.	Multi-adapter control currently requires manual adjustment of guidance feature combinations. Future work will explore adaptive fusion of multi-modal guidance information.	text-to-image synthesis, diffusion models, controllable image generation, adapter networks, multi-modal guidance
2302.08374 Report	Efficiency 360: Efficient Vision Transformers	Badri N. Patro, Vijay Srinivas Agneeswaran	Transformers are widely used for solving tasks in natural language processing, computer vision, speech, and music domains. In this paper, we talk about the efficiency of transformers in terms of memory (the number of parameters), computation cost (number of floating points operations), and performance of models, including accuracy, the robustness of the model, and fair \& bias-free features. We mainly discuss the vision transformer for the image classification task. Our contribution is to introduce an efficient 360 framework, which includes various aspects of the vision transformer, to make it more efficient for industrial applications. By considering those applications, we categorize them into multiple dimensions such as privacy, robustness, transparency, fairness, inclusiveness, continual learning, probabilistic models, approximation, computational complexity, and spectral complexity. We compare various vision transformer models based on their performance, the number of parameters, and the number of floating point operations (FLOPs) on multiple datasets.	This paper presents a comprehensive analysis of efficient transformers in the vision domain, focusing on their memory usage, computational cost, and performance across various aspects such as accuracy, robustness, fairness, and bias.	The paper addresses the challenge of designing efficient transformer models for industrial applications, particularly in computer vision, due to the growing size and computational demands of these models.	The paper reviews various techniques employed to enhance the efficiency of vision transformers, categorizing them into dimensions like computational complexity, spectral complexity, robustness, privacy, approximation, efficient learning, transparency, fairness, and inclusiveness.	WaveViT demonstrates superior efficiency in terms of accuracy and parameter count compared to other transformer models. CvT achieves comparable results on ImageNet benchmarks with a relatively small number of parameters. CMT exhibits promising performance on ImageNet with a small parameter count and low FLOPs, especially for higher resolution images (384x384).	The paper primarily focuses on image classification tasks, leaving the exploration of efficient transformers for other vision tasks like object detection and segmentation for future work. Evaluating the latest vision transformer models on the Long Range Arena (LRA) benchmark, which focuses on long-range data contexts, is an open area for future research.	vision transformers, efficient deep learning, computational complexity, model robustness, transfer learning
2302.08242 Report	Tuning computer vision models with task rewards	André Susano Pinto, Alexander Kolesnikov, Yuge Shi, Lucas Beyer, Xiaohua Zhai	Misalignment between model predictions and intended usage can be detrimental for the deployment of computer vision models. The issue is exacerbated when the task involves complex structured outputs, as it becomes harder to design procedures which address this misalignment. In natural language processing, this is often addressed using reinforcement learning techniques that align models with a task reward. We adopt this approach and show its surprising effectiveness across multiple computer vision tasks, such as object detection, panoptic segmentation, colorization and image captioning. We believe this approach has the potential to be widely useful for better aligning models with a diverse range of computer vision tasks.	This paper introduces a novel approach for fine-tuning computer vision models by directly optimizing task rewards using reinforcement learning, specifically the REINFORCE algorithm.	This is important because traditional computer vision models often rely on optimizing differentiable loss functions that may not directly correlate with the desired task performance or involve complex and indirect optimization procedures.	The methodology consists of two steps: (1) pretraining a model with maximum likelihood estimation (MLE) to learn data distribution and (2) fine-tuning the model to maximize a task-specific reward function using REINFORCE.	Reward optimization significantly improves performance on object detection and panoptic segmentation tasks, achieving results comparable to state-of-the-art methods. It enables control over qualitative aspects of model outputs, as demonstrated by tuning colorization models to produce vivid and colorful images. The approach proves effective for image captioning, showing consistent improvements in CIDEr score compared to MLE pretrained models.	Reward hacking is a potential limitation where the model might exploit weaknesses in reward definition instead of improving the intended task. Careful reward design is crucial and often non-trivial, requiring consideration of potential biases and unintended consequences.	computer vision, reinforcement learning, reward optimization, task alignment, mle pretraining
2302.08113 Report	MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation	Omer Bar-Tal, Lior Yariv, Yaron Lipman, Tali Dekel	Recent advances in text-to-image generation with diffusion models present transformative capabilities in image quality. However, user controllability of the generated image, and fast adaptation to new tasks still remains an open challenge, currently mostly addressed by costly and long re-training and fine-tuning or ad-hoc adaptations to specific image generation tasks. In this work, we present MultiDiffusion, a unified framework that enables versatile and controllable image generation, using a pre-trained text-to-image diffusion model, without any further training or finetuning. At the center of our approach is a new generation process, based on an optimization task that binds together multiple diffusion generation processes with a shared set of parameters or constraints. We show that MultiDiffusion can be readily applied to generate high quality and diverse images that adhere to user-provided controls, such as desired aspect ratio (e.g., panorama), and spatial guiding signals, ranging from tight segmentation masks to bounding boxes. Project webpage: https://multidiffusion.github.io	Introduces MultiDiffusion, a unified framework for versatile and controllable image generation using a pre-trained text-to-image diffusion model without further training.	Addresses the challenge of user controllability in text-to-image generation, enabling flexible adaptation to new tasks without costly retraining.	Defines a new generation process that optimizes a shared set of parameters or constraints across multiple reference diffusion generation processes applied to different image regions.	Generates high-quality, seamless panoramic images from text prompts. Enables text-to-image generation with user-provided spatial guidance, from bounding boxes to tight masks. Outperforms baselines in panorama generation quality and region-based generation accuracy.	Quality heavily reliant on the generative prior of the reference diffusion model. Further exploration of more general optimization problems and constraints within the MultiDiffusion framework.	image generation, diffusion models, controllable generation, text-to-image synthesis, multidiffusion
2302.08106 Report	Towards Efficient Visual Adaption via Structural Re-parameterization	Gen Luo, Minglang Huang, Yiyi Zhou, Xiaoshuai Sun, Guannan Jiang, Zhiyu Wang, Rongrong Ji	Parameter-efficient transfer learning (PETL) is an emerging research spot aimed at inexpensively adapting large-scale pre-trained models to downstream tasks. Recent advances have achieved great success in saving storage costs for various pre-trained models by updating a small number of parameters instead of full tuning. However, we notice that most existing PETL methods still incur non-negligible latency during inference. In this paper, we propose a parameter-efficient and computational friendly adapter for giant vision models, called RepAdapter. Specifically, we first prove that common adaptation modules can also be seamlessly integrated into most giant vision models via our structural re-parameterization, thereby achieving zero-cost during inference. We then investigate the sparse design and effective placement of adapter structure, helping our RepAdaper obtain other advantages in terms of parameter efficiency and performance. To validate RepAdapter, we conduct extensive experiments on 27 benchmark datasets of three vision tasks, i.e., image and video classifications and semantic segmentation. Experimental results show the superior performance and efficiency of RepAdapter than the state-of-the-art PETL methods. For instance, RepAdapter outperforms full tuning by +7.2% on average and saves up to 25% training time, 20% GPU memory, and 94.6% storage cost of ViT-B/16 on VTAB-1k. The generalization ability of RepAdapter is also well validated by a bunch of vision models. Our source code is released at https://github.com/luogen1996/RepAdapter.	This paper proposes RepAdapter, a novel parameter-efficient transfer learning (PETL) method for adapting giant vision models to downstream tasks, which achieves zero inference cost via structural re-parameterization.	Most existing PETL methods, while reducing storage costs, still lead to significant inference latency. This paper addresses the need for a PETL method that is both parameter-efficient and computationally friendly during inference.	RepAdapter sequentially inserts lightweight, linear adapter networks into pre-trained models. After training, these adapters are re-parameterized into the nearby projection weights, enabling zero-cost inference. The paper also investigates a sparse adapter structure and effective placement strategies to further enhance parameter efficiency and performance.	RepAdapter consistently outperforms state-of-the-art PETL methods on 27 benchmark datasets, including image and video classification, and semantic segmentation. It demonstrates superior efficiency, reducing training time and GPU memory consumption compared to full fine-tuning. The method exhibits strong generalization ability across various vision models like ConvNeXt, ViT, Swin-Transformer, and CLIP.	The paper acknowledges that the exploration of sparse structures is limited to group-wise transformations. Future work could investigate applying RepAdapter to more complex vision tasks and exploring automated adapter placement strategies. Future work could also explore the theoretical aspects of why pre-inserting the adapter leads to better performance.	parameter-efficient transfer learning, visual adapters, structural re-parameterization, vision transformer, inference efficiency
2302.08063 Report	MINOTAUR: Multi-task Video Grounding From Multimodal Queries	Raghav Goyal, Effrosyni Mavroudi, Xitong Yang, Sainbayar Sukhbaatar, Leonid Sigal, Matt Feiszli, Lorenzo Torresani, Du Tran	Video understanding tasks take many forms, from action detection to visual query localization and spatio-temporal grounding of sentences. These tasks differ in the type of inputs (only video, or video-query pair where query is an image region or sentence) and outputs (temporal segments or spatio-temporal tubes). However, at their core they require the same fundamental understanding of the video, i.e., the actors and objects in it, their actions and interactions. So far these tasks have been tackled in isolation with individual, highly specialized architectures, which do not exploit the interplay between tasks. In contrast, in this paper, we present a single, unified model for tackling query-based video understanding in long-form videos. In particular, our model can address all three tasks of the Ego4D Episodic Memory benchmark which entail queries of three different forms: given an egocentric video and a visual, textual or activity query, the goal is to determine when and where the answer can be seen within the video. Our model design is inspired by recent query-based approaches to spatio-temporal grounding, and contains modality-specific query encoders and task-specific sliding window inference that allow multi-task training with diverse input modalities and different structured outputs. We exhaustively analyze relationships among the tasks and illustrate that cross-task learning leads to improved performance on each individual task, as well as the ability to generalize to unseen tasks, such as zero-shot spatial localization of language queries.	MINOTAUR, a unified Transformer-based model for grounding multimodal queries (visual, textual, activity) in long-form egocentric videos.	Existing video understanding models often tackle tasks in isolation. This work proposes a unified approach to leverage the interplay between tasks and improve performance/generalization.	The model encodes video and query using task-specific modules, fuses them with a Transformer, and decodes spatio-temporal responses using sliding window inference and a foreground frame prediction module.	Multi-task learning surpasses single-task models on 9 out of 12 metrics across 3 episodic memory tasks. The model demonstrates zero-shot spatio-temporal grounding of language queries, not explicitly trained for. Ablation studies confirm the effectiveness of each component, including modality-specific encoders and multi-scale inference.	The model's performance could benefit from larger-scale pre-training on extensive video-text datasets. Exploring alternative multi-task learning strategies might further enhance performance and generalization capabilities.	video grounding, multimodal learning, egocentric vision, transformer, zero-shot learning
2302.07979 Report	PRedItOR: Text Guided Image Editing with Diffusion Prior	Hareesh Ravi, Sachin Kelkar, Midhun Harikumar, Ajinkya Kale	Diffusion models have shown remarkable capabilities in generating high quality and creative images conditioned on text. An interesting application of such models is structure preserving text guided image editing. Existing approaches rely on text conditioned diffusion models such as Stable Diffusion or Imagen and require compute intensive optimization of text embeddings or fine-tuning the model weights for text guided image editing. We explore text guided image editing with a Hybrid Diffusion Model (HDM) architecture similar to DALLE-2. Our architecture consists of a diffusion prior model that generates CLIP image embedding conditioned on a text prompt and a custom Latent Diffusion Model trained to generate images conditioned on CLIP image embedding. We discover that the diffusion prior model can be used to perform text guided conceptual edits on the CLIP image embedding space without any finetuning or optimization. We combine this with structure preserving edits on the image decoder using existing approaches such as reverse DDIM to perform text guided image editing. Our approach, PRedItOR does not require additional inputs, fine-tuning, optimization or objectives and shows on par or better results than baselines qualitatively and quantitatively. We provide further analysis and understanding of the diffusion prior model and believe this opens up new possibilities in diffusion models research.	PRedItOR: a novel method for text-guided image editing using a pre-trained Hybrid Diffusion Model (HDM) similar to DALLE-2, leveraging the Diffusion Prior for conceptual edits in CLIP image embedding space.	Existing text-guided image editing techniques based on diffusion models often require base prompts, optimization of embeddings, or fine-tuning, which PRedItOR overcomes by using a pre-trained HDM and a novel two-step editing approach.	PRedItOR uses the Diffusion Prior to perform a "conceptual edit" by manipulating the base image's CLIP embedding based on the edit text. This is followed by a "structural edit" using reverse DDIM on the HDM's decoder, conditioned on the edited embedding.	Conceptual editing with the Diffusion Prior effectively captures the edit text's context while preserving information from the base image. PRedItOR achieves comparable or better qualitative results than existing baselines without requiring base prompts, optimization, or fine-tuning. Quantitative analysis shows that PRedItOR can achieve a balance between relevance to the edit text and fidelity to the base image's structure.	The HDM used is trained on a smaller dataset compared to models used in some baselines, limiting the scope of comparable edits. PRedItOR relies on reverse DDIM, which, similar to SDEdit, can struggle with color-changing edits, leading to a trade-off between color accuracy and structure preservation.	text-guided image editing, diffusion models, diffusion prior, clip embedding, hybrid diffusion model
2302.07864 Report	Denoising Diffusion Probabilistic Models for Robust Image Super-Resolution in the Wild	Hshmat Sahak, Daniel Watson, Chitwan Saharia, David Fleet	Diffusion models have shown promising results on single-image super-resolution and other image- to-image translation tasks. Despite this success, they have not outperformed state-of-the-art GAN models on the more challenging blind super-resolution task, where the input images are out of distribution, with unknown degradations. This paper introduces SR3+, a diffusion-based model for blind super-resolution, establishing a new state-of-the-art. To this end, we advocate self-supervised training with a combination of composite, parameterized degradations for self-supervised training, and noise-conditioing augmentation during training and testing. With these innovations, a large-scale convolutional architecture, and large-scale datasets, SR3+ greatly outperforms SR3. It outperforms Real-ESRGAN when trained on the same data, with a DRealSR FID score of 36.82 vs. 37.22, which further improves to FID of 32.37 with larger models, and further still with larger training sets.	This paper introduces SR3+, a diffusion-based model for blind super-resolution that achieves state-of-the-art results by using self-supervised training with a combination of composite, parameterized degradations and noise-conditioning augmentation.	Blind super-resolution, where input images have unknown degradations, is a challenging task where previous diffusion models fell short of state-of-the-art GAN models.	SR3+ leverages a convolutional UNet architecture trained with self-supervision. The training process involves: 1) Applying a sequence of parameterized degradations to high-resolution images to mimic real-world degradations, 2) Noise conditioning augmentation during training and testing to improve robustness and generalization.	SR3+ outperforms SR3 and Real-ESRGAN on FID-10K when trained on the same data. Noise conditioning augmentation at test time provides a trade-off between input alignment and realistic detail hallucination. Increasing model capacity and training set size leads to significant improvements in SR3+ performance.	Potential failure modes, like gibberish text generation, may require more training steps or architectural improvements. Exploration of larger models and improved architectures is left for future work.	super-resolution, diffusion models, blind image super-resolution, noise conditioning augmentation, self-supervised learning
2302.07848 Report	One-Shot Face Video Re-enactment using Hybrid Latent Spaces of StyleGAN2	Trevine Oorloff, Yaser Yacoob	While recent research has progressively overcome the low-resolution constraint of one-shot face video re-enactment with the help of StyleGAN's high-fidelity portrait generation, these approaches rely on at least one of the following: explicit 2D/3D priors, optical flow based warping as motion descriptors, off-the-shelf encoders, etc., which constrain their performance (e.g., inconsistent predictions, inability to capture fine facial details and accessories, poor generalization, artifacts). We propose an end-to-end framework for simultaneously supporting face attribute edits, facial motions and deformations, and facial identity control for video generation. It employs a hybrid latent-space that encodes a given frame into a pair of latents: Identity latent, $\mathcal{W}_{ID}$, and Facial deformation latent, $\mathcal{S}_F$, that respectively reside in the $W+$ and $SS$ spaces of StyleGAN2. Thereby, incorporating the impressive editability-distortion trade-off of $W+$ and the high disentanglement properties of $SS$. These hybrid latents employ the StyleGAN2 generator to achieve high-fidelity face video re-enactment at $1024^2$. Furthermore, the model supports the generation of realistic re-enactment videos with other latent-based semantic edits (e.g., beard, age, make-up, etc.). Qualitative and quantitative analyses performed against state-of-the-art methods demonstrate the superiority of the proposed approach.	This paper presents a novel end-to-end framework for one-shot face video re-enactment at 1024x1024 resolution using a hybrid latent space approach with StyleGAN2.	Existing methods for face video re-enactment either suffer from low resolution, rely on explicit 2D/3D priors that limit generalizability, or struggle to capture fine facial details. This work leverages the implicit priors and disentanglement properties of StyleGAN2's latent spaces to address these limitations.	The framework employs an encoder-decoder architecture. The encoder maps an input frame to two latents: an Identity latent in StyleGAN2's W+ space and a Facial Deformation latent in the first 10 layers of the StyleSpace (SS). The decoder utilizes the pre-trained StyleGAN2 generator to synthesize re-enacted frames by combining these latents. A novel "Cyclic Manifold Adjustment" technique is introduced to improve identity reconstruction for out-of-domain subjects.	The proposed method achieves state-of-the-art quantitative and qualitative results for both same-identity and cross-identity re-enactment at 1024x1024 resolution. The hybrid latent space approach, combining W+ and SS, is shown to be superior to using W+ alone, highlighting the importance of disentanglement for encoding facial deformations. The framework demonstrates robustness to variations in head pose and expression in source frames.	The model inherits limitations from StyleGAN2, such as texture sticking and challenges in handling occlusions and backgrounds. The lack of high-resolution datasets for re-enactment is acknowledged.	face video re-enactment, stylegan2, hybrid latent space, one-shot learning, cyclic manifold adjustment
2302.07685 Report	Video Probabilistic Diffusion Models in Projected Latent Space	Sihyun Yu, Kihyuk Sohn, Subin Kim, Jinwoo Shin	Despite the remarkable progress in deep generative models, synthesizing high-resolution and temporally coherent videos still remains a challenge due to their high-dimensionality and complex temporal dynamics along with large spatial variations. Recent works on diffusion models have shown their potential to solve this challenge, yet they suffer from severe computation- and memory-inefficiency that limit the scalability. To handle this issue, we propose a novel generative model for videos, coined projected latent video diffusion models (PVDM), a probabilistic diffusion model which learns a video distribution in a low-dimensional latent space and thus can be efficiently trained with high-resolution videos under limited resources. Specifically, PVDM is composed of two components: (a) an autoencoder that projects a given video as 2D-shaped latent vectors that factorize the complex cubic structure of video pixels and (b) a diffusion model architecture specialized for our new factorized latent space and the training/sampling procedure to synthesize videos of arbitrary length with a single model. Experiments on popular video generation datasets demonstrate the superiority of PVDM compared with previous video synthesis methods; e.g., PVDM obtains the FVD score of 639.7 on the UCF-101 long video (128 frames) generation benchmark, which improves 1773.4 of the prior state-of-the-art.	This paper proposes PVDM, a novel latent diffusion model for video generation that operates in a low-dimensional latent space, enabling efficient training with high-resolution videos.	Synthesizing high-resolution and temporally coherent videos is challenging due to high dimensionality and complex temporal dynamics. Existing diffusion models, while promising, are computationally and memory intensive, limiting scalability. PVDM addresses these limitations.	PVDM employs a two-stage framework: 1) An autoencoder projects videos into three 2D image-like latent vectors, factorizing the complex cubic structure. 2) A diffusion model tailored for this latent space synthesizes videos of arbitrary length using a joint training strategy for unconditional and frame-conditional generation.	PVDM achieves state-of-the-art results on UCF-101 and SkyTimelapse datasets for video generation, outperforming baselines in both quantitative metrics (FVD and IS) and qualitative assessments. The proposed method demonstrates significant computational and memory efficiency compared to pixel-based video diffusion models, enabling training and generation of high-resolution videos with limited resources. PVDM excels in long video generation, effectively maintaining temporal coherency across extended timesteps even on challenging datasets like UCF-101.	There is still room for improvement in bridging the gap between real and generated videos. Exploring better latent structures or designing more specialized diffusion model architectures for triplane latents could be beneficial.	video generation, diffusion models, latent space, autoencoder, deep learning
2302.07577 Report	Efficient Teacher: Semi-Supervised Object Detection for YOLOv5	Bowen Xu, Mingtao Chen, Wenlong Guan, Lulu Hu	Semi-Supervised Object Detection (SSOD) has been successful in improving the performance of both R-CNN series and anchor-free detectors. However, one-stage anchor-based detectors lack the structure to generate high-quality or flexible pseudo labels, leading to serious inconsistency problems in SSOD. In this paper, we propose the Efficient Teacher framework for scalable and effective one-stage anchor-based SSOD training, consisting of Dense Detector, Pseudo Label Assigner, and Epoch Adaptor. Dense Detector is a baseline model that extends RetinaNet with dense sampling techniques inspired by YOLOv5. The Efficient Teacher framework introduces a novel pseudo label assignment mechanism, named Pseudo Label Assigner, which makes more refined use of pseudo labels from Dense Detector. Epoch Adaptor is a method that enables a stable and efficient end-to-end semi-supervised training schedule for Dense Detector. The Pseudo Label Assigner prevents the occurrence of bias caused by a large number of low-quality pseudo labels that may interfere with the Dense Detector during the student-teacher mutual learning mechanism, and the Epoch Adaptor utilizes domain and distribution adaptation to allow Dense Detector to learn globally distributed consistent features, making the training independent of the proportion of labeled data. Our experiments show that the Efficient Teacher framework achieves state-of-the-art results on VOC, COCO-standard, and COCO-additional using fewer FLOPs than previous methods. To the best of our knowledge, this is the first attempt to apply Semi-Supervised Object Detection to YOLOv5.Code is available: https://github.com/AlibabaResearch/efficientteacher	This paper proposes Efficient Teacher, a novel framework for scalable and effective semi-supervised object detection (SSOD) training for one-stage anchor-based detectors.	One-stage anchor-based detectors often struggle with SSOD due to limitations in generating high-quality pseudo labels and the inconsistency of these labels during training. This paper aims to address these challenges and improve the performance of SSOD in this detector category.	The Efficient Teacher framework consists of three main components: Dense Detector (a RetinaNet-based detector enhanced with dense sampling techniques), Pseudo Label Assigner (PLA, for refined pseudo label assignment), and Epoch Adaptor (EA, for efficient and stable training). PLA categorizes pseudo labels into reliable and uncertain ones and utilizes soft loss for uncertain labels, while EA optimizes training by employing domain and distribution adaptation.	Efficient Teacher achieves state-of-the-art results on VOC, COCO-standard, and COCO-additional datasets with fewer FLOPs compared to previous SSOD methods. Pseudo Label Assigner significantly improves performance by mitigating the negative impact of uncertain pseudo labels. Epoch Adaptor enables faster and more stable training through domain and distribution adaptation.	The current implementation primarily focuses on object detection tasks, further research is needed to explore its applicability in instance segmentation tasks. The computational cost of online Mosaic data augmentation during distribution adaptation could be further reduced.	semi-supervised object detection, pseudo label assignment, one-stage detectors, anchor-based detectors, domain adaptation
2302.07483 Report	EdgeYOLO: An Edge-Real-Time Object Detector	Shihan Liu, Junlin Zha, Jian Sun, Zhuo Li, Gang Wang	This paper proposes an efficient, low-complexity and anchor-free object detector based on the state-of-the-art YOLO framework, which can be implemented in real time on edge computing platforms. We develop an enhanced data augmentation method to effectively suppress overfitting during training, and design a hybrid random loss function to improve the detection accuracy of small objects. Inspired by FCOS, a lighter and more efficient decoupled head is proposed, and its inference speed can be improved with little loss of precision. Our baseline model can reach the accuracy of 50.6% AP50:95 and 69.8% AP50 in MS COCO2017 dataset, 26.4% AP50:95 and 44.8% AP50 in VisDrone2019-DET dataset, and it meets real-time requirements (FPS>=30) on edge-computing device Nvidia Jetson AGX Xavier. We also designed lighter models with less parameters for edge computing devices with lower computing power, which also show better performances. Our source code, hyper-parameters and model weights are all available at https://github.com/LSH9832/edgeyolo.	This paper proposes EdgeYOLO, an efficient and anchor-free object detector based on the YOLO framework, designed for real-time performance on edge computing platforms.	Many state-of-the-art object detectors, while accurate, struggle to achieve real-time performance on edge devices due to their complexity. This work aims to bridge this gap by creating a model that balances high accuracy with real-time inference speed on resource-constrained hardware.	The paper introduces several key innovations: 1) An enhanced data augmentation method combining Mosaic and Mixup to improve data richness and reduce overfitting. 2) A lightweight decoupled head design with reduced channels and layers, further optimized for inference speed using re-parameterization techniques. 3) A staged loss function utilizing Hybrid-Random Loss and cIOU loss to improve detection accuracy, particularly for small objects.	EdgeYOLO achieves 50.6% AP on MS COCO2017 and 26.4% AP on VisDrone2019-DET, surpassing several state-of-the-art models in accuracy while maintaining real-time performance (FPS ≥ 30) on a Nvidia Jetson AGX Xavier. The lightweight decoupled head design provides a significant precision improvement without sacrificing inference speed compared to coupled or traditional decoupled heads. The staged loss function with Hybrid-Random Loss and cIOU loss demonstrably boosts the detection performance, particularly for small objects.	The paper acknowledges that while the use of segmentation labels during training can improve accuracy, it is not strictly necessary and has a minor impact on the final result. Future work will focus on further enhancing the detection accuracy for small objects and exploring additional optimizations for edge devices.	object detection, anchor-free, real-time, edge computing, yolo
2302.07319 Report	Frustratingly Simple but Effective Zero-shot Detection and Segmentation: Analysis and a Strong Baseline	Siddhesh Khandelwal, Anirudth Nambirajan, Behjat Siddiquie, Jayan Eledath, Leonid Sigal	Methods for object detection and segmentation often require abundant instance-level annotations for training, which are time-consuming and expensive to collect. To address this, the task of zero-shot object detection (or segmentation) aims at learning effective methods for identifying and localizing object instances for the categories that have no supervision available. Constructing architectures for these tasks requires choosing from a myriad of design options, ranging from the form of the class encoding used to transfer information from seen to unseen categories, to the nature of the function being optimized for learning. In this work, we extensively study these design choices, and carefully construct a simple yet extremely effective zero-shot recognition method. Through extensive experiments on the MSCOCO dataset on object detection and segmentation, we highlight that our proposed method outperforms existing, considerably more complex, architectures. Our findings and method, which we propose as a competitive future baseline, point towards the need to revisit some of the recent design trends in zero-shot detection / segmentation.	This paper proposes a simple yet effective method for zero-shot object detection and segmentation, achieved by carefully ablating and selecting optimal design choices for each model component.	Current methods for object detection and segmentation require extensive instance-level annotations, which are costly and time-consuming. Zero-shot learning addresses this by transferring knowledge from seen categories to unseen categories without requiring annotations for the latter.	The method uses a two-step training process: 1) Training a Faster R-CNN (detection) or Mask R-CNN (segmentation) on seen categories. 2) Fine-tuning a projection layer to map image features to a semantic embedding space using normalized category-name embeddings (GloVe, ConceptNet).	Outperforms existing zero-shot detection methods by a significant margin on MSCOCO benchmark. Shows superior performance in zero-shot instance segmentation tasks compared to baselines. Demonstrates the importance of choosing appropriate semantic embeddings for optimal zero-shot learning performance.	Limited exploration of more advanced semantic embedding techniques beyond GloVe and ConceptNet. Future work could explore the impact of different architectures and training paradigms on the proposed approach.	zero-shot learning, object detection, instance segmentation, semantic embeddings, transfer learning
2302.07121 Report	Universal Guidance for Diffusion Models	Arpit Bansal, Hong-Min Chu, Avi Schwarzschild, Soumyadip Sengupta, Micah Goldblum, Jonas Geiping, Tom Goldstein	Typical diffusion models are trained to accept a particular form of conditioning, most commonly text, and cannot be conditioned on other modalities without retraining. In this work, we propose a universal guidance algorithm that enables diffusion models to be controlled by arbitrary guidance modalities without the need to retrain any use-specific components. We show that our algorithm successfully generates quality images with guidance functions including segmentation, face recognition, object detection, and classifier signals. Code is available at https://github.com/arpitbansal297/Universal-Guided-Diffusion.	This paper proposes a universal guidance algorithm that allows diffusion models to be controlled by arbitrary guidance modalities (e.g., segmentation, face recognition, object detection) without retraining.	Existing diffusion models are typically limited to a single conditioning modality and require retraining for new modalities, which is computationally expensive.	The algorithm leverages pre-trained guidance models on denoised images during the sampling process, closing the domain gap between noisy latent states and clean images. It incorporates forward guidance based on predicted clean images and backward guidance to optimize the image towards the prompt.	The algorithm successfully generates high-quality images guided by various modalities, including CLIP text embeddings, segmentation maps, face recognition embeddings, and object detection outputs. It is effective with both unconditional diffusion models (ImageNet) and text-conditional models (Stable Diffusion). The method can effectively combine multiple guidance functions simultaneously, as demonstrated with segmentation-guided inpainting.	The generation process using universal guidance is slower than standard conditional generation due to multiple denoising iterations and backward guidance optimization. Optimal hyperparameters for sampling need to be determined individually for each guidance network.	diffusion models, guided image generation, universal guidance, multimodal conditioning, image synthesis
2302.06908 Report	DiffFaceSketch: High-Fidelity Face Image Synthesis with Sketch-Guided Latent Diffusion Model	Yichen Peng, Chunqi Zhao, Haoran Xie, Tsukasa Fukusato, Kazunori Miyata	Synthesizing face images from monochrome sketches is one of the most fundamental tasks in the field of image-to-image translation. However, it is still challenging to (1)~make models learn the high-dimensional face features such as geometry and color, and (2)~take into account the characteristics of input sketches. Existing methods often use sketches as indirect inputs (or as auxiliary inputs) to guide the models, resulting in the loss of sketch features or the alteration of geometry information. In this paper, we introduce a Sketch-Guided Latent Diffusion Model (SGLDM), an LDM-based network architect trained on the paired sketch-face dataset. We apply a Multi-Auto-Encoder (AE) to encode the different input sketches from different regions of a face from pixel space to a feature map in latent space, which enables us to reduce the dimension of the sketch input while preserving the geometry-related information of local face details. We build a sketch-face paired dataset based on the existing method that extracts the edge map from an image. We then introduce a Stochastic Region Abstraction (SRA), an approach to augment our dataset to improve the robustness of SGLDM to handle sketch input with arbitrary abstraction. The evaluation study shows that SGLDM can synthesize high-quality face images with different expressions, facial accessories, and hairstyles from various sketches with different abstraction levels.	This paper introduces DiffFaceSketch (SGLDM), a Latent Diffusion Model (LDM) for synthesizing high-fidelity face images from sketches, enhancing control and detail preservation over existing sketch-to-image methods.	Synthesizing face images from sketches is crucial for applications like character design but challenging due to the sparse nature of sketch data and the need for detailed geometry and color mapping.	SGLDM uses a two-stage training process: 1) a Multi-Auto-Encoder (AE) encodes sketches into feature maps preserving local details, and 2) an LDM learns to generate faces conditioned on these encoded sketches. They also introduce Stochastic Region Abstraction (SRA) for data augmentation, improving robustness to different sketch abstraction levels.	SGLDM generates more realistic faces with higher fidelity to input sketches compared to GAN-based methods like Pix2Pix and DeepFaceDrawing. Quantitative evaluation shows SGLDM achieves superior scores in FID and LPIPS metrics, indicating better image quality and consistency with real faces. User study confirms SGLDM synthesized images have higher preference for both visual quality and input consistency.	The synthesis can be overly sensitive to sketch quality, leading to artifacts with poor sketches. The method, while using LDM for efficiency, is still computationally intensive compared to GAN-based approaches, especially during training and sampling.	image synthesis, sketch-to-image translation, latent diffusion model, face generation, data augmentation
2302.06833 Report	VQ3D: Learning a 3D-Aware Generative Model on ImageNet	Kyle Sargent, Jing Yu Koh, Han Zhang, Huiwen Chang, Charles Herrmann, Pratul Srinivasan, Jiajun Wu, Deqing Sun	Recent work has shown the possibility of training generative models of 3D content from 2D image collections on small datasets corresponding to a single object class, such as human faces, animal faces, or cars. However, these models struggle on larger, more complex datasets. To model diverse and unconstrained image collections such as ImageNet, we present VQ3D, which introduces a NeRF-based decoder into a two-stage vector-quantized autoencoder. Our Stage 1 allows for the reconstruction of an input image and the ability to change the camera position around the image, and our Stage 2 allows for the generation of new 3D scenes. VQ3D is capable of generating and reconstructing 3D-aware images from the 1000-class ImageNet dataset of 1.2 million training images. We achieve an ImageNet generation FID score of 16.8, compared to 69.8 for the next best baseline method.	Presents VQ3D, a novel 3D-aware generative model trained on large and diverse 2D image collections (e.g., ImageNet) using a two-stage vector-quantized autoencoder with a NeRF-based decoder.	Existing 3D generative models struggle with large, diverse datasets like ImageNet, limiting their ability to generate diverse and unconstrained 3D content.	Combines a ViT-based encoder and a conditional NeRF decoder in a two-stage VQ-autoencoder framework. Employs a novel depth loss during training to supervise geometry learning using pseudo-GT depth and renders novel views for improved 3D consistency.	Achieves state-of-the-art generation results on ImageNet with an FID score of 16.8, a significant improvement over the next best baseline (StyleNeRF at 69.8). Demonstrates competitive performance on the CompCars dataset, highlighting its ability to generalize to different datasets with a simple pose sampling scheme. Enables single-view 3D reconstruction and manipulation, allowing for novel view synthesis and image editing directly from a single RGB input.	Large viewpoint manipulation is limited due to the autoencoder-based formulation. Reliance on a pre-trained depth network for geometry supervision may limit applicability to datasets or domains where accurate depth estimation is challenging.	3d generative models, nerf, vector quantization, imagenet, novel view synthesis
2302.06793 Report	HR-NeuS: Recovering High-Frequency Surface Geometry via Neural Implicit Surfaces	Erich Liang, Kenan Deng, Xi Zhang, Chun-Kai Wang	Recent advances in neural implicit surfaces for multi-view 3D reconstruction primarily focus on improving large-scale surface reconstruction accuracy, but often produce over-smoothed geometries that lack fine surface details. To address this, we present High-Resolution NeuS (HR-NeuS), a novel neural implicit surface reconstruction method that recovers high-frequency surface geometry while maintaining large-scale reconstruction accuracy. We achieve this by utilizing (i) multi-resolution hash grid encoding rather than positional encoding at high frequencies, which boosts our model's expressiveness of local geometry details; (ii) a coarse-to-fine algorithmic framework that selectively applies surface regularization to coarse geometry without smoothing away fine details; (iii) a coarse-to-fine grid annealing strategy to train the network. We demonstrate through experiments on DTU and BlendedMVS datasets that our approach produces 3D geometries that are qualitatively more detailed and quantitatively of similar accuracy compared to previous approaches.	This paper proposes \methodname, a novel neural implicit surface reconstruction method that recovers high-frequency surface details while maintaining large-scale accuracy.	Previous methods often produce over-smoothed geometries lacking fine details. This work addresses this limitation to achieve higher fidelity 3D reconstructions.	The method leverages: (i) Multi-resolution hash grid encoding for enhanced local geometry detail. (ii) A coarse-to-fine framework applying surface regularization selectively to avoid over-smoothing. (iii) A coarse-to-fine grid annealing strategy for network training.	Recovers finer surface details and textures compared to NeuS. Achieves similar or better reconstruction accuracy compared to NeuS and NeuralWarp on the DTU dataset. Ablation study demonstrates the individual contributions of each proposed component.	Does not incorporate multi-view constraints used by some other methods. Does not explicitly address ambiguity between shading and surface normals.	3d reconstruction, neural implicit surfaces, multi-resolution hash encoding, surface regularization, coarse-to-fine training
2302.06733 Report	Robust Unsupervised StyleGAN Image Restoration	Yohan Poirier-Ginter, Jean-François Lalonde	GAN-based image restoration inverts the generative process to repair images corrupted by known degradations. Existing unsupervised methods must be carefully tuned for each task and degradation level. In this work, we make StyleGAN image restoration robust: a single set of hyperparameters works across a wide range of degradation levels. This makes it possible to handle combinations of several degradations, without the need to retune. Our proposed approach relies on a 3-phase progressive latent space extension and a conservative optimizer, which avoids the need for any additional regularization terms. Extensive experiments demonstrate robustness on inpainting, upsampling, denoising, and deartifacting at varying degradations levels, outperforming other StyleGAN-based inversion techniques. Our approach also favorably compares to diffusion-based restoration by yielding much more realistic inversion results. Code is available at https://lvsn.github.io/RobustUnsupervised/.	This paper proposes a robust unsupervised StyleGAN image restoration method that uses a single set of hyperparameters across a wide range of degradation levels and types.	Existing unsupervised StyleGAN image restoration methods require careful hyperparameter tuning for each task and degradation level, making them impractical for handling combinations of degradations. This paper addresses this limitation by introducing a robust approach.	The proposed method employs a 3-phase progressive latent space extension, starting with global optimization, then expanding to layer-wise, and finally filter-wise. It leverages a conservative normalized gradient descent (NGD) optimizer and a multi-resolution loss function.	The method achieves state-of-the-art results on most scenarios, outperforming other StyleGAN-based inversion techniques even when they are optimized for each task/level individually. It demonstrates robustness to varying degradation levels across inpainting, upsampling, denoising, and deartifacting. The method effectively handles compositions of these tasks without requiring hyperparameter retuning.	The method is limited to the domain learned by the GAN. It requires knowledge of an approximate degradation function.	image restoration, stylegan, unsupervised learning, generative adversarial networks, robustness
2302.06608 Report	3D-aware Blending with Generative NeRFs	Hyunsu Kim, Gayoung Lee, Yunjey Choi, Jin-Hwa Kim, Jun-Yan Zhu	Image blending aims to combine multiple images seamlessly. It remains challenging for existing 2D-based methods, especially when input images are misaligned due to differences in 3D camera poses and object shapes. To tackle these issues, we propose a 3D-aware blending method using generative Neural Radiance Fields (NeRF), including two key components: 3D-aware alignment and 3D-aware blending. For 3D-aware alignment, we first estimate the camera pose of the reference image with respect to generative NeRFs and then perform 3D local alignment for each part. To further leverage 3D information of the generative NeRF, we propose 3D-aware blending that directly blends images on the NeRF's latent representation space, rather than raw pixel space. Collectively, our method outperforms existing 2D baselines, as validated by extensive quantitative and qualitative evaluations with FFHQ and AFHQ-Cat.	The paper proposes a novel 3D-aware image blending method using generative Neural Radiance Fields (NeRFs), enabling seamless blending of unaligned images while preserving 3D consistency.	Existing 2D image blending methods struggle to handle misaligned images with differences in camera poses and object shapes. This work addresses these limitations by leveraging 3D information.	The proposed method involves 1) 3D-aware alignment: estimating camera poses and aligning objects in 3D using NeRFs, and 2) 3D-aware blending: blending images in the NeRF's latent space using image-blending and density-blending losses.	Outperforms state-of-the-art 2D image blending methods in terms of photorealism and faithfulness. Enables disentanglement of color and geometric changes during blending. Produces multi-view consistent results, showcasing the 3D awareness of the method.	Performance relies on the quality of GAN inversion, which can be a bottleneck. Real-time editing is limited due to the optimization-based approach. Future work could explore encoder-based solutions.	image blending, generative neural radiance fields, 3d-aware image editing, gan inversion, multi-view consistency
2302.06586 Report	Stitchable Neural Networks	Zizheng Pan, Jianfei Cai, Bohan Zhuang	The public model zoo containing enormous powerful pretrained model families (e.g., ResNet/DeiT) has reached an unprecedented scope than ever, which significantly contributes to the success of deep learning. As each model family consists of pretrained models with diverse scales (e.g., DeiT-Ti/S/B), it naturally arises a fundamental question of how to efficiently assemble these readily available models in a family for dynamic accuracy-efficiency trade-offs at runtime. To this end, we present Stitchable Neural Networks (SN-Net), a novel scalable and efficient framework for model deployment. It cheaply produces numerous networks with different complexity and performance trade-offs given a family of pretrained neural networks, which we call anchors. Specifically, SN-Net splits the anchors across the blocks/layers and then stitches them together with simple stitching layers to map the activations from one anchor to another. With only a few epochs of training, SN-Net effectively interpolates between the performance of anchors with varying scales. At runtime, SN-Net can instantly adapt to dynamic resource constraints by switching the stitching positions. Extensive experiments on ImageNet classification demonstrate that SN-Net can obtain on-par or even better performance than many individually trained networks while supporting diverse deployment scenarios. For example, by stitching Swin Transformers, we challenge hundreds of models in Timm model zoo with a single network. We believe this new elastic model framework can serve as a strong baseline for further research in wider communities.	The paper introduces Stitchable Neural Networks (SN-Net), a novel framework that constructs a single scalable network by stitching together pre-trained models of varying sizes from a model family using simple stitching layers, enabling efficient model deployment and dynamic adaptation to resource constraints.	Existing scalable deep learning methods like model compression and NAS are limited to single model design spaces and struggle to leverage the knowledge from pretrained model families. SN-Net aims to overcome these limitations by efficiently combining pretrained models for better flexibility and accuracy in diverse deployment scenarios.	SN-Net strategically stitches together pre-trained models (anchors) from the same family using simple 1x1 convolutional layers. It employs a "Fast-to-Slow" stitching direction, connecting a smaller model's early layers to a larger model's later layers. It also utilizes a "nearest stitching" strategy, stitching only models with similar complexities. Training involves randomly sampling and training individual stitches, leveraging knowledge distillation for performance improvement.	SN-Net achieves flexible accuracy-efficiency trade-offs, effectively interpolating performance between stitched models. A single SN-Net, trained on ImageNet, achieves comparable or superior performance to individually trained models while significantly reducing training cost and storage space. SN-Net demonstrates generalizability across different architectures, successfully stitching plain ViTs, hierarchical ViTs, CNNs, and even combining CNNs with ViTs.	The random stitch sampling strategy during training might be suboptimal for very large stitching spaces, potentially requiring more training epochs. Future work can explore extending SN-Net to other tasks like NLP, dense prediction, and transfer learning.	model stitching, elastic deep learning, model deployment, pre-trained models, resource constraints
2302.06235 Report	A Simple Zero-shot Prompt Weighting Technique to Improve Prompt Ensembling in Text-Image Models	James Urquhart Allingham, Jie Ren, Michael W Dusenberry, Xiuye Gu, Yin Cui, Dustin Tran, Jeremiah Zhe Liu, Balaji Lakshminarayanan	Contrastively trained text-image models have the remarkable ability to perform zero-shot classification, that is, classifying previously unseen images into categories that the model has never been explicitly trained to identify. However, these zero-shot classifiers need prompt engineering to achieve high accuracy. Prompt engineering typically requires hand-crafting a set of prompts for individual downstream tasks. In this work, we aim to automate this prompt engineering and improve zero-shot accuracy through prompt ensembling. In particular, we ask "Given a large pool of prompts, can we automatically score the prompts and ensemble those that are most suitable for a particular downstream dataset, without needing access to labeled validation data?". We demonstrate that this is possible. In doing so, we identify several pathologies in a naive prompt scoring method where the score can be easily overconfident due to biases in pre-training and test data, and we propose a novel prompt scoring method that corrects for the biases. Using our proposed scoring method to create a weighted average prompt ensemble, our method outperforms equal average ensemble, as well as hand-crafted prompts, on ImageNet, 4 of its variants, and 11 fine-grained classification benchmarks, all while being fully automatic, optimization-free, and not requiring access to labeled validation data.	This paper proposes Zero-shot Prompt Ensembling (ZPE), an automatic and optimization-free method for selecting and weighting prompts for zero-shot classification with text-image models, eliminating the need for manual prompt engineering.	Hand-crafting prompts for zero-shot classification in text-image models is labor-intensive and often requires labeled validation data, limiting their general applicability. Automating this process broadens the usability of these models.	ZPE scores prompts based on normalized maximum logits over a set of test images, addressing biases from word frequency in pre-training data and spurious concept frequency in test data. It then uses these scores for weighted averaging or to select a subset of prompts.	ZPE consistently outperforms a naive max-logit scoring baseline. ZPE achieves higher accuracy than hand-crafted prompts on ImageNet, its variants, and a majority of tested fine-grained datasets, despite being fully automatic. ZPE proves robust to variations in model architecture, the size of the pool set, and the number of random/test images used for score estimation.	ZPE relies on a large, diverse pool of high-quality prompts, which is currently lacking. The method scores prompts independently, potentially missing benefits from prompt combinations.	zero-shot classification, prompt engineering, text-image models, prompt ensembling, bias correction
2302.06112 Report	How to Use Dropout Correctly on Residual Networks with Batch Normalization	Bum Jun Kim, Hyeyeon Choi, Hyeonah Jang, Donggeon Lee, Sang Woo Kim	For the stable optimization of deep neural networks, regularization methods such as dropout and batch normalization have been used in various tasks. Nevertheless, the correct position to apply dropout has rarely been discussed, and different positions have been employed depending on the practitioners. In this study, we investigate the correct position to apply dropout. We demonstrate that for a residual network with batch normalization, applying dropout at certain positions increases the performance, whereas applying dropout at other positions decreases the performance. Based on theoretical analysis, we provide the following guideline for the correct position to apply dropout: apply one dropout after the last batch normalization but before the last weight layer in the residual branch. We provide detailed theoretical explanations to support this claim and demonstrate them through module tests. In addition, we investigate the correct position of dropout in the head that produces the final prediction. Although the current consensus is to apply dropout after global average pooling, we prove that applying dropout before global average pooling leads to a more stable output. The proposed guidelines are validated through experiments using different datasets and models.	This paper investigates the optimal position to apply dropout for improved deep neural network performance, particularly within residual networks with batch normalization.	While dropout is a widely used regularization technique, its ideal placement within network architectures, especially alongside batch normalization, remains unclear. This lack of understanding can lead to suboptimal performance.	The authors theoretically analyze the variance inconsistency introduced by dropout and how the order of operations (dropout, ReLU, weight layers, batch normalization, skip connections) affects this inconsistency. They leverage this analysis to propose guidelines for dropout placement.	Applying dropout after the last batch normalization but before the last weight layer in a residual branch improves performance. Using dropout before global average pooling in the network head leads to more stable outputs compared to the common practice of applying it afterward. The proposed guidelines are validated through experiments on various datasets (CIFAR-10, CIFAR-100, Caltech-101, Oxford-IIIT Pet, ImageNet) and models (PreResNet, ResNetV1, MobileNetV2, EfficientNet, DenseNet).	The analysis focuses on PreResNet (ResNetV2) architecture, and while applicable to other variants, further investigation is needed for broader generalization. The study primarily focuses on variance inconsistency as the main challenge posed by dropout, potentially overlooking other factors that might influence performance.	dropout, batch normalization, residual networks, regularization, deep learning
2302.05905 Report	Single Motion Diffusion	Sigal Raab, Inbal Leibovitch, Guy Tevet, Moab Arar, Amit H. Bermano, Daniel Cohen-Or	Synthesizing realistic animations of humans, animals, and even imaginary creatures, has long been a goal for artists and computer graphics professionals. Compared to the imaging domain, which is rich with large available datasets, the number of data instances for the motion domain is limited, particularly for the animation of animals and exotic creatures (e.g., dragons), which have unique skeletons and motion patterns. In this work, we present a Single Motion Diffusion Model, dubbed SinMDM, a model designed to learn the internal motifs of a single motion sequence with arbitrary topology and synthesize motions of arbitrary length that are faithful to them. We harness the power of diffusion models and present a denoising network explicitly designed for the task of learning from a single input motion. SinMDM is designed to be a lightweight architecture, which avoids overfitting by using a shallow network with local attention layers that narrow the receptive field and encourage motion diversity. SinMDM can be applied in various contexts, including spatial and temporal in-betweening, motion expansion, style transfer, and crowd animation. Our results show that SinMDM outperforms existing methods both in quality and time-space efficiency. Moreover, while current approaches require additional training for different applications, our work facilitates these applications at inference time. Our code and trained models are available at https://sinmdm.github.io/SinMDM-page.	This paper introduces SinMDM, a novel single motion diffusion model for synthesizing diverse and realistic motions from a single input sequence.	Motion data, especially for non-humanoid characters, is often scarce, making traditional data-driven methods challenging. SinMDM tackles this by effectively learning motion motifs from a single sequence, enabling diverse animation generation for arbitrary skeletons.	SinMDM leverages a shallow UNet architecture with local attention layers (QnA) to learn from a single motion sequence. This design choice, coupled with a narrow receptive field, encourages motion diversity and prevents overfitting.	SinMDM outperforms prior art, including Ganimator, in quantitative metrics on both HumanML3D and Mixamo benchmarks. The model effectively synthesizes long, high-quality motion sequences and demonstrates various motion manipulation capabilities, including in-betweening, style transfer, and crowd animation. SinMDM showcases the potential of diffusion models for learning from limited data, challenging the notion that they require large datasets.	Like all single-instance learning methods, SinMDM has limited ability to synthesize out-of-distribution motions. The iterative nature of diffusion models results in relatively long inference times.	motion synthesis, diffusion models, single-instance learning, character animation, computer graphics
2302.05872 Report	I$^2$SB: Image-to-Image Schrödinger Bridge	Guan-Horng Liu, Arash Vahdat, De-An Huang, Evangelos A. Theodorou, Weili Nie, Anima Anandkumar	We propose Image-to-Image Schr\"odinger Bridge (I$^2$SB), a new class of conditional diffusion models that directly learn the nonlinear diffusion processes between two given distributions. These diffusion bridges are particularly useful for image restoration, as the degraded images are structurally informative priors for reconstructing the clean images. I$^2$SB belongs to a tractable class of Schr\"odinger bridge, the nonlinear extension to score-based models, whose marginal distributions can be computed analytically given boundary pairs. This results in a simulation-free framework for nonlinear diffusions, where the I$^2$SB training becomes scalable by adopting practical techniques used in standard diffusion models. We validate I$^2$SB in solving various image restoration tasks, including inpainting, super-resolution, deblurring, and JPEG restoration on ImageNet 256x256 and show that I$^2$SB surpasses standard conditional diffusion models with more interpretable generative processes. Moreover, I$^2$SB matches the performance of inverse methods that additionally require the knowledge of the corruption operators. Our work opens up new algorithmic opportunities for developing efficient nonlinear diffusion models on a large scale. scale. Project page and codes: https://i2sb.github.io/	This paper proposes Image-to-Image Schrödinger Bridge (I$^2$SB), a new conditional diffusion model that learns nonlinear diffusion bridges directly between two given distributions, making it particularly suitable for image restoration.	Existing diffusion models for image restoration typically start their generative denoising processes with Gaussian white noise, lacking structural information from the degraded images. I$^2$SB overcomes this limitation by directly leveraging the degraded images as informative priors, leading to more efficient and interpretable image restoration.	I$^2$SB constructs tractable Schrödinger bridges between individual clean images and their corresponding degraded distributions. It leverages an analytic posterior given boundary pairs for training and utilizes standard DDPM for generation. The method avoids complex simulations typically required by standard Schrödinger bridge models, making it scalable to high-dimensional data.	I$^2$SB surpasses standard conditional diffusion models like Palette and ADM in multiple image restoration tasks, including super-resolution, JPEG restoration, and inpainting. I$^2$SB achieves competitive performance to diffusion-based inverse models without requiring knowledge of the corruption operators. I$^2$SB exhibits more interpretable and efficient generation processes with smaller performance drops as the number of function evaluations decreases.	The tractability of I$^2$SB relies on the availability of paired data during training, limiting its application in unpaired image translation tasks. Exploring simulation-free diffusion bridges under more flexible setups is an interesting future direction.	image restoration, diffusion models, schrödinger bridge, conditional generation, image-to-image translation
2302.05499 Report	CUDA: Curriculum of Data Augmentation for Long-Tailed Recognition	Sumyeong Ahn, Jongwoo Ko, Se-Young Yun	Class imbalance problems frequently occur in real-world tasks, and conventional deep learning algorithms are well known for performance degradation on imbalanced training datasets. To mitigate this problem, many approaches have aimed to balance among given classes by re-weighting or re-sampling training samples. These re-balancing methods increase the impact of minority classes and reduce the influence of majority classes on the output of models. However, the extracted representations may be of poor quality owing to the limited number of minority samples. To handle this restriction, several methods have been developed that increase the representations of minority samples by leveraging the features of the majority samples. Despite extensive recent studies, no deep analysis has been conducted on determination of classes to be augmented and strength of augmentation has been conducted. In this study, we first investigate the correlation between the degree of augmentation and class-wise performance, and find that the proper degree of augmentation must be allocated for each class to mitigate class imbalance problems. Motivated by this finding, we propose a simple and efficient novel curriculum, which is designed to find the appropriate per-class strength of data augmentation, called CUDA: CUrriculum of Data Augmentation for long-tailed recognition. CUDA can simply be integrated into existing long-tailed recognition methods. We present the results of experiments showing that CUDA effectively achieves better generalization performance compared to the state-of-the-art method on various imbalanced datasets such as CIFAR-100-LT, ImageNet-LT, and iNaturalist 2018.	The paper proposes CUDA, a simple and efficient curriculum learning-based data augmentation method for long-tailed recognition, which adaptively finds the proper augmentation strength for each class.	Class imbalance problems are common in real-world tasks, and traditional deep learning algorithms often perform poorly on imbalanced datasets. While existing methods address this issue by re-weighting or re-sampling training samples, they often fail to fully utilize the limited information available for minority classes.	CUDA measures a Level-of-Learning (LoL) score for each class, reflecting the model's ability to classify augmented samples. Based on this score, it generates augmented samples with varying difficulties, gradually increasing the augmentation strength for classes the model learns well and decreasing it for those it struggles with.	CUDA consistently outperforms state-of-the-art methods on multiple imbalanced datasets, including CIFAR-100-LT, ImageNet-LT, and iNaturalist 2018. Analysis shows that CUDA improves both the classifier's balance, reducing the variance in weight norms between classes, and the feature extractor's ability, leading to better feature alignment. The LoL score dynamics demonstrate that CUDA effectively adjusts augmentation strength throughout training, allowing the model to learn difficult samples without forgetting the original information.	The impact of the number and type of predefined augmentation operations on CUDA's performance could be further investigated. Exploring the effectiveness of CUDA in other domains beyond image classification, such as natural language processing or time series analysis, would be valuable.	class imbalance, long-tailed recognition, data augmentation, curriculum learning, deep learning
2302.05496 Report	MaskSketch: Unpaired Structure-guided Masked Image Generation	Dina Bashkirova, Jose Lezama, Kihyuk Sohn, Kate Saenko, Irfan Essa	Recent conditional image generation methods produce images of remarkable diversity, fidelity and realism. However, the majority of these methods allow conditioning only on labels or text prompts, which limits their level of control over the generation result. In this paper, we introduce MaskSketch, an image generation method that allows spatial conditioning of the generation result using a guiding sketch as an extra conditioning signal during sampling. MaskSketch utilizes a pre-trained masked generative transformer, requiring no model training or paired supervision, and works with input sketches of different levels of abstraction. We show that intermediate self-attention maps of a masked generative transformer encode important structural information of the input image, such as scene layout and object shape, and we propose a novel sampling method based on this observation to enable structure-guided generation. Our results show that MaskSketch achieves high image realism and fidelity to the guiding structure. Evaluated on standard benchmark datasets, MaskSketch outperforms state-of-the-art methods for sketch-to-image translation, as well as unpaired image-to-image translation approaches.	Introduces MaskSketch, a sketch-guided image generation method leveraging pre-trained masked generative transformers for realistic image synthesis with spatial control.	Addresses limitations of existing methods that struggle with fine-grained spatial control in image generation, particularly in sketch-to-photo translation due to domain gaps.	Utilizes self-attention maps of a pre-trained masked generative transformer to define structural similarity and guide image generation towards the desired layout specified by an input sketch.	Demonstrates that self-attention maps encode structural information robust to domain shifts between sketches and photos. Achieves high realism and structure fidelity in sketch-to-photo translation without paired supervision. Outperforms state-of-the-art sketch-to-photo and general unpaired image translation methods in realism and structure preservation.	Computational efficiency is a limitation due to multiple sampling iterations and rejection sampling. Limited by the coarse granularity of transformer attention maps and the flexibility of the pre-trained ImageNet model.	image generation, sketch-to-photo translation, generative transformers, self-attention maps, structure-guided synthesis
2302.05486 Report	RAFaRe: Learning Robust and Accurate Non-parametric 3D Face Reconstruction from Pseudo 2D&3D Pairs	Longwei Guo, Hao Zhu, Yuanxun Lu, Menghua Wu, Xun Cao	We propose a robust and accurate non-parametric method for single-view 3D face reconstruction (SVFR). While tremendous efforts have been devoted to parametric SVFR, a visible gap still lies between the result 3D shape and the ground truth. We believe there are two major obstacles: 1) the representation of the parametric model is limited to a certain face database; 2) 2D images and 3D shapes in the fitted datasets are distinctly misaligned. To resolve these issues, a large-scale pseudo 2D\&3D dataset is created by first rendering the detailed 3D faces, then swapping the face in the wild images with the rendered face. These pseudo 2D&3D pairs are created from publicly available datasets which eliminate the gaps between 2D and 3D data while covering diverse appearances, poses, scenes, and illumination. We further propose a non-parametric scheme to learn a well-generalized SVFR model from the created dataset, and the proposed hierarchical signed distance function turns out to be effective in predicting middle-scale and small-scale 3D facial geometry. Our model outperforms previous methods on FaceScape-wild/lab and MICC benchmarks and is well generalized to various appearances, poses, expressions, and in-the-wild environments. The code is released at http://github.com/zhuhao-nju/rafare .	This paper presents a novel non-parametric method for single-view 3D face reconstruction (SVFR) that surpasses previous parametric methods limited by 3DMMs and inaccurate training data.	Achieving robust and accurate SVFR is crucial for various applications, including facial editing, animation, and VR/AR.	The authors create a large-scale pseudo 2D&3D dataset with accurate alignment by swapping in-the-wild faces with precisely reconstructed faces. They then employ a hierarchical signed distance function to train a non-parametric SVFR model on this dataset.	The method outperforms previous approaches on FaceScape-wild/lab and MICC benchmarks, demonstrating superior accuracy. It exhibits strong generalization to diverse appearances, poses, expressions, and in-the-wild environments. The hierarchical SDF proves effective in recovering detailed facial geometry at different scales.	The non-uniform mesh topology requires an additional registration step for downstream applications. The performance on faces with large poses is relatively lower due to limited training data with extreme poses.	3d face reconstruction, single-view reconstruction, non-parametric method, hierarchical signed distance function, data augmentation
2302.05016 Report	Is Multimodal Vision Supervision Beneficial to Language?	Avinash Madasu, Vasudev Lal	Vision (image and video) - Language (VL) pre-training is the recent popular paradigm that achieved state-of-the-art results on multi-modal tasks like image-retrieval, video-retrieval, visual question answering etc. These models are trained in an unsupervised way and greatly benefit from the complementary modality supervision. In this paper, we explore if the language representations trained using vision supervision perform better than vanilla language representations on Natural Language Understanding and commonsense reasoning benchmarks. We experiment with a diverse set of image-text models such as ALBEF, BLIP, METER and video-text models like ALPRO, Frozen-in-Time (FiT), VIOLET. We compare the performance of language representations of stand-alone text encoders of these models to the language representations of text encoders learnt through vision supervision. Our experiments suggest that vanilla language representations show superior performance on most of the tasks. These results shed light on the current drawbacks of the vision-language models.	This paper investigates whether language representations trained with visual supervision from image-text and video-text models perform better than vanilla language representations on Natural Language Understanding (NLU) and commonsense reasoning tasks.	Vision-language pre-training has shown success in multi-modal tasks, raising the question of its impact on language understanding capabilities.	The study compares vanilla language models (BERT, RoBERTa, DistilBERT) pre-trained on text captions with their vision-supervised counterparts from models like ALBEF, BLIP, METER, ALPRO, FiT, and VIOLET. They are evaluated on GLUE, Superglue, and commonsense reasoning benchmarks.	Vanilla language representations outperform vision-supervised counterparts on most NLU tasks (NLI, sentence similarity, reading comprehension). Similar trends are observed for commonsense reasoning benchmarks. However, vision-supervised models show improvements on specific tasks like WNLI (GLUE) and COPA (Superglue).	The study primarily focuses on understanding language capabilities and doesn't evaluate multi-modal tasks. Future work can explore the impact of different pre-training objectives and data scales on the performance difference.	vision-language pre-training, natural language understanding, commonsense reasoning, language representation learning, multi-modal learning
2302.04871 Report	In-N-Out: Faithful 3D GAN Inversion with Volumetric Decomposition for Face Editing	Yiran Xu, Zhixin Shu, Cameron Smith, Seoung Wug Oh, Jia-Bin Huang	3D-aware GANs offer new capabilities for view synthesis while preserving the editing functionalities of their 2D counterparts. GAN inversion is a crucial step that seeks the latent code to reconstruct input images or videos, subsequently enabling diverse editing tasks through manipulation of this latent code. However, a model pre-trained on a particular dataset (e.g., FFHQ) often has difficulty reconstructing images with out-of-distribution (OOD) objects such as faces with heavy make-up or occluding objects. We address this issue by explicitly modeling OOD objects from the input in 3D-aware GANs. Our core idea is to represent the image using two individual neural radiance fields: one for the in-distribution content and the other for the out-of-distribution object. The final reconstruction is achieved by optimizing the composition of these two radiance fields with carefully designed regularization. We demonstrate that our explicit decomposition alleviates the inherent trade-off between reconstruction fidelity and editability. We evaluate reconstruction accuracy and editability of our method on challenging real face images and videos and showcase favorable results against other baselines.	This paper introduces a novel 3D-aware GAN inversion method for reconstructing and editing portrait images and videos containing out-of-distribution (OOD) objects, such as heavy makeup or accessories.	Existing 3D GAN inversion techniques struggle to reconstruct and edit images with OOD objects due to the models being trained primarily on in-distribution data (e.g., natural faces). This limits their ability to handle challenging cases with complex textures or occlusions.	The method decomposes the 3D representation into two neural radiance fields, one for the in-distribution face and another for the OOD object. This is achieved by leveraging the tri-plane representation of EG3D and employing a composite volume rendering scheme that combines both radiance fields for reconstruction.	The approach achieves high-fidelity reconstruction of faces with OOD objects, outperforming existing methods on metrics such as LPIPS, PSNR, SSIM, and ID similarity. It preserves the editability of the pre-trained GAN, allowing for semantic manipulations like changing facial expressions while leaving the OOD component intact. The method enables 3D-aware applications such as novel view synthesis and OOD object removal.	The method faces challenges in editing OOD regions directly, handling duplicate objects (like adding glasses to existing glasses), and dealing with extreme poses. The current implementation primarily focuses on single-frame editing and can suffer from temporal inconsistency in video editing.	gan inversion, 3d-aware gans, out-of-distribution data, neural radiance fields, composite volume rendering
2302.04869 Report	Reversible Vision Transformers	Karttikeya Mangalam, Haoqi Fan, Yanghao Li, Chao-Yuan Wu, Bo Xiong, Christoph Feichtenhofer, Jitendra Malik	We present Reversible Vision Transformers, a memory efficient architecture design for visual recognition. By decoupling the GPU memory requirement from the depth of the model, Reversible Vision Transformers enable scaling up architectures with efficient memory usage. We adapt two popular models, namely Vision Transformer and Multiscale Vision Transformers, to reversible variants and benchmark extensively across both model sizes and tasks of image classification, object detection and video classification. Reversible Vision Transformers achieve a reduced memory footprint of up to 15.5x at roughly identical model complexity, parameters and accuracy, demonstrating the promise of reversible vision transformers as an efficient backbone for hardware resource limited training regimes. Finally, we find that the additional computational burden of recomputing activations is more than overcome for deeper models, where throughput can increase up to 2.3x over their non-reversible counterparts. Full code and trained models are available at https://github.com/facebookresearch/slowfast. A simpler, easy to understand and modify version is also available at https://github.com/karttikeya/minREV	This paper introduces Reversible Vision Transformers (Rev-ViT and Rev-MViT), memory-efficient versions of ViT and MViT that decouple memory usage from model depth by recomputing activations instead of storing them.	The memory requirements of deep Vision Transformers often limit their scalability, especially in memory-intensive tasks like video recognition. Reversible architectures offer a solution by significantly reducing activation memory footprint.	The authors adapt ViT and MViT to reversible architectures by employing reversible transformations, reconfiguring residual connections to improve stability in deep models, and developing training recipes tailored for the inherent regularization of reversible networks.	Rev-ViT and Rev-MViT achieve comparable accuracy to their non-reversible counterparts across image classification, object detection, and video classification benchmarks. Reversible models exhibit significant memory savings, with Rev-ViT-L using 15.5x less memory and Rev-MViT-B using 4.5x less memory per image than their non-reversible versions. Deeper Rev-MViT models demonstrate up to 2.3x higher throughput compared to standard MViT models due to reduced memory bottlenecks.	The stage-transition blocks in Rev-MViT, necessary for resolution changes, still require activation caching, somewhat limiting memory savings. Further research can explore asynchronous activation recomputation and parallelization strategies to further improve the training speed of reversible transformers.	vision transformer, reversible architecture, memory efficiency, image classification, video classification, object detection
2302.04868 Report	MEGANE: Morphable Eyeglass and Avatar Network	Junxuan Li, Shunsuke Saito, Tomas Simon, Stephen Lombardi, Hongdong Li, Jason Saragih	Eyeglasses play an important role in the perception of identity. Authentic virtual representations of faces can benefit greatly from their inclusion. However, modeling the geometric and appearance interactions of glasses and the face of virtual representations of humans is challenging. Glasses and faces affect each other's geometry at their contact points, and also induce appearance changes due to light transport. Most existing approaches do not capture these physical interactions since they model eyeglasses and faces independently. Others attempt to resolve interactions as a 2D image synthesis problem and suffer from view and temporal inconsistencies. In this work, we propose a 3D compositional morphable model of eyeglasses that accurately incorporates high-fidelity geometric and photometric interaction effects. To support the large variation in eyeglass topology efficiently, we employ a hybrid representation that combines surface geometry and a volumetric representation. Unlike volumetric approaches, our model naturally retains correspondences across glasses, and hence explicit modification of geometry, such as lens insertion and frame deformation, is greatly simplified. In addition, our model is relightable under point lights and natural illumination, supporting high-fidelity rendering of various frame materials, including translucent plastic and metal within a single morphable model. Importantly, our approach models global light transport effects, such as casting shadows between faces and glasses. Our morphable model for eyeglasses can also be fit to novel glasses via inverse rendering. We compare our approach to state-of-the-art methods and demonstrate significant quality improvements.	Presents MEGANE, a 3D morphable and relightable model of eyeglasses that captures geometric and photometric interactions between eyeglasses and faces.	Existing methods for synthesizing glasses on faces either lack 3D consistency, fail to model interactions, or are not relightable, limiting their realism.	Combines a hybrid mesh-volumetric representation for glasses with a generative human head model, leveraging physics-inspired neural relighting and multi-view data with explicit geometry guidance.	Accurately models geometric deformations of both glasses and faces at contact points. Achieves high-fidelity relighting under novel illuminations, supporting diverse materials including translucent plastic and metal. Enables few-shot reconstruction of novel glasses and supports lens insertion with realistic refraction and reflection.	Initial glasses position and subtle motion due to expressions are entangled. Current relighting is per-point-light, limiting real-time applicability.	neural rendering, generative model, 3d face, eyeglasses, relighting
2302.04867 Report	UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of Diffusion Models	Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, Jiwen Lu	Diffusion probabilistic models (DPMs) have demonstrated a very promising ability in high-resolution image synthesis. However, sampling from a pre-trained DPM is time-consuming due to the multiple evaluations of the denoising network, making it more and more important to accelerate the sampling of DPMs. Despite recent progress in designing fast samplers, existing methods still cannot generate satisfying images in many applications where fewer steps (e.g., $<$10) are favored. In this paper, we develop a unified corrector (UniC) that can be applied after any existing DPM sampler to increase the order of accuracy without extra model evaluations, and derive a unified predictor (UniP) that supports arbitrary order as a byproduct. Combining UniP and UniC, we propose a unified predictor-corrector framework called UniPC for the fast sampling of DPMs, which has a unified analytical form for any order and can significantly improve the sampling quality over previous methods, especially in extremely few steps. We evaluate our methods through extensive experiments including both unconditional and conditional sampling using pixel-space and latent-space DPMs. Our UniPC can achieve 3.87 FID on CIFAR10 (unconditional) and 7.51 FID on ImageNet 256$\times$256 (conditional) with only 10 function evaluations. Code is available at https://github.com/wl-zhao/UniPC.	This paper proposes UniPC, a unified predictor-corrector framework for fast sampling of diffusion probabilistic models (DPMs).	Sampling from DPMs is computationally expensive due to many evaluations of the denoising network. UniPC enables faster sampling while maintaining high image quality, particularly in extremely few steps, which is crucial in applications like prompt design for text-to-image models.	The framework is based on a novel unified corrector (UniC) that increases the order of accuracy without extra model evaluations and a unified predictor (UniP) that supports arbitrary order. It leverages the structure of exponential integrators with respect to half log-SNR for efficient computation.	UniPC achieves superior sampling quality in few-step sampling on various datasets, including CIFAR10, LSUN Bedroom, FFHQ, ImageNet, and MS-COCO2014, outperforming state-of-the-art methods like DPM-Solver++. UniC consistently improves the sampling quality of existing DPM solvers with different updating methods and orders. UniPC allows for customizable order schedules and demonstrates promising results with both noise and data prediction models.	UniPC, being a training-free method, still lags behind training-based approaches in performance. Further improvements are possible by exploring better choices for the function B(h), a more accurate estimation of epsilon(x_t, t), and optimal order schedules.	diffusion probabilistic models, fast sampling, predictor-corrector, high-order solver, image synthesis
2302.04850 Report	Robot Synesthesia: A Sound and Emotion Guided AI Painter	Vihaan Misra, Peter Schaldenbrand, Jean Oh	If a picture paints a thousand words, sound may voice a million. While recent robotic painting and image synthesis methods have achieved progress in generating visuals from text inputs, the translation of sound into images is vastly unexplored. Generally, sound-based interfaces and sonic interactions have the potential to expand accessibility and control for the user and provide a means to convey complex emotions and the dynamic aspects of the real world. In this paper, we propose an approach for using sound and speech to guide a robotic painting process, known here as robot synesthesia. For general sound, we encode the simulated paintings and input sounds into the same latent space. For speech, we decouple speech into its transcribed text and the tone of the speech. Whereas we use the text to control the content, we estimate the emotions from the tone to guide the mood of the painting. Our approach has been fully integrated with FRIDA, a robotic painting framework, adding sound and speech to FRIDA's existing input modalities, such as text and style. In two surveys, participants were able to correctly guess the emotion or natural sound used to generate a given painting more than twice as likely as random chance. On our sound-guided image manipulation and music-guided paintings, we discuss the results qualitatively.	This paper introduces Robot Synesthesia, a novel approach that incorporates sound and speech inputs into the FRIDA robotic painting system, enabling a robot to generate paintings that reflect the semantic and emotional content of auditory cues.	This research is important because it explores the underexplored area of translating sound into images, expanding the accessibility and control of robotic painting systems, and enabling a richer expression of human emotions in art created by robots.	The methodology involves leveraging pre-trained audio-image encoders like CLIP_audio for natural sounds and decoupling speech into transcribed text (content) and tone (emotion) using Whisper and a Speech Emotion Recognition model. These features are then used to guide the robotic painting process in FRIDA.	User studies showed that participants were able to correctly guess the emotion or natural sound used to generate a given painting more than twice as likely as random chance. Paintings generated from natural sounds like rain or thunder were recognizable by human observers. Emotion-guided paintings, even with abstract appearances, successfully conveyed the intended emotion to human viewers.	The generalization of the generated content is limited by the training data used for the audio-image and image-emotion models. Evaluating the quality of generated artwork, especially abstract ones, remains a challenge in this field.	robotic painting, sound-guided image generation, emotion in art, human-robot interaction, multimodal learning
2302.04841 Report	Is This Loss Informative? Faster Text-to-Image Customization by Tracking Objective Dynamics	Anton Voronov, Mikhail Khoroshikh, Artem Babenko, Max Ryabinin	Text-to-image generation models represent the next step of evolution in image synthesis, offering a natural way to achieve flexible yet fine-grained control over the result. One emerging area of research is the fast adaptation of large text-to-image models to smaller datasets or new visual concepts. However, many efficient methods of adaptation have a long training time, which limits their practical applications, slows down experiments, and spends excessive GPU resources. In this work, we study the training dynamics of popular text-to-image personalization methods (such as Textual Inversion or DreamBooth), aiming to speed them up. We observe that most concepts are learned at early stages and do not improve in quality later, but standard training convergence metrics fail to indicate that. Instead, we propose a simple drop-in early stopping criterion that only requires computing the regular training objective on a fixed set of inputs for all training iterations. Our experiments on Stable Diffusion for 48 different concepts and three personalization methods demonstrate the competitive performance of our approach, which makes adaptation up to 8 times faster with no significant drops in quality.	This paper proposes DVAR, a novel early stopping criterion to accelerate the adaptation of text-to-image models (e.g., Textual Inversion, DreamBooth) by leveraging a deterministic training loss calculated on a fixed input batch.	Existing adaptation methods for text-to-image models often have long training times, hindering their practical application and efficient experimentation.	The authors analyze the training dynamics of adaptation methods and identify that fixing random components in the loss function makes its convergence more interpretable. This observation leads to the development of DVAR, which monitors the stabilization of the deterministic loss for early stopping.	DVAR significantly reduces training time (2-8x faster) for various adaptation methods on Stable Diffusion v1.5 without compromising image quality. The deterministic loss used in DVAR provides a more reliable convergence indicator than standard metrics like training loss or gradient norm. The adaptive nature of DVAR helps mitigate overfitting to training images, leading to better generalization to unseen prompts.	The study primarily focuses on Stable Diffusion v1.5, and further validation on other models and datasets is needed. Exploring alternative early stopping criteria beyond variance-based methods could be beneficial.	text-to-image generation, model personalization, early stopping, stable diffusion, training dynamics
2302.04638 Report	Better Diffusion Models Further Improve Adversarial Training	Zekai Wang, Tianyu Pang, Chao Du, Min Lin, Weiwei Liu, Shuicheng Yan	It has been recognized that the data generated by the denoising diffusion probabilistic model (DDPM) improves adversarial training. After two years of rapid development in diffusion models, a question naturally arises: can better diffusion models further improve adversarial training? This paper gives an affirmative answer by employing the most recent diffusion model which has higher efficiency ($\sim 20$ sampling steps) and image quality (lower FID score) compared with DDPM. Our adversarially trained models achieve state-of-the-art performance on RobustBench using only generated data (no external datasets). Under the $\ell_\infty$-norm threat model with $\epsilon=8/255$, our models achieve $70.69\%$ and $42.67\%$ robust accuracy on CIFAR-10 and CIFAR-100, respectively, i.e. improving upon previous state-of-the-art models by $+4.58\%$ and $+8.03\%$. Under the $\ell_2$-norm threat model with $\epsilon=128/255$, our models achieve $84.86\%$ on CIFAR-10 ($+4.44\%$). These results also beat previous works that use external data. We also provide compelling results on the SVHN and TinyImageNet datasets. Our code is available at https://github.com/wzekai99/DM-Improves-AT.	This paper explores the impact of utilizing an advanced diffusion model, the elucidating diffusion model (EDM), to enhance adversarial training (AT) for improved robustness against adversarial attacks.	The work is significant as it addresses the question of whether advancements in diffusion models can further improve the effectiveness of AT, a crucial defense against adversarial attacks.	The authors generate data using the class-conditional EDM and incorporate it into the AT process, replacing the previously used DDPM-generated data. They conduct comprehensive experiments on CIFAR-10, CIFAR-100, SVHN, and TinyImageNet datasets, comparing their approach to state-of-the-art methods.	Replacing DDPM-generated data with EDM-generated data leads to significant improvements in both clean and robust accuracy of adversarially trained models. The authors achieve state-of-the-art results on RobustBench without using any external data, surpassing even previous methods that rely on external datasets. The study reveals that using generated data with lower FID scores (indicating higher quality) consistently leads to enhanced model robustness.	The study primarily focuses on ℓ∞ and ℓ2 norm-based attacks, leaving the exploration of other attack types for future work. The work highlights the need for more efficient utilization of diffusion models in adversarial learning to address the computational demands of generating large amounts of data.	adversarial training, diffusion models, robustness, data augmentation, edm
2302.04440 Report	Feature Likelihood Divergence: Evaluating the Generalization of Generative Models Using Samples	Marco Jiralerspong, Avishek Joey Bose, Ian Gemp, Chongli Qin, Yoram Bachrach, Gauthier Gidel	The past few years have seen impressive progress in the development of deep generative models capable of producing high-dimensional, complex, and photo-realistic data. However, current methods for evaluating such models remain incomplete: standard likelihood-based metrics do not always apply and rarely correlate with perceptual fidelity, while sample-based metrics, such as FID, are insensitive to overfitting, i.e., inability to generalize beyond the training set. To address these limitations, we propose a new metric called the Feature Likelihood Divergence (FLD), a parametric sample-based metric that uses density estimation to provide a comprehensive trichotomic evaluation accounting for novelty (i.e., different from the training samples), fidelity, and diversity of generated samples. We empirically demonstrate the ability of FLD to identify overfitting problem cases, even when previously proposed metrics fail. We also extensively evaluate FLD on various image datasets and model classes, demonstrating its ability to match intuitions of previous metrics like FID while offering a more comprehensive evaluation of generative models. Code is available at https://github.com/marcojira/fld.	Proposes Feature Likelihood Divergence (FLD), a sample-based metric for evaluating generative models that captures sample fidelity, diversity, and novelty.	Existing sample-based metrics like FID, while correlating with sample quality and diversity, fail to detect overfitting (memorization of the training set), which is crucial for assessing generalization ability and addressing privacy concerns.	FLD leverages a Mixture of Gaussians (MoG) density estimator in a perceptually meaningful feature space (e.g., DINOv2). It fits the MoG's variances to the training set such that overfit samples receive vanishingly small variances, negatively impacting the density estimation and resulting FLD score.	FLD correlates strongly with sample fidelity, penalizing perceptually significant transformations more than minor ones. It effectively captures mode coverage and diversity, with scores improving as generated samples encompass more classes and avoid redundant copies. FLD demonstrates consistent detection of overfitting, even for subtly transformed copies of training data, outperforming FID, CT, and AuthPct metrics.	The reliance on fixed feature spaces might not generalize to all datasets and modalities, necessitating exploration of alternative embeddings. Future work could explore extensions of FLD for evaluating conditional generative models and its applicability in other data modalities like text, audio, and time series.	generative models, evaluation metrics, overfitting detection, sample fidelity, sample diversity
2302.04265 Report	PFGM++: Unlocking the Potential of Physics-Inspired Generative Models	Yilun Xu, Ziming Liu, Yonglong Tian, Shangyuan Tong, Max Tegmark, Tommi Jaakkola	We introduce a new family of physics-inspired generative models termed PFGM++ that unifies diffusion models and Poisson Flow Generative Models (PFGM). These models realize generative trajectories for $N$ dimensional data by embedding paths in $N{+}D$ dimensional space while still controlling the progression with a simple scalar norm of the $D$ additional variables. The new models reduce to PFGM when $D{=}1$ and to diffusion models when $D{\to}\infty$. The flexibility of choosing $D$ allows us to trade off robustness against rigidity as increasing $D$ results in more concentrated coupling between the data and the additional variable norms. We dispense with the biased large batch field targets used in PFGM and instead provide an unbiased perturbation-based objective similar to diffusion models. To explore different choices of $D$, we provide a direct alignment method for transferring well-tuned hyperparameters from diffusion models ($D{\to} \infty$) to any finite $D$ values. Our experiments show that models with finite $D$ can be superior to previous state-of-the-art diffusion models on CIFAR-10/FFHQ $64{\times}64$ datasets, with FID scores of $1.91/2.43$ when $D{=}2048/128$. In class-conditional setting, $D{=}2048$ yields current state-of-the-art FID of $1.74$ on CIFAR-10. In addition, we demonstrate that models with smaller $D$ exhibit improved robustness against modeling errors. Code is available at https://github.com/Newbeeer/pfgmpp	Presents PFGM++, a new family of physics-inspired generative models unifying diffusion models and Poisson Flow Generative Models (PFGM) by embedding data in higher dimensions and controlling generation with a scalar norm.	Provides flexibility in balancing robustness and learning rigidity, potentially leading to improved generative models, particularly in resource-constrained settings.	Expands PFGM's electrostatic view into higher dimensions, introduces a perturbation-based training objective, and proves equivalence to diffusion models as the augmentation dimension approaches infinity.	Models with finite augmentation dimensions outperform state-of-the-art diffusion models on CIFAR-10/FFHQ 64x64 datasets. An optimal augmentation dimension exists that balances robustness and learning efficiency. Decreasing the augmentation dimension improves robustness against modeling errors like noise injection, large sampling steps, and quantization.	Identifying the optimal augmentation dimension for various architectures and tasks requires further analysis. Developing stochastic samplers for PFGM++ is a promising direction.	generative models, diffusion models, poisson flow generative models, robustness, image generation
2302.04233 Report	SkyEye: Self-Supervised Bird's-Eye-View Semantic Mapping Using Monocular Frontal View Images	Nikhil Gosala, Kürsat Petek, Paulo L. J. Drews-Jr, Wolfram Burgard, Abhinav Valada	Bird's-Eye-View (BEV) semantic maps have become an essential component of automated driving pipelines due to the rich representation they provide for decision-making tasks. However, existing approaches for generating these maps still follow a fully supervised training paradigm and hence rely on large amounts of annotated BEV data. In this work, we address this limitation by proposing the first self-supervised approach for generating a BEV semantic map using a single monocular image from the frontal view (FV). During training, we overcome the need for BEV ground truth annotations by leveraging the more easily available FV semantic annotations of video sequences. Thus, we propose the SkyEye architecture that learns based on two modes of self-supervision, namely, implicit supervision and explicit supervision. Implicit supervision trains the model by enforcing spatial consistency of the scene over time based on FV semantic sequences, while explicit supervision exploits BEV pseudolabels generated from FV semantic annotations and self-supervised depth estimates. Extensive evaluations on the KITTI-360 dataset demonstrate that our self-supervised approach performs on par with the state-of-the-art fully supervised methods and achieves competitive results using only 1% of direct supervision in the BEV compared to fully supervised approaches. Finally, we publicly release both our code and the BEV datasets generated from the KITTI-360 and Waymo datasets.	This paper introduces SkyEye, the first self-supervised framework for generating semantic bird's-eye-view (BEV) maps from single monocular front-view images.	Generating BEV semantic maps is essential for autonomous driving, but existing methods rely on large amounts of annotated BEV data, which is difficult and expensive to obtain. SkyEye addresses this by utilizing more readily available front-view annotations and self-supervision.	SkyEye leverages implicit supervision, enforcing spatial consistency over time using front-view semantic sequences, and explicit supervision, utilizing BEV pseudolabels generated from front-view annotations and self-supervised depth estimates.	SkyEye achieves performance comparable to state-of-the-art fully supervised methods on the KITTI-360 dataset without using BEV ground truth annotations. The approach shows competitive results even when trained with only 1% of BEV pseudolabels compared to fully supervised approaches. SkyEye demonstrates superior generalization capabilities compared to baselines when pretrained on KITTI-360 and evaluated on Waymo.	The model's reliance on temporal context can impact performance in highly dynamic scenes. Perspective distortion limits spatial observability for distant regions, a common limitation for camera-based methods.	self-supervised learning, bev semantic mapping, autonomous driving, monocular vision, 3d representation learning
2302.03675 Report	Auditing Gender Presentation Differences in Text-to-Image Models	Yanzhe Zhang, Lu Jiang, Greg Turk, Diyi Yang	Text-to-image models, which can generate high-quality images based on textual input, have recently enabled various content-creation tools. Despite significantly affecting a wide range of downstream applications, the distributions of these generated images are still not fully understood, especially when it comes to the potential stereotypical attributes of different genders. In this work, we propose a paradigm (Gender Presentation Differences) that utilizes fine-grained self-presentation attributes to study how gender is presented differently in text-to-image models. By probing gender indicators in the input text (e.g., "a woman" or "a man"), we quantify the frequency differences of presentation-centric attributes (e.g., "a shirt" and "a dress") through human annotation and introduce a novel metric: GEP. Furthermore, we propose an automatic method to estimate such differences. The automatic GEP metric based on our approach yields a higher correlation with human annotations than that based on existing CLIP scores, consistently across three state-of-the-art text-to-image models. Finally, we demonstrate the generalization ability of our metrics in the context of gender stereotypes related to occupations.	This paper proposes GEP, a novel metric to quantify gender presentation differences in text-to-image models by analyzing the frequency of presentation-centric attributes (e.g., clothing) in images generated with different gender indicators.	Understanding how gender is portrayed in generated images is crucial for identifying and mitigating potential biases and stereotypes perpetuated by text-to-image models.	The authors define gender indicators, attributes, and contexts to construct prompts for image generation. They manually annotate the frequency of attributes in generated images and propose an automatic method using cross-modal classifiers trained on CLIP embeddings to estimate these frequencies.	Significant attribute-wise differences are observed in generated images when prompting with different genders, both with and without explicit attribute mentions. The proposed automatic GEP metric based on cross-modal classifiers shows a stronger correlation with human annotations than using CLIP similarity scores alone. The GEP metric can be extended to reveal attribute-based gender stereotypes related to occupations.	The study is limited by the selected set of attributes and contexts, which may not be exhaustive or representative of all scenarios. The lack of real-world distribution data for the studied attributes makes it difficult to determine if the observed differences are amplified compared to reality.	text-to-image generation, gender bias, stereotype detection, clip, cross-modal classifiers
2302.03594 Report	NICER-SLAM: Neural Implicit Scene Encoding for RGB SLAM	Zihan Zhu, Songyou Peng, Viktor Larsson, Zhaopeng Cui, Martin R. Oswald, Andreas Geiger, Marc Pollefeys	Neural implicit representations have recently become popular in simultaneous localization and mapping (SLAM), especially in dense visual SLAM. However, previous works in this direction either rely on RGB-D sensors, or require a separate monocular SLAM approach for camera tracking and do not produce high-fidelity dense 3D scene reconstruction. In this paper, we present NICER-SLAM, a dense RGB SLAM system that simultaneously optimizes for camera poses and a hierarchical neural implicit map representation, which also allows for high-quality novel view synthesis. To facilitate the optimization process for mapping, we integrate additional supervision signals including easy-to-obtain monocular geometric cues and optical flow, and also introduce a simple warping loss to further enforce geometry consistency. Moreover, to further boost performance in complicated indoor scenes, we also propose a local adaptive transformation from signed distance functions (SDFs) to density in the volume rendering equation. On both synthetic and real-world datasets we demonstrate strong performance in dense mapping, tracking, and novel view synthesis, even competitive with recent RGB-D SLAM systems.	NICER-SLAM, a novel dense RGB SLAM system that uses a hierarchical neural implicit representation for end-to-end optimization of both scene representation and camera poses.	To address limitations of existing dense SLAM systems that either rely on RGB-D sensors or separate tracking and mapping pipelines, hindering high-fidelity dense 3D reconstruction in challenging scenarios with monocular RGB input.	The system leverages a hierarchical neural implicit representation for scene geometry and color, incorporating geometric and motion regularizations, including monocular cues and a novel warping loss. A locally adaptive SDF-to-density transformation enhances performance in complex indoor environments.	NICER-SLAM achieves competitive 3D reconstruction and tracking accuracy compared to RGB-D SLAM methods, even without depth input. It demonstrates superior novel view synthesis quality, surpassing both traditional and implicit-based SLAM approaches. The system exhibits robustness in challenging scenarios with low-resolution images and motion blur.	The current implementation is not yet real-time. Loop closure is not incorporated, limiting long-term tracking accuracy.	slam, neural implicit representations, 3d reconstruction, novel view synthesis, monocular rgb
2302.03406 Report	High-Resolution GAN Inversion for Degraded Images in Large Diverse Datasets	Yanbo Wang, Chuming Lin, Donghao Luo, Ying Tai, Zhizhong Zhang, Yuan Xie	The last decades are marked by massive and diverse image data, which shows increasingly high resolution and quality. However, some images we obtained may be corrupted, affecting the perception and the application of downstream tasks. A generic method for generating a high-quality image from the degraded one is in demand. In this paper, we present a novel GAN inversion framework that utilizes the powerful generative ability of StyleGAN-XL for this problem. To ease the inversion challenge with StyleGAN-XL, Clustering \& Regularize Inversion (CRI) is proposed. Specifically, the latent space is firstly divided into finer-grained sub-spaces by clustering. Instead of initializing the inversion with the average latent vector, we approximate a centroid latent vector from the clusters, which generates an image close to the input image. Then, an offset with a regularization term is introduced to keep the inverted latent vector within a certain range. We validate our CRI scheme on multiple restoration tasks (i.e., inpainting, colorization, and super-resolution) of complex natural images, and show preferable quantitative and qualitative results. We further demonstrate our technique is robust in terms of data and different GAN models. To our best knowledge, we are the first to adopt StyleGAN-XL for generating high-quality natural images from diverse degraded inputs. Code is available at https://github.com/Booooooooooo/CRI.	This paper proposes CRI, a novel GAN inversion framework utilizing StyleGAN-XL to generate high-quality images from diverse degraded inputs (e.g., inpainted, colorized, or low-resolution images).	Generating high-quality images from degraded images is crucial due to the uneven quality of online images and the need for high-quality images in various applications.	CRI utilizes clustering to find a better starting point for optimization in the complex latent space of StyleGAN-XL and introduces a regularized offset to constrain the optimization process, ensuring high perceptual quality of generated images.	CRI outperforms existing GAN inversion methods in image inpainting, colorization, and super-resolution tasks on ImageNet, CelebA-HQ, and out-of-domain datasets. CRI with StyleGAN-XL generates higher quality images than DGP with BigGAN, highlighting the benefit of using StyleGAN-XL for this task. Ablation studies demonstrate the effectiveness of clustering and the regularized offset in improving both quantitative and qualitative results.	The clustering time increases with the number of clusters, posing a trade-off between performance and computational cost. Further exploration of applying CRI to other degradation types like noise and blur is left for future work.	gan inversion, image restoration, stylegan-xl, image inpainting, image colorization, super-resolution
2302.03084 Report	Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval	Kuniaki Saito, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee, Kate Saenko, Tomas Pfister	In Composed Image Retrieval (CIR), a user combines a query image with text to describe their intended target. Existing methods rely on supervised learning of CIR models using labeled triplets consisting of the query image, text specification, and the target image. Labeling such triplets is expensive and hinders broad applicability of CIR. In this work, we propose to study an important task, Zero-Shot Composed Image Retrieval (ZS-CIR), whose goal is to build a CIR model without requiring labeled triplets for training. To this end, we propose a novel method, called Pic2Word, that requires only weakly labeled image-caption pairs and unlabeled image datasets to train. Unlike existing supervised CIR models, our model trained on weakly labeled or unlabeled datasets shows strong generalization across diverse ZS-CIR tasks, e.g., attribute editing, object composition, and domain conversion. Our approach outperforms several supervised CIR methods on the common CIR benchmark, CIRR and Fashion-IQ. Code will be made publicly available at https://github.com/google-research/composed_image_retrieval.	This paper proposes a novel task, Zero-Shot Composed Image Retrieval (ZS-CIR), and introduces a method called Pic2Word to address it. Pic2Word enables CIR without requiring labeled triplets for training.	Existing CIR methods are limited by the need for expensive labeled triplet data and often struggle to generalize to diverse CIR tasks. ZS-CIR aims to overcome these limitations by enabling CIR models to function without task-specific labeled data.	Pic2Word leverages pre-trained vision-language contrastive learning models (e.g., CLIP) and learns a mapping network that converts image embeddings into pseudo language tokens. This allows for the composition of image and text queries within the language embedding space, effectively achieving early fusion.	Pic2Word significantly outperforms zero-shot baselines on domain conversion, object composition, and scene manipulation tasks. On CIRR and Fashion-IQ datasets, Pic2Word achieves performance comparable to or better than several recent supervised CIR methods trained on labeled data. Analysis suggests that the learned pseudo language tokens effectively capture image information and that the method benefits from the early fusion strategy.	The performance of Pic2Word on CIRR and Fashion-IQ highlights the potential dataset-specific bias in relative importance of image and text modalities. Future work can explore the use of multiple pseudo tokens to represent images with finer details.	composed image retrieval, zero-shot learning, vision-language models, contrastive learning, early fusion
2302.03024 Report	AIM: Adapting Image Models for Efficient Video Action Recognition	Taojiannan Yang, Yi Zhu, Yusheng Xie, Aston Zhang, Chen Chen, Mu Li	Recent vision transformer based video models mostly follow the ``image pre-training then finetuning" paradigm and have achieved great success on multiple video benchmarks. However, full finetuning such a video model could be computationally expensive and unnecessary, given the pre-trained image transformer models have demonstrated exceptional transferability. In this work, we propose a novel method to Adapt pre-trained Image Models (AIM) for efficient video understanding. By freezing the pre-trained image model and adding a few lightweight Adapters, we introduce spatial adaptation, temporal adaptation and joint adaptation to gradually equip an image model with spatiotemporal reasoning capability. We show that our proposed AIM can achieve competitive or even better performance than prior arts with substantially fewer tunable parameters on four video action recognition benchmarks. Thanks to its simplicity, our method is also generally applicable to different image pre-trained models, which has the potential to leverage more powerful image foundation models in the future. The project webpage is \url{https://adapt-image-models.github.io/}.	This paper proposes AIM, a novel method to adapt pre-trained image models (e.g., ViT) for efficient video understanding by adding lightweight adapters and reusing spatial attention for temporal modeling.	Full finetuning of video models is computationally expensive and potentially unnecessary given the strong transferability of pre-trained image models.	The method introduces spatial, temporal, and joint adaptation modules. Spatial adaptation uses adapters after self-attention for spatial feature refinement. Temporal adaptation reuses image self-attention for temporal modeling and adds adapters for temporal feature tuning. Joint adaptation utilizes adapters in parallel to MLP layers for spatiotemporal reasoning.	AIM achieves competitive or better performance than state-of-the-art methods on K400, K700, and Diving-48 with significantly fewer tunable parameters. AIM exhibits data efficiency, outperforming fully finetuned counterparts, especially in low-data regimes. The method is simple, generally applicable, and reduces training cost significantly compared to full finetuning.	The reused spatial attention for temporal modeling might not be sufficient for complex temporal relationships. Future work could explore leveraging pre-trained weights from text or audio models for enhanced temporal adaptation.	video understanding, action recognition, efficient finetuning, vision transformer, transfer learning
2302.03011 Report	Structure and Content-Guided Video Synthesis with Diffusion Models	Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, Anastasis Germanidis	Text-guided generative diffusion models unlock powerful image creation and editing tools. While these have been extended to video generation, current approaches that edit the content of existing footage while retaining structure require expensive re-training for every input or rely on error-prone propagation of image edits across frames. In this work, we present a structure and content-guided video diffusion model that edits videos based on visual or textual descriptions of the desired output. Conflicts between user-provided content edits and structure representations occur due to insufficient disentanglement between the two aspects. As a solution, we show that training on monocular depth estimates with varying levels of detail provides control over structure and content fidelity. Our model is trained jointly on images and videos which also exposes explicit control of temporal consistency through a novel guidance method. Our experiments demonstrate a wide variety of successes; fine-grained control over output characteristics, customization based on a few reference images, and a strong user preference towards results by our model.	Presents a structure and content-guided video diffusion model that edits videos based on visual or textual descriptions, addressing limitations of prior methods relying on expensive retraining or error-prone propagation.	Improves video editing by enabling intuitive content modification while retaining structure, addressing the challenge of balancing temporal consistency and spatial detail in existing video editing tools.	Extends latent diffusion models to video generation using temporal layers in a pretrained image model, trained jointly on images and videos with conditioning on depth estimates (structure) and CLIP embeddings (content).	Enables diverse video edits including style changes, environment modifications, and character replacements guided by text prompts or example images. Provides control over temporal consistency, content fidelity, and structure adherence through novel guidance methods and variable depth blurring. Demonstrates superior performance in user studies, with a strong preference for generated results compared to baseline methods.	Reliance on depth maps as a structure representation limits the extent of content edits, particularly those involving significant changes in object shape. Potential for misuse of generative models for harmful purposes requires further research on mitigating abuse.	video editing, diffusion models, generative ai, text-to-video, content and structure
2302.02908 Report	LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Retrieval	Ziyang luo, Pu Zhao, Can Xu, Xiubo Geng, Tao Shen, Chongyang Tao, Jing Ma, Qingwen lin, Daxin Jiang	Image-text retrieval (ITR) is a task to retrieve the relevant images/texts, given the query from another modality. The conventional dense retrieval paradigm relies on encoding images and texts into dense representations using dual-stream encoders, however, it faces challenges with low retrieval speed in large-scale retrieval scenarios. In this work, we propose the lexicon-weighting paradigm, where sparse representations in vocabulary space are learned for images and texts to take advantage of the bag-of-words models and efficient inverted indexes, resulting in significantly reduced retrieval latency. A crucial gap arises from the continuous nature of image data, and the requirement for a sparse vocabulary space representation. To bridge this gap, we introduce a novel pre-training framework, Lexicon-Bottlenecked Language-Image Pre-Training (LexLIP), that learns importance-aware lexicon representations. This framework features lexicon-bottlenecked modules between the dual-stream encoders and weakened text decoders, allowing for constructing continuous bag-of-words bottlenecks to learn lexicon-importance distributions. Upon pre-training with same-scale data, our LexLIP achieves state-of-the-art performance on two benchmark ITR datasets, MSCOCO and Flickr30k. Furthermore, in large-scale retrieval scenarios, LexLIP outperforms CLIP with a 5.5 ~ 221.3X faster retrieval speed and 13.2 ~ 48.8X less index storage memory.	This paper presents LexLIP, a lexicon-weighting paradigm for large-scale image-text retrieval that leverages sparse representations in vocabulary space for faster retrieval and reduced storage compared to conventional dense retrieval methods.	Large-scale image-text retrieval faces challenges with low retrieval speed and high storage requirements when using dense retrieval methods, limiting their practicality in real-world applications.	LexLIP introduces a lexicon-bottlenecked pre-training framework with dual-stream encoders, lexicon-bottlenecked modules, and weakened text decoders to learn importance-aware lexicon representations for images and texts. This enables efficient retrieval using bag-of-words models and inverted indexes.	LexLIP achieves state-of-the-art performance on MSCOCO and Flickr30k image-text retrieval benchmarks with smaller pre-training datasets compared to previous methods. In large-scale retrieval scenarios, LexLIP demonstrates significantly faster retrieval speed (5.5x-221.3x) and reduced index storage memory (13.2x-48.8x) compared to CLIP. Ablation studies highlight the contribution of each component in the LexLIP framework, particularly the contrastive learning objectives for aligning image and text representations in the vocabulary space.	The large-scale benchmark is established by expanding Flickr30k with 1M random pairs from Conceptual Caption 12M, which may not fully represent real-world large-scale retrieval scenarios. Future work includes exploring alternative sparsification strategies and applying LexLIP to other cross-modal retrieval tasks beyond image-text retrieval.	image-text retrieval, lexicon-weighting paradigm, lexicon-bottlenecked pre-training, large-scale retrieval, sparse representation
2302.02693 Report	PatchDCT: Patch Refinement for High Quality Instance Segmentation	Qinrou Wen, Jirui Yang, Xue Yang, Kewei Liang	High-quality instance segmentation has shown emerging importance in computer vision. Without any refinement, DCT-Mask directly generates high-resolution masks by compressed vectors. To further refine masks obtained by compressed vectors, we propose for the first time a compressed vector based multi-stage refinement framework. However, the vanilla combination does not bring significant gains, because changes in some elements of the DCT vector will affect the prediction of the entire mask. Thus, we propose a simple and novel method named PatchDCT, which separates the mask decoded from a DCT vector into several patches and refines each patch by the designed classifier and regressor. Specifically, the classifier is used to distinguish mixed patches from all patches, and to correct previously mispredicted foreground and background patches. In contrast, the regressor is used for DCT vector prediction of mixed patches, further refining the segmentation quality at boundary locations. Experiments on COCO show that our method achieves 2.0%, 3.2%, 4.5% AP and 3.4%, 5.3%, 7.0% Boundary AP improvements over Mask-RCNN on COCO, LVIS, and Cityscapes, respectively. It also surpasses DCT-Mask by 0.7%, 1.1%, 1.3% AP and 0.9%, 1.7%, 4.2% Boundary AP on COCO, LVIS and Cityscapes. Besides, the performance of PatchDCT is also competitive with other state-of-the-art methods.	Proposes PatchDCT, a compressed vector-based instance segmentation method that refines mask patches independently using patch DCT vectors, achieving high-quality masks with fine boundaries.	Existing methods struggle to refine high-resolution instance masks due to limitations in low-resolution representations or difficulties in refining global DCT vectors.	Divides masks into patches, classifies them into foreground, background, or mixed, and refines mixed patches with a regressor predicting short, informative DCT vectors.	Achieves 2.0% AP and 3.4% Boundary AP improvement over Mask-RCNN on COCO. Outperforms DCT-Mask by 0.7% AP and 0.9% Boundary AP on COCO. Demonstrates competitive performance with state-of-the-art methods on COCO test-dev.	May generate masks with holes in semantically ambiguous areas. Future work includes improving classification and regression for better handling such areas, and exploring applications in other challenging domains like aerial images.	instance segmentation, dct, multi-stage refinement, patching, boundary attention
2302.02615 Report	Rethinking Out-of-distribution (OOD) Detection: Masked Image Modeling is All You Need	Jingyao Li, Pengguang Chen, Shaozuo Yu, Zexin He, Shu Liu, Jiaya Jia	The core of out-of-distribution (OOD) detection is to learn the in-distribution (ID) representation, which is distinguishable from OOD samples. Previous work applied recognition-based methods to learn the ID features, which tend to learn shortcuts instead of comprehensive representations. In this work, we find surprisingly that simply using reconstruction-based methods could boost the performance of OOD detection significantly. We deeply explore the main contributors of OOD detection and find that reconstruction-based pretext tasks have the potential to provide a generally applicable and efficacious prior, which benefits the model in learning intrinsic data distributions of the ID dataset. Specifically, we take Masked Image Modeling as a pretext task for our OOD detection framework (MOOD). Without bells and whistles, MOOD outperforms previous SOTA of one-class OOD detection by 5.7%, multi-class OOD detection by 3.0%, and near-distribution OOD detection by 2.1%. It even defeats the 10-shot-per-class outlier exposure OOD detection, although we do not include any OOD samples for our detection	This paper proposes MOOD, a novel out-of-distribution (OOD) detection framework leveraging Masked Image Modeling (MIM) as a pretext task to learn intrinsic data distributions of in-distribution data.	Existing recognition-based methods for OOD detection often learn shortcuts instead of comprehensive representations, limiting their effectiveness. This work shows that reconstruction-based methods like MIM significantly improve OOD detection by learning real data distribution.	The paper explores various factors influencing OOD detection, including pretext tasks (comparing MIM with contrastive learning), architectures (evaluating ViT, BiT, and MLP-Mixer), fine-tuning processes, and OOD detection metrics. It uses MIM pre-training on ImageNet-21k, followed by fine-tuning on the ID dataset, and employs Mahalanobis distance for OOD detection.	MOOD outperforms SOTA on one-class OOD detection by 5.7%, achieving 94.9% AUROC. On multi-class OOD detection, MOOD surpasses SOTA by 3.0%, reaching 97.6% AUROC. For near-distribution OOD detection, MOOD achieves 98.3% AUROC, 2.1% higher than previous SOTA.	The paper doesn't conduct experiments with intermediate fine-tuning on ImageNet-30 for one-class OOD detection, potentially limiting performance. Future work could explore the effectiveness of other reconstruction-based pretext tasks for OOD detection.	out-of-distribution detection, masked image modeling, vision transformer, self-supervised learning, anomaly detection
2302.02550 Report	Domain Re-Modulation for Few-Shot Generative Domain Adaptation	Yi Wu, Ziqiang Li, Chaoyue Wang, Heliang Zheng, Shanshan Zhao, Bin Li, Dacheng Tao	In this study, we delve into the task of few-shot Generative Domain Adaptation (GDA), which involves transferring a pre-trained generator from one domain to a new domain using only a few reference images. Inspired by the way human brains acquire knowledge in new domains, we present an innovative generator structure called Domain Re-Modulation (DoRM). DoRM not only meets the criteria of high quality, large synthesis diversity, and cross-domain consistency, which were achieved by previous research in GDA, but also incorporates memory and domain association, akin to how human brains operate. Specifically, DoRM freezes the source generator and introduces new mapping and affine modules (M&A modules) to capture the attributes of the target domain during GDA. This process resembles the formation of new synapses in human brains. Consequently, a linearly combinable domain shift occurs in the style space. By incorporating multiple new M&A modules, the generator gains the capability to perform high-fidelity multi-domain and hybrid-domain generation. Moreover, to maintain cross-domain consistency more effectively, we introduce a similarity-based structure loss. This loss aligns the auto-correlation map of the target image with its corresponding auto-correlation map of the source image during training. Through extensive experiments, we demonstrate the superior performance of our DoRM and similarity-based structure loss in few-shot GDA, both quantitatively and qualitatively. The code will be available at https://github.com/wuyi2020/DoRM.	This paper presents DoRM, a novel generator structure for few-shot Generative Domain Adaptation (GDA) inspired by the human brain's learning mechanism, achieving high-quality, diverse, and cross-domain consistent image synthesis while enabling memory and domain association.	Few-shot GDA aims to transfer a pre-trained generator to a new domain using limited data. Existing methods struggle with multi-domain generation and synthesizing images in unseen hybrid domains, limitations addressed by DoRM.	DoRM freezes the source generator and introduces new mapping and affine modules to capture target domain attributes, enabling domain shift in the style space. A similarity-based structure loss is introduced to enhance cross-domain consistency.	DoRM outperforms state-of-the-art methods in 10-shot GDA across various domains, demonstrating superior quality, diversity, and cross-domain consistency. DoRM enables efficient multi-domain generation with a single generator, significantly reducing storage requirements compared to methods requiring full generator updates. DoRM excels in hybrid-domain generation, effectively integrating learned domains to synthesize images in unseen hybrid domains, a capability not well-addressed by previous works.	The strength of domain shift in DoRM currently requires manual adjustment. The domain association in DoRM can be further improved by incorporating a new M&A module and additional consistency loss.	generative adversarial networks, domain adaptation, few-shot learning, image synthesis, domain association
2302.02503 Report	Leaving Reality to Imagination: Robust Classification via Generated Datasets	Hritik Bansal, Aditya Grover	Recent research on robustness has revealed significant performance gaps between neural image classifiers trained on datasets that are similar to the test set, and those that are from a naturally shifted distribution, such as sketches, paintings, and animations of the object categories observed during training. Prior work focuses on reducing this gap by designing engineered augmentations of training data or through unsupervised pretraining of a single large model on massive in-the-wild training datasets scraped from the Internet. However, the notion of a dataset is also undergoing a paradigm shift in recent years. With drastic improvements in the quality, ease-of-use, and access to modern generative models, generated data is pervading the web. In this light, we study the question: How do these generated datasets influence the natural robustness of image classifiers? We find that Imagenet classifiers trained on real data augmented with generated data achieve higher accuracy and effective robustness than standard training and popular augmentation strategies in the presence of natural distribution shifts. We analyze various factors influencing these results, including the choice of conditioning strategies and the amount of generated data. Additionally, we find that the standard ImageNet classifiers suffer a performance degradation of upto 20\% on the generated data, indicating their fragility at accurately classifying the objects under novel variations. Lastly, we demonstrate that the image classifiers, which have been trained on real data augmented with generated data from the base generative model, exhibit greater resilience to natural distribution shifts compared to the classifiers trained on real data augmented with generated data from the finetuned generative model on the real data. The code, models, and datasets are available at https://github.com/Hritikbansal/generative-robustness.	This paper investigates the impact of augmenting real image datasets with synthetic data generated by modern text-to-image models, specifically Stable Diffusion, on the robustness of image classifiers to natural distribution shifts.	Improving the robustness of image classifiers to natural variations is crucial for real-world applications like autonomous driving and medical diagnosis, where models are often deployed in environments different from their training data.	The authors generate synthetic datasets conditioned on ImageNet class labels using various Stable Diffusion conditioning strategies (text prompts, real images, and their combination). They train classifiers on real data, generated data, and their mixtures, evaluating their performance on ImageNet-1K and its natural distribution shift variants (ImageNet-Sketch, ImageNet-R, ImageNet-V2, ObjectNet).	Classifiers trained on a mixture of real and generated data achieve higher accuracy and effective robustness on natural distribution shift datasets compared to those trained solely on real data or with standard augmentation techniques. Increasing the proportion of generated data in the training mix generally improves effective robustness but might come at the cost of accuracy on the original dataset. Standard ImageNet classifiers show significant performance degradation (up to 20%) on the generated data, highlighting their fragility to novel variations of objects.	The study primarily focuses on ImageNet-1K, and the generalizability of the findings to other datasets and domains requires further investigation. The ethical implications of using generated data, particularly concerning bias amplification and privacy, need careful consideration.	robustness, generative models, data augmentation, image classification, natural distribution shift
2302.02412 Report	Mixture of Diffusers for scene composition and high resolution image generation	Álvaro Barbero Jiménez	Diffusion methods have been proven to be very effective to generate images while conditioning on a text prompt. However, and although the quality of the generated images is unprecedented, these methods seem to struggle when trying to generate specific image compositions. In this paper we present Mixture of Diffusers, an algorithm that builds over existing diffusion models to provide a more detailed control over composition. By harmonizing several diffusion processes acting on different regions of a canvas, it allows generating larger images, where the location of each object and style is controlled by a separate diffusion process.	Presents Mixture of Diffusers, a method leveraging multiple diffusion models on a single canvas to achieve fine-grained composition control and generate high-resolution images with limited GPU memory.	Existing text-conditioned diffusion models struggle to accurately represent complex image compositions and face limitations in generating high-resolution images due to memory constraints.	The method combines multiple diffusion models, each operating on a specific canvas region with a unique text prompt and weight. Gaussian weights are employed to ensure smooth transitions between regions. It adapts to latent space models through an approximate pixel-to-latent region mapping and supports image conditioning for outpainting and iterative image generation.	Demonstrates superior composition control compared to single diffusion models, accurately placing objects at user-specified locations. Enables high-resolution image generation (up to 4K) on limited memory GPUs by dividing the image into smaller regions. Successfully implements smooth style transitions across the image by varying text prompts for different regions.	Current implementation limits diffusion models to rectangular regions. Further exploration of free-form region masking and integration of inpainting techniques.	diffusion models, image generation, image composition, high-resolution images, outpainting
2302.02398 Report	Diffusion Model for Generative Image Denoising	Yutong Xie, Minne Yuan, Bin Dong, Quanzheng Li	In supervised learning for image denoising, usually the paired clean images and noisy images are collected or synthesised to train a denoising model. L2 norm loss or other distance functions are used as the objective function for training. It often leads to an over-smooth result with less image details. In this paper, we regard the denoising task as a problem of estimating the posterior distribution of clean images conditioned on noisy images. We apply the idea of diffusion model to realize generative image denoising. According to the noise model in denoising tasks, we redefine the diffusion process such that it is different from the original one. Hence, the sampling of the posterior distribution is a reverse process of dozens of steps from the noisy image. We consider three types of noise model, Gaussian, Gamma and Poisson noise. With the guarantee of theory, we derive a unified strategy for model training. Our method is verified through experiments on three types of noise models and achieves excellent performance.	This paper proposes a novel diffusion model specifically designed for generative image denoising, diverging from traditional supervised methods.	Traditional supervised denoising methods, relying on L2 norm loss, often result in over-smoothed images, lacking fine details. This new method aims to estimate the posterior distribution of clean images given noisy images, leading to more realistic and detailed denoising results.	The proposed diffusion model defines the diffusion process based on the specific noise model of the image (Gaussian, Gamma, or Poisson), allowing the reverse process to start directly from the noisy image. The training strategy employs a unified approach by minimizing the KL divergence, which is further simplified to minimizing L2 norm loss for all three noise models.	The method generates visually pleasing denoised images with finer details compared to traditional supervised learning. Quantitative metrics (PSNR, SSIM) demonstrate comparable performance between the average of generated samples and supervised learning. The method effectively estimates the posterior distribution of clean images, even with fewer diffusion steps.	There exists a gap in quantitative metrics between individual generated samples and supervised learning results, indicating potential for improvement. Future work aims to explore the method’s applicability to other noise models and diverse datasets.	image denoising, diffusion model, generative model, posterior distribution, noise model
2302.02373 Report	ShiftDDPMs: Exploring Conditional Diffusion Models by Shifting Diffusion Trajectories	Zijian Zhang, Zhou Zhao, Jun Yu, Qi Tian	Diffusion models have recently exhibited remarkable abilities to synthesize striking image samples since the introduction of denoising diffusion probabilistic models (DDPMs). Their key idea is to disrupt images into noise through a fixed forward process and learn its reverse process to generate samples from noise in a denoising way. For conditional DDPMs, most existing practices relate conditions only to the reverse process and fit it to the reversal of unconditional forward process. We find this will limit the condition modeling and generation in a small time window. In this paper, we propose a novel and flexible conditional diffusion model by introducing conditions into the forward process. We utilize extra latent space to allocate an exclusive diffusion trajectory for each condition based on some shifting rules, which will disperse condition modeling to all timesteps and improve the learning capacity of model. We formulate our method, which we call \textbf{ShiftDDPMs}, and provide a unified point of view on existing related methods. Extensive qualitative and quantitative experiments on image synthesis demonstrate the feasibility and effectiveness of ShiftDDPMs.	The paper proposes ShiftDDPMs, a novel conditional diffusion model that introduces conditions into the forward process by shifting diffusion trajectories in latent space according to conditions.	Existing conditional DDPM methods relate conditions only to the reverse process, limiting condition modeling to a small time window. Shifting diffusion trajectories disperses condition modeling across all timesteps, potentially improving model learning capacity.	The method utilizes a shift coefficient schedule and a shift predictor to control the mean shift of diffusion trajectories. Different shift modes, such as Prior-Shift, Data-Normalization, and Quadratic-Shift, are explored with fixed and trainable shift predictors.	ShiftDDPMs effectively perform conditional image synthesis, as demonstrated on MNIST and CIFAR-10 datasets. Both Prior-Shift and Quadratic-Shift outperform traditional conditional DDPMs, showing improved learning capacity. ShiftDDPMs successfully interpolate between different conditions and achieve competitive results on image inpainting and text-to-image synthesis.	The choice of shift coefficient schedule (k_t) is flexible but lacks extensive empirical investigation. The paper primarily focuses on image synthesis, leaving exploration of other data modalities for future work.	diffusion models, conditional image synthesis, generative models, deep learning, shiftddpms
2302.02284 Report	Design Booster: A Text-Guided Diffusion Model for Image Translation with Spatial Layout Preservation	Shiqi Sun, Shancheng Fang, Qian He, Wei Liu	Diffusion models are able to generate photorealistic images in arbitrary scenes. However, when applying diffusion models to image translation, there exists a trade-off between maintaining spatial structure and high-quality content. Besides, existing methods are mainly based on test-time optimization or fine-tuning model for each input image, which are extremely time-consuming for practical applications. To address these issues, we propose a new approach for flexible image translation by learning a layout-aware image condition together with a text condition. Specifically, our method co-encodes images and text into a new domain during the training phase. In the inference stage, we can choose images/text or both as the conditions for each time step, which gives users more flexible control over layout and content. Experimental comparisons of our method with state-of-the-art methods demonstrate our model performs best in both style image translation and semantic image translation and took the shortest time.	Proposes Design Booster, a novel diffusion-based method for flexible image translation that balances text descriptions with input image layout preservation.	Addresses limitations of existing diffusion models for image translation, which often struggle to maintain spatial structure while achieving high-quality content generation.	Introduces a jointly trained encoder to extract spatial information from input images and employs a flexible sampling strategy with multi-condition control (text and/or image) at each denoising step.	Achieves superior performance in both style and semantic image translation compared to state-of-the-art methods. Preserves spatial layout of input images while enabling text-guided modifications to style and content. Offers fast inference speed, making it suitable for practical applications.	Exhibits slightly weaker ability to change color under strong layout-preserving parameters. Future work could explore more complex and adaptive strategies for condition injection during sampling.	image translation, diffusion models, layout preservation, text-guided synthesis, multi-condition control
2302.02272 Report	Divide and Compose with Score Based Generative Models	Sandesh Ghimire, Armand Comas, Davin Hill, Aria Masoomi, Octavia Camps, Jennifer Dy	While score based generative models, or diffusion models, have found success in image synthesis, they are often coupled with text data or image label to be able to manipulate and conditionally generate images. Even though manipulation of images by changing the text prompt is possible, our understanding of the text embedding and our ability to modify it to edit images is quite limited. Towards the direction of having more control over image manipulation and conditional generation, we propose to learn image components in an unsupervised manner so that we can compose those components to generate and manipulate images in informed manner. Taking inspiration from energy based models, we interpret different score components as the gradient of different energy functions. We show how score based learning allows us to learn interesting components and we can visualize them through generation. We also show how this novel decomposition allows us to compose, generate and modify images in interesting ways akin to dreaming. We make our code available at https://github.com/sandeshgh/Score-based-disentanglement	This paper proposes a novel method for decomposing images into interpretable score components within a score-based generative model framework, enabling controlled image manipulation and generation.	Existing conditional score-based generative models lack interpretability and control over image generation, making targeted manipulation challenging.	The authors leverage the connection between score functions and energy-based models, decomposing the score function into multiple components representing different energy functions. They train an autoencoder that learns to encode images into latent vectors representing these components, which can be individually manipulated to generate diverse and controlled variations.	Score component decomposition allows for reconstruction with natural variations. Visualizing generated samples from individual components reveals their ability to capture distinct image attributes, like shape, color, or texture. Manipulating score components by interpolation with unconditional score functions enables controlled image editing, preserving certain features while varying others.	The current method's ability to manipulate images is limited by the number of score components. Future work could explore scaling the approach to a higher number of components and guiding them toward more human-interpretable representations.	score-based generative models, diffusion models, image manipulation, disentanglement, energy-based models
2302.02234 Report	Revisiting Image Deblurring with an Efficient ConvNet	Lingyan Ruan, Mojtaba Bemana, Hans-peter Seidel, Karol Myszkowski, Bin Chen	Image deblurring aims to recover the latent sharp image from its blurry counterpart and has a wide range of applications in computer vision. The Convolution Neural Networks (CNNs) have performed well in this domain for many years, and until recently an alternative network architecture, namely Transformer, has demonstrated even stronger performance. One can attribute its superiority to the multi-head self-attention (MHSA) mechanism, which offers a larger receptive field and better input content adaptability than CNNs. However, as MHSA demands high computational costs that grow quadratically with respect to the input resolution, it becomes impractical for high-resolution image deblurring tasks. In this work, we propose a unified lightweight CNN network that features a large effective receptive field (ERF) and demonstrates comparable or even better performance than Transformers while bearing less computational costs. Our key design is an efficient CNN block dubbed LaKD, equipped with a large kernel depth-wise convolution and spatial-channel mixing structure, attaining comparable or larger ERF than Transformers but with a smaller parameter scale. Specifically, we achieve +0.17dB / +0.43dB PSNR over the state-of-the-art Restormer on defocus / motion deblurring benchmark datasets with 32% fewer parameters and 39% fewer MACs. Extensive experiments demonstrate the superior performance of our network and the effectiveness of each module. Furthermore, we propose a compact and intuitive ERFMeter metric that quantitatively characterizes ERF, and shows a high correlation to the network performance. We hope this work can inspire the research community to further explore the pros and cons of CNN and Transformer architectures beyond image deblurring tasks.	This paper proposes LaKDNet, a lightweight CNN for image deblurring that achieves comparable or better performance than Transformer-based methods while being more computationally efficient.	Image deblurring is crucial for various computer vision tasks, but Transformer-based methods, while effective, are computationally expensive, especially for high-resolution images. This work explores the potential of efficient CNNs for this task.	The authors propose the LaKD block, featuring large kernel depth-wise convolution and spatial-channel mixing to achieve a large effective receptive field (ERF) with low computational cost. They integrate this block into a U-Net architecture. Additionally, they introduce ERFMeter, a metric to quantify ERF and correlate it with network performance.	LaKDNet achieves state-of-the-art results on defocus deblurring benchmarks, outperforming Restormer with 32% fewer parameters and 39% fewer MACs. For motion deblurring, LaKDNet shows competitive performance, exceeding Uformer and Restormer on GoPro dataset by up to +0.43dB PSNR while using significantly fewer computational resources. ERFMeter demonstrates a strong correlation (Pearson correlation coefficient r=0.8) with network performance, suggesting its potential for guiding network design.	The network's generalization ability from synthetic to real blur is slightly weaker than Transformer-based methods, suggesting room for improvement. The ERFMeter metric primarily focuses on ERF and might not capture the impact of other factors contributing to network performance.	image deblurring, convolutional neural networks, effective receptive field, lightweight model, erfmeter
2302.02181 Report	Model Stitching and Visualization How GAN Generators can Invert Networks in Real-Time	Rudolf Herdt, Maximilian Schmidt, Daniel Otero Baguer, Jean Le'Clerc Arrastia, Peter Maass	In this work, we propose a fast and accurate method to reconstruct activations of classification and semantic segmentation networks by stitching them with a GAN generator utilizing a 1x1 convolution. We test our approach on images of animals from the AFHQ wild dataset, ImageNet1K, and real-world digital pathology scans of stained tissue samples. Our results show comparable performance to established gradient descent methods but with a processing time that is two orders of magnitude faster, making this approach promising for practical applications.	This paper presents a fast and accurate method to reconstruct the activations of deep neural networks used for classification and semantic segmentation. This method works by stitching the feature extractor network with a pretrained GAN generator using a 1x1 convolution.	Reconstructing activations of deep networks is important for understanding the internal representations learned by these models and can aid in tasks like image generation and manipulation. Existing methods, such as gradient descent, are accurate but computationally expensive. This work offers a faster alternative with comparable accuracy.	The method trains a 1x1 convolution layer to map the activations from a hidden layer of the feature extractor to a hidden layer of a pretrained GAN generator. During inference, this mapping enables the reconstruction of activations by propagating them through the stitched GAN, acting as a decoder.	The GAN-based reconstruction method achieves comparable accuracy to gradient descent methods in terms of cosine similarity and L1 loss when evaluated on AFHQ wild, ImageNet1K, and digital pathology datasets. The proposed method is significantly faster than gradient descent, achieving a speedup of two orders of magnitude, making it suitable for real-time applications. The study suggests that the features learned by the GAN generator are compatible with the features learned by the feature extractor, even if they are trained independently.	The method's performance depends on the ability of the GAN generator to understand the concepts learned by the feature extractor. If the GAN is not trained on similar data or concepts, the reconstruction might be inaccurate. The use of class-conditional GANs for reconstruction introduces challenges when stitching into deeper layers due to the reliance on class conditioning information.	computer vision, deep learning, gan, network inversion, activation reconstruction
2302.02057 Report	Semantic Diffusion Network for Semantic Segmentation	Haoru Tan, Sitong Wu, Jimin Pi	Precise and accurate predictions over boundary areas are essential for semantic segmentation. However, the commonly-used convolutional operators tend to smooth and blur local detail cues, making it difficult for deep models to generate accurate boundary predictions. In this paper, we introduce an operator-level approach to enhance semantic boundary awareness, so as to improve the prediction of the deep semantic segmentation model. Specifically, we first formulate the boundary feature enhancement as an anisotropic diffusion process. We then propose a novel learnable approach called semantic diffusion network (SDN) to approximate the diffusion process, which contains a parameterized semantic difference convolution operator followed by a feature fusion module. Our SDN aims to construct a differentiable mapping from the original feature to the inter-class boundary-enhanced feature. The proposed SDN is an efficient and flexible module that can be easily plugged into existing encoder-decoder segmentation models. Extensive experiments show that our approach can achieve consistent improvements over several typical and state-of-the-art segmentation baseline models on challenging public benchmarks. The code will be released soon.	This paper introduces Semantic Diffusion Network (SDN), an operator-level approach to enhance semantic boundary awareness in semantic segmentation models by approximating an anisotropic diffusion process.	Existing convolutional operators tend to smooth and blur local details, hindering accurate boundary prediction in semantic segmentation. SDN addresses this limitation by enhancing inter-class boundary features.	SDN utilizes a learnable approach comprising a parameterized semantic difference convolution operator and a feature fusion module. The semantic difference convolution leverages semantic guidance features to enhance inter-class boundaries while suppressing intra-class ones.	SDN achieves consistent mIoU improvements across various baseline models and datasets (ADE20K and Cityscapes). It significantly improves boundary quality, as demonstrated by higher F-scores in boundary regions. SDN exhibits good compatibility with other boundary-promoting methods, further enhancing segmentation performance.	The paper only validates SDN's effectiveness on semantic segmentation; further exploration in other visual tasks is needed. While SDN has positive applications, its potential use in inhumane surveillance needs careful consideration and regulation.	semantic segmentation, boundary awareness, deep learning, anisotropic diffusion, convolutional neural networks
2302.01872 Report	MOSE: A New Dataset for Video Object Segmentation in Complex Scenes	Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, Philip H. S. Torr, Song Bai	Video object segmentation (VOS) aims at segmenting a particular object throughout the entire video clip sequence. The state-of-the-art VOS methods have achieved excellent performance (e.g., 90+% J&F) on existing datasets. However, since the target objects in these existing datasets are usually relatively salient, dominant, and isolated, VOS under complex scenes has rarely been studied. To revisit VOS and make it more applicable in the real world, we collect a new VOS dataset called coMplex video Object SEgmentation (MOSE) to study the tracking and segmenting objects in complex environments. MOSE contains 2,149 video clips and 5,200 objects from 36 categories, with 431,725 high-quality object segmentation masks. The most notable feature of MOSE dataset is complex scenes with crowded and occluded objects. The target objects in the videos are commonly occluded by others and disappear in some frames. To analyze the proposed MOSE dataset, we benchmark 18 existing VOS methods under 4 different settings on the proposed MOSE dataset and conduct comprehensive comparisons. The experiments show that current VOS algorithms cannot well perceive objects in complex scenes. For example, under the semi-supervised VOS setting, the highest J&F by existing state-of-the-art VOS methods is only 59.4% on MOSE, much lower than their ~90% J&F performance on DAVIS. The results reveal that although excellent performance has been achieved on existing benchmarks, there are unresolved challenges under complex scenes and more efforts are desired to explore these challenges in the future. The proposed MOSE dataset has been released at https://henghuiding.github.io/MOSE.	This paper introduces a new large-scale video object segmentation benchmark dataset called MOSE, specifically designed to study object tracking and segmentation in complex environments.	Existing video object segmentation datasets often feature salient and isolated objects, while real-world scenarios frequently involve complex and occluded scenes. MOSE aims to bridge this gap and promote the development of more comprehensive and robust video object segmentation algorithms.	The authors collected 2,149 high-resolution videos featuring crowded and occluded objects, many of which disappear and reappear throughout the video. They annotated these videos with 430,984 high-quality segmentation masks. The dataset was then used to benchmark 18 existing video object segmentation methods under 4 different settings.	Existing video object segmentation algorithms perform significantly worse on MOSE compared to previous benchmark datasets, highlighting the difficulty of complex scenes. The highest $\mathcal{J}\&\mathcal{F}$ achieved by current state-of-the-art methods under the semi-supervised setting is only 59.4% on MOSE, significantly lower than their ~90% performance on datasets like DAVIS. Heavy occlusions, crowds, small object size, and object disappearance/reappearance pose significant challenges to existing methods.	MOSE currently focuses on object categories common in existing image segmentation datasets, potentially limiting its generalizability. Future work could explore incorporating more diverse object categories and even more challenging scenarios, such as extreme lighting conditions or fast camera movements.	video object segmentation, dataset, complex scenes, occlusion, benchmark
2302.01721 Report	TEXTure: Text-Guided Texturing of 3D Shapes	Elad Richardson, Gal Metzer, Yuval Alaluf, Raja Giryes, Daniel Cohen-Or	In this paper, we present TEXTure, a novel method for text-guided generation, editing, and transfer of textures for 3D shapes. Leveraging a pretrained depth-to-image diffusion model, TEXTure applies an iterative scheme that paints a 3D model from different viewpoints. Yet, while depth-to-image models can create plausible textures from a single viewpoint, the stochastic nature of the generation process can cause many inconsistencies when texturing an entire 3D object. To tackle these problems, we dynamically define a trimap partitioning of the rendered image into three progression states, and present a novel elaborated diffusion sampling process that uses this trimap representation to generate seamless textures from different views. We then show that one can transfer the generated texture maps to new 3D geometries without requiring explicit surface-to-surface mapping, as well as extract semantic textures from a set of images without requiring any explicit reconstruction. Finally, we show that TEXTure can be used to not only generate new textures but also edit and refine existing textures using either a text prompt or user-provided scribbles. We demonstrate that our TEXTuring method excels at generating, transferring, and editing textures through extensive evaluation, and further close the gap between 2D image generation and 3D texturing.	TEXTure, a novel method for text-guided generation, editing, and transfer of textures for 3D shapes, leveraging pretrained depth-to-image diffusion models.	Addresses the limitations of previous 3D texturing methods by providing a fast and efficient approach for generating high-quality and consistent textures on 3D models.	An iterative painting scheme that renders the object from different viewpoints, applies a depth-based painting using a modified diffusion model, and projects the result back to the mesh vertices or atlas.	Generates high-quality, realistic textures consistent on both local and global scales. Enables texture transfer from both painted meshes and sets of images to new, untextured meshes. Supports texture editing through both text prompts and user-provided scribbles.	Potential inconsistencies on a global scale due to occlusions. Dependence on fixed viewpoints for painting, which may not be optimal for all geometries.	text-guided synthesis, 3d texturing, diffusion models, texture transfer, texture editing
2302.01579 Report	Semantic 3D-aware Portrait Synthesis and Manipulation Based on Compositional Neural Radiance Field	Tianxiang Ma, Bingchuan Li, Qian He, Jing Dong, Tieniu Tan	Recently 3D-aware GAN methods with neural radiance field have developed rapidly. However, current methods model the whole image as an overall neural radiance field, which limits the partial semantic editability of synthetic results. Since NeRF renders an image pixel by pixel, it is possible to split NeRF in the spatial dimension. We propose a Compositional Neural Radiance Field (CNeRF) for semantic 3D-aware portrait synthesis and manipulation. CNeRF divides the image by semantic regions and learns an independent neural radiance field for each region, and finally fuses them and renders the complete image. Thus we can manipulate the synthesized semantic regions independently, while fixing the other parts unchanged. Furthermore, CNeRF is also designed to decouple shape and texture within each semantic region. Compared to state-of-the-art 3D-aware GAN methods, our approach enables fine-grained semantic region manipulation, while maintaining high-quality 3D-consistent synthesis. The ablation studies show the effectiveness of the structure and loss function used by our method. In addition real image inversion and cartoon portrait 3D editing experiments demonstrate the application potential of our method.	This paper introduces CNeRF, the first compositional neural radiance field for semantic 3D-aware portrait synthesis and manipulation.	Current 3D-aware GAN methods with neural radiance fields lack semantic editability as they model the entire image as a single unit. CNeRF addresses this limitation.	CNeRF divides the image into semantic regions and learns independent neural radiance fields for each region. It then fuses these fields to render the complete image. This method allows for individual manipulation of semantic regions using latent codes.	CNeRF achieves high-quality 3D-consistent portrait synthesis comparable to state-of-the-art methods. The proposed method allows for fine-grained semantic region manipulation in generated portraits. CNeRF successfully decouples shape and texture within each semantic region, allowing for independent control over each attribute.	Further improvements in 3D reconstruction quality are possible. Future work will explore combining CNeRF with advanced 3D-aware GANs like EG3D.	generative adversarial networks, neural radiance fields, 3d-aware image synthesis, semantic manipulation, compositional rendering
2302.01532 Report	INV: Towards Streaming Incremental Neural Videos	Shengze Wang, Alexey Supikov, Joshua Ratcliff, Henry Fuchs, Ronald Azuma	Recent works in spatiotemporal radiance fields can produce photorealistic free-viewpoint videos. However, they are inherently unsuitable for interactive streaming scenarios (e.g. video conferencing, telepresence) because have an inevitable lag even if the training is instantaneous. This is because these approaches consume videos and thus have to buffer chunks of frames (often seconds) before processing. In this work, we take a step towards interactive streaming via a frame-by-frame approach naturally free of lag. Conventional wisdom believes that per-frame NeRFs are impractical due to prohibitive training costs and storage. We break this belief by introducing Incremental Neural Videos (INV), a per-frame NeRF that is efficiently trained and streamable. We designed INV based on two insights: (1) Our main finding is that MLPs naturally partition themselves into Structure and Color Layers, which store structural and color/texture information respectively. (2) We leverage this property to retain and improve upon knowledge from previous frames, thus amortizing training across frames and reducing redundant learning. As a result, with negligible changes to NeRF, INV can achieve good qualities (>28.6db) in 8min/frame. It can also outperform prior SOTA in 19% less training time. Additionally, our Temporal Weight Compression reduces the per-frame size to 0.3MB/frame (6.6% of NeRF). More importantly, INV is free from buffer lag and is naturally fit for streaming. While this work does not achieve real-time training, it shows that incremental approaches like INV present new possibilities in interactive 3D streaming. Moreover, our discovery of natural information partition leads to a better understanding and manipulation of MLPs. Code and dataset will be released soon.	This paper introduces Incremental Neural Videos (INV), a per-frame neural radiance field representation for efficient streaming of dynamic 3D scenes.	Existing spatiotemporal radiance fields suffer from buffer lag, making them unsuitable for interactive streaming applications like telepresence.	The authors leverage the discovery that MLPs naturally partition into Structure and Color Layers, enabling them to design INV which stores per-frame structure and a shared color representation.	INV achieves state-of-the-art per-frame quality with less training than previous methods. Temporal Weight Compression reduces the per-frame size to a streamable 0.3MB. The paper provides evidence for the natural partitioning of information within MLPs, leading to a better understanding of these models.	Visual stability is limited, especially for short training times. Future work includes achieving real-time training and handling large scene changes.	neural radiance fields, 3d video streaming, incremental learning, mlp, temporal weight compression
2302.01384 Report	Energy-Inspired Self-Supervised Pretraining for Vision Models	Ze Wang, Jiang Wang, Zicheng Liu, Qiang Qiu	Motivated by the fact that forward and backward passes of a deep network naturally form symmetric mappings between input and output representations, we introduce a simple yet effective self-supervised vision model pretraining framework inspired by energy-based models (EBMs). In the proposed framework, we model energy estimation and data restoration as the forward and backward passes of a single network without any auxiliary components, e.g., an extra decoder. For the forward pass, we fit a network to an energy function that assigns low energy scores to samples that belong to an unlabeled dataset, and high energy otherwise. For the backward pass, we restore data from corrupted versions iteratively using gradient-based optimization along the direction of energy minimization. In this way, we naturally fold the encoder-decoder architecture widely used in masked image modeling into the forward and backward passes of a single vision model. Thus, our framework now accepts a wide range of pretext tasks with different data corruption methods, and permits models to be pretrained from masked image modeling, patch sorting, and image restoration, including super-resolution, denoising, and colorization. We support our findings with extensive experiments, and show the proposed method delivers comparable and even better performance with remarkably fewer epochs of training compared to the state-of-the-art self-supervised vision model pretraining methods. Our findings shed light on further exploring self-supervised vision model pretraining and pretext tasks beyond masked image modeling.	This paper proposes a simple yet effective self-supervised vision model pretraining framework inspired by energy-based models (EBMs), where energy estimation and data restoration are modeled as the forward and backward passes of a single network.	This approach eliminates the need for auxiliary components like decoders, heavy data augmentations, or modifications to the network structure, simplifying self-supervised vision model pretraining.	The forward pass trains a network to fit an energy function, assigning low energy scores to in-distribution samples and high energy to others. The backward pass uses gradient-based optimization to restore data from corrupted versions, moving towards energy minimization. This effectively folds the encoder-decoder architecture into a single vision model.	The proposed method achieves comparable or even better performance with remarkably fewer epochs of training compared to state-of-the-art self-supervised vision model pretraining methods. The framework's flexibility allows for a broader range of pretext tasks beyond masked image modeling, including patch sorting and image restoration (e.g., super-resolution, denoising, and colorization). The approach demonstrates good generalization across various network architectures, including ViT, ResNet, ConvNeXt, and Swin-Transformer.	While achieving strong finetuning results, the method doesn't directly yield strongly linearly-separable features, resulting in lower linear probing accuracy compared to contrastive learning methods. Future work will focus on further exploring pretext tasks for self-supervised vision model pretraining and improving linear separability of the learned features.	self-supervised learning, vision model pretraining, energy-based models, masked image modeling, image restoration
2302.01329 Report	Dreamix: Video Diffusion Models are General Video Editors	Eyal Molad, Eliahu Horwitz, Dani Valevski, Alex Rav Acha, Yossi Matias, Yael Pritch, Yaniv Leviathan, Yedid Hoshen	Text-driven image and video diffusion models have recently achieved unprecedented generation realism. While diffusion models have been successfully applied for image editing, very few works have done so for video editing. We present the first diffusion-based method that is able to perform text-based motion and appearance editing of general videos. Our approach uses a video diffusion model to combine, at inference time, the low-resolution spatio-temporal information from the original video with new, high resolution information that it synthesized to align with the guiding text prompt. As obtaining high-fidelity to the original video requires retaining some of its high-resolution information, we add a preliminary stage of finetuning the model on the original video, significantly boosting fidelity. We propose to improve motion editability by a new, mixed objective that jointly finetunes with full temporal attention and with temporal attention masking. We further introduce a new framework for image animation. We first transform the image into a coarse video by simple image processing operations such as replication and perspective geometric projections, and then use our general video editor to animate it. As a further application, we can use our method for subject-driven video generation. Extensive qualitative and numerical experiments showcase the remarkable editing ability of our method and establish its superior performance compared to baseline methods.	Dreamix, a novel method for text-based video editing, enabling motion and appearance edits in real-world videos using a text-conditioned video diffusion model.	Existing text-based editing methods are primarily image-centric and struggle to maintain temporal consistency in videos. Dreamix leverages the power of video diffusion models to achieve high-quality video editing with strong fidelity to the original content.	Dreamix finetunes a pre-trained cascaded video diffusion model on the input video with a mixed objective, combining full temporal attention and masked temporal attention. At inference, it corrupts the video, then uses the finetuned model to guide the generation towards the text prompt while preserving original video details.	Dreamix enables unprecedented video editing capabilities, including modifying motion, appearance, adding objects, and changing backgrounds, all while maintaining temporal consistency. The proposed mixed finetuning significantly improves motion editing and background change scenarios compared to baselines. Dreamix enables new applications, such as text-guided image animation by converting images to coarse videos and applying video editing, and subject-driven video generation by finetuning on a collection of subject images.	Hyperparameter selection, such as noise strength, is currently manual and could be automated for improved user experience. Automatic evaluation metrics for text-guided video editing are lacking and would benefit from further research to better align with human preference.	video editing, diffusion models, text-guided generation, image animation, subject-driven video generation
2302.01327 Report	Dual PatchNorm	Manoj Kumar, Mostafa Dehghani, Neil Houlsby	We propose Dual PatchNorm: two Layer Normalization layers (LayerNorms), before and after the patch embedding layer in Vision Transformers. We demonstrate that Dual PatchNorm outperforms the result of exhaustive search for alternative LayerNorm placement strategies in the Transformer block itself. In our experiments, incorporating this trivial modification, often leads to improved accuracy over well-tuned Vision Transformers and never hurts.	This paper introduces Dual PatchNorm (DPN), a simple modification to Vision Transformers (ViTs) that involves adding two Layer Normalization layers before and after the patch embedding layer.	The authors aim to explore LayerNorm placement strategies beyond the standard pre-LN approach in ViTs and demonstrate that DPN consistently improves performance across various vision tasks.	The authors conduct extensive experiments on image classification (ImageNet-1k, ImageNet-21k, JFT), contrastive learning, semantic segmentation (ADE20K), and transfer learning (VTAB). They compare DPN with various other LayerNorm placement strategies, including exhaustive search within Transformer blocks.	DPN consistently improves accuracy over well-tuned vanilla ViT baselines on image classification tasks, achieving an average gain of 1.4% on ImageNet-1k. DPN also shows benefits in contrastive learning and semantic segmentation, leading to improved zero-shot ImageNet accuracy and mIoU on ADE20K, respectively. Analysis of gradient norms suggests that DPN helps stabilize training by reducing the gradient norm of the embedding layer.	While DPN consistently shows improvements, there are a few cases where it performs on par or slightly worse than the baseline. Future work can explore the theoretical underpinnings of DPN's effectiveness and investigate its applicability to other ViT variants.	vision transformers, layer normalization, dual patchnorm, image classification, contrastive learning
2302.01162 Report	Get3DHuman: Lifting StyleGAN-Human into a 3D Generative Model using Pixel-aligned Reconstruction Priors	Zhangyang Xiong, Di Kang, Derong Jin, Weikai Chen, Linchao Bao, Shuguang Cui, Xiaoguang Han	Fast generation of high-quality 3D digital humans is important to a vast number of applications ranging from entertainment to professional concerns. Recent advances in differentiable rendering have enabled the training of 3D generative models without requiring 3D ground truths. However, the quality of the generated 3D humans still has much room to improve in terms of both fidelity and diversity. In this paper, we present Get3DHuman, a novel 3D human framework that can significantly boost the realism and diversity of the generated outcomes by only using a limited budget of 3D ground-truth data. Our key observation is that the 3D generator can profit from human-related priors learned through 2D human generators and 3D reconstructors. Specifically, we bridge the latent space of Get3DHuman with that of StyleGAN-Human via a specially-designed prior network, where the input latent code is mapped to the shape and texture feature volumes spanned by the pixel-aligned 3D reconstructor. The outcomes of the prior network are then leveraged as the supervisory signals for the main generator network. To ensure effective training, we further propose three tailored losses applied to the generated feature volumes and the intermediate feature maps. Extensive experiments demonstrate that Get3DHuman greatly outperforms the other state-of-the-art approaches and can support a wide range of applications including shape interpolation, shape re-texturing, and single-view reconstruction through latent inversion.	Presents Get3DHuman, a 3D human generation framework that leverages priors from 2D human generators and 3D reconstructors to synthesize high-fidelity clothed 3D humans with diverse shapes and textures.	Generating diverse and realistic 3D humans is crucial for various applications, but current 3D generative models struggle with limited 3D data. This work overcomes these limitations by leveraging priors from well-established 2D and 3D domains.	The framework employs a prior network (StyleGAN-Human + PIFuHD) to extract normal maps, depth maps, and shape/texture feature volumes as supervisory signals. A two-branch 3D generator (shape and texture) is trained with tailored losses, including latent prior loss and adversarial loss on feature volumes, to ensure high-quality and diverse results. A refinement module enhances the final textured mesh.	Significantly outperforms state-of-the-art methods (EG3D, SDF-StyleGAN, GET3D) in generating high-fidelity clothed 3D humans, as evidenced by quantitative metrics (COV, MMD, FPD, FID, FID3D) and visual comparisons. Demonstrates strong capability in various applications, including shape interpolation, shape re-texturing, and single-view reconstruction through latent inversion. Successfully incorporates inductive bias from 2D and 3D priors, resulting in a more effective and efficient 3D human generation process.	Currently limited to generating models in standing poses due to the constraints of the StyleGAN-Human prior. Reliance on manual filtering of training data might introduce bias.	3d human generation, generative adversarial networks, prior learning, differentiable rendering, shape and texture synthesis
2302.01133 Report	SceneScape: Text-Driven Consistent Scene Generation	Rafail Fridman, Amit Abecasis, Yoni Kasten, Tali Dekel	We present a method for text-driven perpetual view generation -- synthesizing long-term videos of various scenes solely, given an input text prompt describing the scene and camera poses. We introduce a novel framework that generates such videos in an online fashion by combining the generative power of a pre-trained text-to-image model with the geometric priors learned by a pre-trained monocular depth prediction model. To tackle the pivotal challenge of achieving 3D consistency, i.e., synthesizing videos that depict geometrically-plausible scenes, we deploy an online test-time training to encourage the predicted depth map of the current frame to be geometrically consistent with the synthesized scene. The depth maps are used to construct a unified mesh representation of the scene, which is progressively constructed along the video generation process. In contrast to previous works, which are applicable only to limited domains, our method generates diverse scenes, such as walkthroughs in spaceships, caves, or ice castles.	This paper introduces SceneScape, the first text-driven perpetual view generation method that synthesizes long-term videos of diverse scenes solely from text prompts and camera poses.	This addresses limitations of previous methods restricted to specific domains and requiring large-scale training by leveraging the power of pre-trained text-to-image and depth prediction models for zero-shot scene generation.	SceneScape combines pre-trained models with a unified 3D mesh representation, progressively constructing the scene while ensuring 3D consistency through test-time fine-tuning of depth prediction and image inpainting.	SceneScape generates high-quality, diverse scenes with significant parallax and complex structures from text prompts. The method demonstrates superior 3D consistency compared to baselines like VideoFusion and GEN-1, evidenced by quantitative metrics and user studies. Test-time fine-tuning of depth prediction and image decoding proves crucial for achieving geometric plausibility and visual quality.	The reliance on pre-trained models can introduce biases present in their training data. Representing scenes with triangular mesh limits the ability to depict dramatic depth discontinuities found in outdoor environments.	text-driven generation, perpetual view generation, 3d scene synthesis, test-time optimization, zero-shot learning
2302.01056 Report	Beyond Pretrained Features: Noisy Image Modeling Provides Adversarial Defense	Zunzhi You, Daochang Liu, Bohyung Han, Chang Xu	Recent advancements in masked image modeling (MIM) have made it a prevailing framework for self-supervised visual representation learning. The MIM pretrained models, like most deep neural network methods, remain vulnerable to adversarial attacks, limiting their practical application, and this issue has received little research attention. In this paper, we investigate how this powerful self-supervised learning paradigm can provide adversarial robustness to downstream classifiers. During the exploration, we find that noisy image modeling (NIM), a simple variant of MIM that adopts denoising as the pre-text task, reconstructs noisy images surprisingly well despite severe corruption. Motivated by this observation, we propose an adversarial defense method, referred to as De^3, by exploiting the pretrained decoder for denoising. Through De^3, NIM is able to enhance adversarial robustness beyond providing pretrained features. Furthermore, we incorporate a simple modification, sampling the noise scale hyperparameter from random distributions, and enable the defense to achieve a better and tunable trade-off between accuracy and robustness. Experimental results demonstrate that, in terms of adversarial robustness, NIM is superior to MIM thanks to its effective denoising capability. Moreover, the defense provided by NIM achieves performance on par with adversarial training while offering the extra tunability advantage. Source code and models are available at https://github.com/youzunzhi/NIM-AdvDef.	This paper investigates Noisy Image Modeling (NIM), a variant of Masked Image Modeling (MIM) using denoising as a pretext task, and proposes \de3, a defense method leveraging NIM's denoising capability to enhance adversarial robustness.	MIM models, while effective for representation learning, lack adversarial robustness, limiting their applicability in safety-critical tasks. This paper explores NIM's potential for enhancing robustness beyond pretrained features.	The authors train NIM models, observe their strong denoising capability, and propose \de3. This method adds noise to adversarial examples during testing and uses the pretrained NIM decoder to denoise them, mitigating adversarial perturbations. They also propose randomizing the noise level during NIM pretraining for a tunable accuracy-robustness trade-off.	NIM-pretrained classifiers, even without defense, exhibit better robustness than MIM counterparts. NIM with \de3 significantly improves robustness against various attacks, outperforming undefended MIM while offering a tunable accuracy-robustness trade-off. NIM's denoising capability is shown to be superior to MIM's reconstruction ability, contributing to enhanced robustness.	The paper primarily focuses on demonstrating NIM's advantage over MIM for robustness, not achieving state-of-the-art defense. Exploration of alternative degradation methods beyond Gaussian noise in NIM is left for future work.	adversarial robustness, self-supervised learning, masked image modeling, denoising, vision transformers
2302.00908 Report	GANalyzer: Analysis and Manipulation of GANs Latent Space for Controllable Face Synthesis	Ali Pourramezan Fard, Mohammad H. Mahoor, Sarah Ariel Lamer, Timothy Sweeny	Generative Adversarial Networks (GANs) are capable of synthesizing high-quality facial images. Despite their success, GANs do not provide any information about the relationship between the input vectors and the generated images. Currently, facial GANs are trained on imbalanced datasets, which generate less diverse images. For example, more than 77% of 100K images that we randomly synthesized using the StyleGAN3 are classified as Happy, and only around 3% are Angry. The problem even becomes worse when a mixture of facial attributes is desired: less than 1% of the generated samples are Angry Woman, and only around 2% are Happy Black. To address these problems, this paper proposes a framework, called GANalyzer, for the analysis, and manipulation of the latent space of well-trained GANs. GANalyzer consists of a set of transformation functions designed to manipulate latent vectors for a specific facial attribute such as facial Expression, Age, Gender, and Race. We analyze facial attribute entanglement in the latent space of GANs and apply the proposed transformation for editing the disentangled facial attributes. Our experimental results demonstrate the strength of GANalyzer in editing facial attributes and generating any desired faces. We also create and release a balanced photo-realistic human face dataset. Our code is publicly available on GitHub.	Proposes GANalyzer, a framework to analyze and manipulate the latent space of pre-trained GANs for controllable face synthesis, enabling both facial attribute editing (preserving identity) and feature-based synthesis (specifying attributes).	Addresses the lack of control over facial attributes in GAN-generated images and the issue of imbalanced datasets leading to less diverse outputs.	Analyzes facial attributes of synthesized images and their latent vectors using pre-trained classifiers. Defines a transformation function based on Eigenvectors of the Covariance matrix of latent vectors belonging to specific attributes. This function allows for manipulation of latent vectors to control facial features in generated images.	Successfully edits single and multiple facial attributes like age, gender, race, and expression while preserving identity. Enables feature-based synthesis to generate faces with specific attributes, addressing dataset imbalance. Provides control over the intensity of the desired facial attribute in both editing and synthesis.	Reliance on the performance of pre-trained classifiers for accurate attribute labeling. Potential limitations due to entanglement of facial attributes in the training data of the original GAN.	generative adversarial networks, face synthesis, latent space manipulation, facial attribute editing, feature-based synthesis
2302.00833 Report	RobustNeRF: Ignoring Distractors with Robust Losses	Sara Sabour, Suhani Vora, Daniel Duckworth, Ivan Krasin, David J. Fleet, Andrea Tagliasacchi	Neural radiance fields (NeRF) excel at synthesizing new views given multi-view, calibrated images of a static scene. When scenes include distractors, which are not persistent during image capture (moving objects, lighting variations, shadows), artifacts appear as view-dependent effects or 'floaters'. To cope with distractors, we advocate a form of robust estimation for NeRF training, modeling distractors in training data as outliers of an optimization problem. Our method successfully removes outliers from a scene and improves upon our baselines, on synthetic and real-world scenes. Our technique is simple to incorporate in modern NeRF frameworks, with few hyper-parameters. It does not assume a priori knowledge of the types of distractors, and is instead focused on the optimization problem rather than pre-processing or modeling transient objects. More results on our page https://robustnerf.github.io/public.	This paper presents RobustNeRF, a novel method to address the issue of distractors (transient objects or effects) in neural radiance fields (NeRF) by treating them as outliers during optimization.	Distractors are common in real-world scenes and can severely degrade the quality of NeRF reconstructions. Existing methods for handling distractors have limitations, such as requiring pre-trained segmentation models or complex loss balancing.	RobustNeRF utilizes a trimmed least squares loss function combined with iterative re-weighted least squares (IRLS). It leverages spatial smoothness assumptions to distinguish distractors from high-frequency details, effectively ignoring distractors during training.	RobustNeRF outperforms baselines like MipNeRF360 and DDNeRF in terms of reconstruction quality on both synthetic and real-world datasets. The method is robust to varying clutter levels and requires minimal hyperparameter tuning. Qualitative and quantitative evaluations demonstrate the efficacy of RobustNeRF in ignoring distractors and producing high-quality NeRF reconstructions.	On clean datasets, RobustNeRF may exhibit slightly lower reconstruction quality and longer training times compared to methods like MipNeRF360 due to inherent statistical inefficiency. Future work will focus on handling very small distractors, learning neural weight functions for improved accuracy, and incorporating the robust loss into other NeRF frameworks.	nerf, robust estimation, outlier rejection, 3d reconstruction, computer vision
2302.00190 Report	Neural Wavelet-domain Diffusion for 3D Shape Generation, Inversion, and Manipulation	Jingyu Hu, Ka-Hei Hui, Zhengzhe Liu, Ruihui Li, Chi-Wing Fu	This paper presents a new approach for 3D shape generation, inversion, and manipulation, through a direct generative modeling on a continuous implicit representation in wavelet domain. Specifically, we propose a compact wavelet representation with a pair of coarse and detail coefficient volumes to implicitly represent 3D shapes via truncated signed distance functions and multi-scale biorthogonal wavelets. Then, we design a pair of neural networks: a diffusion-based generator to produce diverse shapes in the form of the coarse coefficient volumes and a detail predictor to produce compatible detail coefficient volumes for introducing fine structures and details. Further, we may jointly train an encoder network to learn a latent space for inverting shapes, allowing us to enable a rich variety of whole-shape and region-aware shape manipulations. Both quantitative and qualitative experimental results manifest the compelling shape generation, inversion, and manipulation capabilities of our approach over the state-of-the-art methods.	This paper introduces a novel approach for 3D shape generation, inversion, and manipulation using a compact wavelet representation of implicit functions in the frequency domain.	Existing 3D shape generation methods struggle to produce diverse and realistic shapes with fine details. This work addresses these limitations by proposing a compact wavelet representation and a diffusion-based generative model operating in the frequency domain.	The method leverages biorthogonal wavelets to decompose the truncated signed distance field (TSDF) of a 3D shape into coarse and detail coefficient volumes. It then employs a diffusion-based generator to synthesize coarse coefficient volumes and a detail predictor to generate compatible details. An encoder network is jointly trained for shape inversion and manipulation.	The method generates diverse and realistic 3D shapes exhibiting complex structures, fine details, and clean surfaces. It faithfully inverts unseen shapes into latent codes, enabling high-quality shape reconstruction and interpolation. It supports various region-aware manipulations, including part replacement, part-wise interpolation, and part-wise re-generation.	The generated shapes, while visually plausible, may not always meet desired functionalities. The method requires a large number of shapes for training, limiting its effectiveness for categories with few training samples.	3d shape generation, shape manipulation, diffusion model, wavelet representation, implicit function
2301.13823 Report	Grounding Language Models to Images for Multimodal Inputs and Outputs	Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried	We propose an efficient method to ground pretrained text-only language models to the visual domain, enabling them to process arbitrarily interleaved image-and-text data, and generate text interleaved with retrieved images. Our method leverages the abilities of language models learnt from large scale text-only pretraining, such as in-context learning and free-form text generation. We keep the language model frozen, and finetune input and output linear layers to enable cross-modality interactions. This allows our model to process arbitrarily interleaved image-and-text inputs, and generate free-form text interleaved with retrieved images. We achieve strong zero-shot performance on grounded tasks such as contextual image retrieval and multimodal dialogue, and showcase compelling interactive abilities. Our approach works with any off-the-shelf language model and paves the way towards an effective, general solution for leveraging pretrained language models in visually grounded settings.	This paper introduces FROMAGE, an efficient method to ground pre-trained text-only language models (LLMs) to the visual domain, allowing them to process and generate interleaved image-and-text data.	This is important because it enables LLMs to leverage visual cues, improving their performance on visually grounded tasks like multimodal dialogue and contextual image retrieval, while retaining their existing text generation abilities.	FROMAGE leverages a frozen LLM and a frozen visual encoder, training only linear mapping layers for image-to-text and text-to-image interactions, along with a new \texttt{[RET]} token for image retrieval. This allows for efficient training with a multi-task objective of image captioning and image-text retrieval.	FROMAGE demonstrates strong zero-shot performance on contextual image retrieval, outperforming CLIP, particularly when provided with long and complex descriptions or multimodal context. The model exhibits competitive results on zero-shot Visual Dialogue, surpassing prior work in text-to-image retrieval within dialogue. FROMAGE showcases in-context learning abilities by generating coherent and relevant multimodal stories and responses in interactive settings.	The model's reliance on image retrieval from a fixed set limits its ability to generate novel images or handle prompts unlikely to be found in natural images. While the introduced \texttt{[RET]} token enables image interleaving, further research is needed to encourage its natural generation during inference.	vision-and-language, large language models, multimodal dialogue, contextual image retrieval, frozen model adaptation
2301.13721 Report	DisDiff: Unsupervised Disentanglement of Diffusion Probabilistic Models	Tao Yang, Yuwang Wang, Yan Lv, Nanning Zheng	Targeting to understand the underlying explainable factors behind observations and modeling the conditional generation process on these factors, we connect disentangled representation learning to Diffusion Probabilistic Models (DPMs) to take advantage of the remarkable modeling ability of DPMs. We propose a new task, disentanglement of (DPMs): given a pre-trained DPM, without any annotations of the factors, the task is to automatically discover the inherent factors behind the observations and disentangle the gradient fields of DPM into sub-gradient fields, each conditioned on the representation of each discovered factor. With disentangled DPMs, those inherent factors can be automatically discovered, explicitly represented, and clearly injected into the diffusion process via the sub-gradient fields. To tackle this task, we devise an unsupervised approach named DisDiff, achieving disentangled representation learning in the framework of DPMs. Extensive experiments on synthetic and real-world datasets demonstrate the effectiveness of DisDiff.	This paper introduces a novel task: disentanglement of diffusion probabilistic models (DPMs). The goal is to automatically discover and explicitly represent inherent factors of variation within pre-trained DPMs, without any factor annotations. This is achieved by disentangling the DPM's gradient fields into sub-gradient fields, each conditioned on the representation of a discovered factor.	Disentangling DPMs offers two main advantages: 1) It enables unsupervised control over image generation by uncovering inherent semantic factors, extending the possibilities for DPM conditioning beyond supervised methods. 2) DPMs, with their strong image generation quality and natural affinity for inversion, offer a more suitable framework for disentangled representation learning compared to VAEs or GANs.	The authors propose DisDiff, an unsupervised approach that learns disentangled representations for each factor and their corresponding disentangled conditional sub-gradient fields. It utilizes an encoder to learn factor representations and a decoder to learn the sub-gradient fields. A novel Disentangling Loss function encourages the learned representations to satisfy disentanglement requirements while still allowing for accurate input image reconstruction.	DisDiff significantly outperforms existing VAE-based and GAN-based disentanglement methods on benchmark datasets like Shapes3D, MPI3D, and Cars3D, as measured by FactorVAE score and DCI. Qualitative results demonstrate DisDiff's ability to effectively disentangle factors and enable image editing by swapping factor representations. DisDiff allows for partial condition sampling, generating images conditioned on a specific subset of factors, both in controlled and real-world datasets like CelebA.	The unsupervised nature of DisDiff might lead to learned disentangled representations on natural image sets that are not easily interpretable by humans, requiring further exploration of methods like CLIP for guidance. As a diffusion-based method, DisDiff's generation speed is slower compared to VAE-based and GAN-based methods, a common limitation for DPM-based approaches.	disentangled representation learning, diffusion probabilistic models, unsupervised learning, image generation, image editing
2301.13622 Report	Learning Data Representations with Joint Diffusion Models	Kamil Deja, Tomasz Trzcinski, Jakub M. Tomczak	Joint machine learning models that allow synthesizing and classifying data often offer uneven performance between those tasks or are unstable to train. In this work, we depart from a set of empirical observations that indicate the usefulness of internal representations built by contemporary deep diffusion-based generative models not only for generating but also predicting. We then propose to extend the vanilla diffusion model with a classifier that allows for stable joint end-to-end training with shared parameterization between those objectives. The resulting joint diffusion model outperforms recent state-of-the-art hybrid methods in terms of both classification and generation quality on all evaluated benchmarks. On top of our joint training approach, we present how we can directly benefit from shared generative and discriminative representations by introducing a method for visual counterfactual explanations.	This paper introduces a joint diffusion model, combining a diffusion model and a classifier through shared parameterization, to enhance both data generation and classification tasks.	Joint models that synthesize and classify data often suffer from uneven performance or training instability. This work leverages the representational power of diffusion models to improve both aspects within a single model.	The authors analyze the usefulness of internal representations learned by diffusion models for prediction tasks. They propose a joint training approach where a classifier shares the encoder part of the diffusion model's UNet architecture, leading to shared representations for both generative and discriminative objectives. Additionally, they introduce a conditional sampling algorithm that optimizes internal diffusion representations using the classifier.	The joint diffusion model outperforms stand-alone classifiers and previous joint models in classification accuracy across multiple datasets. It demonstrates superior generative capabilities compared to vanilla diffusion models and other hybrid methods, as evidenced by improved FID scores. The model effectively generates visual counterfactual explanations by identifying minimal changes in input images required to alter the classifier's decision.	The conditional sampling method, while effective, relies on a step size parameter that requires tuning for optimal precision and diversity of generated samples. Further exploration of more sophisticated domain adaptation techniques could enhance the model's performance in domain transfer scenarios.	deep generative models, diffusion models, joint models, conditional sampling, counterfactual explanations
2301.13188 Report	Extracting Training Data from Diffusion Models	Nicholas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramèr, Borja Balle, Daphne Ippolito, Eric Wallace	Image diffusion models such as DALL-E 2, Imagen, and Stable Diffusion have attracted significant attention due to their ability to generate high-quality synthetic images. In this work, we show that diffusion models memorize individual images from their training data and emit them at generation time. With a generate-and-filter pipeline, we extract over a thousand training examples from state-of-the-art models, ranging from photographs of individual people to trademarked company logos. We also train hundreds of diffusion models in various settings to analyze how different modeling and data decisions affect privacy. Overall, our results show that diffusion models are much less private than prior generative models such as GANs, and that mitigating these vulnerabilities may require new advances in privacy-preserving training.	This paper demonstrates that state-of-the-art diffusion models memorize and regenerate individual training examples, posing privacy risks.	This is important because it challenges the assumption that diffusion models generate novel images and raises concerns about data privacy, copyright infringement, and the potential for misuse with sensitive data.	The authors devise a two-stage data extraction attack: (1) generate numerous images from pre-trained diffusion models (Stable Diffusion and Imagen) using diverse prompts and (2) identify memorized training examples by detecting near-identical generations. They also train hundreds of diffusion models on CIFAR-10 to analyze the factors influencing memorization.	The authors extract over a thousand training examples from Stable Diffusion and Imagen, including personally identifiable information and copyrighted material. Diffusion models are found to be less private than GANs, with stronger diffusion models exhibiting higher vulnerability to memorization. Existing defenses like data deduplication and differentially-private training provide limited protection or cause training instability.	The definition of memorization based on pixel-level similarity might overlook more nuanced forms of data copying. Differentially-private training for diffusion models requires further investigation to address training instability.	diffusion models, memorization, privacy, data extraction, generative models
2301.13173 Report	Shape-aware Text-driven Layered Video Editing	Yao-Chih Lee, Ji-Ze Genevieve Jang, Yi-Ting Chen, Elizabeth Qiu, Jia-Bin Huang	Temporal consistency is essential for video editing applications. Existing work on layered representation of videos allows propagating edits consistently to each frame. These methods, however, can only edit object appearance rather than object shape changes due to the limitation of using a fixed UV mapping field for texture atlas. We present a shape-aware, text-driven video editing method to tackle this challenge. To handle shape changes in video editing, we first propagate the deformation field between the input and edited keyframe to all frames. We then leverage a pre-trained text-conditioned diffusion model as guidance for refining shape distortion and completing unseen regions. The experimental results demonstrate that our method can achieve shape-aware consistent video editing and compare favorably with the state-of-the-art.	This paper introduces a novel shape-aware, text-driven video editing method that enables changes to both object appearance and shape in a video, ensuring temporal consistency.	Existing video editing methods based on layered representations are limited to appearance editing due to fixed UV mapping. This method addresses the challenge of achieving consistent shape changes in videos, expanding the possibilities for creative video editing.	The method leverages a pre-trained NLA model to decompose the video into layers, then uses a text-to-image diffusion model to edit a keyframe. By estimating semantic correspondence between the input and edited keyframes, the method generates per-frame deformation fields. Finally, a pre-trained diffusion model guides the optimization of atlas texture and deformation, completing unseen regions and refining shape details.	The method successfully achieves shape-aware consistent video editing, as demonstrated through visual comparisons with baseline methods. Ablation studies confirm the importance of both UV deformation and atlas optimization for achieving high-quality results. The proposed approach allows for shape interpolation, expanding creative possibilities for video editing.	The method relies on the accuracy of NLA mapping, which may fail in complex motion scenarios leading to artifacts. Inaccurate semantic correspondence initialization between different objects can hinder the optimization process, suggesting potential for user-guided improvements.	video editing, shape editing, text-driven editing, neural layered atlas, diffusion models
2301.13156 Report	SeaFormer: Squeeze-enhanced Axial Transformer for Mobile Semantic Segmentation	Qiang Wan, Zilong Huang, Jiachen Lu, Gang Yu, Li Zhang	Since the introduction of Vision Transformers, the landscape of many computer vision tasks (e.g., semantic segmentation), which has been overwhelmingly dominated by CNNs, recently has significantly revolutionized. However, the computational cost and memory requirement render these methods unsuitable on the mobile device, especially for the high-resolution per-pixel semantic segmentation task. In this paper, we introduce a new method squeeze-enhanced Axial TransFormer (SeaFormer) for mobile semantic segmentation. Specifically, we design a generic attention block characterized by the formulation of squeeze Axial and detail enhancement. It can be further used to create a family of backbone architectures with superior cost-effectiveness. Coupled with a light segmentation head, we achieve the best trade-off between segmentation accuracy and latency on the ARM-based mobile devices on the ADE20K and Cityscapes datasets. Critically, we beat both the mobile-friendly rivals and Transformer-based counterparts with better performance and lower latency without bells and whistles. Beyond semantic segmentation, we further apply the proposed SeaFormer architecture to image classification problem, demonstrating the potentials of serving as a versatile mobile-friendly backbone.	This paper introduces SeaFormer, a mobile-friendly Transformer-based model for semantic segmentation, featuring a squeeze-enhanced Axial attention block for efficient global context modeling.	Vision Transformers are computationally expensive and memory-intensive, making them unsuitable for mobile semantic segmentation, especially with high-resolution images.	SeaFormer uses a squeeze-enhanced Axial attention mechanism, squeezing feature maps for efficient global context aggregation and enhancing local details with a convolution kernel.	SeaFormer outperforms mobile-friendly networks (e.g., MobileNetV3) and Transformer-based models (e.g., TopFormer) on ADE20K and Cityscapes datasets. SeaFormer-Base achieves +7.9% mIoU improvement over MobileNetV3 with lower latency on an ARM-based mobile device. Ablation studies demonstrate the effectiveness of each component in SeaFormer, especially the squeeze-enhanced Axial attention block.	The system's performance might be limited due to the lack of exhaustive evaluation and testing in real-world deployments. Future work includes extending the mobile-friendly approach to more downstream tasks and exploring its potential on GPU systems.	semantic segmentation, vision transformer, mobile-friendly, axial attention, edge computing
2301.12959 Report	GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis	Ming Tao, Bing-Kun Bao, Hao Tang, Changsheng Xu	Synthesizing high-fidelity complex images from text is challenging. Based on large pretraining, the autoregressive and diffusion models can synthesize photo-realistic images. Although these large models have shown notable progress, there remain three flaws. 1) These models require tremendous training data and parameters to achieve good performance. 2) The multi-step generation design slows the image synthesis process heavily. 3) The synthesized visual features are difficult to control and require delicately designed prompts. To enable high-quality, efficient, fast, and controllable text-to-image synthesis, we propose Generative Adversarial CLIPs, namely GALIP. GALIP leverages the powerful pretrained CLIP model both in the discriminator and generator. Specifically, we propose a CLIP-based discriminator. The complex scene understanding ability of CLIP enables the discriminator to accurately assess the image quality. Furthermore, we propose a CLIP-empowered generator that induces the visual concepts from CLIP through bridge features and prompts. The CLIP-integrated generator and discriminator boost training efficiency, and as a result, our model only requires about 3% training data and 6% learnable parameters, achieving comparable results to large pretrained autoregressive and diffusion models. Moreover, our model achieves 120 times faster synthesis speed and inherits the smooth latent space from GAN. The extensive experimental results demonstrate the excellent performance of our GALIP. Code is available at https://github.com/tobran/GALIP.	This paper introduces GALIP, a novel text-to-image generation framework that integrates the pretrained CLIP model in both the discriminator and generator, enabling high-quality, efficient, fast, and controllable text-to-image synthesis.	Existing large pretrained autoregressive and diffusion models, while impressive, require tremendous training data and parameters, have slow multi-step generation, and lack intuitive control over visual features. GALIP addresses these limitations.	GALIP leverages a CLIP-based discriminator with a frozen CLIP image encoder and a learnable mate-discriminator to accurately assess image quality. It also uses a CLIP-empowered generator with a frozen CLIP encoder and a learnable mate-generator to induce visual concepts from CLIP via bridge features and prompts.	GALIP achieves comparable synthesis quality to large pretrained models with significantly smaller trainable parameters and training data. It enables ~120x faster synthesis speed compared to diffusion models like LDM. GALIP inherits the smooth latent space from GANs, allowing for more controllable synthesis.	The CLIP text encoder in GALIP might be improved by using more advanced large language models like T5. Increasing the model size and pretraining dataset size could further enhance the synthesis ability, particularly for imaginary images.	text-to-image synthesis, generative adversarial networks (gans), clip, image generation, deep learning
2301.12914 Report	PromptMix: Text-to-image diffusion models enhance the performance of lightweight networks	Arian Bakhtiarnia, Qi Zhang, Alexandros Iosifidis	Many deep learning tasks require annotations that are too time consuming for human operators, resulting in small dataset sizes. This is especially true for dense regression problems such as crowd counting which requires the location of every person in the image to be annotated. Techniques such as data augmentation and synthetic data generation based on simulations can help in such cases. In this paper, we introduce PromptMix, a method for artificially boosting the size of existing datasets, that can be used to improve the performance of lightweight networks. First, synthetic images are generated in an end-to-end data-driven manner, where text prompts are extracted from existing datasets via an image captioning deep network, and subsequently introduced to text-to-image diffusion models. The generated images are then annotated using one or more high-performing deep networks, and mixed with the real dataset for training the lightweight network. By extensive experiments on five datasets and two tasks, we show that PromptMix can significantly increase the performance of lightweight networks by up to 26%.	Introduces PromptMix, a method to improve the performance of lightweight deep neural networks by augmenting training datasets with synthetic data generated using text-to-image diffusion models.	Lightweight networks are crucial for deploying deep learning models on resource-constrained devices, but they often suffer from reduced accuracy compared to their heavyweight counterparts. PromptMix addresses this issue by generating additional training data, which is particularly beneficial for tasks where data collection and annotation are costly.	1. Prompt Generation: Extract text descriptions (prompts) from existing datasets using image captioning or manually define them. 2. Prompt Modification: Enhance prompts with prefixes/suffixes to guide image generation. 3. Image Generation: Generate synthetic images using a text-to-image diffusion model (Stable Diffusion) based on the modified prompts. 4. Image Filtering: Filter out low-quality synthetic images based on the agreement between multiple heavyweight models' annotations. 5. Data Mixing: Combine a subset of the filtered synthetic data with the real dataset during training.	PromptMix consistently enhances the performance of lightweight networks across different tasks (crowd counting, depth estimation), datasets, and architectures. ResCSRNet, an ultra-lightweight architecture introduced in the paper, achieves comparable or even superior results to heavyweight models when trained with PromptMix. The paper provides insights into PromptMix's hyperparameters, demonstrating that a wide range of settings leads to improvements over baseline training.	PromptMix involves several hyperparameters that need to be tuned, although the ablation study shows its robustness to different configurations. Generating high-quality synthetic images with faces in crowds remains a challenge due to limitations in current text-to-image diffusion models.	lightweight deep learning, data augmentation, text-to-image synthesis, diffusion models, crowd counting, monocular depth estimation
2301.12686 Report	GibbsDDRM: A Partially Collapsed Gibbs Sampler for Solving Blind Inverse Problems with Denoising Diffusion Restoration	Naoki Murata, Koichi Saito, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Yuki Mitsufuji, Stefano Ermon	Pre-trained diffusion models have been successfully used as priors in a variety of linear inverse problems, where the goal is to reconstruct a signal from noisy linear measurements. However, existing approaches require knowledge of the linear operator. In this paper, we propose GibbsDDRM, an extension of Denoising Diffusion Restoration Models (DDRM) to a blind setting in which the linear measurement operator is unknown. GibbsDDRM constructs a joint distribution of the data, measurements, and linear operator by using a pre-trained diffusion model for the data prior, and it solves the problem by posterior sampling with an efficient variant of a Gibbs sampler. The proposed method is problem-agnostic, meaning that a pre-trained diffusion model can be applied to various inverse problems without fine-tuning. In experiments, it achieved high performance on both blind image deblurring and vocal dereverberation tasks, despite the use of simple generic priors for the underlying linear operators.	This paper introduces GibbsDDRM, a novel method for blind linear inverse problems that utilizes a pre-trained diffusion model as a data prior and a partially collapsed Gibbs sampler for efficient posterior sampling.	Many real-world inverse problems are blind, meaning the measurement process is unknown, requiring estimation of both the original signal and the linear operator parameters, which poses a significant challenge.	GibbsDDRM constructs a joint distribution of data, measurements, and linear operator parameters. It then uses a partially collapsed Gibbs sampler, alternately sampling data/latent variables and linear operator parameters, leveraging the diffusion model's representational power for accurate estimation.	GibbsDDRM achieves high performance on blind image deblurring, surpassing competing methods in perceptual quality (LPIPS) despite using a simple prior for the blur kernel. In vocal dereverberation, GibbsDDRM demonstrates superior performance in terms of signal quality (SI-SDR), perceptual quality (FAD), and reverberation removal (SRMR). The method's efficacy is demonstrated even with large measurement noise and simple priors for the linear operator, highlighting its robustness and generalizability.	GibbsDDRM's reliance on SVD computations might limit its applicability to problems involving large-scale linear operators. Future research could explore extending GibbsDDRM to handle non-linear inverse problems.	diffusion models, inverse problems, gibbs sampling, blind deblurring, dereverberation
2301.12597 Report	BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models	Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi	The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods. For example, our model outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters. We also demonstrate the model's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.	BLIP-2, a new vision-language pre-training method that bootstraps from frozen pre-trained image encoders and large language models (LLMs), achieving state-of-the-art performance on various vision-language tasks while being computationally efficient.	Vision-language pre-training (VLP) is becoming computationally expensive, and BLIP-2 leverages readily available unimodal models to improve efficiency and performance.	BLIP-2 uses a lightweight Querying Transformer (Q-Former) to bridge the modality gap between frozen image encoders and LLMs. It employs a two-stage pre-training strategy: (1) vision-language representation learning with a frozen image encoder and (2) vision-to-language generative learning with a frozen LLM.	BLIP-2 achieves state-of-the-art performance on zero-shot VQAv2, outperforming Flamingo80B by 8.7% with 54x fewer trainable parameters. It demonstrates strong generalization ability to out-of-domain images, achieving impressive results on image captioning tasks. BLIP-2 enables instructed zero-shot image-to-text generation, demonstrating capabilities like visual knowledge reasoning, visual conversation, etc.	BLIP-2 does not currently benefit from in-context learning with LLMs due to the limitation of the pre-training dataset. Generated image-to-text outputs may be inaccurate due to limitations of the LLM's knowledge or reasoning abilities.	vision-language pre-training, image captioning, visual question answering, image-text retrieval, large language models
2301.12429 Report	Debiased Fine-Tuning for Vision-language Models by Prompt Regularization	Beier Zhu, Yulei Niu, Saeil Lee, Minhoe Hur, Hanwang Zhang	We present a new paradigm for fine-tuning large-scale visionlanguage pre-trained models on downstream task, dubbed Prompt Regularization (ProReg). Different from traditional fine-tuning which easily overfits to the downstream task data, ProReg uses the prediction by prompting the pretrained model to regularize the fine-tuning. The motivation is: by prompting the large model "a photo of a [CLASS]", the fil-lin answer is only dependent on the pretraining encyclopedic knowledge while independent of the task data distribution, which is usually biased. Specifically, given a training sample prediction during fine-tuning, we first calculate its KullbackLeibler loss of the prompt prediction and Cross-Entropy loss of the ground-truth label, and then combine them with a proposed sample-wise adaptive trade-off weight, which automatically adjusts the transfer between the pretrained and downstream domains. On various out-of-distribution benchmarks, we show the consistently strong performance of ProReg compared with conventional fine-tuning, zero-shot prompt, prompt tuning, and other state-of-the-art methods.	This paper introduces Prompt Regularization (ProReg), a novel fine-tuning paradigm for large-scale vision-language pre-trained models that leverages prompt-based predictions as regularization to mitigate overfitting to downstream task data and improve out-of-distribution generalization.	Traditional fine-tuning methods often overfit to biased downstream data, while zero-shot prompt methods struggle with domain-specific generalization. ProReg addresses these limitations by effectively transferring knowledge from both pre-trained and downstream domains.	ProReg combines a cross-entropy loss from ground-truth labels with a Kullback-Leibler loss between fine-tuned and prompt-based predictions. It introduces a sample-wise adaptive weight to dynamically balance the contribution of task-specific and pre-trained knowledge during training.	ProReg consistently outperforms zero-shot prompt, conventional fine-tuning, and prompt tuning across various out-of-distribution benchmarks for image classification and visual question answering. ProReg effectively mitigates biases from both pre-trained and downstream domains, achieving compelling performance in both out-of-distribution and in-distribution settings. Ablation studies demonstrate the effectiveness of the sample-wise adaptive weight and the limitations of traditional knowledge distillation and model ensemble approaches.	The performance of ProReg may be sensitive to the choice of prompt template. The computational cost of ProReg is slightly higher than conventional fine-tuning due to the additional prompt prediction.	prompt learning, fine-tuning, out-of-distribution generalization, vision-language models, knowledge distillation
2301.12276 Report	ProtoSeg: Interpretable Semantic Segmentation with Prototypical Parts	Mikołaj Sacha, Dawid Rymarczyk, Łukasz Struski, Jacek Tabor, Bartosz Zieliński	We introduce ProtoSeg, a novel model for interpretable semantic image segmentation, which constructs its predictions using similar patches from the training set. To achieve accuracy comparable to baseline methods, we adapt the mechanism of prototypical parts and introduce a diversity loss function that increases the variety of prototypes within each class. We show that ProtoSeg discovers semantic concepts, in contrast to standard segmentation models. Experiments conducted on Pascal VOC and Cityscapes datasets confirm the precision and transparency of the presented method.	Presents [Method Name], a model for interpretable semantic segmentation using prototypical object parts.	Most current semantic segmentation methods lack interpretability of their predictions, which is important for applications like autonomous vehicles and medical image analysis.	The method uses a DeepLabv2 backbone with a novel Prototype Diversity Loss to ensure that prototypes learned for different object parts are semantically distinct. The model is evaluated on Cityscapes and PASCAL VOC 2012 datasets.	The model achieves interpretable segmentation by identifying prototypical parts of objects. The Prototype Diversity Loss successfully encourages diversity in learned prototypes. While the method provides interpretability, it achieves lower mIOU compared to the baseline DeepLabv2 model.	The model's precision is currently lower than state-of-the-art non-interpretable methods. Future work includes improving precision, exploring different backbones (e.g., U-Net), and applying the method to other segmentation tasks.	semantic segmentation, interpretability, prototypical parts, deep learning, computer vision
2301.12257 Report	Few-shot Face Image Translation via GAN Prior Distillation	Ruoyu Zhao, Mingrui Zhu, Xiaoyu Wang, Nannan Wang	Face image translation has made notable progress in recent years. However, when training on limited data, the performance of existing approaches significantly declines. Although some studies have attempted to tackle this problem, they either failed to achieve the few-shot setting (less than 10) or can only get suboptimal results. In this paper, we propose GAN Prior Distillation (GPD) to enable effective few-shot face image translation. GPD contains two models: a teacher network with GAN Prior and a student network that fulfills end-to-end translation. Specifically, we adapt the teacher network trained on large-scale data in the source domain to the target domain with only a few samples, where it can learn the target domain's knowledge. Then, we can achieve few-shot augmentation by generating source domain and target domain images simultaneously with the same latent codes. We propose an anchor-based knowledge distillation module that can fully use the difference between the training and the augmented data to distill the knowledge of the teacher network into the student network. The trained student network achieves excellent generalization performance with the absorption of additional knowledge. Qualitative and quantitative experiments demonstrate that our method achieves superior results than state-of-the-art approaches in a few-shot setting.	This paper proposes GAN Prior Distillation (GPD), a novel framework for few-shot face image translation that leverages knowledge distillation from a teacher network pre-trained on large-scale datasets.	Existing face image translation methods struggle with limited training data, especially in few-shot settings (less than 10 image pairs). GPD addresses this challenge by efficiently transferring knowledge from a pre-trained GAN to a smaller, faster translation network.	GPD employs two main modules: (1) a few-shot generative augmentation module that adapts a pre-trained GAN to generate augmented image pairs for the target domain, and (2) an anchor-based knowledge distillation module that leverages the differences in realism between training data and augmented data to effectively distill knowledge into the student network.	GPD significantly outperforms existing few-shot image translation methods in terms of visual quality and evaluation metrics, particularly in capturing complex styles. The few-shot generative augmentation module effectively expands limited training data, leading to improved structural integrity and detail in translated images. The anchor-based knowledge distillation module effectively mitigates overfitting and promotes generalization by strategically leveraging both training and augmented data.	GPD's current reliance on StyleGAN, predominantly trained on face datasets, limits its applicability to other domains. Future work will focus on extending GPD to broader image domains by exploring novel few-shot image generation models for diverse data types.	face image translation, few-shot learning, generative adversarial networks, knowledge distillation, data augmentation
2301.12247 Report	SEGA: Instructing Text-to-Image Models using Semantic Guidance	Manuel Brack, Felix Friedrich, Dominik Hintersdorf, Lukas Struppek, Patrick Schramowski, Kristian Kersting	Text-to-image diffusion models have recently received a lot of interest for their astonishing ability to produce high-fidelity images from text only. However, achieving one-shot generation that aligns with the user's intent is nearly impossible, yet small changes to the input prompt often result in very different images. This leaves the user with little semantic control. To put the user in control, we show how to interact with the diffusion process to flexibly steer it along semantic directions. This semantic guidance (SEGA) generalizes to any generative architecture using classifier-free guidance. More importantly, it allows for subtle and extensive edits, changes in composition and style, as well as optimizing the overall artistic conception. We demonstrate SEGA's effectiveness on both latent and pixel-based diffusion models such as Stable Diffusion, Paella, and DeepFloyd-IF using a variety of tasks, thus providing strong evidence for its versatility, flexibility, and improvements over existing methods.	This paper introduces Semantic Guidance (SEGA), a novel method to exert fine-grained semantic control over image generation in diffusion models.	Current text-to-image diffusion models lack granular control; small prompt tweaks yield drastically different images. SEGA addresses this by enabling subtle and extensive edits, compositional and stylistic changes, and artistic optimization.	SEGA leverages classifier-free guidance, manipulating the noise estimates of diffusion models based on user-defined textual prompts. It identifies semantic directions within the noise-estimate space and guides the generation along these vectors.	SEGA robustly incorporates concepts into images, demonstrated by successfully adding 'glasses' to diverse portraits. Guidance vectors are unique and transferable, allowing a single calculated vector to be applied across multiple images. The strength of semantic guidance scales monotonically with the magnitude of the desired effect, offering intuitive control over the generation.	Transferring guidance vectors across vastly different image compositions requires separate calculations. The paper acknowledges potential biases inherited from the underlying diffusion model's training data.	diffusion models, text-to-image generation, semantic control, image editing, generative ai
2301.12141 Report	What Decreases Editing Capability? Domain-Specific Hybrid Refinement for Improved GAN Inversion	Pu Cao, Lu Yang, Dongxv Liu, Xiaoya Yang, Tianrui Huang, Qing Song	Recently, inversion methods have focused on additional high-rate information in the generator (e.g., weights or intermediate features) to refine inversion and editing results from embedded latent codes. Although these techniques gain reasonable improvement in reconstruction, they decrease editing capability, especially on complex images (e.g., containing occlusions, detailed backgrounds, and artifacts). A vital crux is refining inversion results, avoiding editing capability degradation. To tackle this problem, we introduce Domain-Specific Hybrid Refinement (DHR), which draws on the advantages and disadvantages of two mainstream refinement techniques to maintain editing ability with fidelity improvement. Specifically, we first propose Domain-Specific Segmentation to segment images into two parts: in-domain and out-of-domain parts. The refinement process aims to maintain the editability for in-domain areas and improve two domains' fidelity. We refine these two parts by weight modulation and feature modulation, which we call Hybrid Modulation Refinement. Our proposed method is compatible with all latent code embedding methods. Extension experiments demonstrate that our approach achieves state-of-the-art in real image inversion and editing. Code is available at https://github.com/caopulan/Domain-Specific_Hybrid_Refinement_Inversion.	This paper introduces Domain-Specific Hybrid Refinement (DHR), a novel GAN inversion method that refines image inversion and editing by leveraging a hybrid approach of weight and feature modulation, addressing the issue of editing capability degradation in existing refinement techniques.	Existing refinement methods for GAN inversion, though improve reconstruction fidelity, often sacrifice editing capability, especially for images with complex features. This paper addresses this by proposing DHR, a method that selectively refines different image domains to maintain a good balance between fidelity and editability.	The proposed DHR method consists of two components: Domain-Specific Segmentation (DSS) and Hybrid Modulation Refinement (HMR). DSS automatically segments images into easy-to-invert 'in-domain' parts and challenging 'out-of-domain' parts without requiring data annotation. HMR then applies weight modulation to the 'in-domain' areas for better editing capability and feature modulation to 'out-of-domain' areas for accurate detail reconstruction.	DHR achieves state-of-the-art performance in quantitative metrics, surpassing existing methods in MSE, LPIPS, and identity similarity. Qualitative results demonstrate DHR's ability to preserve image details during both inversion and editing, leading to more faithful and photorealistic results. User studies confirm the superiority of DHR, with users showing a strong preference for DHR results over existing methods in terms of both inversion quality and editing realism.	The method currently focuses on the face domain, and future work could explore its generalization to other image domains. The runtime of DHR, though significantly faster than some baselines, can be further improved for real-time applications.	gan inversion, image editing, weight modulation, feature modulation, domain-specific segmentation
2301.12025 Report	Cross-Architectural Positive Pairs improve the effectiveness of Self-Supervised Learning	Pranav Singh, Jacopo Cirrone	Existing self-supervised techniques have extreme computational requirements and suffer a substantial drop in performance with a reduction in batch size or pretraining epochs. This paper presents Cross Architectural - Self Supervision (CASS), a novel self-supervised learning approach that leverages Transformer and CNN simultaneously. Compared to the existing state-of-the-art self-supervised learning approaches, we empirically show that CASS-trained CNNs and Transformers across four diverse datasets gained an average of 3.8% with 1% labeled data, 5.9% with 10% labeled data, and 10.13% with 100% labeled data while taking 69% less time. We also show that CASS is much more robust to changes in batch size and training epochs than existing state-of-the-art self-supervised learning approaches. We have open-sourced our code at https://github.com/pranavsinghps1/CASS.	This paper introduces CASS (Cross-Architectural Self-Supervision), a new self-supervised learning approach that uses both CNNs and Transformers to learn better data representations, particularly beneficial for medical image analysis where data is often limited.	Existing self-supervised methods require large datasets and significant computational resources, hindering their application in medical imaging where data and computational power are often limited. CASS aims to address these limitations.	CASS leverages the inherent architectural differences between CNNs and Transformers to create positive pairs from the same input image. It minimizes the cosine similarity loss between the logits of the two architectures, encouraging them to learn from each other.	CASS outperforms the state-of-the-art self-supervised method DINO on four medical imaging datasets, achieving an average improvement of 3.8% with 1% labeled data, 5.9% with 10% labeled data, and 10.13% with 100% labeled data. CASS is more robust to changes in batch size and training epochs compared to DINO. CASS is computationally more efficient than DINO, taking 69% less time to train on the same hardware.	CASS's performance has not been extensively evaluated on large-scale natural image datasets. At inference time, without ground-truth labels, it's unclear whether to choose the CNN or the Transformer arm of CASS for optimal performance.	self-supervised learning, medical image analysis, cnn, transformer, limited data
2301.11706 Report	Input Perturbation Reduces Exposure Bias in Diffusion Models	Mang Ning, Enver Sangineto, Angelo Porrello, Simone Calderara, Rita Cucchiara	Denoising Diffusion Probabilistic Models have shown an impressive generation quality, although their long sampling chain leads to high computational costs. In this paper, we observe that a long sampling chain also leads to an error accumulation phenomenon, which is similar to the exposure bias problem in autoregressive text generation. Specifically, we note that there is a discrepancy between training and testing, since the former is conditioned on the ground truth samples, while the latter is conditioned on the previously generated results. To alleviate this problem, we propose a very simple but effective training regularization, consisting in perturbing the ground truth samples to simulate the inference time prediction errors. We empirically show that, without affecting the recall and precision, the proposed input perturbation leads to a significant improvement in the sample quality while reducing both the training and the inference times. For instance, on CelebA 64$\times$64, we achieve a new state-of-the-art FID score of 1.27, while saving 37.5% of the training time. The code is publicly available at https://github.com/forever208/DDPM-IP	The paper proposes DDPM-IP, a novel training regularization method for Denoising Diffusion Probabilistic Models (DDPMs) using input perturbation to address the exposure bias problem.	Existing DDPMs suffer from exposure bias due to a discrepancy between training (conditioned on ground truth) and inference (conditioned on previous predictions), leading to error accumulation and suboptimal generation quality.	DDPM-IP perturbs the ground truth input during training with Gaussian noise, simulating inference-time prediction errors and encouraging the model to learn a smoother prediction function. The method requires no changes to network architecture or loss function.	DDPM-IP consistently outperforms the baseline ADM model in terms of FID and sFID scores across various datasets (CIFAR10, ImageNet 32x32, LSUN tower 64x64, CelebA 64x64, FFHQ 128x128). The method demonstrates significant acceleration in both training (converging faster) and inference (achieving comparable or better results with fewer sampling steps). DDPM-IP does not negatively impact sample diversity, as evidenced by similar recall and precision scores compared to the baseline.	Experiments are limited to datasets with small resolution images due to the computational demands of training DDPMs. Future work includes exploring the effectiveness of DDPM-IP on higher resolution images and investigating dataset-specific tuning of the noise hyperparameter.	denoising diffusion probabilistic models, exposure bias, input perturbation, image generation, generative models
2301.11699 Report	Image Restoration with Mean-Reverting Stochastic Differential Equations	Ziwei Luo, Fredrik K. Gustafsson, Zheng Zhao, Jens Sjölund, Thomas B. Schön	This paper presents a stochastic differential equation (SDE) approach for general-purpose image restoration. The key construction consists in a mean-reverting SDE that transforms a high-quality image into a degraded counterpart as a mean state with fixed Gaussian noise. Then, by simulating the corresponding reverse-time SDE, we are able to restore the origin of the low-quality image without relying on any task-specific prior knowledge. Crucially, the proposed mean-reverting SDE has a closed-form solution, allowing us to compute the ground truth time-dependent score and learn it with a neural network. Moreover, we propose a maximum likelihood objective to learn an optimal reverse trajectory that stabilizes the training and improves the restoration results. The experiments show that our proposed method achieves highly competitive performance in quantitative comparisons on image deraining, deblurring, and denoising, setting a new state-of-the-art on two deraining datasets. Finally, the general applicability of our approach is further demonstrated via qualitative results on image super-resolution, inpainting, and dehazing. Code is available at https://github.com/Algolzw/image-restoration-sde.	This paper introduces Image Restoration Stochastic Differential Equation (IR-SDE), a novel approach for image restoration leveraging a mean-reverting SDE to model image degradation as a diffusion process.	Current diffusion models for image restoration often struggle to accurately restore ground truth details, especially when initialized with high-variance noise. IR-SDE tackles this by directly modeling the degradation process, leading to improved fidelity.	The method uses a mean-reverting SDE with a closed-form solution to represent the degradation from high-quality to low-quality images. A neural network learns the time-dependent score function of this SDE using a proposed maximum likelihood objective for improved training stability.	IR-SDE achieves state-of-the-art performance on deraining benchmarks, outperforming existing methods in both quantitative metrics and perceptual quality. The method demonstrates strong performance on deblurring, surpassing GAN-based methods in perceptual scores while maintaining consistency with ground truths. The authors further showcase the versatility of IR-SDE by successfully applying it to super-resolution, inpainting, and dehazing tasks.	The exponential term in the variance calculation leads to overly smooth changes in the final steps, potentially hindering learning in that stage. Future work will explore alternative schedules to mitigate this. The iterative nature of reverse SDE simulation increases computational cost during inference. Exploring optimization techniques for faster inference is a potential future direction.	image restoration, stochastic differential equations, diffusion models, mean-reverting process, maximum likelihood
2301.11558 Report	Accelerating Guided Diffusion Sampling with Splitting Numerical Methods	Suttisak Wizadwongsa, Supasorn Suwajanakorn	Guided diffusion is a technique for conditioning the output of a diffusion model at sampling time without retraining the network for each specific task. One drawback of diffusion models, however, is their slow sampling process. Recent techniques can accelerate unguided sampling by applying high-order numerical methods to the sampling process when viewed as differential equations. On the contrary, we discover that the same techniques do not work for guided sampling, and little has been explored about its acceleration. This paper explores the culprit of this problem and provides a solution based on operator splitting methods, motivated by our key finding that classical high-order numerical methods are unsuitable for the conditional function. Our proposed method can re-utilize the high-order methods for guided sampling and can generate images with the same quality as a 250-step DDIM baseline using 32-58% less sampling time on ImageNet256. We also demonstrate usage on a wide variety of conditional generation tasks, such as text-to-image generation, colorization, inpainting, and super-resolution.	This paper proposes a solution based on operator splitting methods for accelerating guided diffusion sampling, which was previously difficult to achieve with high-order numerical methods.	Guided diffusion models are powerful but suffer from slow sampling speed. Accelerating guided sampling enables faster generation of high-quality images for various conditional generation tasks.	The authors first analyze the guided ODE and identify that the conditional term is the culprit behind the failure of classical high-order methods. They then apply splitting methods, namely Lie-Trotter and Strang splitting, to separate the diffusion and condition subproblems and solve them with different numerical methods.	Only the conditional subproblem is incompatible with classical high-order numerical methods. Strang splitting combined with 4th-order PLMS for diffusion and 1st-order PLMS for condition achieves the best performance. The proposed method is 32-58% faster than a 250-step DDIM baseline while maintaining similar image quality on various tasks like text-to-image generation, inpainting, colorization, and super-resolution.	The paper's findings are based on existing models and a specific sigma schedule; further investigation is needed for other schedules and models. Improving the behavior of the conditional function itself might be a promising future direction.	guided diffusion, sampling acceleration, operator splitting, numerical methods, conditional image generation
2301.11326 Report	Unsupervised Volumetric Animation	Aliaksandr Siarohin, Willi Menapace, Ivan Skorokhodov, Kyle Olszewski, Jian Ren, Hsin-Ying Lee, Menglei Chai, Sergey Tulyakov	We propose a novel approach for unsupervised 3D animation of non-rigid deformable objects. Our method learns the 3D structure and dynamics of objects solely from single-view RGB videos, and can decompose them into semantically meaningful parts that can be tracked and animated. Using a 3D autodecoder framework, paired with a keypoint estimator via a differentiable PnP algorithm, our model learns the underlying object geometry and parts decomposition in an entirely unsupervised manner. This allows it to perform 3D segmentation, 3D keypoint estimation, novel view synthesis, and animation. We primarily evaluate the framework on two video datasets: VoxCeleb $256^2$ and TEDXPeople $256^2$. In addition, on the Cats $256^2$ image dataset, we show it even learns compelling 3D geometry from still images. Finally, we show our model can obtain animatable 3D objects from a single or few images. Code and visual results available on our project website, see https://snap-research.github.io/unsupervised-volumetric-animation .	This paper presents the first unsupervised method for 3D animation of non-rigid objects from single-view videos.	Unsupervised 3D animation enables animating arbitrary objects in a 3D-consistent manner without requiring expensive 3D supervision or limiting animation to predefined object categories.	The proposed approach learns a canonical 3D volumetric representation of objects and their segmentation into parts, using a differentiable PnP algorithm to estimate part poses from 2D keypoints, enabling deformation and animation.	The method learns high-quality 3D geometry and part decompositions from unconstrained videos, outperforming 3D-GANs in novel view synthesis quality without requiring camera supervision. Quantitative and qualitative comparisons on face and body datasets demonstrate superior novel view synthesis and comparable animation quality to state-of-the-art unsupervised 2D animation methods. The approach generalizes to unseen objects, enabling animation and novel view synthesis from a single or a few images, highlighting its potential for diverse applications.	Limitations include a fixed voxel grid resolution that may lead to artifacts during large camera movements and reliance on an optimization-based embedding procedure during inference. Future work includes exploring alternative 3D representations for higher resolution and efficiency, and incorporating techniques for improving the quality and speed of inference for unseen objects.	unsupervised learning, 3d animation, novel view synthesis, volumetric rendering, computer vision
2301.11116 Report	Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring	Ruyang Liu, Jingjia Huang, Ge Li, Jiashi Feng, Xinglong Wu, Thomas H. Li	Image-text pretrained models, e.g., CLIP, have shown impressive general multi-modal knowledge learned from large-scale image-text data pairs, thus attracting increasing attention for their potential to improve visual representation learning in the video domain. In this paper, based on the CLIP model, we revisit temporal modeling in the context of image-to-video knowledge transferring, which is the key point for extending image-text pretrained models to the video domain. We find that current temporal modeling mechanisms are tailored to either high-level semantic-dominant tasks (e.g., retrieval) or low-level visual pattern-dominant tasks (e.g., recognition), and fail to work on the two cases simultaneously. The key difficulty lies in modeling temporal dependency while taking advantage of both high-level and low-level knowledge in CLIP model. To tackle this problem, we present Spatial-Temporal Auxiliary Network (STAN) -- a simple and effective temporal modeling mechanism extending CLIP model to diverse video tasks. Specifically, to realize both low-level and high-level knowledge transferring, STAN adopts a branch structure with decomposed spatial-temporal modules that enable multi-level CLIP features to be spatial-temporally contextualized. We evaluate our method on two representative video tasks: Video-Text Retrieval and Video Recognition. Extensive experiments demonstrate the superiority of our model over the state-of-the-art methods on various datasets, including MSR-VTT, DiDeMo, LSMDC, MSVD, Kinetics-400, and Something-Something-V2. Codes will be available at https://github.com/farewellthree/STAN	This paper revisits temporal modeling in CLIP-based image-to-video knowledge transferring and proposes Spatial-Temporal Auxiliary Network (STAN), a new branch structure for effective knowledge transfer to diverse video tasks.	Extending image-text pretrained models like CLIP to the video domain is important but challenging. Existing temporal modeling methods fail to effectively transfer both high-level semantic and low-level visual knowledge from CLIP.	STAN augments video frame features with spatial-temporal contexts at different CLIP output levels without altering CLIP's structure. It utilizes a branch structure with decomposed spatial-temporal modules for multi-level feature contextualization, and explores self-attention and 3D convolution based cross-frame modules.	STAN outperforms state-of-the-art methods on video-text retrieval benchmarks like MSR-VTT, DiDeMo, and LSMDC. STAN achieves competitive performance on video recognition tasks, including Kinetics-400 and Something-Something-V2. Ablation studies confirm the contribution of each component in STAN and demonstrate the importance of multi-level feature learning and the branch structure.	The performance of STAN on small-scale video-text retrieval datasets with limited training data is less competitive. Exploring the compatibility of STAN with other advanced techniques like hierarchical video-text interaction and hard sample modeling is left for future work.	video-text retrieval, video recognition, clip, knowledge transfer, temporal modeling
2301.10972 Report	On the Importance of Noise Scheduling for Diffusion Models	Ting Chen	We empirically study the effect of noise scheduling strategies for denoising diffusion generative models. There are three findings: (1) the noise scheduling is crucial for the performance, and the optimal one depends on the task (e.g., image sizes), (2) when increasing the image size, the optimal noise scheduling shifts towards a noisier one (due to increased redundancy in pixels), and (3) simply scaling the input data by a factor of $b$ while keeping the noise schedule function fixed (equivalent to shifting the logSNR by $\log b$) is a good strategy across image sizes. This simple recipe, when combined with recently proposed Recurrent Interface Network (RIN), yields state-of-the-art pixel-based diffusion models for high-resolution images on ImageNet, enabling single-stage, end-to-end generation of diverse and high-fidelity images at 1024$\times$1024 resolution (without upsampling/cascades).	This paper investigates the impact of noise scheduling strategies on the performance of denoising diffusion generative models for image generation.	Noise scheduling is crucial for diffusion model performance, impacting the distribution of noise levels learned by the model. Optimal scheduling varies across tasks and image resolutions due to differing data redundancy and information density.	The authors systematically explore two noise scheduling strategies: 1) Adjusting the noise schedule function (cosine, sigmoid, linear) and 2) Scaling the input data. They train and evaluate their methods on class-conditional ImageNet image generation at varying resolutions, using FID and Inception Score as metrics.	Optimal noise scheduling is task- and resolution-dependent. Scaling the input data is a simple yet effective strategy for adjusting noise scheduling, outperforming adjustments to the schedule function. Combining input scaling with the Recurrent Interface Network (RIN) architecture achieves state-of-the-art single-stage, high-resolution image generation on ImageNet.	The study primarily focuses on pixel-based diffusion models and hasn't been evaluated on latent diffusion models. Further exploration of hyperparameter tuning for high-resolution images may yield additional performance improvements.	denoising diffusion models, noise scheduling, image generation, recurrent interface network, high-resolution images
2301.10941 Report	GeCoNeRF: Few-shot Neural Radiance Fields via Geometric Consistency	Min-seop Kwak, Jiuhn Song, Seungryong Kim	We present a novel framework to regularize Neural Radiance Field (NeRF) in a few-shot setting with a geometry-aware consistency regularization. The proposed approach leverages a rendered depth map at unobserved viewpoint to warp sparse input images to the unobserved viewpoint and impose them as pseudo ground truths to facilitate learning of NeRF. By encouraging such geometry-aware consistency at a feature-level instead of using pixel-level reconstruction loss, we regularize the NeRF at semantic and structural levels while allowing for modeling view dependent radiance to account for color variations across viewpoints. We also propose an effective method to filter out erroneous warped solutions, along with training strategies to stabilize training during optimization. We show that our model achieves competitive results compared to state-of-the-art few-shot NeRF models. Project page is available at https://ku-cvlab.github.io/GeCoNeRF/.	This paper introduces GeCoNeRF, a novel framework that utilizes geometric consistency to enhance the performance of Neural Radiance Fields (NeRF) in few-shot novel view synthesis.	NeRF struggles in few-shot settings due to overfitting sparse input images and failing to reconstruct accurate geometry. This work addresses this limitation by introducing geometric constraints to regularize NeRF.	GeCoNeRF warps sparse input images to novel viewpoints guided by the depth rendered by NeRF. By enforcing consistency between warped images and those rendered at novel viewpoints, GeCoNeRF regularizes both geometry and appearance. It utilizes feature-level regularization to handle view-dependent radiance effects and employs occlusion masking to filter out erroneous warpings.	GeCoNeRF achieves competitive results compared to state-of-the-art few-shot NeRF models, as demonstrated on synthetic and real datasets. The proposed method effectively captures fine details and reduces artifacts in few-shot scenarios. Ablation studies validate the contribution of each component, including feature-level consistency loss, occlusion masking, and progressive training strategies.	The thresholding technique used for occlusion handling can be sensitive to different scenes and datasets. The assumption of surface coverage by multiple viewpoints may not always hold, leading to unnecessary computational costs.	neural radiance fields, nerf, few-shot learning, novel view synthesis, geometric consistency
2301.10916 Report	ITstyler: Image-optimized Text-based Style Transfer	Yunpeng Bai, Jiayue Liu, Chao Dong, Chun Yuan	Text-based style transfer is a newly-emerging research topic that uses text information instead of style image to guide the transfer process, significantly extending the application scenario of style transfer. However, previous methods require extra time for optimization or text-image paired data, leading to limited effectiveness. In this work, we achieve a data-efficient text-based style transfer method that does not require optimization at the inference stage. Specifically, we convert text input to the style space of the pre-trained VGG network to realize a more effective style swap. We also leverage CLIP's multi-modal embedding space to learn the text-to-style mapping with the image dataset only. Our method can transfer arbitrary new styles of text input in real-time and synthesize high-quality artistic images.	ITstyler, a data-efficient text-based style transfer method that utilizes the style representation capability of VGG features and the multi-modal embedding space of CLIP, enabling real-time transfer of arbitrary new styles from text input to artistic images.	Existing text-based style transfer methods suffer from limitations like requiring extra optimization time, relying on text-image paired data, or lacking effectiveness in finding a suitable style representation space.	The method adapts CLIP's embedding space to the style space of a pre-trained VGG network. It trains a mapping network to convert text embeddings from CLIP's text encoder into style representations in VGG's feature space. This allows for efficient style swapping using a modified AdaIN operation.	ITstyler generates stylized images that better match the text description and are more artistically pleasing compared to previous methods. User study confirms a strong preference for ITstyler's results. The method exhibits superior speed, enabling real-time performance for arbitrary style transfer.	The method might not effectively convert specific words (e.g., names) into artistic styles. Future work can explore incorporating more complex language understanding to further improve the mapping from text to style representations.	style transfer, text-based style transfer, clip, adain, vgg
2301.10670 Report	Towards Arbitrary Text-driven Image Manipulation via Space Alignment	Yunpeng Bai, Zihan Zhong, Chao Dong, Weichen Zhang, Guowei Xu, Chun Yuan	The recent GAN inversion methods have been able to successfully invert the real image input to the corresponding editable latent code in StyleGAN. By combining with the language-vision model (CLIP), some text-driven image manipulation methods are proposed. However, these methods require extra costs to perform optimization for a certain image or a new attribute editing mode. To achieve a more efficient editing method, we propose a new Text-driven image Manipulation framework via Space Alignment (TMSA). The Space Alignment module aims to align the same semantic regions in CLIP and StyleGAN spaces. Then, the text input can be directly accessed into the StyleGAN space and be used to find the semantic shift according to the text description. The framework can support arbitrary image editing mode without additional cost. Our work provides the user with an interface to control the attributes of a given image according to text input and get the result in real time. Ex tensive experiments demonstrate our superior performance over prior works.	This paper presents TMSA, a novel text-driven image manipulation framework that aligns semantic regions between CLIP and StyleGAN latent spaces, allowing arbitrary text input to control image attributes in real-time.	Existing text-driven image manipulation methods require expensive optimization or training for each image or attribute, limiting flexibility and efficiency. TMSA aims to overcome these limitations.	A mapping network is trained to align image embeddings from CLIP with corresponding latent codes in StyleGAN's W+ space. The network is further fine-tuned using generated images for in-domain adjustment and adapted for specific inversion encoders.	TMSA enables accurate attribute editing from arbitrary text descriptions while preserving image identity, outperforming previous methods qualitatively and quantitatively. The Space Alignment module effectively maps text embeddings to precise semantic locations in the StyleGAN latent space. TMSA is adaptable to different inversion methods and can be combined with them to achieve high-fidelity image editing.	The reliance on pre-trained CLIP and StyleGAN models might limit the scope of editable attributes. Future work can explore extending TMSA for more complex manipulation tasks, such as multi-attribute editing or text-guided image generation.	image manipulation, text-driven editing, generative adversarial networks, clip, stylegan
2301.10241 Report	K-Planes: Explicit Radiance Fields in Space, Time, and Appearance	Sara Fridovich-Keil, Giacomo Meanti, Frederik Warburg, Benjamin Recht, Angjoo Kanazawa	We introduce k-planes, a white-box model for radiance fields in arbitrary dimensions. Our model uses d choose 2 planes to represent a d-dimensional scene, providing a seamless way to go from static (d=3) to dynamic (d=4) scenes. This planar factorization makes adding dimension-specific priors easy, e.g. temporal smoothness and multi-resolution spatial structure, and induces a natural decomposition of static and dynamic components of a scene. We use a linear feature decoder with a learned color basis that yields similar performance as a nonlinear black-box MLP decoder. Across a range of synthetic and real, static and dynamic, fixed and varying appearance scenes, k-planes yields competitive and often state-of-the-art reconstruction fidelity with low memory usage, achieving 1000x compression over a full 4D grid, and fast optimization with a pure PyTorch implementation. For video results and code, please see https://sarafridov.github.io/K-Planes.	This paper introduces k-planes, an explicit, interpretable, and memory-efficient model for representing radiance fields in arbitrary dimensions.	Existing methods for dynamic radiance fields, which require representing 4D volumes, are either memory inefficient or rely on black-box components like MLPs. K-planes offers a solution that is both memory efficient and interpretable.	K-planes factorizes a d-dimensional scene into (d choose 2) planes, representing every pair of dimensions. Each plane stores features that are then combined using the Hadamard product and decoded into color and density using either a linear decoder with a learned color basis (explicit model) or an MLP (hybrid model).	K-planes achieves competitive and often state-of-the-art reconstruction fidelity on a variety of static and dynamic scenes, including those with varying appearance. The model is compact, achieving 1000x compression over a full 4D grid. K-planes allows for fast optimization using a pure PyTorch implementation.	The resolution of reconstructions for scenes with varying appearance is slightly lower than NeRF-W. Future work could explore extending the planar factorization to efficiently incorporate higher-order interactions between dimensions.	radiance fields, neural rendering, planar factorization, dynamic scenes, explicit representation
2301.09879 Report	Data Augmentation Alone Can Improve Adversarial Training	Lin Li, Michael Spratling	Adversarial training suffers from the issue of robust overfitting, which seriously impairs its generalization performance. Data augmentation, which is effective at preventing overfitting in standard training, has been observed by many previous works to be ineffective in mitigating overfitting in adversarial training. This work proves that, contrary to previous findings, data augmentation alone can significantly boost accuracy and robustness in adversarial training. We find that the hardness and the diversity of data augmentation are important factors in combating robust overfitting. In general, diversity can improve both accuracy and robustness, while hardness can boost robustness at the cost of accuracy within a certain limit and degrade them both over that limit. To mitigate robust overfitting, we first propose a new crop transformation, Cropshift, which has improved diversity compared to the conventional one (Padcrop). We then propose a new data augmentation scheme, based on Cropshift, with much improved diversity and well-balanced hardness. Empirically, our augmentation method achieves the state-of-the-art accuracy and robustness for data augmentations in adversarial training. Furthermore, when combined with weight averaging it matches, or even exceeds, the performance of the best contemporary regularization methods for alleviating robust overfitting. Code is available at: https://github.com/TreeLLi/DA-Alone-Improves-AT.	This paper demonstrates that data augmentation alone, contrary to prior belief, can significantly improve the accuracy and robustness of adversarial training by carefully managing the hardness and diversity of augmentations.	Robust overfitting, a major issue in adversarial training, limits generalization performance. Previous attempts to leverage data augmentation for this problem have been largely ineffective.	The paper analyzes the impact of augmentation hardness and diversity, proposing Cropshift, a diversity-enhancing crop operation, and IDBH, a new augmentation scheme that balances hardness and maximizes diversity. The authors conduct experiments on CIFAR-10, SVHN, and Tiny ImageNet datasets, comparing IDBH with existing augmentation and regularization techniques.	IDBH achieves state-of-the-art accuracy and robustness among data augmentation methods in adversarial training, significantly surpassing the baseline augmentation with early stopping. IDBH matches or exceeds the performance of state-of-the-art regularization methods designed to alleviate robust overfitting. The study reveals that while diversity consistently improves accuracy and robustness, hardness should be carefully balanced to avoid compromising accuracy.	The study lacked sufficient computational resources to conduct a more extensive automatic augmentation search for optimal hyperparameters. The proposed hardness measure has limitations, as demonstrated by the exceptional behavior of certain transformations.	adversarial training, robust overfitting, data augmentation, cropshift, idbh
2301.09637 Report	InfiniCity: Infinite-Scale City Synthesis	Chieh Hubert Lin, Hsin-Ying Lee, Willi Menapace, Menglei Chai, Aliaksandr Siarohin, Ming-Hsuan Yang, Sergey Tulyakov	Toward infinite-scale 3D city synthesis, we propose a novel framework, InfiniCity, which constructs and renders an unconstrainedly large and 3D-grounded environment from random noises. InfiniCity decomposes the seemingly impractical task into three feasible modules, taking advantage of both 2D and 3D data. First, an infinite-pixel image synthesis module generates arbitrary-scale 2D maps from the bird's-eye view. Next, an octree-based voxel completion module lifts the generated 2D map to 3D octrees. Finally, a voxel-based neural rendering module texturizes the voxels and renders 2D images. InfiniCity can thus synthesize arbitrary-scale and traversable 3D city environments, and allow flexible and interactive editing from users. We quantitatively and qualitatively demonstrate the efficacy of the proposed framework. Project page: https://hubert0527.github.io/infinicity/	InfiniCity, a novel framework for synthesizing infinite-scale, realistic, and navigable 3D city environments from random noises.	Existing methods are limited to bounded environments, constrained camera movements, or lack 3D consistency, highlighting the need for a scalable and realistic 3D environment generation approach.	A three-stage pipeline: 1) Infinite-pixel satellite image synthesis generates large-scale 2D maps. 2) Octree-based voxel completion converts 2D maps to watertight 3D voxel environments. 3) Voxel-based neural rendering adds textures to the voxel world.	Generates arbitrary-scale, coherent, and diverse city maps with plausible structures across multiple modalities. Successfully lifts 2D maps to 3D voxel environments, ensuring watertight structures while retaining surface details. Outperforms baseline methods in quantitative evaluations (FID, KID, P-FID) and exhibits better visual quality and 3D consistency.	The quality of the final rendering is currently limited by the neural rendering stage. Future work includes exploring advanced neural rendering techniques to improve visual fidelity and address convergence and efficiency challenges.	3d city synthesis, infinite-scale generation, neural rendering, voxel completion, generative modeling
2301.09595 Report	Zorro: the masked multimodal transformer	Adrià Recasens, Jason Lin, Joāo Carreira, Drew Jaegle, Luyu Wang, Jean-baptiste Alayrac, Pauline Luc, Antoine Miech, Lucas Smaira, Ross Hemsley, Andrew Zisserman	Attention-based models are appealing for multimodal processing because inputs from multiple modalities can be concatenated and fed to a single backbone network - thus requiring very little fusion engineering. The resulting representations are however fully entangled throughout the network, which may not always be desirable: in learning, contrastive audio-visual self-supervised learning requires independent audio and visual features to operate, otherwise learning collapses; in inference, evaluation of audio-visual models should be possible on benchmarks having just audio or just video. In this paper, we introduce Zorro, a technique that uses masks to control how inputs from each modality are routed inside Transformers, keeping some parts of the representation modality-pure. We apply this technique to three popular transformer-based architectures (ViT, Swin and HiP) and show that with contrastive pre-training Zorro achieves state-of-the-art results on most relevant benchmarks for multimodal tasks (AudioSet and VGGSound). Furthermore, the resulting models are able to perform unimodal inference on both video and audio benchmarks such as Kinetics-400 or ESC-50.	Introduces Zorro, a multimodal Transformer architecture using masks to control information flow between modalities, enabling unimodal and multimodal training and inference within a single model.	Addresses limitations of existing multimodal models that entangle representations, hindering contrastive learning and unimodal inference.	Preserves modality-specific representations within the Transformer by applying masks to self-attention and decoding cross-attention operations. Extends the approach to ViT, Swin, and HiP architectures.	Achieves state-of-the-art results on AudioSet and VGGSound benchmarks with contrastive pre-training. Enables unimodal inference on Kinetics-400 (video) and ESC-50 (audio) with a model trained on multimodal data. Shows superior performance compared to alternative masking configurations and fusion positions.	Unimodal performance on Kinetics-400, while strong, does not yet surpass specialized video Transformers. Exploration of alternative self-supervised learning methods beyond contrastive learning.	multimodal learning, transformers, contrastive learning, self-supervised learning, audio-visual recognition
2301.09451 Report	A Simple Recipe for Competitive Low-compute Self supervised Vision Models	Quentin Duval, Ishan Misra, Nicolas Ballas	Self-supervised methods in vision have been mostly focused on large architectures as they seem to suffer from a significant performance drop for smaller architectures. In this paper, we propose a simple self-supervised distillation technique that can train high performance low-compute neural networks. Our main insight is that existing joint-embedding based SSL methods can be repurposed for knowledge distillation from a large self-supervised teacher to a small student model. Thus, we call our method Replace one Branch (RoB) as it simply replaces one branch of the joint-embedding training with a large teacher model. RoB is widely applicable to a number of architectures such as small ResNets, MobileNets and ViT, and pretrained models such as DINO, SwAV or iBOT. When pretraining on the ImageNet dataset, RoB yields models that compete with supervised knowledge distillation. When applied to MSN, RoB produces students with strong semi-supervised capabilities. Finally, our best ViT-Tiny models improve over prior SSL state-of-the-art on ImageNet by $2.3\%$ and are on par or better than a supervised distilled DeiT on five downstream transfer tasks (iNaturalist, CIFAR, Clevr/Count, Clevr/Dist and Places). We hope RoB enables practical self-supervision at smaller scale.	This paper introduces "Replace one Branch" (RoB), a simple self-supervised distillation technique for training high-performance, low-compute neural networks by adapting existing joint-embedding based self-supervised learning methods.	Existing self-supervised learning methods often underperform with smaller architectures, limiting their practical application in contexts where computational resources are constrained. RoB addresses this limitation, enabling the benefits of self-supervision in low-compute settings.	RoB replaces one branch of a joint-embedding self-supervised method (e.g., DINO, SwAV, iBOT, MSN) with a pretrained teacher network. It removes the regularization terms used to prevent collapse in the original method and utilizes identical-view predictions instead of cross-view predictions.	RoB produces state-of-the-art self-supervised models for ViT-Tiny, ResNet18, and ResNet34, demonstrating significant improvements over previous methods. The distilled students outperform their supervised counterparts on transfer learning tasks and achieve competitive performance with models trained via supervised distillation. RoB-trained models exhibit strong semi-supervised learning capabilities, particularly for low-shot image classification.	The performance improvement with RoB is more significant for students that struggle with traditional SSL training. Future work could explore alternative student head designs and distillation techniques to further improve RoB's performance.	self-supervised learning, knowledge distillation, low-compute vision models, transfer learning, semi-supervised learning
2301.09430 Report	Rethinking Real-world Image Deraining via An Unpaired Degradation-Conditioned Diffusion Model	Yiyang Shen, Mingqiang Wei, Yongzhen Wang, Xueyang Fu, Jing Qin	Recent diffusion models have exhibited great potential in generative modeling tasks. Part of their success can be attributed to the ability of training stable on huge sets of paired synthetic data. However, adapting these models to real-world image deraining remains difficult for two aspects. First, collecting a large-scale paired real-world clean/rainy dataset is unavailable while regular conditional diffusion models heavily rely on paired data for training. Second, real-world rain usually reflects real-world scenarios with a variety of unknown rain degradation types, which poses a significant challenge for the generative modeling process. To meet these challenges, we propose RainDiff, the first real-world image deraining paradigm based on diffusion models, serving as a new standard bar for real-world image deraining. We address the first challenge by introducing a stable and non-adversarial unpaired cycle-consistent architecture that can be trained, end-to-end, with only unpaired data for supervision; and the second challenge by proposing a degradation-conditioned diffusion model that refines the desired output via a diffusive generative process conditioned by learned priors of multiple rain degradations. Extensive experiments confirm the superiority of our RainDiff over existing unpaired/semi-supervised methods and show its competitive advantages over several fully-supervised ones.	This paper presents RainDiff, a novel unpaired learning paradigm for real-world image deraining leveraging a degradation-conditioned diffusion model.	Real-world image deraining is challenging due to the lack of paired training data and the diverse degradation types in real rain.	RainDiff utilizes a non-adversarial unpaired cycle-consistent architecture with a degradation-conditioned diffusion model. It learns degradation priors to guide the rain removal process.	RainDiff outperforms state-of-the-art unpaired and semi-supervised methods on both synthetic and real-world datasets. The degradation-conditioned diffusion model effectively handles multiple rain degradation types. The proposed method achieves competitive performance against fully-supervised deraining approaches.	Further research is needed to assess the performance of RainDiff on diverse weather conditions beyond rain. Similar to other diffusion models, RainDiff has a longer runtime compared to single-pass image restoration models.	image deraining, diffusion models, unpaired learning, degradation-conditioned, cycle-consistent
2301.09376 Report	Crowd3D: Towards Hundreds of People Reconstruction from a Single Image	Hao Wen, Jing Huang, Huili Cui, Haozhe Lin, YuKun Lai, Lu Fang, Kun Li	Image-based multi-person reconstruction in wide-field large scenes is critical for crowd analysis and security alert. However, existing methods cannot deal with large scenes containing hundreds of people, which encounter the challenges of large number of people, large variations in human scale, and complex spatial distribution. In this paper, we propose Crowd3D, the first framework to reconstruct the 3D poses, shapes and locations of hundreds of people with global consistency from a single large-scene image. The core of our approach is to convert the problem of complex crowd localization into pixel localization with the help of our newly defined concept, Human-scene Virtual Interaction Point (HVIP). To reconstruct the crowd with global consistency, we propose a progressive reconstruction network based on HVIP by pre-estimating a scene-level camera and a ground plane. To deal with a large number of persons and various human sizes, we also design an adaptive human-centric cropping scheme. Besides, we contribute a benchmark dataset, LargeCrowd, for crowd reconstruction in a large scene. Experimental results demonstrate the effectiveness of the proposed method. The code and datasets will be made public.	This supplementary document provides additional details and analysis for the paper 'Crowd3D: Towards Hundreds of People Reconstruction from a Single Image', focusing on large-scale crowd reconstruction from images.	The work addresses the limitations of existing methods that struggle with large-scene images containing many people, aiming to improve the accuracy and robustness of 3D human pose and shape estimation in crowded scenes.	The document details the adaptive human-centric cropping scheme for dividing large images into manageable patches, the process of applying and adapting other methods for comparison, and additional ablation studies of key modules.	The method achieves state-of-the-art results on the large-scale 'LargeCrowd' dataset, demonstrating its effectiveness. Evaluation on small-scene datasets ('Panoptic', 'MuPoTS') shows that the method generalizes well and maintains high performance. Ablation studies confirm the positive impact of adaptive cropping and accurate ground plane estimation on overall performance.	Automatically obtaining cropping parameters can be sensitive to false pose detections and computationally expensive. Future work could explore weakly-supervised or unsupervised methods to reduce reliance on ground-truth annotations for ground plane estimation.	crowd reconstruction, 3d human pose estimation, large-scale scene understanding, adaptive cropping, depth estimation
2301.09264 Report	Efficient Training Under Limited Resources	Mahdi Zolnouri, Dounia Lakhmiri, Christophe Tribes, Eyyüb Sari, Sébastien Le Digabel	Training time budget and size of the dataset are among the factors affecting the performance of a Deep Neural Network (DNN). This paper shows that Neural Architecture Search (NAS), Hyper Parameters Optimization (HPO), and Data Augmentation help DNNs perform much better while these two factors are limited. However, searching for an optimal architecture and the best hyperparameter values besides a good combination of data augmentation techniques under low resources requires many experiments. We present our approach to achieving such a goal in three steps: reducing training epoch time by compressing the model while maintaining the performance compared to the original model, preventing model overfitting when the dataset is small, and performing the hyperparameter tuning. We used NOMAD, which is a blackbox optimization software based on a derivative-free algorithm to do NAS and HPO. Our work achieved an accuracy of 86.0 % on a tiny subset of Mini-ImageNet at the ICLR 2021 Hardware Aware Efficient Training (HAET) Challenge and won second place in the competition. The competition results can be found at haet2021.github.io/challenge and our source code can be found at github.com/DouniaLakhmiri/ICLR\_HAET2021.	This paper presents a method for improving Deep Neural Network (DNN) performance under limited training time and data size constraints by combining Neural Architecture Search (NAS), Hyperparameter Optimization (HPO), and Data Augmentation (DA).	Training time budget and dataset size significantly affect DNN performance. This paper addresses the challenge of finding optimal architectures, hyperparameters, and data augmentation techniques under limited resources.	The methodology involves: 1) Reducing training time by compressing the model via NAS while maintaining performance. 2) Using DA to mitigate overfitting on small datasets. 3) Performing HPO using the blackbox optimization software NOMAD to fine-tune hyperparameters.	SENet-18 and ResNet-18 were identified as strong baseline models on CIFAR-10 subsets. NAS produced smaller variants of SENet-18 and ResNet-18 with comparable accuracy. The final model, NOMAD-NAS-SENet-18, achieved 86.0% accuracy on a Mini-ImageNet subset, securing second place in the ICLR 2021 HAET Challenge.	The study primarily used CIFAR-10 as a proxy dataset, potentially limiting generalizability to the Mini-ImageNet evaluation dataset. An undocumented image resizing step in the evaluation process impacted the final performance.	neural architecture search, hyperparameter optimization, data augmentation, model compression, efficient training
2301.09121 Report	Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision	Jilan Xu, Junlin Hou, Yuejie Zhang, Rui Feng, Yi Wang, Yu Qiao, Weidi Xie	In this paper, we consider the problem of open-vocabulary semantic segmentation (OVS), which aims to segment objects of arbitrary classes instead of pre-defined, closed-set categories. The main contributions are as follows: First, we propose a transformer-based model for OVS, termed as OVSegmentor, which only exploits web-crawled image-text pairs for pre-training without using any mask annotations. OVSegmentor assembles the image pixels into a set of learnable group tokens via a slot-attention based binding module, and aligns the group tokens to the corresponding caption embedding. Second, we propose two proxy tasks for training, namely masked entity completion and cross-image mask consistency. The former aims to infer all masked entities in the caption given the group tokens, that enables the model to learn fine-grained alignment between visual groups and text entities. The latter enforces consistent mask predictions between images that contain shared entities, which encourages the model to learn visual invariance. Third, we construct CC4M dataset for pre-training by filtering CC12M with frequently appeared entities, which significantly improves training efficiency. Fourth, we perform zero-shot transfer on three benchmark datasets, PASCAL VOC 2012, PASCAL Context, and COCO Object. Our model achieves superior segmentation results over the state-of-the-art method by using only 3\% data (4M vs 134M) for pre-training. Code and pre-trained models will be released for future research.	This paper proposes OVSegmentor, a transformer-based model for open-vocabulary semantic segmentation that uses only image-caption pairs for pre-training and can segment arbitrary object classes.	Existing semantic segmentation approaches suffer from costly annotations and limitations to pre-defined classes. This work explores open-vocabulary segmentation by leveraging freely available image-caption pairs and enables zero-shot transfer to unseen classes.	OVSegmentor introduces learnable group tokens to cluster image patches and aligns them with caption embeddings. Two novel proxy tasks, masked entity completion and cross-image mask consistency, are introduced to learn entity-specific and visually invariant group semantics.	OVSegmentor achieves superior results compared to methods using supervised finetuning on PASCAL VOC, showcasing the effectiveness of training with image-caption pairs. The model outperforms state-of-the-art open-vocabulary segmentation approaches on PASCAL VOC while using only 3% of their pre-training data, demonstrating high training efficiency. Ablation studies validate the contribution of each component, especially the proposed proxy tasks, in improving segmentation performance.	The model shows limitations in recognizing stuff classes and struggles to separate co-occurring objects into distinct groups. Future work can explore incorporating fine-grained descriptions and leveraging external knowledge bases to further enhance segmentation accuracy.	open-vocabulary semantic segmentation, vision-language pre-training, zero-shot transfer learning, masked entity completion, cross-image mask consistency
2301.08898 Report	Recurrent Generic Contour-based Instance Segmentation with Progressive Learning	Hao Feng, Keyi Zhou, Wengang Zhou, Yufei Yin, Jiajun Deng, Qi Sun, Houqiang Li	Contour-based instance segmentation has been actively studied, thanks to its flexibility and elegance in processing visual objects within complex backgrounds. In this work, we propose a novel deep network architecture, i.e., PolySnake, for generic contour-based instance segmentation. Motivated by the classic Snake algorithm, the proposed PolySnake achieves superior and robust segmentation performance with an iterative and progressive contour refinement strategy. Technically, PolySnake introduces a recurrent update operator to estimate the object contour iteratively. It maintains a single estimate of the contour that is progressively deformed toward the object boundary. At each iteration, PolySnake builds a semantic-rich representation for the current contour and feeds it to the recurrent operator for further contour adjustment. Through the iterative refinements, the contour progressively converges to a stable status that tightly encloses the object instance. Beyond the scope of general instance segmentation, extensive experiments are conducted to validate the effectiveness and generalizability of our PolySnake in two additional specific task scenarios, including scene text detection and lane detection. The results demonstrate that the proposed PolySnake outperforms the existing advanced methods on several multiple prevalent benchmarks across the three tasks. The codes and pre-trained models are available at https://github.com/fh2019ustc/PolySnake	Proposes PolySnake, a deep network architecture for generic contour-based instance segmentation, which uses an iterative and progressive contour refinement strategy inspired by the classic Snake algorithm.	Addresses limitations of existing instance segmentation methods that rely on inaccurate bounding boxes or complex contour learning strategies, aiming for more accurate and efficient instance segmentation.	PolySnake initializes a coarse contour and iteratively deforms it using a recurrent update operator that leverages multi-scale features. A shape loss is introduced to encourage accurate object boundary adherence. The model is trained in two stages, optimizing initial contour generation and iterative deformation, followed by multi-scale contour refinement.	PolySnake outperforms state-of-the-art contour-based methods on SBD, Cityscapes, COCO, and KINS datasets for instance segmentation. Achieves outstanding performance in scene text detection on CTW1500, demonstrating its generalizability. Exhibits superior accuracy in lane detection on CULane, surpassing both segmentation-based and anchor-based methods, particularly in challenging categories like dazzle and night.	Performance improvement on KINS dataset (amodal instance segmentation) with multi-scale refinement is marginal, possibly due to limitations in leveraging vertex features from occluded parts. Future work includes integrating PolySnake with state-of-the-art instance segmentation methods and exploring its application to other polygon or curve estimation tasks.	instance segmentation, contour-based segmentation, progressive learning, recurrent networks, scene text detection, lane detection
2301.08455 Report	Spatial Steerability of GANs via Self-Supervision from Discriminator	Jianyuan Wang, Lalit Bhagat, Ceyuan Yang, Yinghao Xu, Yujun Shen, Hongdong Li, Bolei Zhou	Generative models make huge progress to the photorealistic image synthesis in recent years. To enable human to steer the image generation process and customize the output, many works explore the interpretable dimensions of the latent space in GANs. Existing methods edit the attributes of the output image such as orientation or color scheme by varying the latent code along certain directions. However, these methods usually require additional human annotations for each pretrained model, and they mostly focus on editing global attributes. In this work, we propose a self-supervised approach to improve the spatial steerability of GANs without searching for steerable directions in the latent space or requiring extra annotations. Specifically, we design randomly sampled Gaussian heatmaps to be encoded into the intermediate layers of generative models as spatial inductive bias. Along with training the GAN model from scratch, these heatmaps are being aligned with the emerging attention of the GAN's discriminator in a self-supervised learning manner. During inference, users can interact with the spatial heatmaps in an intuitive manner, enabling them to edit the output image by adjusting the scene layout, moving, or removing objects. Moreover, we incorporate DragGAN into our framework, which facilitates fine-grained manipulation within a reasonable time and supports a coarse-to-fine editing process. Extensive experiments show that the proposed method not only enables spatial editing over human faces, animal faces, outdoor scenes, and complicated multi-object indoor scenes but also brings improvement in synthesis quality. Code, models, and demo video are available at https://genforce.github.io/SpatialGAN/.	This paper introduces SpatialGAN, a self-supervised approach to enhance the spatial steerability of GANs, enabling users to manipulate scene layouts, move or remove objects, and change local appearances without the need for extra annotations or searching for steerable directions in the latent space.	Existing methods for controlling GAN outputs often require expensive annotations or struggle to provide fine-grained spatial control. SpatialGAN addresses these limitations by introducing a novel self-supervised framework that leverages the inherent spatial attention of the discriminator to guide the generator.	The method encodes randomly sampled Gaussian heatmaps into the intermediate layers of the generator as spatial inductive bias. These heatmaps are then aligned with the emerging attention maps of the discriminator during training through a self-supervised learning process. For complex indoor scenes, a multi-object heatmap sampling and encoding method is introduced. Moreover, the integration of DragGAN with SpatialGAN is proposed for more efficient and flexible manipulations.	SpatialGAN enables various spatial manipulations like moving and removing objects by altering the encoded heatmaps. The proposed method consistently improves synthesis quality across different datasets, including LSUN Cat, FFHQ, LSUN Church, and LSUN Bedroom, outperforming the baseline StyleGAN2 and its conference version. The integration of DragGAN with SpatialGAN allows for precise object manipulation while significantly reducing computation time compared to using DragGAN alone.	The spatial encoding operation may occasionally result in blurring at heatmap boundaries, impacting the visual quality of manipulations. In some cases, altering one sub-heatmap may unintentionally affect the appearance of distant, unrelated areas, a phenomenon not intended by the design.	generative adversarial networks, spatial editing, self-supervision, image synthesis, interpretability
2301.07969 Report	Fast Inference in Denoising Diffusion Models via MMD Finetuning	Emanuele Aiello, Diego Valsesia, Enrico Magli	Denoising Diffusion Models (DDMs) have become a popular tool for generating high-quality samples from complex data distributions. These models are able to capture sophisticated patterns and structures in the data, and can generate samples that are highly diverse and representative of the underlying distribution. However, one of the main limitations of diffusion models is the complexity of sample generation, since a large number of inference timesteps is required to faithfully capture the data distribution. In this paper, we present MMD-DDM, a novel method for fast sampling of diffusion models. Our approach is based on the idea of using the Maximum Mean Discrepancy (MMD) to finetune the learned distribution with a given budget of timesteps. This allows the finetuned model to significantly improve the speed-quality trade-off, by substantially increasing fidelity in inference regimes with few steps or, equivalently, by reducing the required number of steps to reach a target fidelity, thus paving the way for a more practical adoption of diffusion models in a wide range of applications. We evaluate our approach on unconditional image generation with extensive experiments across the CIFAR-10, CelebA, ImageNet and LSUN-Church datasets. Our findings show that the proposed method is able to produce high-quality samples in a fraction of the time required by widely-used diffusion models, and outperforms state-of-the-art techniques for accelerated sampling. Code is available at: https://github.com/diegovalsesia/MMD-DDM.	Presents MMD-DDM, a technique for fast sampling of diffusion models by finetuning the learned distribution using Maximum Mean Discrepancy (MMD) with a limited number of timesteps.	Addresses the limitation of slow sample generation in Denoising Diffusion Models (DDMs) by significantly improving the speed-quality trade-off.	Finetunes a pretrained DDM by minimizing the MMD between real and generated samples in a perceptually-relevant feature space (Inception-V3 or CLIP), backpropagating through the sampling chain with reparametrization and gradient checkpointing.	Significantly reduces the number of timesteps required to achieve a target fidelity, outperforming state-of-the-art methods. Demonstrates effectiveness on CIFAR-10, CelebA, ImageNet, and LSUN-Church datasets using FID and other metrics. Shows improved visual quality, including sharper details and clarity, especially when using CLIP feature space for MMD.	Memory requirements can be high when finetuning over a large number of timesteps. Future work could explore more advanced timestep selection and optimization techniques, as well as integration with conditional DDMs.	denoising diffusion models, generative models, fast inference, maximum mean discrepancy, image generation
2301.07870 Report	Fast-BEV: Towards Real-time On-vehicle Bird's-Eye View Perception	Bin Huang, Yangguang Li, Enze Xie, Feng Liang, Luya Wang, Mingzhu Shen, Fenggang Liu, Tianqi Wang, Ping Luo, Jing Shao	Recently, the pure camera-based Bird's-Eye-View (BEV) perception removes expensive Lidar sensors, making it a feasible solution for economical autonomous driving. However, most existing BEV solutions either suffer from modest performance or require considerable resources to execute on-vehicle inference. This paper proposes a simple yet effective framework, termed Fast-BEV, which is capable of performing real-time BEV perception on the on-vehicle chips. Towards this goal, we first empirically find that the BEV representation can be sufficiently powerful without expensive view transformation or depth representation. Starting from M2BEV baseline, we further introduce (1) a strong data augmentation strategy for both image and BEV space to avoid over-fitting (2) a multi-frame feature fusion mechanism to leverage the temporal information (3) an optimized deployment-friendly view transformation to speed up the inference. Through experiments, we show Fast-BEV model family achieves considerable accuracy and efficiency on edge. In particular, our M1 model (R18@256x704) can run over 50FPS on the Tesla T4 platform, with 47.0% NDS on the nuScenes validation set. Our largest model (R101@900x1600) establishes a new state-of-the-art 53.5% NDS on the nuScenes validation set. The code is released at: https://github.com/Sense-GVT/Fast-BEV.	This paper proposes Fast-BEV, a simple yet effective fully convolutional framework capable of real-time Bird’s-Eye-View (BEV) perception on resource-constrained on-vehicle chips.	Existing BEV solutions either suffer from limited performance or are too resource-intensive for real-time on-vehicle inference, hindering economical autonomous driving.	Building upon the M$^2$BEV baseline, Fast-BEV introduces strong image and BEV augmentation, a multi-frame feature fusion mechanism, and an optimized deployment-friendly view transformation for efficient inference.	Fast-BEV achieves state-of-the-art 53.5% NDS on the nuScenes validation set with its largest model (R101@900x1600). The efficient M1 model (R18@256×704) achieves a considerable 47.0% NDS while running over 50FPS on the Tesla T4 platform. The proposed optimized view transformation achieves orders of magnitude speedup on CPU compared to the M$^2$BEV baseline.	The study primarily focuses on the nuScenes dataset, potentially limiting generalizability to other datasets. Further investigation into more advanced temporal fusion techniques beyond simple concatenation could potentially yield additional performance gains. Future work will explore the generalization of Fast-BEV to other autonomous driving datasets and investigate more sophisticated temporal modeling techniques.	autonomous driving, bev perception, 3d object detection, real-time inference, on-vehicle deployment
2301.07584 Report	Joint Representation Learning for Text and 3D Point Cloud	Rui Huang, Xuran Pan, Henry Zheng, Haojun Jiang, Zhifeng Xie, Shiji Song, Gao Huang	Recent advancements in vision-language pre-training (e.g. CLIP) have shown that vision models can benefit from language supervision. While many models using language modality have achieved great success on 2D vision tasks, the joint representation learning of 3D point cloud with text remains under-explored due to the difficulty of 3D-Text data pair acquisition and the irregularity of 3D data structure. In this paper, we propose a novel Text4Point framework to construct language-guided 3D point cloud models. The key idea is utilizing 2D images as a bridge to connect the point cloud and the language modalities. The proposed Text4Point follows the pre-training and fine-tuning paradigm. During the pre-training stage, we establish the correspondence of images and point clouds based on the readily available RGB-D data and use contrastive learning to align the image and point cloud representations. Together with the well-aligned image and text features achieved by CLIP, the point cloud features are implicitly aligned with the text embeddings. Further, we propose a Text Querying Module to integrate language information into 3D representation learning by querying text embeddings with point cloud features. For fine-tuning, the model learns task-specific 3D representations under informative language guidance from the label set without 2D images. Extensive experiments demonstrate that our model shows consistent improvement on various downstream tasks, such as point cloud semantic segmentation, instance segmentation, and object detection. The code will be available here: https://github.com/LeapLabTHU/Text4Point	The paper proposes Text4Point, a novel framework that bridges the gap between text and 3D point cloud representations by leveraging 2D images as an intermediate link, enabling language-guided 3D point cloud models.	Joint representation learning of 3D point clouds and text remains under-explored due to challenges in data acquisition and the irregularity of 3D data structures. This work aims to address these challenges and utilize language information for improved 3D representation learning.	The method uses contrastive learning to align image and point cloud representations during pre-training, leveraging readily available RGB-D data. A Text Querying Module integrates language information by querying text embeddings with point cloud features, guiding 3D representation learning. Fine-tuning adapts the model for specific tasks with language guidance from label sets.	Text4Point achieves state-of-the-art performance on the S3DIS dataset for both semantic and instance segmentation tasks. The method significantly outperforms previous pre-training methods for 3D object detection on SUN RGB-D and ScanNet datasets. Ablation studies confirm the importance of each component, particularly language modality, in enhancing 3D representation learning.	The reliance on 2D images as a bridge introduces potential limitations if the image-point cloud correspondence is inaccurate. Further exploration of alternative methods for aligning text and point cloud representations could be beneficial.	3d point cloud, joint representation learning, vision-language pre-training, contrastive learning, text querying
2301.07581 Report	Blur Invariants for Image Recognition	Jan Flusser, Matej Lebl, Matteo Pedone, Filip Sroubek, Jitka Kostkova	Blur is an image degradation that is difficult to remove. Invariants with respect to blur offer an alternative way of a~description and recognition of blurred images without any deblurring. In this paper, we present an original unified theory of blur invariants. Unlike all previous attempts, the new theory does not require any prior knowledge of the blur type. The invariants are constructed in the Fourier domain by means of orthogonal projection operators and moment expansion is used for efficient and stable computation. It is shown that all blur invariants published earlier are just particular cases of this approach. Experimental comparison to concurrent approaches shows the advantages of the proposed theory.	This paper presents a unified theory of blur invariants for image recognition, using orthogonal projection operators and moment expansion.	Blur invariants offer a robust alternative to deblurring for image recognition, and the proposed theory provides a general framework for their construction regardless of the blur type.	The method defines blur invariants in the Fourier domain as the ratio of the image's Fourier transform to the Fourier transform of its projection onto the blur subspace. It then utilizes moment expansion in the image domain to enable efficient and stable computation of these invariants.	A general theorem (GTBI) for constructing blur invariants for arbitrary blur subspaces closed under convolution and correlation is presented. The completeness theorem demonstrates that the proposed invariants can distinguish any two images belonging to different blur-equivalence classes. A method for calculating blur invariants directly in the image domain using moments is derived, eliminating the need for explicit deconvolution.	The theory primarily focuses on linear orthogonal projection operators, while exploring non-linear and non-orthogonal projectors is left for future work. Extending the framework to 3D images and investigating the fusion of blur invariants with deep learning are identified as promising research directions.	blur invariants, image recognition, projection operators, moment expansion, image blur
2301.07464 Report	CLIPTER: Looking at the Bigger Picture in Scene Text Recognition	Aviad Aberdam, David Bensaïd, Alona Golts, Roy Ganz, Oren Nuriel, Royee Tichauer, Shai Mazor, Ron Litman	Reading text in real-world scenarios often requires understanding the context surrounding it, especially when dealing with poor-quality text. However, current scene text recognizers are unaware of the bigger picture as they operate on cropped text images. In this study, we harness the representative capabilities of modern vision-language models, such as CLIP, to provide scene-level information to the crop-based recognizer. We achieve this by fusing a rich representation of the entire image, obtained from the vision-language model, with the recognizer word-level features via a gated cross-attention mechanism. This component gradually shifts to the context-enhanced representation, allowing for stable fine-tuning of a pretrained recognizer. We demonstrate the effectiveness of our model-agnostic framework, CLIPTER (CLIP TExt Recognition), on leading text recognition architectures and achieve state-of-the-art results across multiple benchmarks. Furthermore, our analysis highlights improved robustness to out-of-vocabulary words and enhanced generalization in low-data regimes.	CLIPTER, a novel framework that integrates image-level context into crop-based scene text recognizers using vision-language models like CLIP.	Current scene text recognizers lack scene context as they operate on cropped text images, hindering their performance on poor-quality text where context is crucial.	The framework extracts rich image representations using a frozen vision-language model and fuses them with word-level features of the recognizer through a gated cross-attention mechanism.	CLIPTER consistently improves the performance of leading text recognizers, including TRBA, ViTSTR, ABINet, and PARSeq, achieving state-of-the-art results on multiple benchmarks. It enhances robustness to out-of-vocabulary words and improves generalization in low-data regimes. End-to-end evaluation demonstrates a marginal latency increase while surpassing the performance of both two-stage and existing end-to-end text spotting methods.	The optimal integration point for fusing image and word-level features is architecture-dependent, requiring empirical search. The computational cost of late fusion increases significantly for autoregressive decoders.	scene text recognition, vision-language models, clip, contextual information, out-of-vocabulary words
2301.07389 Report	Towards Models that Can See and Read	Roy Ganz, Oren Nuriel, Aviad Aberdam, Yair Kittenplon, Shai Mazor, Ron Litman	Visual Question Answering (VQA) and Image Captioning (CAP), which are among the most popular vision-language tasks, have analogous scene-text versions that require reasoning from the text in the image. Despite their obvious resemblance, the two are treated independently and, as we show, yield task-specific methods that can either see or read, but not both. In this work, we conduct an in-depth analysis of this phenomenon and propose UniTNT, a Unified Text-Non-Text approach, which grants existing multimodal architectures scene-text understanding capabilities. Specifically, we treat scene-text information as an additional modality, fusing it with any pretrained encoder-decoder-based architecture via designated modules. Thorough experiments reveal that UniTNT leads to the first single model that successfully handles both task types. Moreover, we show that scene-text understanding capabilities can boost vision-language models' performance on general VQA and CAP by up to 2.69% and 0.6 CIDEr, respectively.	This paper presents UniTNT, a unified text-non-text model that grants existing multimodal architectures scene-text understanding capabilities, enabling them to excel in both general and scene-text-based VQA and CAP tasks.	Existing vision-language models often struggle to reason jointly from visual and scene-text information, limiting their performance on tasks requiring understanding of both.	UniTNT introduces scene-text as an additional modality, fusing it with pretrained encoder-decoder architectures via a dedicated encoder and a gated cross-attention-based mechanism. It also employs scene-text-related intermediate supervision to encourage leveraging the added information.	UniTNT enables a single model to successfully handle both general and scene-text VQA and CAP tasks, outperforming methods trained separately for each. Integrating scene-text understanding boosts performance on general VQA (e.g., improving BLIP by 2.69% on VQAv2) and CAP (e.g., enhancing BLIP by 0.6 CIDEr on COCO Captions). Analysis reveals the importance of combined training on datasets containing both visual and scene-text elements and highlights the need for benchmarks focusing on questions requiring reasoning over both modalities.	The intrinsic tradeoff between TextCaps and COCO Captions, due to their different nature of ground truth captions, requires further investigation. Further research is needed to improve models' performance on challenging questions requiring reasoning over both scene-text and visual information simultaneously.	vision-language, scene-text understanding, visual question answering, image captioning, multimodal learning
2301.07301 Report	PTA-Det: Point Transformer Associating Point cloud and Image for 3D Object Detection	Rui Wan, Tianyun Zhao, Wei Zhao	In autonomous driving, 3D object detection based on multi-modal data has become an indispensable approach when facing complex environments around the vehicle. During multi-modal detection, LiDAR and camera are simultaneously applied for capturing and modeling. However, due to the intrinsic discrepancies between the LiDAR point and camera image, the fusion of the data for object detection encounters a series of problems. Most multi-modal detection methods perform even worse than LiDAR-only methods. In this investigation, we propose a method named PTA-Det to improve the performance of multi-modal detection. Accompanied by PTA-Det, a Pseudo Point Cloud Generation Network is proposed, which can convert image information including texture and semantic features by pseudo points. Thereafter, through a transformer-based Point Fusion Transition (PFT) module, the features of LiDAR points and pseudo points from image can be deeply fused under a unified point-based representation. The combination of these modules can conquer the major obstacle in feature fusion across modalities and realizes a complementary and discriminative representation for proposal generation. Extensive experiments on the KITTI dataset show the PTA-Det achieves a competitive result and support its effectiveness.	This paper proposes PTA-Det, a multi-modal 3D object detection method that leverages pseudo points as an intermediate modality between images and point clouds.	Multi-modal 3D object detection is crucial for autonomous driving, but existing methods struggle to effectively fuse image and point cloud data.	PTA-Det employs a Pseudo Point Cloud Generation Network to transform images into pseudo points representing image features. A two-stream transformer-based network learns intra- and inter-modal features. A Point Fusion Transition (PFT) module fuses features across modalities.	PTA-Det achieves competitive results on the KITTI dataset, particularly for car detection (77.88% mAP). The method outperforms LiDAR-only methods with the same number of input points. Ablation studies demonstrate the effectiveness of each module, especially the PFT module and the keypoint-based pseudo point sampling strategy.	PTA-Det's performance on smaller objects like cyclists and pedestrians is hindered by the reduced number of input points due to transformer memory constraints. Future work includes optimizing the model for efficiency and exploring data augmentation techniques for multi-modal scenarios.	3d object detection, multi-modal fusion, point cloud, pseudo point cloud, transformer
2301.07093 Report	GLIGEN: Open-Set Grounded Text-to-Image Generation	Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, Yong Jae Lee	Large-scale text-to-image diffusion models have made amazing advances. However, the status quo is to use text input alone, which can impede controllability. In this work, we propose GLIGEN, Grounded-Language-to-Image Generation, a novel approach that builds upon and extends the functionality of existing pre-trained text-to-image diffusion models by enabling them to also be conditioned on grounding inputs. To preserve the vast concept knowledge of the pre-trained model, we freeze all of its weights and inject the grounding information into new trainable layers via a gated mechanism. Our model achieves open-world grounded text2img generation with caption and bounding box condition inputs, and the grounding ability generalizes well to novel spatial configurations and concepts. GLIGEN's zero-shot performance on COCO and LVIS outperforms that of existing supervised layout-to-image baselines by a large margin.	This paper introduces \shortname{}, a novel method that incorporates grounding inputs into pre-trained text-to-image diffusion models, enabling versatile controllability over object placement, style, and composition.	Existing text-to-image generation models often lack precise controllability beyond textual descriptions. \shortname{} addresses this by enabling grounding with bounding boxes, keypoints, reference images, and spatially-aligned maps.	\shortname{} freezes pre-trained model weights and introduces trainable gated Transformer layers to inject grounding information. This approach preserves existing knowledge while enabling the integration of new conditions like bounding boxes, image prompts, keypoints, and spatially-aligned maps.	Achieves open-world grounded text-to-image generation, enabling the synthesis of novel localized concepts. Outperforms existing supervised layout-to-image generation baselines on COCO and LVIS benchmarks, highlighting the effectiveness of building upon large pre-trained generative models. Demonstrates generalization to unseen object categories and diverse grounding conditions.	The generalization ability for keypoint grounding across different object categories is limited. Further research can explore incorporating more complex grounding conditions and refining the interplay between text and grounding inputs.	text-to-image generation, diffusion models, grounding, open-world learning, controllable image synthesis
2301.06958 Report	RILS: Masked Visual Reconstruction in Language Semantic Space	Shusheng Yang, Yixiao Ge, Kun Yi, Dian Li, Ying Shan, Xiaohu Qie, Xinggang Wang	Both masked image modeling (MIM) and natural language supervision have facilitated the progress of transferable visual pre-training. In this work, we seek the synergy between two paradigms and study the emerging properties when MIM meets natural language supervision. To this end, we present a novel masked visual Reconstruction In Language semantic Space (RILS) pre-training framework, in which sentence representations, encoded by the text encoder, serve as prototypes to transform the vision-only signals into patch-sentence probabilities as semantically meaningful MIM reconstruction targets. The vision models can therefore capture useful components with structured information by predicting proper semantic of masked tokens. Better visual representations could, in turn, improve the text encoder via the image-text alignment objective, which is essential for the effective MIM target transformation. Extensive experimental results demonstrate that our method not only enjoys the best of previous MIM and CLIP but also achieves further improvements on various tasks due to their mutual benefits. RILS exhibits advanced transferability on downstream classification, detection, and segmentation, especially for low-shot regimes. Code will be made available at https://github.com/hustvl/RILS.	This paper introduces RILS, a novel masked visual Reconstruction In Language semantic Space pre-training framework, that combines the strengths of masked image modeling (MIM) and natural language supervision for improved visual representation learning.	This work addresses the limitations of independently using MIM or natural language supervision, aiming to leverage their synergistic potential for more transferable and scalable visual pre-training.	RILS utilizes a dual-encoder architecture (vision and language) with an asymmetric encoder-decoder design for the vision model. It leverages text representations as prototypes to map masked visual tokens into probability distributions within the language semantic space. RILS employs both image-text contrastive loss and a novel masked visual reconstruction loss in the language space.	RILS achieves state-of-the-art performance on various downstream tasks, including image classification, object detection, and semantic segmentation. RILS exhibits strong transferability, particularly in low-shot learning scenarios, demonstrating its ability to learn from limited labeled data. RILS shows superior zero-shot image classification and image-text retrieval performance, highlighting its capacity to capture rich semantic information.	The paper acknowledges the potential for further scaling up RILS in terms of both model and data size. Future work could explore incorporating more sophisticated techniques like multi-crop augmentation and exponential moving average (EMA) into RILS for potential performance gains.	masked image modeling, natural language supervision, vision-language pre-training, transfer learning, zero-shot learning
2301.06871 Report	Denoising Diffusion Probabilistic Models as a Defense against Adversarial Attacks	Lars Lien Ankile, Anna Midgley, Sebastian Weisshaar	Neural Networks are infamously sensitive to small perturbations in their inputs, making them vulnerable to adversarial attacks. This project evaluates the performance of Denoising Diffusion Probabilistic Models (DDPM) as a purification technique to defend against adversarial attacks. This works by adding noise to an adversarial example before removing it through the reverse process of the diffusion model. We evaluate the approach on the PatchCamelyon data set for histopathologic scans of lymph node sections and find an improvement of the robust accuracy by up to 88\% of the original model's accuracy, constituting a considerable improvement over the vanilla model and our baselines. The project code is located at https://github.com/ankile/Adversarial-Diffusion.	This paper evaluates the effectiveness of Denoising Diffusion Probabilistic Models (DDPMs) as a purification technique to defend against adversarial attacks on image classification models, particularly in histopathology.	Robustness against adversarial attacks is crucial for reliable deployment of deep learning models, especially in sensitive domains like medical image analysis, where small perturbations can lead to misdiagnosis.	The study uses a DDPM to add noise to adversarial examples generated from the PatchCamelyon histopathology dataset and then removes the noise through the reverse diffusion process, aiming to purify the image and allow for correct classification by a ResNet model. The approach is compared to baseline methods including Gaussian noise addition and adversarial training.	DDPM purification significantly improves robust accuracy against adversarial attacks, recovering up to 88% of the original model's accuracy on adversarial examples. The method outperforms baseline defenses, including simple noise addition and adversarial training. The study identifies a trade-off between standard accuracy and robust accuracy for adversarial training, highlighting the challenge of balancing performance on clean and adversarial data.	The chosen noise level, while enabling faster inference, could potentially be further optimized to potentially achieve even better robust accuracy. The study focuses on a specific dataset (PatchCamelyon) and a single attack type, and future work could explore generalizability to other datasets, attack strategies, and medical imaging modalities.	adversarial attacks, diffusion models, image classification, histopathology, robustness
2301.06782 Report	A Large-Scale Outdoor Multi-modal Dataset and Benchmark for Novel View Synthesis and Implicit Scene Reconstruction	Chongshan Lu, Fukun Yin, Xin Chen, Tao Chen, Gang YU, Jiayuan Fan	Neural Radiance Fields (NeRF) has achieved impressive results in single object scene reconstruction and novel view synthesis, which have been demonstrated on many single modality and single object focused indoor scene datasets like DTU, BMVS, and NeRF Synthetic.However, the study of NeRF on large-scale outdoor scene reconstruction is still limited, as there is no unified outdoor scene dataset for large-scale NeRF evaluation due to expensive data acquisition and calibration costs. In this paper, we propose a large-scale outdoor multi-modal dataset, OMMO dataset, containing complex land objects and scenes with calibrated images, point clouds and prompt annotations. Meanwhile, a new benchmark for several outdoor NeRF-based tasks is established, such as novel view synthesis, surface reconstruction, and multi-modal NeRF. To create the dataset, we capture and collect a large number of real fly-view videos and select high-quality and high-resolution clips from them. Then we design a quality review module to refine images, remove low-quality frames and fail-to-calibrate scenes through a learning-based automatic evaluation plus manual review. Finally, a number of volunteers are employed to add the text descriptions for each scene and key-frame to meet the potential multi-modal requirements in the future. Compared with existing NeRF datasets, our dataset contains abundant real-world urban and natural scenes with various scales, camera trajectories, and lighting conditions. Experiments show that our dataset can benchmark most state-of-the-art NeRF methods on different tasks. We will release the dataset and model weights very soon.	The paper introduces OMMO, a large-scale outdoor multi-modal dataset for benchmarking Neural Radiance Fields (NeRF) in complex outdoor scenes.	Existing NeRF datasets are limited to single objects, indoor scenes, or small-scale outdoor environments, hindering the development and evaluation of NeRF methods for large-scale outdoor scene reconstruction.	The OMMO dataset is created by collecting and curating real fly-view videos from various sources, including YouTube and drone captures. It includes quality review, manual annotation, and scene representation generation using methods like Mega-NeRF and Colmap.	OMMO dataset contains 33 diverse outdoor scenes with over 14K calibrated images, surpassing existing datasets in quantity and diversity. Benchmarks for novel view synthesis demonstrate the dataset's ability to support various NeRF methods, with Mip-NeRF 360 showing superior performance. Analysis of scene representation benchmarks highlights the challenges in reconstructing large-scale outdoor scenes with high fidelity, indicating an area for future research.	The dataset currently has a limited number of scenes with low-light, rain, and fog conditions. Future work involves expanding the dataset with more diverse scenes and exploring advanced reconstruction methods.	neural radiance fields, nerf, outdoor dataset, novel view synthesis, scene representation
2301.06281 Report	DPE: Disentanglement of Pose and Expression for General Video Portrait Editing	Youxin Pang, Yong Zhang, Weize Quan, Yanbo Fan, Xiaodong Cun, Ying Shan, Dong-ming Yan	One-shot video-driven talking face generation aims at producing a synthetic talking video by transferring the facial motion from a video to an arbitrary portrait image. Head pose and facial expression are always entangled in facial motion and transferred simultaneously. However, the entanglement sets up a barrier for these methods to be used in video portrait editing directly, where it may require to modify the expression only while maintaining the pose unchanged. One challenge of decoupling pose and expression is the lack of paired data, such as the same pose but different expressions. Only a few methods attempt to tackle this challenge with the feat of 3D Morphable Models (3DMMs) for explicit disentanglement. But 3DMMs are not accurate enough to capture facial details due to the limited number of Blenshapes, which has side effects on motion transfer. In this paper, we introduce a novel self-supervised disentanglement framework to decouple pose and expression without 3DMMs and paired data, which consists of a motion editing module, a pose generator, and an expression generator. The editing module projects faces into a latent space where pose motion and expression motion can be disentangled, and the pose or expression transfer can be performed in the latent space conveniently via addition. The two generators render the modified latent codes to images, respectively. Moreover, to guarantee the disentanglement, we propose a bidirectional cyclic training strategy with well-designed constraints. Evaluations demonstrate our method can control pose or expression independently and be used for general video editing.	This paper presents a novel self-supervised disentanglement framework for decoupling pose and expression in facial motion for general video portrait editing.	Existing one-shot talking face generation methods struggle to transfer pose and expression independently due to their entanglement, limiting their application in video editing tasks where modifying expression while maintaining pose is crucial.	The framework consists of a motion editing module, a pose generator, and an expression generator. It employs a bidirectional cyclic training strategy with self-reconstruction constraints to disentangle pose and expression motion in a latent space, enabling independent editing via addition.	The method achieves independent control over pose and expression, enabling seamless video portrait editing by pasting edited faces back into original videos. It outperforms 3DMM-based methods in preserving facial expression details and identity during editing. The method shows comparable performance to state-of-the-art one-shot talking face generation approaches.	Pose preservation performance is slightly worse than the 3DMM-based method, PIRender. Expression motion, being more local and subtle, is harder to learn than pose motion, demanding further refinement.	talking face generation, video portrait editing, disentanglement, self-supervised learning, motion transfer
2301.06267 Report	Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models	Zhiqiu Lin, Samuel Yu, Zhiyi Kuang, Deepak Pathak, Deva Ramanan	The ability to quickly learn a new task with minimal instruction - known as few-shot learning - is a central aspect of intelligent agents. Classical few-shot benchmarks make use of few-shot samples from a single modality, but such samples may not be sufficient to characterize an entire concept class. In contrast, humans use cross-modal information to learn new concepts efficiently. In this work, we demonstrate that one can indeed build a better ${\bf visual}$ dog classifier by ${\bf read}$ing about dogs and ${\bf listen}$ing to them bark. To do so, we exploit the fact that recent multimodal foundation models such as CLIP are inherently cross-modal, mapping different modalities to the same representation space. Specifically, we propose a simple cross-modal adaptation approach that learns from few-shot examples spanning different modalities. By repurposing class names as additional one-shot training samples, we achieve SOTA results with an embarrassingly simple linear classifier for vision-language adaptation. Furthermore, we show that our approach can benefit existing methods such as prefix tuning, adapters, and classifier ensembling. Finally, to explore other modalities beyond vision and language, we construct the first (to our knowledge) audiovisual few-shot benchmark and use cross-modal training to improve the performance of both image and audio classification.	This paper proposes a cross-modal adaptation approach for few-shot learning, leveraging multimodal models like CLIP. It utilizes examples from different modalities (e.g., text labels, audio) as additional training samples, effectively converting an "n-shot" problem into an "(n+1)-shot" problem.	The method addresses the ambiguity inherent in traditional few-shot learning by incorporating cross-modal information, mimicking how humans utilize multiple senses for concept learning.	The approach leverages pre-trained multimodal models (CLIP and AudioCLIP) that map different modalities to the same representation space. It treats examples from additional modalities as supplementary training data, jointly optimizing a linear classifier with data from all modalities.	Cross-modal adaptation achieves state-of-the-art results on 11 image classification benchmarks using a simple linear classifier. The method improves the performance of existing few-shot adaptation methods like prompting, adapters, and robust finetuning. Audiovisual adaptation is explored, showing improvement in both image and audio classification tasks on a newly created ImageNet-ESC benchmark.	Cross-modal adaptation may be less effective when model representations are not well-aligned or sufficiently trained (e.g. limited audio data). The paper primarily focuses on uni-modal inference tasks (e.g. image classification), leaving exploration of multimodal test sets (e.g. video) for future work.	cross-modal learning, few-shot learning, multimodal models, clip, audioclip
2301.06052 Report	T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations	Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Shaoli Huang, Yong Zhang, Hongwei Zhao, Hongtao Lu, Xi Shen	In this work, we investigate a simple and must-known conditional generative framework based on Vector Quantised-Variational AutoEncoder (VQ-VAE) and Generative Pre-trained Transformer (GPT) for human motion generation from textural descriptions. We show that a simple CNN-based VQ-VAE with commonly used training recipes (EMA and Code Reset) allows us to obtain high-quality discrete representations. For GPT, we incorporate a simple corruption strategy during the training to alleviate training-testing discrepancy. Despite its simplicity, our T2M-GPT shows better performance than competitive approaches, including recent diffusion-based approaches. For example, on HumanML3D, which is currently the largest dataset, we achieve comparable performance on the consistency between text and generated motion (R-Precision), but with FID 0.116 largely outperforming MotionDiffuse of 0.630. Additionally, we conduct analyses on HumanML3D and observe that the dataset size is a limitation of our approach. Our work suggests that VQ-VAE still remains a competitive approach for human motion generation.	This paper proposes T2M-GPT, a simple yet effective two-stage text-to-motion generation framework based on VQ-VAE and GPT that utilizes discrete representations.	Generating human motion from text is crucial for various applications like gaming and animation but remains challenging due to the modality gap between language and motion.	The framework first learns a mapping between motion data and discrete code sequences using a CNN-based VQ-VAE with EMA and Code Reset techniques. Subsequently, a GPT-like model generates code indices from text embeddings, which are then decoded into motions.	T2M-GPT achieves state-of-the-art performance on HumanML3D and KIT-ML datasets, showing competitive or superior results compared to diffusion-based methods. The study demonstrates that VQ-VAE with proper training strategies remains a strong approach for motion generation. Analysis suggests that model performance can be further improved with larger datasets.	The model may miss details in excessively long text descriptions. Generated motions can exhibit slight jittering, requiring further refinement in VQ-VAE architecture.	text-to-motion generation, vq-vae, gpt, human motion synthesis, discrete representation learning
2301.06018 Report	CMAE-V: Contrastive Masked Autoencoders for Video Action Recognition	Cheng-Ze Lu, Xiaojie Jin, Zhicheng Huang, Qibin Hou, Ming-Ming Cheng, Jiashi Feng	Contrastive Masked Autoencoder (CMAE), as a new self-supervised framework, has shown its potential of learning expressive feature representations in visual image recognition. This work shows that CMAE also trivially generalizes well on video action recognition without modifying the architecture and the loss criterion. By directly replacing the original pixel shift with the temporal shift, our CMAE for visual action recognition, CMAE-V for short, can generate stronger feature representations than its counterpart based on pure masked autoencoders. Notably, CMAE-V, with a hybrid architecture, can achieve 82.2% and 71.6% top-1 accuracy on the Kinetics-400 and Something-something V2 datasets, respectively. We hope this report could provide some informative inspiration for future works.	This paper proposes CMAE-V, a novel approach for video action recognition leveraging the Contrastive Masked Autoencoder (CMAE) framework.	The authors demonstrate that CMAE can be effectively adapted for video action recognition without architectural modifications or loss function adjustments, achieving strong performance in self-supervised representation learning.	The key innovation lies in replacing the spatial pixel shift in the original CMAE with a temporal shift for generating correlated augmented views. This adaptation effectively captures temporal correlations in videos, enabling the model to learn temporally invariant and semantically meaningful representations.	CMAE-V achieves state-of-the-art performance on Kinetics-400 and Something-Something V2 datasets, outperforming previous self-supervised methods. Replacing the vanilla ViT encoder with a hybrid convolutional ViT further boosts CMAE-V's performance, setting new benchmarks for both datasets. The results highlight the effectiveness of incorporating contrastive learning within a masked autoencoder framework for video action recognition.	The paper acknowledges the potential limitation of using a relatively simple temporal shift augmentation strategy and suggests exploring more sophisticated augmentation techniques. Future work could focus on extending CMAE-V to other video understanding tasks beyond action recognition.	video action recognition, contrastive learning, masked autoencoders, self-supervised learning, computer vision
2301.06015 Report	Diffusion-based Generation, Optimization, and Planning in 3D Scenes	Siyuan Huang, Zan Wang, Puhao Li, Baoxiong Jia, Tengyu Liu, Yixin Zhu, Wei Liang, Song-Chun Zhu	We introduce SceneDiffuser, a conditional generative model for 3D scene understanding. SceneDiffuser provides a unified model for solving scene-conditioned generation, optimization, and planning. In contrast to prior works, SceneDiffuser is intrinsically scene-aware, physics-based, and goal-oriented. With an iterative sampling strategy, SceneDiffuser jointly formulates the scene-aware generation, physics-based optimization, and goal-oriented planning via a diffusion-based denoising process in a fully differentiable fashion. Such a design alleviates the discrepancies among different modules and the posterior collapse of previous scene-conditioned generative models. We evaluate SceneDiffuser with various 3D scene understanding tasks, including human pose and motion generation, dexterous grasp generation, path planning for 3D navigation, and motion planning for robot arms. The results show significant improvements compared with previous models, demonstrating the tremendous potential of SceneDiffuser for the broad community of 3D scene understanding.	This paper proposes SceneDiffuser, a novel conditional generative model for 3D scene understanding that unifies scene-conditioned generation, physics-based optimization, and goal-oriented planning.	Existing methods suffer from posterior collapse in generation and lack a unified framework to address discrepancies among generation, optimization, and planning.	SceneDiffuser leverages a diffusion model for scene-conditioned generation and integrates physics-based and goal-oriented objectives into an iterative guided-sampling framework.	SceneDiffuser generates significantly more plausible human poses and motions in 3D scenes compared to CVAE-based methods. It generates diverse and successful dexterous grasps for unseen objects, outperforming baselines in success rate and collision avoidance. It demonstrates superior performance in path planning for 3D navigation and motion planning for robot arms, exhibiting better generalization and efficiency in long-horizon tasks.	SceneDiffuser suffers from slow training and test speed compared to non-diffusion-based generative models. The optimization and planning modules heavily rely on objective design, demanding significant effort in hyperparameter tuning.	3d scene understanding, conditional generation, motion planning, diffusion models, physics-based optimization
2301.05957 Report	Towards Spatial Equilibrium Object Detection	Zhaohui Zheng, Yuming Chen, Qibin Hou, Xiang Li, Ming-Ming Cheng	Semantic objects are unevenly distributed over images. In this paper, we study the spatial disequilibrium problem of modern object detectors and propose to quantify this ``spatial bias'' by measuring the detection performance over zones. Our analysis surprisingly shows that the spatial imbalance of objects has a great impact on the detection performance, limiting the robustness of detection applications. This motivates us to design a more generalized measurement, termed Spatial equilibrium Precision (SP), to better characterize the detection performance of object detectors. Furthermore, we also present a spatial equilibrium label assignment (SELA) to alleviate the spatial disequilibrium problem by injecting the prior spatial weight into the optimization process of detectors. Extensive experiments on PASCAL VOC, MS COCO, and 3 application datasets on face mask/fruit/helmet images demonstrate the advantages of our method. Our findings challenge the conventional sense of object detectors and show the indispensability of spatial equilibrium. We hope these discoveries would stimulate the community to rethink how an excellent object detector should be. All the source code, evaluation protocols, and the tutorials are publicly available at https://github.com/Zzh-tju/ZoneEval	This paper reveals the spatial disequilibrium problem in object detection, where detectors perform inconsistently across different image zones due to photographer's bias in datasets (objects concentrated in the center).	Traditional metrics like Average Precision (AP) are inflated by central object performance, masking poor detection in outer regions. This spatial bias limits robustness in real-world applications.	The paper introduces zone evaluation, dividing images into zones and calculating metrics (ZP) for each. It proposes Spatial equilibrium Precision (SP), weighted by zone area, for a more comprehensive assessment.	Significant performance gaps exist between central and outer zones across various detectors and datasets. Traditional AP is inflated, failing to reflect poor outer zone detection. Even excluding a small central area drastically lowers performance. Proposed Spatial Equilibrium Label Assignment (SELA) improves SP and reduces performance variance across zones by re-balancing sampling.	Current zone division is preliminary, exploring other designs is needed. SELA is a first step, other solutions like data augmentation or skew spatial weights warrant investigation.	object detection, spatial bias, zone evaluation, spatial equilibrium precision (sp), spatial equilibrium label assignment (sela)
2301.05586 Report	YOLOv6 v3.0: A Full-Scale Reloading	Chuyi Li, Lulu Li, Yifei Geng, Hongliang Jiang, Meng Cheng, Bo Zhang, Zaidan Ke, Xiaoming Xu, Xiangxiang Chu	The YOLO community has been in high spirits since our first two releases! By the advent of Chinese New Year 2023, which sees the Year of the Rabbit, we refurnish YOLOv6 with numerous novel enhancements on the network architecture and the training scheme. This release is identified as YOLOv6 v3.0. For a glimpse of performance, our YOLOv6-N hits 37.5% AP on the COCO dataset at a throughput of 1187 FPS tested with an NVIDIA Tesla T4 GPU. YOLOv6-S strikes 45.0% AP at 484 FPS, outperforming other mainstream detectors at the same scale (YOLOv5-S, YOLOv8-S, YOLOX-S and PPYOLOE-S). Whereas, YOLOv6-M/L also achieve better accuracy performance (50.0%/52.8% respectively) than other detectors at a similar inference speed. Additionally, with an extended backbone and neck design, our YOLOv6-L6 achieves the state-of-the-art accuracy in real-time. Extensive experiments are carefully conducted to validate the effectiveness of each improving component. Our code is made available at https://github.com/meituan/YOLOv6.	This paper introduces YOLOv6 v3.0, an enhanced object detection framework with improvements to network architecture and training schemes.	The YOLO series is important for real-time object detection in industrial applications due to its balance between speed and accuracy. This work aims to push the boundaries of real-time object detection performance.	The authors propose: 1) Bi-directional Concatenation (BiC) module and SimCSPSPPF block for improved neck design. 2) Anchor-aided training (AAT) strategy to leverage benefits of both anchor-based and anchor-free paradigms. 3) Extended backbone and neck with an extra stage for high-resolution inputs. 4) New self-distillation strategies for small and large models.	YOLOv6-N achieves 37.5% AP at 1187 FPS on COCO, outperforming peers at the same scale. YOLOv6-L6 achieves state-of-the-art accuracy in real-time object detection. Extensive experiments validate the effectiveness of individual components like BiC, AAT, and self-distillation.	The paper primarily focuses on speed and accuracy, with limited discussion on other aspects like model robustness. Future work could explore adapting YOLOv6 for specific tasks and datasets beyond COCO.	object detection, yolo, real-time, deep learning, computer vision
2301.05499 Report	CLIP the Gap: A Single Domain Generalization Approach for Object Detection	Vidit Vidit, Martin Engilberge, Mathieu Salzmann	Single Domain Generalization (SDG) tackles the problem of training a model on a single source domain so that it generalizes to any unseen target domain. While this has been well studied for image classification, the literature on SDG object detection remains almost non-existent. To address the challenges of simultaneously learning robust object localization and representation, we propose to leverage a pre-trained vision-language model to introduce semantic domain concepts via textual prompts. We achieve this via a semantic augmentation strategy acting on the features extracted by the detector backbone, as well as a text-based classification loss. Our experiments evidence the benefits of our approach, outperforming by 10% the only existing SDG object detection method, Single-DGOD [49], on their own diverse weather-driving benchmark.	This paper proposes a novel single domain generalization approach for object detection that leverages a pre-trained vision-language model (CLIP) to introduce semantic domain concepts via textual prompts.	SDG for object detection is a nascent topic and poses additional challenges compared to image classification due to the need for robust object localization.	The method uses textual prompts related to potential target domain concepts to perform semantic augmentations on image features extracted by the detector backbone. It also incorporates a text-based classification loss during training to further leverage the vision-language model.	Outperforms the only existing SDG object detection method, Single-DGOD, by 10% on a diverse weather-driving benchmark. Demonstrates consistent improvements across various target domains including day-foggy, night-clear, dusk-rainy, and night-rainy. Shows the effectiveness of semantic augmentation with relevant prompts compared to random or no augmentation.	The method assumes some prior knowledge about the potential domain gap to generate relevant textual prompts. Future work includes exploring techniques to learn the prompts automatically, further enhancing generalization capabilities.	single domain generalization, object detection, vision-language models, clip, semantic augmentation
2301.05496 Report	Learning Transformations To Reduce the Geometric Shift in Object Detection	Vidit Vidit, Martin Engilberge, Mathieu Salzmann	The performance of modern object detectors drops when the test distribution differs from the training one. Most of the methods that address this focus on object appearance changes caused by, e.g., different illumination conditions, or gaps between synthetic and real images. Here, by contrast, we tackle geometric shifts emerging from variations in the image capture process, or due to the constraints of the environment causing differences in the apparent geometry of the content itself. We introduce a self-training approach that learns a set of geometric transformations to minimize these shifts without leveraging any labeled data in the new domain, nor any information about the cameras. We evaluate our method on two different shifts, i.e., a camera's field of view (FoV) change and a viewpoint change. Our results evidence that learning geometric transformations helps detectors to perform better in the target domains.	This paper proposes a self-training approach to handle geometric shifts in object detection for unsupervised domain adaptation. The method learns a set of geometric transformations, modeled as multiple homographies, to reduce the domain gap without requiring labeled target data or camera information.	Object detectors often suffer performance drops when tested on data with geometric differences from the training set. Existing domain adaptation methods mainly focus on appearance changes but neglect geometric shifts caused by variations in camera viewpoint, field-of-view, or object scale.	The proposed method uses an aggregator block within a FasterRCNN architecture to combine features from multiple homography-transformed images. The model is trained in three steps: source-only detector training, aggregator training on source data with random transformations, and joint optimization of transformations using a Mean Teacher strategy on source and unlabeled target data.	The approach achieves state-of-the-art results on car detection for field-of-view adaptation between Cityscapes and KITTI datasets, outperforming methods relying on camera information. It generalizes well to different degrees of field-of-view changes, demonstrating consistent improvement over baselines. The method effectively handles viewpoint adaptation for pedestrian detection, showing its applicability to diverse geometric shifts.	The computational cost increases with the number of homographies used. Current implementation optimizes transformations at the dataset level; image-specific adaptation could be beneficial.	unsupervised domain adaptation, object detection, geometric shift, homography, mean teacher
2301.05225 Report	Domain Expansion of Image Generators	Yotam Nitzan, Michaël Gharbi, Richard Zhang, Taesung Park, Jun-Yan Zhu, Daniel Cohen-Or, Eli Shechtman	Can one inject new concepts into an already trained generative model, while respecting its existing structure and knowledge? We propose a new task - domain expansion - to address this. Given a pretrained generator and novel (but related) domains, we expand the generator to jointly model all domains, old and new, harmoniously. First, we note the generator contains a meaningful, pretrained latent space. Is it possible to minimally perturb this hard-earned representation, while maximally representing the new domains? Interestingly, we find that the latent space offers unused, "dormant" directions, which do not affect the output. This provides an opportunity: By "repurposing" these directions, we can represent new domains without perturbing the original representation. In fact, we find that pretrained generators have the capacity to add several - even hundreds - of new domains! Using our expansion method, one "expanded" model can supersede numerous domain-specific models, without expanding the model size. Additionally, a single expanded generator natively supports smooth transitions between domains, as well as composition of domains. Code and project page available at https://yotamnitzan.github.io/domain-expansion/.	Introduces domain expansion, a novel task aiming to augment the image generation space of a pre-trained model with new, related domains, without overriding its original generation capabilities.	Addresses the limitations of domain adaptation, which typically erases the original domain after adapting to a new one. Domain expansion allows a single generator to model multiple domains harmoniously, enabling applications like domain composition and fine-grained control over image generation.	Structures the latent space by identifying dormant directions that don't affect image generation. These directions are repurposed to represent new domains by applying domain adaptation methods only to specific subspaces associated with those domains. Regularization techniques ensure the preservation of the original domain.	Repurposed latent directions successfully encode new domains, enabling smooth transitions and extrapolations beyond trained concepts. Expanded generator maintains high-quality generation for both original and new domains, comparable to domain-specific generators. Disentangled representation allows for meaningful composition of multiple domains, even those learned from different adaptation tasks.	The number of domains that can be expanded might be ultimately limited by model capacity. Current method requires roughly linear increase in training time with the number of domains.	generative models, domain adaptation, latent space, disentanglement, compositionality
2301.05187 Report	WIRE: Wavelet Implicit Neural Representations	Vishwanath Saragadam, Daniel LeJeune, Jasper Tan, Guha Balakrishnan, Ashok Veeraraghavan, Richard G. Baraniuk	Implicit neural representations (INRs) have recently advanced numerous vision-related areas. INR performance depends strongly on the choice of the nonlinear activation function employed in its multilayer perceptron (MLP) network. A wide range of nonlinearities have been explored, but, unfortunately, current INRs designed to have high accuracy also suffer from poor robustness (to signal noise, parameter variation, etc.). Inspired by harmonic analysis, we develop a new, highly accurate and robust INR that does not exhibit this tradeoff. Wavelet Implicit neural REpresentation (WIRE) uses a continuous complex Gabor wavelet activation function that is well-known to be optimally concentrated in space-frequency and to have excellent biases for representing images. A wide range of experiments (image denoising, image inpainting, super-resolution, computed tomography reconstruction, image overfitting, and novel view synthesis with neural radiance fields) demonstrate that WIRE defines the new state of the art in INR accuracy, training time, and robustness.	Introduces WIRE (Wavelet Implicit neural REpresentation), a novel INR employing a continuous complex Gabor wavelet activation function for superior signal representation.	Addresses limitations of existing INRs, such as lack of robustness to noise, slow training times, and limitations in accuracy for fine details, especially in high-dimensional data.	Leverages the optimal space-frequency concentration of Gabor wavelets, offering advantages over purely sinusoidal (SIREN) or Gaussian activations, with a focus on visual signal representation.	WIRE excels in diverse applications including image denoising/inpainting, super-resolution, CT reconstruction, and NeRF. Demonstrates faster training, higher accuracy (e.g., PSNR, SSIM), and improved robustness to noise compared to existing INR techniques. Proposes a multi-dimensional extension of WIRE further enhancing performance in tasks like denoising and super-resolution.	Limited exploration of WIRE for non-visual signals, focusing primarily on image-based tasks. Computational cost of complex-valued operations, although mitigated by reducing hidden features, might still pose challenges for real-time applications. Future work will explore applications in areas like audio processing and time series analysis.	implicit neural representations, gabor wavelets, inverse problems, image processing, neural radiance fields
2301.05065 Report	Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks	Xinsong Zhang, Yan Zeng, Jipeng Zhang, Hang Li	Foundation models or pre-trained models have substantially improved the performance of various language, vision, and vision-language understanding tasks. However, existing foundation models can only perform the best in one type of tasks, namely language, vision, or vision-language. It is still an open question whether it is possible to construct a foundation model performing the best for all the understanding tasks, which we call a general foundation model. In this paper, we propose a new general foundation model, X-FM (the X-Foundation Model). X-FM has one language encoder, one vision encoder, and one fusion encoder, as well as a new training method. The training method includes two new techniques for learning X-FM from text, image, and image-text pair data. One is to stop gradients from the vision-language training when learning the language encoder. The other is to leverage the vision-language training to guide the learning of the vision encoder. Extensive experiments on benchmark datasets show that X-FM can significantly outperform existing general foundation models and perform better than or comparable to existing foundation models specifically for language, vision, or vision-language understanding. Code and pre-trained models are released at https://github.com/zhangxinsong-nlp/XFM.	This paper introduces XFM (X-Foundation Model), a novel general foundation model designed to excel in language, vision, and vision-language understanding tasks.	Existing foundation models typically specialize in a single modality, making it challenging to achieve state-of-the-art performance across all understanding tasks with a single model.	XFM employs three encoders (language, vision, and fusion) and leverages two innovative training techniques: (1) stopping gradients from vision-language training to the language encoder and (2) using vision-language training to guide masked image modeling for the vision encoder.	XFM significantly outperforms previous general foundation models on 23 language, vision, and vision-language understanding tasks. XFM achieves comparable or even superior performance to state-of-the-art models specifically designed for language, vision, or vision-language tasks. Ablation studies demonstrate the effectiveness of the proposed training techniques in enhancing both uni-modal and multi-modal understanding capabilities.	The training process is computationally expensive, requiring optimization for efficiency. Future work will investigate scalability by exploring larger model sizes and datasets.	foundation models, multimodality, vision-language understanding, masked image modeling, transfer learning
2301.04650 Report	Geometry-biased Transformers for Novel View Synthesis	Naveen Venkat, Mayank Agarwal, Maneesh Singh, Shubham Tulsiani	We tackle the task of synthesizing novel views of an object given a few input images and associated camera viewpoints. Our work is inspired by recent 'geometry-free' approaches where multi-view images are encoded as a (global) set-latent representation, which is then used to predict the color for arbitrary query rays. While this representation yields (coarsely) accurate images corresponding to novel viewpoints, the lack of geometric reasoning limits the quality of these outputs. To overcome this limitation, we propose 'Geometry-biased Transformers' (GBTs) that incorporate geometric inductive biases in the set-latent representation-based inference to encourage multi-view geometric consistency. We induce the geometric bias by augmenting the dot-product attention mechanism to also incorporate 3D distances between rays associated with tokens as a learnable bias. We find that this, along with camera-aware embeddings as input, allows our models to generate significantly more accurate outputs. We validate our approach on the real-world CO3D dataset, where we train our system over 10 categories and evaluate its view-synthesis ability for novel objects as well as unseen categories. We empirically validate the benefits of the proposed geometric biases and show that our approach significantly improves over prior works.	This paper proposes Geometry-biased Transformers (GBTs) for novel view synthesis, which incorporate geometric inductive biases into set-latent representation-based inference to improve multi-view consistency.	Existing geometry-free methods, while able to capture global context, lack precision and struggle to render high-quality details. This work aims to address this limitation by introducing geometric reasoning into the process.	GBTs leverage a ray-distance-based bias within the attention mechanism of Transformer layers. This bias guides both scene encoding and ray decoding stages to prioritize geometrically relevant context. The model uses a CNN for patch-level feature extraction, a GBT encoder for global scene representation, and a GBT decoder for pixel color prediction.	GBTs outperform previous state-of-the-art methods in novel view synthesis on the CO3D dataset, demonstrating superior quality in rendering details and consistency. The method exhibits strong generalization capabilities, achieving good performance on unseen object categories. Analysis reveals that the geometric bias in the attention mechanism leads to more concentrated focus on relevant regions, resulting in finer details and improved rendering quality.	Set-latent representation methods, including GBTs, still lag behind projection-based methods in predicting precise details, presenting an area for future improvement. The reliance on camera viewpoints for inference might restrict the applicability of GBTs in real-world scenarios with unknown camera poses.	novel view synthesis, transformers, geometric reasoning, set-latent representation, attention mechanism
2301.04647 Report	EXIF as Language: Learning Cross-Modal Associations Between Images and Camera Metadata	Chenhao Zheng, Ayush Shrivastava, Andrew Owens	We learn a visual representation that captures information about the camera that recorded a given photo. To do this, we train a multimodal embedding between image patches and the EXIF metadata that cameras automatically insert into image files. Our model represents this metadata by simply converting it to text and then processing it with a transformer. The features that we learn significantly outperform other self-supervised and supervised features on downstream image forensics and calibration tasks. In particular, we successfully localize spliced image regions "zero shot" by clustering the visual embeddings for all of the patches within an image.	The paper introduces a method for learning a visual representation that captures camera properties from image patches by associating them with EXIF metadata using a contrastive learning framework. The metadata is treated as a language-like modality and processed using a transformer.	Understanding the imaging properties of an image is crucial for various tasks, including image forensics, 3D reconstruction, and image generation, complementing the understanding of semantic content.	The method involves training a joint embedding between image patches and EXIF metadata, which is converted to text and processed using a transformer. The model learns to associate visual features with camera properties described in the metadata.	Camera metadata serves as effective supervision for self-supervised representation learning. Image-metadata embeddings prove valuable for image forensics and camera understanding tasks, outperforming alternative features. Image manipulations, such as splicing, can be detected 'zero shot' by identifying inconsistencies in patch embeddings derived from the learned representation.	The model's performance might be limited to cameras and metadata present in the training dataset (YFCC100M). The method's reliance on large patches might limit its effectiveness in detecting small splices.	self-supervised learning, multimodal learning, image forensics, camera metadata, exif
2301.04634 Report	Street-View Image Generation from a Bird's-Eye View Layout	Alexander Swerdlow, Runsheng Xu, Bolei Zhou	Bird's-Eye View (BEV) Perception has received increasing attention in recent years as it provides a concise and unified spatial representation across views and benefits a diverse set of downstream driving applications. At the same time, data-driven simulation for autonomous driving has been a focal point of recent research but with few approaches that are both fully data-driven and controllable. Instead of using perception data from real-life scenarios, an ideal model for simulation would generate realistic street-view images that align with a given HD map and traffic layout, a task that is critical for visualizing complex traffic scenarios and developing robust perception models for autonomous driving. In this paper, we propose BEVGen, a conditional generative model that synthesizes a set of realistic and spatially consistent surrounding images that match the BEV layout of a traffic scenario. BEVGen incorporates a novel cross-view transformation with spatial attention design which learns the relationship between cameras and map views to ensure their consistency. We evaluate the proposed model on the challenging NuScenes and Argoverse 2 datasets. After training, BEVGen can accurately render road and lane lines, as well as generate traffic scenes with diverse different weather conditions and times of day.	This paper introduces BEVGen, a novel generative model that synthesizes realistic and spatially consistent street-view images from a given Bird's-Eye View (BEV) layout.	This work addresses the unexplored area of generative BEV perception, with applications in synthetic data generation for perception models, visualization of safety-critical situations, and editing of traffic scenes for autonomous driving development.	BEVGen uses an autoregressive transformer with VQ-VAE encoders for images and BEV layouts. Spatial embeddings align image and BEV tokens, while a pairwise camera bias ensures image consistency and correspondence.	BEVGen generates high-quality, diverse scenes including intersections, parking lots, and boulevards, with consistent weather and time of day across views. Quantitative evaluation on NuScenes and Argoverse 2 shows superior performance over baselines in terms of FID score, road/vehicle mIoU, and View Consistency Score (VSC). The model demonstrates promising applications in data augmentation for BEV segmentation and 3D detection, and in generating images from simulated BEV layouts for safety-critical scenario testing and sim2real applications.	The model exhibits limitations in synthesizing small objects like vehicles, impacting downstream perception tasks. Future work can explore improvements in small object synthesis, alternative architectures like diffusion models, and integration with scenario generation methods.	generative model, bev perception, autonomous driving, data augmentation, scene synthesis
2301.04628 Report	Face Attribute Editing with Disentangled Latent Vectors	Yusuf Dalva, Hamza Pehlivan, Cansu Moran, Öykü Irmak Hatipoğlu, Ayşegül Dündar	We propose an image-to-image translation framework for facial attribute editing with disentangled interpretable latent directions. Facial attribute editing task faces the challenges of targeted attribute editing with controllable strength and disentanglement in the representations of attributes to preserve the other attributes during edits. For this goal, inspired by the latent space factorization works of fixed pretrained GANs, we design the attribute editing by latent space factorization, and for each attribute, we learn a linear direction that is orthogonal to the others. We train these directions with orthogonality constraints and disentanglement losses. To project images to semantically organized latent spaces, we set an encoder-decoder architecture with attention-based skip connections. We extensively compare with previous image translation algorithms and editing with pretrained GAN works. Our extensive experiments show that our method significantly improves over the state-of-the-arts. Project page: https://yusufdalva.github.io/vecgan	This paper presents VecGAN++, an image-to-image translation framework for facial attribute editing that uses disentangled and interpretable latent directions.	Facial attribute editing is challenging because it requires targeted modifications without affecting other attributes, and existing methods often struggle with disentanglement, controllability, or reconstruction quality.	VecGAN++ uses an encoder-decoder architecture with latent space manipulation. It learns a linear direction for each attribute and performs translations via vector arithmetic in this space. The framework incorporates orthogonality constraints, disentanglement losses, and attention-based skip connections to improve attribute separation and image reconstruction.	VecGAN++ achieves state-of-the-art results on facial attribute editing benchmarks, outperforming both end-to-end image translation methods and StyleGAN inversion-based editing techniques. The learned latent directions provide controllable attribute editing, allowing for adjustments to the intensity of the modifications. Analysis of the latent space reveals that the learned directions successfully capture semantic information, with projected style codes effectively classifying attributes like smile presence and hair color.	While VecGAN++ demonstrates strong performance on pre-defined attributes, it requires training for each specific attribute, limiting its flexibility compared to StyleGAN inversion methods that can leverage pre-trained models for diverse edits. The separation of hair color attributes, while improved, still presents challenges due to its continuous nature, suggesting an area for future exploration.	image translation, generative adversarial networks, facial attribute editing, latent space manipulation, disentanglement
2301.04604 Report	LinkGAN: Linking GAN Latents to Pixels for Controllable Image Synthesis	Jiapeng Zhu, Ceyuan Yang, Yujun Shen, Zifan Shi, Bo Dai, Deli Zhao, Qifeng Chen	This work presents an easy-to-use regularizer for GAN training, which helps explicitly link some axes of the latent space to a set of pixels in the synthesized image. Establishing such a connection facilitates a more convenient local control of GAN generation, where users can alter the image content only within a spatial area simply by partially resampling the latent code. Experimental results confirm four appealing properties of our regularizer, which we call LinkGAN. (1) The latent-pixel linkage is applicable to either a fixed region (\textit{i.e.}, same for all instances) or a particular semantic category (i.e., varying across instances), like the sky. (2) Two or multiple regions can be independently linked to different latent axes, which further supports joint control. (3) Our regularizer can improve the spatial controllability of both 2D and 3D-aware GAN models, barely sacrificing the synthesis performance. (4) The models trained with our regularizer are compatible with GAN inversion techniques and maintain editability on real images.	Proposes LinkGAN, a regularizer for GAN training that explicitly links specific latent space axes to image pixels, enabling precise local control over generated images by resampling the linked latent codes.	Addresses limitations of existing GAN manipulation methods that rely on posterior discovery of latent semantics, which are often unstable, inaccurate, and inflexible. Enables more convenient and reliable control for image editing applications.	Introduces a regularizer that minimizes the impact of perturbing a subset of latent codes on designated out-of-region pixels, while encouraging changes within the linked region. This effectively disentangles the influence of specific latent axes on chosen image areas.	Successfully links arbitrary image regions to latent axes, enabling independent control over multiple areas. Demonstrates effectiveness for both fixed regions and semantically defined areas (e.g., sky, car). Improves controllability of 2D and 3D-aware GAN models without significantly sacrificing synthesis quality.	The built linkage is not perfect, sometimes resulting in slight changes outside the targeted region or image inconsistencies after resampling. Future work includes exploring methods to further improve linkage accuracy and address the inconsistency issue, potentially through incorporating stronger priors or adversarial training strategies.	generative adversarial networks, image synthesis, local image editing, disentanglement, latent space manipulation
2301.04075 Report	Benchmarking Robustness in Neural Radiance Fields	Chen Wang, Angtian Wang, Junbo Li, Alan Yuille, Cihang Xie	Neural Radiance Field (NeRF) has demonstrated excellent quality in novel view synthesis, thanks to its ability to model 3D object geometries in a concise formulation. However, current approaches to NeRF-based models rely on clean images with accurate camera calibration, which can be difficult to obtain in the real world, where data is often subject to corruption and distortion. In this work, we provide the first comprehensive analysis of the robustness of NeRF-based novel view synthesis algorithms in the presence of different types of corruptions. We find that NeRF-based models are significantly degraded in the presence of corruption, and are more sensitive to a different set of corruptions than image recognition models. Furthermore, we analyze the robustness of the feature encoder in generalizable methods, which synthesize images using neural features extracted via convolutional neural networks or transformers, and find that it only contributes marginally to robustness. Finally, we reveal that standard data augmentation techniques, which can significantly improve the robustness of recognition models, do not help the robustness of NeRF-based models. We hope that our findings will attract more researchers to study the robustness of NeRF-based approaches and help to improve their performance in the real world.	This paper presents the first comprehensive benchmark for evaluating the robustness of Neural Radiance Field (NeRF) based novel view synthesis models to visual corruptions.	NeRF models are often applied to real-world scenarios where data corruption is common, yet their robustness to such corruption has not been previously studied.	The authors construct two datasets, LLFF-C and Blender-C, by adding various types of corruptions to existing NeRF datasets. They benchmark seven representative NeRF models on these datasets, evaluating their performance using PSNR, SSIM, and LPIPS metrics.	All benchmarked NeRF models exhibit significant performance degradation across all types of corruptions. Scene-specific NeRF models are generally more robust than generalizable ones. Standard image data augmentation techniques, while effective for image recognition, do not improve the robustness of NeRF models.	The study primarily focuses on novel view synthesis and could be extended to other NeRF applications. Future work could investigate the development of more robust NeRF training and pose estimation techniques.	neural radiance fields, novel view synthesis, robustness, benchmarking, data corruption
2301.03992 Report	Vision Transformers Are Good Mask Auto-Labelers	Shiyi Lan, Xitong Yang, Zhiding Yu, Zuxuan Wu, Jose M. Alvarez, Anima Anandkumar	We propose Mask Auto-Labeler (MAL), a high-quality Transformer-based mask auto-labeling framework for instance segmentation using only box annotations. MAL takes box-cropped images as inputs and conditionally generates their mask pseudo-labels.We show that Vision Transformers are good mask auto-labelers. Our method significantly reduces the gap between auto-labeling and human annotation regarding mask quality. Instance segmentation models trained using the MAL-generated masks can nearly match the performance of their fully-supervised counterparts, retaining up to 97.4\% performance of fully supervised models. The best model achieves 44.1\% mAP on COCO instance segmentation (test-dev 2017), outperforming state-of-the-art box-supervised methods by significant margins. Qualitative results indicate that masks produced by MAL are, in some cases, even better than human annotations.	This paper proposes Mask Auto-Labeler (MAL), a Transformer-based mask auto-labeling framework for instance segmentation that only requires bounding box annotations.	Creating large-scale instance segmentation datasets with mask annotations is expensive and time-consuming. MAL offers a solution to train these models without expensive mask annotations by generating high-quality masks from bounding boxes.	MAL uses a two-phase framework: 1) Training a Vision Transformer to generate mask pseudo-labels from box-cropped images. 2) Training instance segmentation models using the generated masks. The framework utilizes techniques like box expansion, attention-based decoding, and a teacher network to achieve high-quality masks.	MAL significantly reduces the gap between auto-labeling and human annotation quality. Instance segmentation models trained on MAL-generated masks achieve up to 97.4% of their fully supervised performance on COCO and LVIS. MAL demonstrates strong open-vocabulary generalization by labeling novel categories not seen during training.	MAL faces challenges in occlusion situations where human annotations outperform it. The authors observed saturation problems when scaling the model from ViT-Base to ViT-Large.	instance segmentation, weakly supervised learning, mask auto-labeling, vision transformers, box-supervised learning
2301.03786 Report	DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation	Shuai Shen, Wenliang Zhao, Zibin Meng, Wanhua Li, Zheng Zhu, Jie Zhou, Jiwen Lu	Talking head synthesis is a promising approach for the video production industry. Recently, a lot of effort has been devoted in this research area to improve the generation quality or enhance the model generalization. However, there are few works able to address both issues simultaneously, which is essential for practical applications. To this end, in this paper, we turn attention to the emerging powerful Latent Diffusion Models, and model the Talking head generation as an audio-driven temporally coherent denoising process (DiffTalk). More specifically, instead of employing audio signals as the single driving factor, we investigate the control mechanism of the talking face, and incorporate reference face images and landmarks as conditions for personality-aware generalized synthesis. In this way, the proposed DiffTalk is capable of producing high-quality talking head videos in synchronization with the source audio, and more importantly, it can be naturally generalized across different identities without any further fine-tuning. Additionally, our DiffTalk can be gracefully tailored for higher-resolution synthesis with negligible extra computational cost. Extensive experiments show that the proposed DiffTalk efficiently synthesizes high-fidelity audio-driven talking head videos for generalized novel identities. For more video results, please refer to \url{https://sstzal.github.io/DiffTalk/}.	Presents DiffTalk, a novel conditional diffusion model for high-quality and generalized talking head synthesis that leverages audio and reference images to generate synchronized and personalized talking videos across multiple identities without fine-tuning.	Addresses the limitations of existing talking head synthesis methods that struggle to achieve both high generation quality and strong model generalization, essential for practical applications like animation and virtual avatars.	Models talking head generation as an audio-driven denoising process using Latent Diffusion Models (LDMs). It incorporates smooth audio features as conditions to guide lip movements and utilizes reference images, landmarks, and masked ground-truth images to ensure identity preservation and pose control.	Achieves high-fidelity talking head video synthesis with accurate audio-lip synchronization across diverse identities without fine-tuning. Significantly outperforms 2D-based methods in terms of generated image quality and surpasses 3D-based methods in model generalization ability. Demonstrates the capacity for higher-resolution image generation by adjusting the downsampling factor of the image encoder and decoder.	The iterative denoising process of DiffTalk demands more time for frame synthesis compared to GAN-based methods. Cross-identity audio driving poses a challenge, leading to relatively less accurate audio-lip synchronization than in self-driven scenarios.	talking head synthesis, diffusion models, generative models, audio-visual synchronization, identity preservation
2301.03580 Report	Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling	Keyu Tian, Yi Jiang, Qishuai Diao, Chen Lin, Liwei Wang, Zehuan Yuan	We identify and overcome two key obstacles in extending the success of BERT-style pre-training, or the masked image modeling, to convolutional networks (convnets): (i) convolution operation cannot handle irregular, random-masked input images; (ii) the single-scale nature of BERT pre-training is inconsistent with convnet's hierarchical structure. For (i), we treat unmasked pixels as sparse voxels of 3D point clouds and use sparse convolution to encode. This is the first use of sparse convolution for 2D masked modeling. For (ii), we develop a hierarchical decoder to reconstruct images from multi-scale encoded features. Our method called Sparse masKed modeling (SparK) is general: it can be used directly on any convolutional model without backbone modifications. We validate it on both classical (ResNet) and modern (ConvNeXt) models: on three downstream tasks, it surpasses both state-of-the-art contrastive learning and transformer-based masked modeling by similarly large margins (around +1.0%). Improvements on object detection and instance segmentation are more substantial (up to +3.5%), verifying the strong transferability of features learned. We also find its favorable scaling behavior by observing more gains on larger models. All this evidence reveals a promising future of generative pre-training on convnets. Codes and models are released at https://github.com/keyu-tian/SparK.	This paper introduces SparK, a novel BERT-style pre-training method specifically designed for convolutional networks (convnets), addressing the limitations of traditional masked image modeling approaches when applied to convnets.	Extending the success of BERT-style pre-training, highly effective in NLP and vision transformers, to convnets remained a challenge. This paper tackles this gap by overcoming convnets' limitations in processing irregular masked images and leveraging their hierarchical structure.	SparK leverages sparse convolution to encode unmasked image patches as a 3D point cloud, effectively handling irregularly masked inputs. It also employs a hierarchical decoder that utilizes multi-scale features to reconstruct the image, harnessing convnets' inherent strengths.	SparK-pretrained convnets outperform state-of-the-art contrastive learning methods and transformer-based masked modeling on ImageNet classification by a significant margin. The improvements are even more pronounced on COCO object detection and instance segmentation tasks, highlighting SparK's ability to learn highly transferable representations. SparK exhibits favorable scaling behavior, with larger models showing more significant gains, suggesting its potential to boost a wide range of convnet architectures.	The current implementation utilizes a simple, fixed decoder architecture for all encoder models and could benefit from exploring task-specific decoder designs. Future work may involve investigating the integration of SparK with other pre-training techniques, like contrastive learning, to further enhance representation learning in convnets.	self-supervised learning, masked image modeling, convolutional networks, sparse convolution, hierarchical pre-training
2301.03396 Report	Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation	Michał Stypułkowski, Konstantinos Vougioukas, Sen He, Maciej Zięba, Stavros Petridis, Maja Pantic	Talking face generation has historically struggled to produce head movements and natural facial expressions without guidance from additional reference videos. Recent developments in diffusion-based generative models allow for more realistic and stable data synthesis and their performance on image and video generation has surpassed that of other generative models. In this work, we present an autoregressive diffusion model that requires only one identity image and audio sequence to generate a video of a realistic talking human head. Our solution is capable of hallucinating head movements, facial expressions, such as blinks, and preserving a given background. We evaluate our model on two different datasets, achieving state-of-the-art results on both of them.	Presents Diffused Heads, a novel autoregressive diffusion model for generating realistic talking head videos using only a single identity image and an audio sequence.	Addresses limitations of GAN-based talking face generation methods, which often suffer from training instability, require additional guidance (limiting originality), and can produce distortions, especially with large head movements.	Utilizes a diffusion model conditioned on an identity frame, motion frames (past frames for smoother motion), and audio embeddings (for lip sync and expressions). Introduces a lip sync loss to enhance mouth movement accuracy and uses grayscale motion frames during sampling to prioritize motion information.	Achieves state-of-the-art results on LRW and CREMA datasets in terms of visual quality (FID, FVD), smoothness (OFM, F-MSE), and expressiveness (blinks). Outperforms existing methods in a human perception Turing test, demonstrating realistic and believable talking head synthesis. Demonstrates strong generalization ability, effectively generating videos with out-of-distribution identity images and audio recordings.	Limited to generating shorter video sequences (8-9 seconds) due to the autoregressive frame generation process. Suffers from long generation times compared to GAN-based approaches, hindering real-time applications.	talking face generation, diffusion models, speech-driven animation, one-shot learning, video synthesis
2301.03110 Report	RobArch: Designing Robust Architectures against Adversarial Attacks	ShengYun Peng, Weilin Xu, Cory Cornelius, Kevin Li, Rahul Duggal, Duen Horng Chau, Jason Martin	Adversarial Training is the most effective approach for improving the robustness of Deep Neural Networks (DNNs). However, compared to the large body of research in optimizing the adversarial training process, there are few investigations into how architecture components affect robustness, and they rarely constrain model capacity. Thus, it is unclear where robustness precisely comes from. In this work, we present the first large-scale systematic study on the robustness of DNN architecture components under fixed parameter budgets. Through our investigation, we distill 18 actionable robust network design guidelines that empower model developers to gain deep insights. We demonstrate these guidelines' effectiveness by introducing the novel Robust Architecture (RobArch) model that instantiates the guidelines to build a family of top-performing models across parameter capacities against strong adversarial attacks. RobArch achieves the new state-of-the-art AutoAttack accuracy on the RobustBench ImageNet leaderboard. The code is available at $\href{https://github.com/ShengYun-Peng/RobArch}{\text{this url}}$.	This paper conducts a large-scale systematic study of the robustness of DNN architecture components under fixed parameter budgets, resulting in 18 actionable guidelines for designing robust networks.	Architecture design plays a crucial role in deep learning, but its impact on robustness against adversarial attacks is not well understood. This paper fills this gap by systematically examining how architecture components contribute to robustness.	The authors conduct a comprehensive study on ImageNet, controlling for model capacity to isolate the effects of individual architecture components. They train over 150 models, exploring network depth, width, stage-level modifications, and block-level designs, under both Fast-AT and Standard-AT.	Deepening a network is generally more effective than widening it for robustness, until catastrophic overfitting occurs. Specific modifications, like adding Squeeze-and-Excitation (SE) blocks, removing the first normalization layer in a block, and reducing downsampling in the stem stage, boost robustness. Architectural choices that harm robustness include inverted bottlenecks, large dilation factors, Instance Normalization (IN), parametric activation functions, and reducing activation layers.	The study is limited to ResNet-style architectures and ImageNet. Future work could explore interactions between architectural components and optimize training recipes for robust models.	adversarial robustness, deep neural networks, architecture design, imagenet, adversarial training
2301.02657 Report	TarViS: A Unified Approach for Target-based Video Segmentation	Ali Athar, Alexander Hermans, Jonathon Luiten, Deva Ramanan, Bastian Leibe	The general domain of video segmentation is currently fragmented into different tasks spanning multiple benchmarks. Despite rapid progress in the state-of-the-art, current methods are overwhelmingly task-specific and cannot conceptually generalize to other tasks. Inspired by recent approaches with multi-task capability, we propose TarViS: a novel, unified network architecture that can be applied to any task that requires segmenting a set of arbitrarily defined 'targets' in video. Our approach is flexible with respect to how tasks define these targets, since it models the latter as abstract 'queries' which are then used to predict pixel-precise target masks. A single TarViS model can be trained jointly on a collection of datasets spanning different tasks, and can hot-swap between tasks during inference without any task-specific retraining. To demonstrate its effectiveness, we apply TarViS to four different tasks, namely Video Instance Segmentation (VIS), Video Panoptic Segmentation (VPS), Video Object Segmentation (VOS) and Point Exemplar-guided Tracking (PET). Our unified, jointly trained model achieves state-of-the-art performance on 5/7 benchmarks spanning these four tasks, and competitive performance on the remaining two. Code and model weights are available at: https://github.com/Ali2500/TarViS	TarViS, a novel unified architecture for target-based video segmentation, allowing a single model to be jointly trained for and perform multiple video segmentation tasks by specifying the target as queries.	Existing video segmentation methods are task-specific and cannot generalize to other tasks. This work addresses the fragmentation by proposing a unified model for tasks requiring segmenting arbitrarily defined targets in videos.	TarViS uses a temporal neck for spatiotemporal feature interaction and a transformer decoder to refine target queries. It encodes task-specific targets, such as object instances or semantic classes, as queries and jointly trains on datasets spanning different tasks like VIS, VPS, VOS, and PET.	TarViS achieves state-of-the-art results for VIS on YouTube-VIS 2021 and OVIS, and for VPS on KITTI-STEP and VIPSeg. The model performs competitively for VOS on DAVIS, outperforming several space-time correspondence-based methods. For PET on BURST, TarViS significantly outperforms baselines by encoding objects as queries, effectively handling both point and mask object guidance.	Training on multiple datasets might not always improve performance on all benchmarks due to potential class bias. The approach of encoding objects as queries could lead to a loss of fine-grained object details.	video segmentation, multi-task learning, transformer, query-based model, target segmentation
2301.02280 Report	Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training	Filip Radenovic, Abhimanyu Dubey, Abhishek Kadian, Todor Mihaylov, Simon Vandenhende, Yash Patel, Yi Wen, Vignesh Ramanathan, Dhruv Mahajan	Vision-language models trained with contrastive learning on large-scale noisy data are becoming increasingly popular for zero-shot recognition problems. In this paper we improve the following three aspects of the contrastive pre-training pipeline: dataset noise, model initialization and the training objective. First, we propose a straightforward filtering strategy titled Complexity, Action, and Text-spotting (CAT) that significantly reduces dataset size, while achieving improved performance across zero-shot vision-language tasks. Next, we propose an approach titled Concept Distillation to leverage strong unimodal representations for contrastive training that does not increase training complexity while outperforming prior work. Finally, we modify the traditional contrastive alignment objective, and propose an importance-sampling approach to up-sample the importance of hard-negatives without adding additional complexity. On an extensive zero-shot benchmark of 29 tasks, our Distilled and Hard-negative Training (DiHT) approach improves on 20 tasks compared to the baseline. Furthermore, for few-shot linear probing, we propose a novel approach that bridges the gap between zero-shot and few-shot performance, substantially improving over prior work. Models are available at https://github.com/facebookresearch/diht.	The paper proposes Distilled and Hard-negative Training (DiHT), a new vision-language pre-training method combining dataset filtering, concept distillation, and a hard-negative contrastive objective.	Existing contrastive pre-training methods suffer from noisy datasets, suboptimal model initialization, and inefficient use of negative samples. DiHT addresses these issues to improve zero-shot performance on various vision-language tasks.	DiHT uses the Complexity, Action, and Text-spotting (CAT) filter to clean large-scale datasets, distills object and attribute concepts from a pre-trained teacher model, and employs a hard-negative contrastive loss to focus on more informative negative samples.	DiHT improves zero-shot performance on 20 out of 29 vision-language tasks compared to the CLIP baseline trained on LAION-2B. Training DiHT on the smaller PMD dataset yields better performance than CLIP on 28 out of 29 tasks. DiHT bridges the gap between zero-shot and few-shot performance, showing significant improvements in few-shot linear probing.	The effectiveness of the hard-negative loss in very noisy settings needs further exploration. Extending DiHT to more computationally expensive but performant encoder/decoder architectures is a promising future direction.	vision-language pre-training, contrastive learning, dataset filtering, concept distillation, hard negative mining
2301.02240 Report	Skip-Attention: Improving Vision Transformers by Paying Less Attention	Shashanka Venkataramanan, Amir Ghodrati, Yuki M. Asano, Fatih Porikli, Amirhossein Habibian	This work aims to improve the efficiency of vision transformers (ViT). While ViTs use computationally expensive self-attention operations in every layer, we identify that these operations are highly correlated across layers -- a key redundancy that causes unnecessary computations. Based on this observation, we propose SkipAt, a method to reuse self-attention computation from preceding layers to approximate attention at one or more subsequent layers. To ensure that reusing self-attention blocks across layers does not degrade the performance, we introduce a simple parametric function, which outperforms the baseline transformer's performance while running computationally faster. We show the effectiveness of our method in image classification and self-supervised learning on ImageNet-1K, semantic segmentation on ADE20K, image denoising on SIDD, and video denoising on DAVIS. We achieve improved throughput at the same-or-higher accuracy levels in all these tasks.	This paper introduces Skip-Attention (SKAT), a method to improve the efficiency of Vision Transformers (ViT) by reusing self-attention computations from previous layers.	Self-attention operations in ViTs are computationally expensive and exhibit high correlation across layers, leading to redundancy.	SKAT introduces a parametric function inspired by ResNeXt that reuses and refines attention information from preceding layers to approximate attention in subsequent layers.	SKAT achieves superior accuracy-efficiency trade-off compared to baseline ViT and other state-of-the-art efficiency-focused methods on ImageNet-1K image classification. SKAT demonstrates strong generalization by improving performance on various tasks like semantic segmentation, image denoising, and video denoising. The paper provides extensive ablations to analyze the impact of different components and configurations of SKAT.	The paper primarily focuses on skipping attention in the encoder part of the models and suggests exploring its application in decoders as future work. Investigating the effectiveness of applying the parametric function directly to the self-attention map instead of the entire MSA block is left for future exploration.	vision transformer, self-attention, efficiency, image classification, semantic segmentation
2301.02239 Report	Robust Dynamic Radiance Fields	Yu-Lun Liu, Chen Gao, Andreas Meuleman, Hung-Yu Tseng, Ayush Saraf, Changil Kim, Yung-Yu Chuang, Johannes Kopf, Jia-Bin Huang	Dynamic radiance field reconstruction methods aim to model the time-varying structure and appearance of a dynamic scene. Existing methods, however, assume that accurate camera poses can be reliably estimated by Structure from Motion (SfM) algorithms. These methods, thus, are unreliable as SfM algorithms often fail or produce erroneous poses on challenging videos with highly dynamic objects, poorly textured surfaces, and rotating camera motion. We address this robustness issue by jointly estimating the static and dynamic radiance fields along with the camera parameters (poses and focal length). We demonstrate the robustness of our approach via extensive quantitative and qualitative experiments. Our results show favorable performance over the state-of-the-art dynamic view synthesis methods.	This paper introduces RoDynRF, an algorithm for reconstructing dynamic radiance fields from casual monocular videos without requiring known camera poses and camera intrinsics as input.	Existing dynamic radiance field reconstruction methods rely on accurate camera poses typically derived from SfM algorithms, which are prone to failure in challenging videos with dynamic objects, textureless surfaces, and complex camera motion. This work addresses this robustness issue.	The method jointly estimates camera poses, focal length, and two separate radiance fields for static and dynamic elements. It employs a coarse-to-fine strategy for static scene reconstruction, late viewing direction conditioning for improved geometry estimation, and incorporates auxiliary losses (reprojection, disparity, monocular depth) for both static and dynamic components. The dynamic reconstruction utilizes a deformation MLP to handle temporal information and a scene flow MLP for motion modeling.	RoDynRF achieves state-of-the-art results on dynamic view synthesis benchmarks, including the Dynamic Scene and iPhone datasets. The method demonstrates superior camera pose estimation accuracy compared to existing NeRF-based methods and performs favorably against learning-based visual odometry techniques on the MPI Sintel dataset. Qualitative results showcase RoDynRF's ability to reconstruct high-fidelity dynamic scenes and synthesize novel views from challenging videos where traditional SfM techniques fail.	The method assumes a fixed focal length throughout the video, limiting its applicability in scenarios with zooming effects. Fast camera motion or inaccurate flow estimation can still lead to failure in pose estimation. Future work could explore handling dynamic camera intrinsics and improving robustness in extreme motion scenarios.	dynamic radiance fields, view synthesis, camera pose estimation, neural rendering, computer vision
2301.01802 Report	MonoEdge: Monocular 3D Object Detection Using Local Perspectives	Minghan Zhu, Lingting Ge, Panqu Wang, Huei Peng	We propose a novel approach for monocular 3D object detection by leveraging local perspective effects of each object. While the global perspective effect shown as size and position variations has been exploited for monocular 3D detection extensively, the local perspectives has long been overlooked. We design a local perspective module to regress a newly defined variable named keyedge-ratios as the parameterization of the local shape distortion to account for the local perspective, and derive the object depth and yaw angle from it. Theoretically, this module does not rely on the pixel-wise size or position in the image of the objects, therefore independent of the camera intrinsic parameters. By plugging this module in existing monocular 3D object detection frameworks, we incorporate the local perspective distortion with global perspective effect for monocular 3D reasoning, and we demonstrate the effectiveness and superior performance over strong baseline methods in multiple datasets.	Presents MonoEdge, a novel monocular 3D object detection approach leveraging local perspective effects within objects to estimate depth and yaw angle.	Exploits previously overlooked local perspective cues, offering camera-intrinsics-free depth and direct global yaw angle estimation, unlike common allocentric angle-based methods.	Introduces 'keyedge-ratios' to quantify local shape distortions, deriving depth and yaw angle independent of camera intrinsics. Employs camera-centric keyedge indexing and grouped regression heads for effective learning, and integrates uncertainty-based depth fusion.	Achieves consistent improvements across all evaluation metrics on KITTI and nuScenes datasets over baseline methods. Demonstrates the value of incorporating local perspective distortion with existing approaches for enhanced 3D object detection. Shows robustness in handling objects with varying viewpoints, including those with minimal apparent local distortion.	Relies on combining with existing methods based on visual size and position for optimal performance. Limited effectiveness for distant objects with diminished local perspective distortion.	3d object detection, monocular vision, local perspective, keyedge-ratios, camera-intrinsics-free
2301.01795 Report	PACO: Parts and Attributes of Common Objects	Vignesh Ramanathan, Anmol Kalia, Vladan Petrovic, Yi Wen, Baixue Zheng, Baishan Guo, Rui Wang, Aaron Marquez, Rama Kovvuri, Abhishek Kadian, Amir Mousavi, Yiwen Song, Abhimanyu Dubey, Dhruv Mahajan	Object models are gradually progressing from predicting just category labels to providing detailed descriptions of object instances. This motivates the need for large datasets which go beyond traditional object masks and provide richer annotations such as part masks and attributes. Hence, we introduce PACO: Parts and Attributes of Common Objects. It spans 75 object categories, 456 object-part categories and 55 attributes across image (LVIS) and video (Ego4D) datasets. We provide 641K part masks annotated across 260K object boxes, with roughly half of them exhaustively annotated with attributes as well. We design evaluation metrics and provide benchmark results for three tasks on the dataset: part mask segmentation, object and part attribute prediction and zero-shot instance detection. Dataset, models, and code are open-sourced at https://github.com/facebookresearch/paco.	The paper introduces \dataname{}, a large-scale dataset for common objects with annotations for part masks, object attributes, and part attributes, aiming to enable research on fine-grained object understanding beyond category-level labels.	Existing object datasets lack comprehensive annotations for parts and attributes, limiting research on tasks requiring detailed object understanding such as open-vocabulary detection, visual question answering, and referring expressions.	The dataset is constructed from LVIS and Ego4D, with careful selection of 75 object categories, 456 object-part categories, and 55 attributes. The annotation pipeline involves object bounding box and mask annotation, part mask annotation, object and part attribute annotation, and instance ID annotation, ensuring high quality through user studies and manual curation.	Part segmentation results show lower AP for object-parts compared to objects due to smaller size, with larger backbones improving performance. Attribute prediction is more challenging than object detection, with larger models showing better performance, and a significant gap between lower and upper bounds highlighting room for improvement. Zero-shot instance detection performance suggests a trade-off between query complexity and compounded errors from multiple attribute predictions, with object and part attributes both contributing significantly to performance.	The study primarily focuses on visual attributes and does not explicitly consider shape attributes due to annotation challenges. Future work includes exploring more sophisticated models for joint object, part, and attribute detection, as well as investigating the role of shape attributes.	computer vision, object detection, part segmentation, attribute prediction, zero-shot learning
2301.01413 Report	Attribute-Centric Compositional Text-to-Image Generation	Yuren Cong, Martin Renqiang Min, Li Erran Li, Bodo Rosenhahn, Michael Ying Yang	Despite the recent impressive breakthroughs in text-to-image generation, generative models have difficulty in capturing the data distribution of underrepresented attribute compositions while over-memorizing overrepresented attribute compositions, which raises public concerns about their robustness and fairness. To tackle this challenge, we propose ACTIG, an attribute-centric compositional text-to-image generation framework. We present an attribute-centric feature augmentation and a novel image-free training scheme, which greatly improves model's ability to generate images with underrepresented attributes. We further propose an attribute-centric contrastive loss to avoid overfitting to overrepresented attribute compositions. We validate our framework on the CelebA-HQ and CUB datasets. Extensive experiments show that the compositional generalization of ACTIG is outstanding, and our framework outperforms previous works in terms of image quality and text-image consistency.	This paper proposes ACTIG, an attribute-centric compositional text-to-image generation framework that excels in generating high-fidelity images with accurate attribute compositions even for underrepresented combinations.	Current text-to-image models struggle to capture the data distribution of underrepresented attribute combinations, leading to biased or inaccurate generations, which raises concerns about robustness and fairness.	ACTIG introduces: (1) attribute-centric feature augmentation to generate training data with underrepresented attributes, (2) image-free training using augmented text features, and (3) attribute-centric contrastive loss to disentangle attribute distributions and prevent overfitting to popular combinations.	ACTIG achieves state-of-the-art results on CelebA-HQ and CUB datasets in terms of image quality (FID) and text-image consistency (R-Precision). ACTIG effectively generates images matching complex attribute compositions, including those not seen during training. User studies confirm ACTIG's superiority in generating high-quality images with accurate attribute representations compared to other state-of-the-art models.	The image features from the text-to-image mapping network used in image-free training might not perfectly represent visual appearance, potentially limiting image quality. The attribute parser based on dependency matching may not accurately extract attributes from complex sentences, requiring further improvement.	text-to-image generation, compositional generalization, attribute-centric, image-free training, contrastive learning
2301.01296 Report	TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models	Sucheng Ren, Fangyun Wei, Zheng Zhang, Han Hu	Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs). However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach. In this paper, we explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones. We systematically study different options in the distillation framework, including distilling targets, losses, input, network regularization, sequential distillation, etc, revealing that: 1) Distilling token relations is more effective than CLS token- and feature-based distillation; 2) An intermediate layer of the teacher network as target perform better than that using the last layer when the depth of the student mismatches that of the teacher; 3) Weak regularization is preferred; etc. With these findings, we achieve significant fine-tuning accuracy improvements over the scratch MIM pre-training on ImageNet-1K classification, using all the ViT-Tiny, ViT-Small, and ViT-base models, with +4.2%/+2.4%/+1.4% gains, respectively. Our TinyMIM model of base size achieves 52.2 mIoU in AE20K semantic segmentation, which is +4.1 higher than the MAE baseline. Our TinyMIM model of tiny size achieves 79.6% top-1 accuracy on ImageNet-1K image classification, which sets a new record for small vision models of the same size and computation budget. This strong performance suggests an alternative way for developing small vision Transformer models, that is, by exploring better training methods rather than introducing inductive biases into architectures as in most previous works. Code is available at https://github.com/OliverRensu/TinyMIM.	This paper presents TinyMIM, a method that leverages knowledge distillation to enable masked image modeling (MIM) pre-training for small vision transformers (ViTs), significantly improving their performance on downstream tasks.	While MIM pre-training has proven effective for large ViTs, small ViTs struggle to benefit from this approach due to their limited capacity. This hinders their applicability in real-world scenarios where efficiency is crucial.	TinyMIM distills knowledge from a larger, MIM pre-trained teacher ViT to a smaller student ViT. The paper systematically investigates various design choices for distillation, including targets (token relations, features, CLS token), input (raw or masked images), network regularization, and sequential distillation.	Distilling token relations is more effective than using CLS token or feature-based distillation. Intermediate teacher layers as targets often outperform the last layer, particularly when student and teacher depths mismatch. TinyMIM significantly boosts the accuracy of small ViTs on ImageNet-1K classification and ADE20K segmentation, setting new records for their size and computational budget.	The study primarily focuses on transferring knowledge from MIM pre-trained teachers, leaving other pre-training methods unexplored for distillation. Future work could explore different teacher architectures or pre-training tasks to further enhance TinyMIM's performance.	knowledge distillation, masked image modeling, vision transformers, self-supervised learning, model compression
2301.01206 Report	Speed up the inference of diffusion models via shortcut MCMC sampling	Gang Chen	Diffusion probabilistic models have generated high quality image synthesis recently. However, one pain point is the notorious inference to gradually obtain clear images with thousands of steps, which is time consuming compared to other generative models. In this paper, we present a shortcut MCMC sampling algorithm, which balances training and inference, while keeping the generated data's quality. In particular, we add the global fidelity constraint with shortcut MCMC sampling to combat the local fitting from diffusion models. We do some initial experiments and show very promising results. Our implementation is available at https://github.com//vividitytech/diffusion-mcmc.git.	Presents a shortcut MCMC sampling algorithm for diffusion models that balances training and inference time while maintaining generated data quality by adding a global fidelity constraint to combat local fitting.	Diffusion models, while effective for high-quality image synthesis, suffer from slow inference times compared to other generative models due to the need for thousands of steps to obtain clear images.	Introduces shortcut MCMC sampling and incorporates a fidelity term in the loss function to match synthesized images with original data, acting as a global constraint for quality control during shortcut generation.	The approach achieves fast convergence and better reconstruction compared to traditional diffusion models. Generates high-quality samples with significantly fewer inference steps. Demonstrates superior performance on synthetic datasets, showcasing its potential for fast and accurate image synthesis.	Currently explored on synthetic datasets; further validation on more complex and diverse datasets is needed. The impact of varying the number of shortcut steps (K) on the balance between inference speed and generated image quality requires further investigation.	diffusion models, mcmc sampling, image synthesis, generative models, fast inference
2301.01156 Report	Reference Twice: A Simple and Unified Baseline for Few-Shot Instance Segmentation	Yue Han, Jiangning Zhang, Zhucun Xue, Chao Xu, Xintian Shen, Yabiao Wang, Chengjie Wang, Yong Liu, Xiangtai Li	Few Shot Instance Segmentation (FSIS) requires models to detect and segment novel classes with limited several support examples. In this work, we explore a simple yet unified solution for FSIS as well as its incremental variants, and introduce a new framework named Reference Twice (RefT) to fully explore the relationship between support/query features based on a Transformer-like framework. Our key insights are two folds: Firstly, with the aid of support masks, we can generate dynamic class centers more appropriately to re-weight query features. Secondly, we find that support object queries have already encoded key factors after base training. In this way, the query features can be enhanced twice from two aspects, i.e., feature-level and instance-level. In particular, we firstly design a mask-based dynamic weighting module to enhance support features and then propose to link object queries for better calibration via cross-attention. After the above steps, the novel classes can be improved significantly over our strong baseline. Additionally, our new framework can be easily extended to incremental FSIS with minor modification. When benchmarking results on the COCO dataset for FSIS, gFSIS, and iFSIS settings, our method achieves a competitive performance compared to existing approaches across different shots, e.g., we boost nAP by noticeable +8.2/+9.4 over the current state-of-the-art FSIS method for 10/30-shot. We further demonstrate the superiority of our approach on Few Shot Object Detection. Code and model will be available.	This paper proposes RefT, a simple and unified Transformer-based framework for Few-Shot Instance Segmentation (FSIS) and its variants, by leveraging support set information on both feature and instance levels.	FSIS is challenging as it requires both detection and segmentation of novel instances with limited data. Existing methods often under-explore cues from support data. This paper aims to address this issue with a more powerful framework and unify FSIS, gFSIS, and iFSIS.	RefT uses a two-stage Meta-Learning pipeline with two key novelties: 1) a mask-based dynamic prototype generation for feature-level enhancement, and 2) cross-attention linking of query and support object queries for instance-level guidance.	RefT significantly outperforms previous FSIS methods, achieving state-of-the-art results on COCO benchmarks across different shots. The method generalizes well to gFSIS and iFSIS, consistently outperforming recent state-of-the-art approaches. Ablation studies demonstrate the effectiveness of each component, highlighting the importance of both feature-level and instance-level enhancements.	The model doesn't perform well in a one-shot setting. Future work will focus on improving performance in the one-shot setting.	few-shot instance segmentation, few-shot learning, vision transformer, object detection, image segmentation
2301.01146 Report	Rethinking Mobile Block for Efficient Attention-based Models	Jiangning Zhang, Xiangtai Li, Jian Li, Liang Liu, Zhucun Xue, Boshen Zhang, Zhengkai Jiang, Tianxin Huang, Yabiao Wang, Chengjie Wang	This paper focuses on developing modern, efficient, lightweight models for dense predictions while trading off parameters, FLOPs, and performance. Inverted Residual Block (IRB) serves as the infrastructure for lightweight CNNs, but no counterpart has been recognized by attention-based studies. This work rethinks lightweight infrastructure from efficient IRB and effective components of Transformer from a unified perspective, extending CNN-based IRB to attention-based models and abstracting a one-residual Meta Mobile Block (MMB) for lightweight model design. Following simple but effective design criterion, we deduce a modern Inverted Residual Mobile Block (iRMB) and build a ResNet-like Efficient MOdel (EMO) with only iRMB for down-stream tasks. Extensive experiments on ImageNet-1K, COCO2017, and ADE20K benchmarks demonstrate the superiority of our EMO over state-of-the-art methods, e.g., EMO-1M/2M/5M achieve 71.5, 75.1, and 78.4 Top-1 that surpass equal-order CNN-/Attention-based models, while trading-off the parameter, efficiency, and accuracy well: running 2.8-4.0x faster than EdgeNeXt on iPhone14.	This paper introduces the Meta Mobile Block (MMB), a novel one-residual block design for lightweight attention-based models, and presents the Efficient MOdel (EMO), built solely with MMBs.	Existing efficient models either struggle to achieve high accuracy (CNN-based) or require complex structures and high computational costs (attention-based). This work aims to bridge this gap by creating a simple yet effective lightweight model design.	The authors extend the concept of Inverted Residual Blocks (IRBs) used in lightweight CNNs to attention-based models. They abstract a unified MMB that can be instantiated into IRB, MHSA, and FFN by adjusting expansion ratio and efficient operator. By deducing MMB with specific components, they propose the Inverted Residual Mobile Block (iRMB) and build a ResNet-like EMO with only iRMBs.	EMO outperforms state-of-the-art lightweight attention-based models on ImageNet-1K, COCO2017, and ADE20K benchmarks. EMO achieves a better balance between accuracy, parameters, and FLOPs compared to counterparts, running 2.8-4.0x faster than EdgeNeXt on iPhone14. Ablation studies demonstrate the effectiveness of the iRMB design and the importance of component choices and configurations.	Exploration of more complex and potentially more effective operators within the iRMB structure is left for future work. Further performance improvements could be achieved by utilizing higher resolution input, NAS, knowledge distillation, larger datasets, and stronger training strategies.	lightweight model, efficient architecture, attention mechanism, meta mobile block, inverted residual block
2301.00950 Report	Class-Continuous Conditional Generative Neural Radiance Field	Jiwook Kim, Minhyeok Lee	The 3D-aware image synthesis focuses on conserving spatial consistency besides generating high-resolution images with fine details. Recently, Neural Radiance Field (NeRF) has been introduced for synthesizing novel views with low computational cost and superior performance. While several works investigate a generative NeRF and show remarkable achievement, they cannot handle conditional and continuous feature manipulation in the generation procedure. In this work, we introduce a novel model, called Class-Continuous Conditional Generative NeRF ($\text{C}^{3}$G-NeRF), which can synthesize conditionally manipulated photorealistic 3D-consistent images by projecting conditional features to the generator and the discriminator. The proposed $\text{C}^{3}$G-NeRF is evaluated with three image datasets, AFHQ, CelebA, and Cars. As a result, our model shows strong 3D-consistency with fine details and smooth interpolation in conditional feature manipulation. For instance, $\text{C}^{3}$G-NeRF exhibits a Fr\'echet Inception Distance (FID) of 7.64 in 3D-aware face image synthesis with a $\text{128}^{2}$ resolution. Additionally, we provide FIDs of generated 3D-aware images of each class of the datasets as it is possible to synthesize class-conditional images with $\text{C}^{3}$G-NeRF.	This paper introduces C³G-NeRF, a novel model for conditional and continuous feature manipulation in 3D-aware image generation using Neural Radiance Fields (NeRF).	Existing generative NeRF methods lack the ability to handle conditional and continuous feature control during generation, limiting their application in fields like avatar customization in the metaverse.	C³G-NeRF projects conditional features onto both the generator and discriminator, enabling fine-grained control over image synthesis. It utilizes a generative neural feature field for each object and the background, composed using a density-weighted mean-based composition operator. The model employs volume rendering to generate feature images and a neural rendering network to produce the final high-resolution image. Residual modules are incorporated to improve training efficiency and performance.	C³G-NeRF achieves state-of-the-art results in conditional 3D-aware image generation, outperforming baseline models in terms of FID and KID scores across various datasets (AFHQ, CelebA, Cars). The model exhibits robust 3D consistency, preserving spatial coherence under object rotations, translations, and additions. C³G-NeRF enables smooth interpolation and extrapolation of conditional input values, allowing for continuous feature manipulation in generated images.	The evaluation of FID scores for different view degrees across datasets might not be directly comparable. Future work could explore incorporating higher-resolution feature images or alternative neural rendering techniques to further enhance generation quality.	generative adversarial networks (gans), neural radiance fields (nerf), 3d-aware image synthesis, conditional image generation, continuous feature manipulation
2301.00808 Report	ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders	Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, Saining Xie	Driven by improved architectures and better representation learning frameworks, the field of visual recognition has enjoyed rapid modernization and performance boost in the early 2020s. For example, modern ConvNets, represented by ConvNeXt, have demonstrated strong performance in various scenarios. While these models were originally designed for supervised learning with ImageNet labels, they can also potentially benefit from self-supervised learning techniques such as masked autoencoders (MAE). However, we found that simply combining these two approaches leads to subpar performance. In this paper, we propose a fully convolutional masked autoencoder framework and a new Global Response Normalization (GRN) layer that can be added to the ConvNeXt architecture to enhance inter-channel feature competition. This co-design of self-supervised learning techniques and architectural improvement results in a new model family called ConvNeXt V2, which significantly improves the performance of pure ConvNets on various recognition benchmarks, including ImageNet classification, COCO detection, and ADE20K segmentation. We also provide pre-trained ConvNeXt V2 models of various sizes, ranging from an efficient 3.7M-parameter Atto model with 76.7% top-1 accuracy on ImageNet, to a 650M Huge model that achieves a state-of-the-art 88.9% accuracy using only public training data.	This paper introduces ConvNeXt V2, an improved ConvNeXt model designed for enhanced performance with masked autoencoders (MAE). It features a fully convolutional masked autoencoder framework and a novel Global Response Normalization (GRN) layer.	This co-design of self-supervised learning techniques and architecture boosts the performance of pure ConvNets in image recognition tasks, exceeding previous ConvNeXt versions and rivaling transformer-based models.	The authors develop a fully convolutional MAE framework with sparse convolutions for ConvNets. They analyze feature collapse in ConvNeXt with MAE pre-training and address it by incorporating the GRN layer to enhance feature diversity.	ConvNeXt V2 significantly outperforms prior ConvNeXt models and achieves state-of-the-art accuracy on ImageNet classification with public data (88.9%). The proposed method demonstrates consistent improvement across a wide range of model sizes, from efficient (3.7M parameters) to high-capacity (650M parameters) variants. ConvNeXt V2 excels in transfer learning, surpassing Swin transformer-based models in object detection and semantic segmentation tasks on COCO and ADE20K.	The largest model shows a slight performance gap compared to ViT in the huge model regime, potentially due to ViT benefiting more from self-supervised pre-training. The efficiency of sparse convolution libraries can be further optimized for modern hardware.	image recognition, convolutional neural networks, self-supervised learning, masked autoencoders, transfer learning
2301.00805 Report	Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation	Jianzong Wu, Xiangtai Li, Henghui Ding, Xia Li, Guangliang Cheng, Yunhai Tong, Chen Change Loy	In this work, we focus on open vocabulary instance segmentation to expand a segmentation model to classify and segment instance-level novel categories. Previous approaches have relied on massive caption datasets and complex pipelines to establish one-to-one mappings between image regions and words in captions. However, such methods build noisy supervision by matching non-visible words to image regions, such as adjectives and verbs. Meanwhile, context words are also important for inferring the existence of novel objects as they show high inter-correlations with novel categories. To overcome these limitations, we devise a joint \textbf{Caption Grounding and Generation (CGG)} framework, which incorporates a novel grounding loss that only focuses on matching object nouns to improve learning efficiency. We also introduce a caption generation head that enables additional supervision and contextual modeling as a complementation to the grounding loss. Our analysis and results demonstrate that grounding and generation components complement each other, significantly enhancing the segmentation performance for novel classes. Experiments on the COCO dataset with two settings: Open Vocabulary Instance Segmentation (OVIS) and Open Set Panoptic Segmentation (OSPS) demonstrate the superiority of the CGG. Specifically, CGG achieves a substantial improvement of 6.8% mAP for novel classes without extra data on the OVIS task and 15% PQ improvements for novel classes on the OSPS benchmark.	This paper presents a joint Caption Grounding and Generation (CGG) framework to address the challenge of open vocabulary instance segmentation, where the goal is to enable segmentation models to classify and segment novel object categories not seen during training.	Existing methods often rely on noisy supervision from matching non-visual words in captions to image regions or struggle to effectively leverage contextual information for novel object inference. CGG addresses these limitations by employing a novel grounding loss focused on matching object nouns and introducing a caption generation head for contextual modeling and additional supervision.	CGG uses a Mask2Former baseline and incorporates two main components: (1) a caption grounding module that extracts object nouns from captions and aligns them with object queries in the segmentation model using a dedicated loss function and (2) a caption generation module that leverages multi-modal embeddings from the segmentation model to generate image captions, providing additional supervision and contextual understanding.	CGG outperforms previous state-of-the-art methods on the OVIS benchmark, achieving a significant improvement of 6.8% mAP for novel classes without using additional data. On the OSPS benchmark, CGG demonstrates superior performance, achieving a 15% PQ improvement for novel classes compared to previous approaches. Ablation studies confirm the effectiveness of both the caption grounding and generation components, highlighting their complementary roles in enhancing the model's ability to identify and segment novel object categories.	The study is limited by computational resources, preventing pre-training on larger caption datasets or using VLMs like CLIP for distillation/supervision. Future work will focus on exploring these avenues and evaluating CGG on more extensive datasets such as LVIS and Open Images.	open vocabulary instance segmentation, caption grounding, caption generation, mask2former, open set panoptic segmentation
2301.00592 Report	Edge Enhanced Image Style Transfer via Transformers	Chiyu Zhang, Jun Yang, Zaiyan Dai, Peng Cao	In recent years, arbitrary image style transfer has attracted more and more attention. Given a pair of content and style images, a stylized one is hoped that retains the content from the former while catching style patterns from the latter. However, it is difficult to simultaneously keep well the trade-off between the content details and the style features. To stylize the image with sufficient style patterns, the content details may be damaged and sometimes the objects of images can not be distinguished clearly. For this reason, we present a new transformer-based method named STT for image style transfer and an edge loss which can enhance the content details apparently to avoid generating blurred results for excessive rendering on style features. Qualitative and quantitative experiments demonstrate that STT achieves comparable performance to state-of-the-art image style transfer methods while alleviating the content leak problem.	This paper proposes STT, a novel Transformer-based image style transfer network that generates high-quality stylized images while preserving fine content details.	Existing image style transfer methods struggle to balance content details and style features, often resulting in blurred outputs with indistinguishable objects.	STT utilizes a Transformer-based encoder-transfer-decoder architecture with a content-aware positional encoding (Conv PE). It incorporates a novel edge loss to enhance content details and prevent blurred stylizations.	STT demonstrates superior performance in preserving content details and transferring style features compared to state-of-the-art methods. The proposed Conv PE effectively encodes positional information, outperforming traditional functional and parametric approaches. The edge loss significantly improves the clarity of stylized images, particularly in cases where the original results are blurred.	The edge loss in STT is only applied when the initial results are noticeably blurred. Further research could explore the integration of the content-aware positional encoding (CAPE) within the STT framework.	image style transfer, transformer, edge enhancement, content leak, deep learning
2301.00527 Report	Diffusion Probabilistic Models for Scene-Scale 3D Categorical Data	Jumin Lee, Woobin Im, Sebin Lee, Sung-Eui Yoon	In this paper, we learn a diffusion model to generate 3D data on a scene-scale. Specifically, our model crafts a 3D scene consisting of multiple objects, while recent diffusion research has focused on a single object. To realize our goal, we represent a scene with discrete class labels, i.e., categorical distribution, to assign multiple objects into semantic categories. Thus, we extend discrete diffusion models to learn scene-scale categorical distributions. In addition, we validate that a latent diffusion model can reduce computation costs for training and deploying. To the best of our knowledge, our work is the first to apply discrete and latent diffusion for 3D categorical data on a scene-scale. We further propose to perform semantic scene completion (SSC) by learning a conditional distribution using our diffusion model, where the condition is a partial observation in a sparse point cloud. In experiments, we empirically show that our diffusion models not only generate reasonable scenes, but also perform the scene completion task better than a discriminative model. Our code and models are available at https://github.com/zoomin-lee/scene-scale-diffusion	This paper introduces the first application of discrete and latent diffusion models for generating scene-scale 3D semantic segmentation maps.	Existing 3D diffusion models focus on single object generation, while this model aims to generate entire 3D scenes with multiple objects which has broader applications like semantic scene completion.	The authors extend discrete diffusion models to handle 3D categorical voxel data, representing scenes with discrete class labels. They also validate the use of latent diffusion models to reduce computation costs during training and deployment.	Both discrete and latent diffusion models successfully generate diverse and plausible 3D scenes. Latent diffusion significantly reduces training and sampling time compared to discrete diffusion. The proposed method outperforms a discriminative model in the semantic scene completion task, demonstrating its ability to complete scenes from partial observations.	The performance of VQ-VAE in latent diffusion can be limited by codebook size and resolution. Future work can explore more sophisticated network architectures and training strategies specifically designed for 3D scene generation.	diffusion models, 3d scene generation, semantic scene completion, latent diffusion, categorical data
2301.00411 Report	Detachable Novel Views Synthesis of Dynamic Scenes Using Distribution-Driven Neural Radiance Fields	Boyu Zhang, Wenbo Xu, Zheng Zhu, Guan Huang	Representing and synthesizing novel views in real-world dynamic scenes from casual monocular videos is a long-standing problem. Existing solutions typically approach dynamic scenes by applying geometry techniques or utilizing temporal information between several adjacent frames without considering the underlying background distribution in the entire scene or the transmittance over the ray dimension, limiting their performance on static and occlusion areas. Our approach $\textbf{D}$istribution-$\textbf{D}$riven neural radiance fields offers high-quality view synthesis and a 3D solution to $\textbf{D}$etach the background from the entire $\textbf{D}$ynamic scene, which is called $\text{D}^4$NeRF. Specifically, it employs a neural representation to capture the scene distribution in the static background and a 6D-input NeRF to represent dynamic objects, respectively. Each ray sample is given an additional occlusion weight to indicate the transmittance lying in the static and dynamic components. We evaluate $\text{D}^4$NeRF on public dynamic scenes and our urban driving scenes acquired from an autonomous-driving dataset. Extensive experiments demonstrate that our approach outperforms previous methods in rendering texture details and motion areas while also producing a clean static background. Our code will be released at https://github.com/Luciferbobo/D4NeRF.	This paper proposes D⁴NeRF, a novel method using Distribution-Driven Neural Radiance Fields for Detachable Novel Views Synthesis of Dynamic Scenes from casual monocular videos.	Existing methods often neglect the underlying background distribution and ray transmittance in dynamic scenes, limiting their performance on static and occlusion areas. This work addresses these limitations.	The method uses a parallel structure with a background pipeline capturing the scene distribution and a 6D-input NeRF representing dynamic objects. An occlusion weight module is introduced to learn the transmittance between static and dynamic components, and multiple regularization losses optimize the training.	D⁴NeRF outperforms state-of-the-art methods in novel view synthesis quality, achieving higher PSNR and SSIM and lower LPIPS. The method effectively decouples static backgrounds from dynamic scenes in a self-supervised manner. Quantitative and qualitative evaluations on NVIDIA dynamic scenes and a new urban driving scenes dataset demonstrate the effectiveness of D⁴NeRF.	The performance relies on accurate estimations of camera poses and optical flow. Future work includes exploring high-quality decomposition and editor models for 3D dynamic scenes, and potentially integrating deformable latent code for finer dynamic object representation.	novel view synthesis, neural radiance fields, dynamic scenes, monocular videos, occlusion handling
2301.00184 Report	Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?	Wenhao Wu, Haipeng Luo, Bo Fang, Jingdong Wang, Wanli Ouyang	Most existing text-video retrieval methods focus on cross-modal matching between the visual content of videos and textual query sentences. However, in real-world scenarios, online videos are often accompanied by relevant text information such as titles, tags, and even subtitles, which can be utilized to match textual queries. This insight has motivated us to propose a novel approach to text-video retrieval, where we directly generate associated captions from videos using zero-shot video captioning with knowledge from web-scale pre-trained models (e.g., CLIP and GPT-2). Given the generated captions, a natural question arises: what benefits do they bring to text-video retrieval? To answer this, we introduce Cap4Video, a new framework that leverages captions in three ways: i) Input data: video-caption pairs can augment the training data. ii) Intermediate feature interaction: we perform cross-modal feature interaction between the video and caption to produce enhanced video representations. iii) Output score: the Query-Caption matching branch can complement the original Query-Video matching branch for text-video retrieval. We conduct comprehensive ablation studies to demonstrate the effectiveness of our approach. Without any post-processing, Cap4Video achieves state-of-the-art performance on four standard text-video retrieval benchmarks: MSR-VTT (51.4%), VATEX (66.6%), MSVD (51.8%), and DiDeMo (52.0%). The code is available at https://github.com/whwu95/Cap4Video .	Proposes Cap4Video, a novel framework leveraging automatically generated captions to enhance text-video retrieval.	Existing methods focus on visual-textual matching, neglecting the valuable textual information often associated with videos.	1. Generates captions from videos using zero-shot video captioning (CLIP+GPT-2). 2. Leverages captions for: Data augmentation with video-caption pairs, Feature interaction between video and caption representations, Output score fusion of query-video and query-caption matching.	Achieves state-of-the-art performance on MSR-VTT (51.4%), VATEX (66.6%), MSVD (51.8%), and DiDeMo (52.0%). Demonstrates significant improvements over baselines, particularly in retrieving precise ground-truth videos. Shows effectiveness of caption-based data augmentation, feature interaction, and score fusion through ablation studies.	Relies on the quality of generated captions, which can be improved with more advanced captioning methods. Current implementation focuses on global caption embedding; exploring fine-grained caption information could be beneficial.	text-video retrieval, video captioning, cross-modal learning, clip, gpt-2
2301.00182 Report	Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models	Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, Wanli Ouyang	Vision-language models (VLMs) pre-trained on large-scale image-text pairs have demonstrated impressive transferability on various visual tasks. Transferring knowledge from such powerful VLMs is a promising direction for building effective video recognition models. However, current exploration in this field is still limited. We believe that the greatest value of pre-trained VLMs lies in building a bridge between visual and textual domains. In this paper, we propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge: i) We introduce the Video Attribute Association mechanism, which leverages the Video-to-Text knowledge to generate textual auxiliary attributes for complementing video recognition. ii) We also present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner, leading to enhanced video representation. Extensive studies on six popular video datasets, including Kinetics-400 & 600, UCF-101, HMDB-51, ActivityNet and Charades, show that our method achieves state-of-the-art performance in various recognition scenarios, such as general, zero-shot, and few-shot video recognition. Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model. The code is available at https://github.com/whwu95/BIKE .	This paper proposes BIKE, a novel framework that utilizes bidirectional cross-modal knowledge from pre-trained VLMs for enhanced video recognition.	Existing methods utilizing VLMs for video recognition often only leverage unidirectional video-to-text matching, not fully exploiting VLMs' potential.	BIKE comprises two branches: 1) Attributes branch: employs Video-Attributes Association to generate textual attributes from videos, complementing recognition. 2) Video branch: utilizes Video Concept Spotting to generate temporal saliency from category descriptions, enhancing video representation.	BIKE achieves state-of-the-art accuracy on Kinetics-400 (88.6%) using CLIP, outperforming methods with larger pre-training datasets. It also demonstrates superior performance on ActivityNet, UCF-101, and HMDB-51, showing strong generalization ability. BIKE exhibits promising results in few-shot and zero-shot video recognition settings, showcasing its effectiveness in data-scarce scenarios.	The complementary effect of the Attributes branch diminishes with larger backbones. Future work can explore automatically generating better lexicon for attribute generation.	video recognition, vision-language models, cross-modal learning, temporal saliency, attribute generation
2301.00157 Report	Ponder: Point Cloud Pre-training via Neural Rendering	Di Huang, Sida Peng, Tong He, Honghui Yang, Xiaowei Zhou, Wanli Ouyang	We propose a novel approach to self-supervised learning of point cloud representations by differentiable neural rendering. Motivated by the fact that informative point cloud features should be able to encode rich geometry and appearance cues and render realistic images, we train a point-cloud encoder within a devised point-based neural renderer by comparing the rendered images with real images on massive RGB-D data. The learned point-cloud encoder can be easily integrated into various downstream tasks, including not only high-level tasks like 3D detection and segmentation, but low-level tasks like 3D reconstruction and image synthesis. Extensive experiments on various tasks demonstrate the superiority of our approach compared to existing pre-training methods.	This paper introduces Ponder, a novel self-supervised point cloud representation learning framework that leverages differentiable neural rendering.	Learning effective 3D point cloud representations is crucial for various applications, but existing pre-training methods have limitations such as reliance on contrastive learning or difficulty in handling point cloud irregularity. Ponder addresses these limitations by connecting 2D and 3D data through rendering, enabling the learning of rich geometry and appearance cues.	Ponder takes RGB-D images as input, constructs point clouds via back-projection, encodes point features, and organizes them into a 3D feature volume. It then reconstructs a neural scene representation using SDF and utilizes differentiable rendering to generate color and depth images. The network is trained by minimizing the difference between rendered and real images.	Ponder significantly outperforms existing pre-training methods on 3D object detection and semantic segmentation tasks. It demonstrates strong transfer learning ability to low-level 3D tasks, including scene reconstruction and image synthesis from point clouds, which is a first for pre-training methods. The pre-trained Ponder model can be directly applied to 3D reconstruction and image synthesis from sparse point clouds, producing high-fidelity meshes and realistic images.	The current Ponder model could be improved by integrating more recent advancements in neural representations for better rendering quality. The flexible architecture design of Ponder presents potential for expansion to other self-supervised learning areas, like 2D image backbone pre-training, and different downstream tasks.	self-supervised learning, point cloud representation, neural rendering, 3d object detection, 3d scene reconstruction
2301.00135 Report	TeViS:Translating Text Synopses to Video Storyboards	Xu Gu, Yuchong Sun, Feiyue Ni, Shizhe Chen, Xihua Wang, Ruihua Song, Boyuan Li, Xiang Cao	A video storyboard is a roadmap for video creation which consists of shot-by-shot images to visualize key plots in a text synopsis. Creating video storyboards, however, remains challenging which not only requires cross-modal association between high-level texts and images but also demands long-term reasoning to make transitions smooth across shots. In this paper, we propose a new task called Text synopsis to Video Storyboard (TeViS) which aims to retrieve an ordered sequence of images as the video storyboard to visualize the text synopsis. We construct a MovieNet-TeViS dataset based on the public MovieNet dataset. It contains 10K text synopses each paired with keyframes manually selected from corresponding movies by considering both relevance and cinematic coherence. To benchmark the task, we present strong CLIP-based baselines and a novel VQ-Trans. VQ-Trans first encodes text synopsis and images into a joint embedding space and uses vector quantization (VQ) to improve the visual representation. Then, it auto-regressively generates a sequence of visual features for retrieval and ordering. Experimental results demonstrate that VQ-Trans significantly outperforms prior methods and the CLIP-based baselines. Nevertheless, there is still a large gap compared to human performance suggesting room for promising future work. The code and data are available at: \url{https://ruc-aimind.github.io/projects/TeViS/}	This paper introduces TeViS, a novel task focused on automatically generating video storyboards from text synopses by retrieving and ordering relevant images.	The task addresses the challenge faced by amateur video creators in translating their creative ideas into professional-looking visual sequences.	The authors build a dataset, MovieNet-TeViS, derived from the MovieNet dataset, containing 10k text synopses paired with manually selected keyframes. They propose TeViS, a decoder-only model that uses CLIP for text and image encoding and leverages vector quantization to improve visual representation for sequence generation.	TeViS significantly outperforms several baselines, including those based on CLIP and existing story-to-image retrieval models. Employing Vector Quantization for image discretization and a decoder-only architecture shows substantial improvement in ordering accuracy. A considerable gap still exists between TeViS and human performance, indicating a large space for future development.	The current dataset and model don't fully encapsulate intricate cinematic styles like camera angles and movements, which limits the professional quality of generated storyboards. Future work could focus on incorporating these nuanced elements and explore more advanced generative models to further bridge the gap to human performance.	video storyboarding, text-to-image retrieval, sequence generation, vector quantization, movienet